Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

posse-post-discovery: reduce number of NDB operations #199

Merged
merged 3 commits into from Jun 23, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
49 changes: 38 additions & 11 deletions models.py
Expand Up @@ -629,6 +629,14 @@ def query_by_original(cls, source, url):
return cls.query(cls.original == url,
ancestor=source.key).get()

@classmethod
def query_by_originals(cls, source, urls):
# 30 item limit for IN queries, chain together multiple queries
urls = list(urls)
return itertools.chain.from_iterable(
cls.query(cls.original.IN(urls[i:i + 30]), ancestor=source.key)
for i in xrange(0, len(urls), 30))

@classmethod
def query_by_syndication(cls, source, url):
return cls.query(cls.syndication == url,
Expand All @@ -638,11 +646,22 @@ def query_by_syndication(cls, source, url):
@ndb.transactional
def get_or_insert_by_syndication_url(cls, source, syndication,
original):
"""Insert a relationship from syndication-url -> original.
"""Insert a relationship from syndication-url -> original, replacing
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't fully followed the logic here yet, but i'm curious what will happen if we have an existing (non-blank) SyndicatedPost where one of the values is the same, but the other is different. i'm guessing we'll leave it, and not store the new relationship?

i haven't thought through what we should do in that case, but i'm guessing we should allow overlapping entities like that, ie leave the existing one and also store the new one.

one catch is when an existing relationship is removed from an h-entry. the concrete case would be you posse to twitter, we grab your rel-syndication link, then later you notice that your possed tweet had a typo, so you delete it, post a new correct one, and update your rel-syndication link.

doesn't have to be handled in this PR, but worth thinking about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Right now the refetch behavior explicitly only tries to update blanks...does not support changes yet. Sigh :)

i'm curious what will happen if we have an existing (non-blank) SyndicatedPost where one of the values is the same, but the other is different.

Differs depending on which value is new.

If there is an existing this original -> different syndication relationship, a) we should not be in this method because there is an earlier check for a preexisting relationship that should have prevented it, and b) if we somehow are here, this would add a new SyndicatedPost for the same original URL.

If there is an existing different original -> this syndication relationship, this method would not save a new one.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks. i'm fine with just adding a TODO, or even just a comment if we're happy with the current behavior for now. we can file an issue to eventually support changed rel-syndication urls in the future if we want.

btw, if you're pretty confident that we should never get here if there's an existing this original -> different syndication relationship, then maybe add an assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I added some comments and restructured the method to be clearer. There is already a TODO to add support for updated URLs in tasks.refetch_hfeed if that is good enough.

And actually it should be OK for there to be multiple syndication URLs (even for the same Source) pointing to the same original... Like in the case of syndicating a longer post to a pmarca-style tweetstorm. So I was wrong to say that we shouldn't ever be there with this original -> different syndication

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

so, i'm still thinking through the logic and all possible cases here. it sounds like this is a fully many-to-many relationship, ie we expect that we'll see some SyndicatedPosts with the same original and different syndications (use case: pmarca), and we also expect some with the same syndication and different originals (use case: amending a previous post).

given that, i suspect we need to fully replace query_by_original with query_by_originals, and same with syndication, and change all callers to handle (usually iterate over) multiple results. does that sound right?

apologies if this is all scope creep for this PR. i'm honestly not sure myself. :P

also, minor: consider renaming this method get_or_insert_with, or something similar, since it provides both original and syndication.

blank placeholder relationships if they exist.

This does a check-and-set inside a transaction to avoid putting
duplicates in the database because we assume each syndicated post
can only have one original.
duplicates in the database because we assume each syndication URL
can only have one original. If there is already a non-blank
SyndicatedPost for this syndication URL, this function will return
without saving anything.

If there is a pre-existing non-blank SyndicationPost for this
original, this function will add another relationship for the same
original.

If there are pre-existing syndication->None or original->None
relationships, this function will remove them before adding a
new non-blank relationship.

Args:
source: models.Source subclass
Expand All @@ -651,14 +670,22 @@ def get_or_insert_by_syndication_url(cls, source, syndication,
"""
relationship = cls.query_by_syndication(source, syndication)

# replace blank relationships with newly discovered ones
if relationship and original and not relationship.original:
relationship.key.delete()
relationship = None
# do not overwrite a preexisting relationship
if relationship and relationship.original:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if we do change the query callers to handle multiple results, i think we'd look for a SyndicatedPost that matches both original and syndication here, and continue if we don't find one, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely, when we handle >1 original per syndication URL in the future, this will only need to give up in the case of exact duplicates.

Right now, the syndication column is supposed to be unique, so it gives up on the existence of any non-blank SyndicatedPost for this syndication URL

return relationship

# if this is a non-blank relationship, remove pre-existing blanks
if original and syndication:
# remove syndication->None relationships
if relationship and not relationship.original:
relationship.key.delete()

if not relationship:
relationship = cls(parent=source.key, original=original,
syndication=syndication)
relationship.put()
# remove original->None relationships too
rel_by_original = cls.query_by_original(source, original)
if rel_by_original and not rel_by_original.syndication:
rel_by_original.key.delete()

relationship = cls(parent=source.key, original=original,
syndication=syndication)
relationship.put()
return relationship
38 changes: 19 additions & 19 deletions original_post_discovery.py
Expand Up @@ -252,10 +252,15 @@ def _process_author(source, author_url, refetch_blanks=False):
for permalink in child['properties'].get('url', []):
permalinks.add(permalink)

# query all preexisting permalinks at once, instead of once per link
preexisting = {r.original: r for r in
SyndicatedPost.query_by_originals(source, permalinks)}

results = {}
for permalink in permalinks:
logging.debug('processing permalink: %s', permalink)
results.update(_process_entry(source, permalink, refetch_blanks))
results.update(_process_entry(source, permalink, refetch_blanks,
preexisting))

if results:
# keep track of the last time we've seen rel=syndication urls for
Expand All @@ -269,36 +274,31 @@ def _process_author(source, author_url, refetch_blanks=False):
return results


def _process_entry(source, permalink, refetch_blanks):
def _process_entry(source, permalink, refetch_blanks, preexisting):
"""Fetch and process an h-hentry, saving a new SyndicatedPost to the
DB if successful.

Args:
permalink: url of the unprocessed post
syndication_url: url of the syndicated content
refetch_blanks: boolean whether we should ignore blank preexisting
SyndicatedPosts
preexisting: dict of original url to SyndicatedPost

Return:
a map from syndicated url to new models.SyndicatedPosts
a dict from syndicated url to new models.SyndicatedPosts
"""
results = {}
preexisting_relationship = preexisting.get(permalink)

# TODO replace this with one query for the Source as a
# whole. querying each permalink individually is expensive.
preexisting_relationship = SyndicatedPost.query_by_original(source, permalink)

# refetching to look for SyndicatedPosts that didn't have
# a rel=syndication url the first time we checked
if (refetch_blanks and preexisting_relationship
and not preexisting_relationship.syndication):
logging.debug('deleting blank SyndicatedPost for original %s',
permalink)
preexisting_relationship.key.delete()
preexisting_relationship = None

# if the post has already been processed. do not add to the results
# since this method only returns *newly* discovered relationships
# if the post has already been processed, do not add to the results
# since this method only returns *newly* discovered relationships.
if preexisting_relationship:
return results
# if we're refetching blanks and this one is blank, do not return
if refetch_blanks and not preexisting_relationship.syndication:
logging.debug('ignoring blank relationship for original %s', permalink)
else:
return results

syndication_urls = set()
parsed = None
Expand Down