New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
posse-post-discovery: reduce number of NDB operations #199
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -629,6 +629,14 @@ def query_by_original(cls, source, url): | |
return cls.query(cls.original == url, | ||
ancestor=source.key).get() | ||
|
||
@classmethod | ||
def query_by_originals(cls, source, urls): | ||
# 30 item limit for IN queries, chain together multiple queries | ||
urls = list(urls) | ||
return itertools.chain.from_iterable( | ||
cls.query(cls.original.IN(urls[i:i + 30]), ancestor=source.key) | ||
for i in xrange(0, len(urls), 30)) | ||
|
||
@classmethod | ||
def query_by_syndication(cls, source, url): | ||
return cls.query(cls.syndication == url, | ||
|
@@ -638,11 +646,22 @@ def query_by_syndication(cls, source, url): | |
@ndb.transactional | ||
def get_or_insert_by_syndication_url(cls, source, syndication, | ||
original): | ||
"""Insert a relationship from syndication-url -> original. | ||
"""Insert a relationship from syndication-url -> original, replacing | ||
blank placeholder relationships if they exist. | ||
|
||
This does a check-and-set inside a transaction to avoid putting | ||
duplicates in the database because we assume each syndicated post | ||
can only have one original. | ||
duplicates in the database because we assume each syndication URL | ||
can only have one original. If there is already a non-blank | ||
SyndicatedPost for this syndication URL, this function will return | ||
without saving anything. | ||
|
||
If there is a pre-existing non-blank SyndicationPost for this | ||
original, this function will add another relationship for the same | ||
original. | ||
|
||
If there are pre-existing syndication->None or original->None | ||
relationships, this function will remove them before adding a | ||
new non-blank relationship. | ||
|
||
Args: | ||
source: models.Source subclass | ||
|
@@ -651,14 +670,22 @@ def get_or_insert_by_syndication_url(cls, source, syndication, | |
""" | ||
relationship = cls.query_by_syndication(source, syndication) | ||
|
||
# replace blank relationships with newly discovered ones | ||
if relationship and original and not relationship.original: | ||
relationship.key.delete() | ||
relationship = None | ||
# do not overwrite a preexisting relationship | ||
if relationship and relationship.original: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (if we do change the query callers to handle multiple results, i think we'd look for a SyndicatedPost that matches both original and syndication here, and continue if we don't find one, right?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes definitely, when we handle >1 original per syndication URL in the future, this will only need to give up in the case of exact duplicates. Right now, the |
||
return relationship | ||
|
||
# if this is a non-blank relationship, remove pre-existing blanks | ||
if original and syndication: | ||
# remove syndication->None relationships | ||
if relationship and not relationship.original: | ||
relationship.key.delete() | ||
|
||
if not relationship: | ||
relationship = cls(parent=source.key, original=original, | ||
syndication=syndication) | ||
relationship.put() | ||
# remove original->None relationships too | ||
rel_by_original = cls.query_by_original(source, original) | ||
if rel_by_original and not rel_by_original.syndication: | ||
rel_by_original.key.delete() | ||
|
||
relationship = cls(parent=source.key, original=original, | ||
syndication=syndication) | ||
relationship.put() | ||
return relationship |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven't fully followed the logic here yet, but i'm curious what will happen if we have an existing (non-blank) SyndicatedPost where one of the values is the same, but the other is different. i'm guessing we'll leave it, and not store the new relationship?
i haven't thought through what we should do in that case, but i'm guessing we should allow overlapping entities like that, ie leave the existing one and also store the new one.
one catch is when an existing relationship is removed from an h-entry. the concrete case would be you posse to twitter, we grab your rel-syndication link, then later you notice that your possed tweet had a typo, so you delete it, post a new correct one, and update your rel-syndication link.
doesn't have to be handled in this PR, but worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Right now the refetch behavior explicitly only tries to update blanks...does not support changes yet. Sigh :)
Differs depending on which value is new.
If there is an existing
this original -> different syndication
relationship, a) we should not be in this method because there is an earlier check for a preexisting relationship that should have prevented it, and b) if we somehow are here, this would add a new SyndicatedPost for the same original URL.If there is an existing
different original -> this syndication
relationship, this method would not save a new one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks. i'm fine with just adding a TODO, or even just a comment if we're happy with the current behavior for now. we can file an issue to eventually support changed rel-syndication urls in the future if we want.
btw, if you're pretty confident that we should never get here if there's an existing
this original -> different syndication
relationship, then maybe add an assert?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I added some comments and restructured the method to be clearer. There is already a TODO to add support for updated URLs in tasks.refetch_hfeed if that is good enough.
And actually it should be OK for there to be multiple syndication URLs (even for the same Source) pointing to the same original... Like in the case of syndicating a longer post to a pmarca-style tweetstorm. So I was wrong to say that we shouldn't ever be there with
this original -> different syndication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
so, i'm still thinking through the logic and all possible cases here. it sounds like this is a fully many-to-many relationship, ie we expect that we'll see some SyndicatedPosts with the same original and different syndications (use case: pmarca), and we also expect some with the same syndication and different originals (use case: amending a previous post).
given that, i suspect we need to fully replace
query_by_original
withquery_by_originals
, and same with syndication, and change all callers to handle (usually iterate over) multiple results. does that sound right?apologies if this is all scope creep for this PR. i'm honestly not sure myself. :P
also, minor: consider renaming this method
get_or_insert_with
, or something similar, since it provides both original and syndication.