New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve Metadata query limit fixes #433

Merged
merged 5 commits into from Jan 31, 2014

Conversation

Projects
None yet
4 participants
@aurimasv
Contributor

aurimasv commented Dec 31, 2013

It doesn't seem that we were recognizing HTTP 503 responses from GS as Query Limit Reached.

Also added captcha support. Works surprisingly well on my tests.

Some other tweaks:

  • Don't stop retrieving metadata if we hit a limit (with this patch, only if the user fails/refuses to solve captcha or GS captcha mechanism changes and we don't recognize it), because we may still be able to retrieve ISBN-/DOI-based metadata
  • Don't close window on blur. I'm not sure why this was ever a good idea. Sometimes you may want to analyze the retrieved metadata and see if everything is right. Since the dialog box is not modal, you can actually scroll through your items in Zotero, which is nice. IMO, there's really no good reason to close the dialog automatically.

I'm sure this will need some code restructuring, but I think the general idea works extremely well.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 2, 2014

Contributor

FWIW, here's 155 PDFs that go the GS route (no DOI, no ISBN, positive matches on GS) https://www.dropbox.com/s/lp82ijrbho38wmd/GS_pdfs.tar.gz (for testing purposes only. I'll take it down when this gets pushed)

Contributor

aurimasv commented Jan 2, 2014

FWIW, here's 155 PDFs that go the GS route (no DOI, no ISBN, positive matches on GS) https://www.dropbox.com/s/lp82ijrbho38wmd/GS_pdfs.tar.gz (for testing purposes only. I'll take it down when this gets pushed)

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 12, 2014

Contributor

Done. Let me know if it's ready to be squashed

Contributor

aurimasv commented Jan 12, 2014

Done. Let me know if it's ready to be squashed

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 12, 2014

Member

I'll try it out, but I'm going to let @simonster review this, since the original code is his.

Member

dstillman commented Jan 12, 2014

I'll try it out, but I'm going to let @simonster review this, since the original code is his.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 13, 2014

Member

I'm not sure how I feel about removing the auto-blur close. I think you're right for cases where the user retrieves metadata for a lot of items, but for 1-2 items it should probably close on blur, since the user probably doesn't care about the information in the dialog. We could use the translate progress window for 1-2 items, increase the delay before closing on blur, or make the behavior change depending on the number of items (but that seems kind of weird).

Member

simonster commented Jan 13, 2014

I'm not sure how I feel about removing the auto-blur close. I think you're right for cases where the user retrieves metadata for a lot of items, but for 1-2 items it should probably close on blur, since the user probably doesn't care about the information in the dialog. We could use the translate progress window for 1-2 items, increase the delay before closing on blur, or make the behavior change depending on the number of items (but that seems kind of weird).

promise;
Zotero.debug(allText);
if(!doi) {
// Look for a JSTOR stable URL, which can be converted to a DOI by prepending 10.2307
doi = firstChunk.match(/www.\jstor\.org\/stable\/(\S+)/i);

This comment has been minimized.

@simonster

simonster Jan 13, 2014

Member

Is this always true now? I vaguely remember that at one point not all JSTOR items had registered DOIs.

@simonster

simonster Jan 13, 2014

Member

Is this always true now? I vaguely remember that at one point not all JSTOR items had registered DOIs.

This comment has been minimized.

@aurimasv

aurimasv Jan 13, 2014

Contributor

No, I don't think all of their DOIs are registered, but I don't really see a downside to trying it. I don't think this would produce false negatives.

@aurimasv

aurimasv Jan 13, 2014

Contributor

No, I don't think all of their DOIs are registered, but I don't really see a downside to trying it. I don't think this would produce false negatives.

This comment has been minimized.

@simonster

simonster Jan 13, 2014

Member

You're right; we'll just fall back on GS if it's in CrossRef.

@simonster

simonster Jan 13, 2014

Member

You're right; we'll just fall back on GS if it's in CrossRef.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 13, 2014

Member

If we keep the close on blur, we should also clear the close timer when the window is refocused.

Member

simonster commented Jan 13, 2014

If we keep the close on blur, we should also clear the close timer when the window is refocused.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 13, 2014

Contributor

I'm not sure how I feel about removing the auto-blur close. I think you're right for cases where the user retrieves metadata for a lot of items, but for 1-2 items it should probably close on blur, since the user probably doesn't care about the information in the dialog. We could use the translate progress window for 1-2 items, increase the delay before closing on blur, or make the behavior change depending on the number of items (but that seems kind of weird).

Close on blur is just not a feature that I have encountered often (if ever) on Windows. The behavior is quite unexpected particularly since there is a Close button, the overhead of clicking which is extremely minimal.

For a single file, the contents of the dialog may not be that useful to the user, since the file is automatically selected in the library. Although I would still argue that if an error is produced, the user might want the dialog to not disappear if he goes to type in the error into his/her browser. We could of course not add the onBlur action in the case of an error (I'm not actually sure if we do), but then the behavior of that dialog becomes so chaotic for different cases.

For more than one item (as few as 2 or 3), even successful retrieval of all items may warrant for the dialog to stay open even after losing focus. Say, the first item was a false-positive, so it would not be highlighted after retrieval. If the user was retrieving metadata within a large collection or My Library, he may want to search for this item. If we close the dialog he may lose the title of the new false-positive item and will have to resort to sorting by date added to find it (not the most obvious step for most users).

I just don't see that much utility in closing this dialog on blur to warrant potential headache for the user.

Contributor

aurimasv commented Jan 13, 2014

I'm not sure how I feel about removing the auto-blur close. I think you're right for cases where the user retrieves metadata for a lot of items, but for 1-2 items it should probably close on blur, since the user probably doesn't care about the information in the dialog. We could use the translate progress window for 1-2 items, increase the delay before closing on blur, or make the behavior change depending on the number of items (but that seems kind of weird).

Close on blur is just not a feature that I have encountered often (if ever) on Windows. The behavior is quite unexpected particularly since there is a Close button, the overhead of clicking which is extremely minimal.

For a single file, the contents of the dialog may not be that useful to the user, since the file is automatically selected in the library. Although I would still argue that if an error is produced, the user might want the dialog to not disappear if he goes to type in the error into his/her browser. We could of course not add the onBlur action in the case of an error (I'm not actually sure if we do), but then the behavior of that dialog becomes so chaotic for different cases.

For more than one item (as few as 2 or 3), even successful retrieval of all items may warrant for the dialog to stay open even after losing focus. Say, the first item was a false-positive, so it would not be highlighted after retrieval. If the user was retrieving metadata within a large collection or My Library, he may want to search for this item. If we close the dialog he may lose the title of the new false-positive item and will have to resort to sorting by date added to find it (not the most obvious step for most users).

I just don't see that much utility in closing this dialog on blur to warrant potential headache for the user.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 13, 2014

Member

I'll let @dstillman be the tie breaker on whether this should close on blur.

One additional question is whether we should even try to handle the case where there are multiple retrieve metadata processes happening at once. This is not trivial, since they might not even been opened from the same window. But maybe this is good enough, since that's not going to be particularly common and I think with this PR we'd just show a captcha for each of them, which is not all that bad.

Member

simonster commented Jan 13, 2014

I'll let @dstillman be the tie breaker on whether this should close on blur.

One additional question is whether we should even try to handle the case where there are multiple retrieve metadata processes happening at once. This is not trivial, since they might not even been opened from the same window. But maybe this is good enough, since that's not going to be particularly common and I think with this PR we'd just show a captcha for each of them, which is not all that bad.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 13, 2014

Contributor

I haven't considered that case. I'll have to see if anything can be broken with multiple simultaneous metadata retrievals.

Contributor

aurimasv commented Jan 13, 2014

I haven't considered that case. I'll have to see if anything can be broken with multiple simultaneous metadata retrievals.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 13, 2014

Contributor

I hope that's what you meant. The relationship between Zotero_RecognizePDF and ItemRecognizer is a bit weird. It doesn't seem like ItemRecognizer should be calling a Zotero_RecognizePDF method (https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L465), but it works, so whatever.

Contributor

aurimasv commented Jan 13, 2014

I hope that's what you meant. The relationship between Zotero_RecognizePDF and ItemRecognizer is a bit weird. It doesn't seem like ItemRecognizer should be calling a Zotero_RecognizePDF method (https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L465), but it works, so whatever.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 13, 2014

Member

I agree that ItemRecognizer should probably not be calling a method on Zotero_RecognizePDF. If you want to change that I'd accept a PR for it.

Member

simonster commented Jan 13, 2014

I agree that ItemRecognizer should probably not be calling a method on Zotero_RecognizePDF. If you want to change that I'd accept a PR for it.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 13, 2014

Member

(But this looks good for now, provided that you don't see any problems with multiple retrieve metadata windows open simultaneously.)

Member

simonster commented Jan 13, 2014

(But this looks good for now, provided that you don't see any problems with multiple retrieve metadata windows open simultaneously.)

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 13, 2014

Contributor

The behavior with two Retrieve Metadata windows is a bit messed up when it comes to captchas. Partially it's because Google Scholar will only accept a correct response from (probably) the latest captcha and the user is given two of them to respond to. I'll have to think about how to best approach this.

Contributor

aurimasv commented Jan 13, 2014

The behavior with two Retrieve Metadata windows is a bit messed up when it comes to captchas. Partially it's because Google Scholar will only accept a correct response from (probably) the latest captcha and the user is given two of them to respond to. I'll have to think about how to best approach this.

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 13, 2014

Member

I'm inclined to keep the close on blur, perhaps with a slightly longer timeout and with the close-on-blur disabled altogether if the windows is refocused after being blurred once (so that you could switch away and back and then it wouldn't disappear).

I agree that it's a bit weird for it to disappear, but I don't like the idea of there potentially being a whole bunch of these windows lying around in the background after multiple Retrieve Metadata attempts. If we always sent results to the same window if one was already open, I might change my mind.

Member

dstillman commented Jan 13, 2014

I'm inclined to keep the close on blur, perhaps with a slightly longer timeout and with the close-on-blur disabled altogether if the windows is refocused after being blurred once (so that you could switch away and back and then it wouldn't disappear).

I agree that it's a bit weird for it to disappear, but I don't like the idea of there potentially being a whole bunch of these windows lying around in the background after multiple Retrieve Metadata attempts. If we always sent results to the same window if one was already open, I might change my mind.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 13, 2014

Contributor

but I don't like the idea of there potentially being a whole bunch of these windows lying around in the background

The dialog is opened with a "dependent" flag, so it's always on top, but I see that this is not implemented on MacOS X, which is why it presents this problem. Could we make the on blur behavior only present on Mac? (I'll check Ubuntu to make sure it's working as it does on Windows)

Edit: or we could use "alwaysRaised" to make the behavior consistent.

Contributor

aurimasv commented Jan 13, 2014

but I don't like the idea of there potentially being a whole bunch of these windows lying around in the background

The dialog is opened with a "dependent" flag, so it's always on top, but I see that this is not implemented on MacOS X, which is why it presents this problem. Could we make the on blur behavior only present on Mac? (I'll check Ubuntu to make sure it's working as it does on Windows)

Edit: or we could use "alwaysRaised" to make the behavior consistent.

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 14, 2014

Member

Ah, OK. I agree that the auto-close is weirder when the window is on top, so I'm all right with making this Mac-only (and Linux depending on how it behaves).

I don't think alwaysRaised works on OS X (or, at least, it didn't used to, but I imagine if it did 'dependent' would work the same way).

Member

dstillman commented Jan 14, 2014

Ah, OK. I agree that the auto-close is weirder when the window is on top, so I'm all right with making this Mac-only (and Linux depending on how it behaves).

I don't think alwaysRaised works on OS X (or, at least, it didn't used to, but I imagine if it did 'dependent' would work the same way).

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 14, 2014

Contributor

OK. So I moved all of the GS code into its own singleton object. The way GS queries now work is that you end up queueing them up and they are processed one at a time. This way there is always only a single query to GS at a time and captchas cannot get messed up no matter how many windows are open at the same time. The downside is that each window has to wait for its turn to get GS metadata, so it's a bit slower, but I think that's what we want anyway.

There's one tiny pitfall left, which I don't think is really worth addressing (at least I don't see a straightforward way). If Google Scholar blocks us while we are trying to fetch the BibTeX (not when we're performing the search), then we get an "Unexpected Error". Odds of that are not too high and the user can redo retrieval afterwards.

I was also able to trigger the 403 error from GS, which is basically a page saying "We think you're a bot, try again later" with no captcha to solve. The good news is that clearing a single cookie gets you back to the 503 response with a captcha. With that in mind, I added some code that will clear the cookie and retry (it will ask the user about this when using Firefox, since I'm not sure what the consequences of clearing this cookie are in terms of staying logged in or persisting some Google per-session settings). Unfortunately, after I put that code in place, I was no longer able to trigger the 403 error, so it's not really tested.

Finally, I added the on blur close timer for Mac with a reset after re-focusing the window. I bumped the time up to 5 seconds.

Contributor

aurimasv commented Jan 14, 2014

OK. So I moved all of the GS code into its own singleton object. The way GS queries now work is that you end up queueing them up and they are processed one at a time. This way there is always only a single query to GS at a time and captchas cannot get messed up no matter how many windows are open at the same time. The downside is that each window has to wait for its turn to get GS metadata, so it's a bit slower, but I think that's what we want anyway.

There's one tiny pitfall left, which I don't think is really worth addressing (at least I don't see a straightforward way). If Google Scholar blocks us while we are trying to fetch the BibTeX (not when we're performing the search), then we get an "Unexpected Error". Odds of that are not too high and the user can redo retrieval afterwards.

I was also able to trigger the 403 error from GS, which is basically a page saying "We think you're a bot, try again later" with no captcha to solve. The good news is that clearing a single cookie gets you back to the 503 response with a captcha. With that in mind, I added some code that will clear the cookie and retry (it will ask the user about this when using Firefox, since I'm not sure what the consequences of clearing this cookie are in terms of staying logged in or persisting some Google per-session settings). Unfortunately, after I put that code in place, I was no longer able to trigger the 403 error, so it's not really tested.

Finally, I added the on blur close timer for Mac with a reset after re-focusing the window. I bumped the time up to 5 seconds.

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 14, 2014

Member

And we don't have any reason to believe that Google response with an IP block if too many requests hit the 403 (after repeated cookie clearings)?

Finally, I added the on blur close timer for Mac with a reset after re-focusing the window. I bumped the time up to 5 seconds.

I assume we don't close on blur if the process is still running, right?

Member

dstillman commented Jan 14, 2014

And we don't have any reason to believe that Google response with an IP block if too many requests hit the 403 (after repeated cookie clearings)?

Finally, I added the on blur close timer for Mac with a reset after re-focusing the window. I bumped the time up to 5 seconds.

I assume we don't close on blur if the process is still running, right?

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 14, 2014

Contributor

And we don't have any reason to believe that Google response with an IP block if too many requests hit the 403 (after repeated cookie clearings)?

I don't exactly know what triggers 403. But after clearing the cookie, you are presented with a captcha. You fill it out and you're back in business. In my experience subsequent blocks have been 503s until a random 403 pops up again. I think it may be some left-over older code that is still being triggered somehow.

I assume we don't close on blur if the process is still running, right?

Of course.

I'll fix the messages.

Contributor

aurimasv commented Jan 14, 2014

And we don't have any reason to believe that Google response with an IP block if too many requests hit the 403 (after repeated cookie clearings)?

I don't exactly know what triggers 403. But after clearing the cookie, you are presented with a captcha. You fill it out and you're back in business. In my experience subsequent blocks have been 503s until a random 403 pops up again. I think it may be some left-over older code that is still being triggered somehow.

I assume we don't close on blur if the process is still running, right?

Of course.

I'll fix the messages.

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 14, 2014

Member

I think it may be some left-over older code that is still being triggered somehow.

Oh, I doubt that. I've always assumed that it's just a higher throttling threshold being reached at which they no longer particularly care to give you a captcha— hence my concern about persisting after the 403, in case there's a third threshold at which they block by IP address. But given that they're Google and the prevalence of NAT and proxies, I assume usage from an IP has to be pretty extreme for them to do that.

Member

dstillman commented Jan 14, 2014

I think it may be some left-over older code that is still being triggered somehow.

Oh, I doubt that. I've always assumed that it's just a higher throttling threshold being reached at which they no longer particularly care to give you a captcha— hence my concern about persisting after the 403, in case there's a third threshold at which they block by IP address. But given that they're Google and the prevalence of NAT and proxies, I assume usage from an IP has to be pretty extreme for them to do that.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 14, 2014

Contributor

FWIW, I found some discussion (in Russian) here (through Google Translate). Half way down the page, they discuss the CAPTCHA page, 403 page, and the GDSESS cookie. My observations don't quite match theirs in terms of getting the 403 page to display (which I haven't been able to today) and the role of the GDSESS cookie. If anyone feels like toying around and trying to figure this out, that would be great. I'll keep fiddling with this myself.

Contributor

aurimasv commented Jan 14, 2014

FWIW, I found some discussion (in Russian) here (through Google Translate). Half way down the page, they discuss the CAPTCHA page, 403 page, and the GDSESS cookie. My observations don't quite match theirs in terms of getting the 403 page to display (which I haven't been able to today) and the role of the GDSESS cookie. If anyone feels like toying around and trying to figure this out, that would be great. I'll keep fiddling with this myself.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 15, 2014

Contributor

I didn't notice any side effects of removing GDSESS cookie. I've hit the 403 page quite a few times today and removing the cookie brought me back to the CAPTCHA page every time (except I think once, but I can't be sure that the cookie got removed properly). I never got permanently blocked and I probably ended up querying Google well over 1000 times. I also commented out the user prompt.

I think this should be ready for release.

Contributor

aurimasv commented Jan 15, 2014

I didn't notice any side effects of removing GDSESS cookie. I've hit the 403 page quite a few times today and removing the cookie brought me back to the CAPTCHA page every time (except I think once, but I can't be sure that the cookie got removed properly). I never got permanently blocked and I probably ended up querying Google well over 1000 times. I also commented out the user prompt.

I think this should be ready for release.

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 15, 2014

Member

OK, thank for testing all of that. I'm still a bit uncomfortable silently doing all of this on the user's behalf, though.

  1. I think we need a bit more explanatory text in the CAPTCHA window (which I haven't seen, but from the code it looks like it's just a title, image, and textbox) saying that this is for Google Scholar. Not totally sure how to phrase that, since it will be news to most users that Google Scholar has anything to do with this feature, so we may need to communicate that as well. "Zotero uses Google Scholar to help identify PDFs. To continue using Google Scholar, please enter the text from the image below." I'm not sure if we want to be any more specific about why they're getting a CAPTCHA.

  2. Do we slow down at all when we hit these CAPTCHAs, or do we just present them to the user and then carry on? Particularly if we're not identifying our requests (in Firefox, though Standalone does have its own user agent string), I'm a little uncomfortable sending potentially a huge number of requests to Google without showing at least some deference for these blocks. They're mostly to prevent scraping, presumably, but in theory they could also be for load management (if this weren't, you know, Google), and if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place. I'm mostly concerned here that they would start IP-blocking Zotero users, but I also think we want to generally consider this an API and be respectful in our use of it. Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem. We could maybe even speed up again if a certain amount of time has passed without a CAPTCHA?

Member

dstillman commented Jan 15, 2014

OK, thank for testing all of that. I'm still a bit uncomfortable silently doing all of this on the user's behalf, though.

  1. I think we need a bit more explanatory text in the CAPTCHA window (which I haven't seen, but from the code it looks like it's just a title, image, and textbox) saying that this is for Google Scholar. Not totally sure how to phrase that, since it will be news to most users that Google Scholar has anything to do with this feature, so we may need to communicate that as well. "Zotero uses Google Scholar to help identify PDFs. To continue using Google Scholar, please enter the text from the image below." I'm not sure if we want to be any more specific about why they're getting a CAPTCHA.

  2. Do we slow down at all when we hit these CAPTCHAs, or do we just present them to the user and then carry on? Particularly if we're not identifying our requests (in Firefox, though Standalone does have its own user agent string), I'm a little uncomfortable sending potentially a huge number of requests to Google without showing at least some deference for these blocks. They're mostly to prevent scraping, presumably, but in theory they could also be for load management (if this weren't, you know, Google), and if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place. I'm mostly concerned here that they would start IP-blocking Zotero users, but I also think we want to generally consider this an API and be respectful in our use of it. Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem. We could maybe even speed up again if a certain amount of time has passed without a CAPTCHA?

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 15, 2014

Member

I'm OK with not showing the user anything special when we clear the cookie, though, since that's client side and too complicated to explain anyway.

I could see adding a 1-2 seconds to the request delay each time we hit a CAPTCHA or 403, and then reducing the delay by 1-2 seconds after some period of time without a CAPTCHA (chosen by someone with a better sense of Google blocking patterns). Not sure if the delay should reset to 2 for a fresh attempt or persist (the latter assuming enough time hasn't passed).

Member

dstillman commented Jan 15, 2014

I'm OK with not showing the user anything special when we clear the cookie, though, since that's client side and too complicated to explain anyway.

I could see adding a 1-2 seconds to the request delay each time we hit a CAPTCHA or 403, and then reducing the delay by 1-2 seconds after some period of time without a CAPTCHA (chosen by someone with a better sense of Google blocking patterns). Not sure if the delay should reset to 2 for a fresh attempt or persist (the latter assuming enough time hasn't passed).

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 16, 2014

Contributor

Do we slow down at all when we hit these CAPTCHAs, or do we just present them to the user and then carry on?

There's no additional slow-down besides the usual 2 second wait that we do in-between requests. I suppose we could bump it up to 5 seconds, though I don't think it would help us avoid further CAPTCHAs. Based on what I've read around, the limits imposed by Google Scholar are way too low for us to try to avoid them and still seem like we're actually doing something. So in terms of being respectful of the API, yes, maybe we can back off a bit, but we're still going to be hitting the CAPTCHAs, so I don't see much point. Perhaps make the delay longer in general (i.e. always wait 5 seconds in-between GS queries)? I'm not sure if that would help avoid the 403 pages. Would have to test it.

Also, I'm pretty sure we're not imposing a delay between querying GS and retrieving the BibTeX file for the result. I think that would probably be a bit harder to do.

Particularly if we're not identifying our requests (in Firefox, though Standalone does have its own user agent string), I'm a little uncomfortable sending potentially a huge number of requests to Google without showing at least some deference for these blocks.

Are you suggesting that we should send a unique User-Agent header?

They're mostly to prevent scraping, presumably, but in theory they could also be for load management (if this weren't, you know, Google), and if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place.

Yes, but then we're mostly back to square one. Have the users wait out the 403 errors?

Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem.

What do you mean by running in the background?

We could maybe even speed up again if a certain amount of time has passed without a CAPTCHA?

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again. I suppose with PDFs that are more diverse (i.e. including DOIs) it could take longer.

I wouldn't want Zotero to be flagged by Google Scholar, but I also feel that we can't really stay under the limits imposed by Google Scholar and not be annoying to the users. Would be great if we could communicate with Google about this, but obviously that's not going to happen. Basically, idk what the best querying scheme would be. Since we're probably not actually causing any damage to Google with our automated queries, I think we could just give the current system a shot and see what happens. Otherwise, whatever throttling you decide on will be fine by me.

Contributor

aurimasv commented Jan 16, 2014

Do we slow down at all when we hit these CAPTCHAs, or do we just present them to the user and then carry on?

There's no additional slow-down besides the usual 2 second wait that we do in-between requests. I suppose we could bump it up to 5 seconds, though I don't think it would help us avoid further CAPTCHAs. Based on what I've read around, the limits imposed by Google Scholar are way too low for us to try to avoid them and still seem like we're actually doing something. So in terms of being respectful of the API, yes, maybe we can back off a bit, but we're still going to be hitting the CAPTCHAs, so I don't see much point. Perhaps make the delay longer in general (i.e. always wait 5 seconds in-between GS queries)? I'm not sure if that would help avoid the 403 pages. Would have to test it.

Also, I'm pretty sure we're not imposing a delay between querying GS and retrieving the BibTeX file for the result. I think that would probably be a bit harder to do.

Particularly if we're not identifying our requests (in Firefox, though Standalone does have its own user agent string), I'm a little uncomfortable sending potentially a huge number of requests to Google without showing at least some deference for these blocks.

Are you suggesting that we should send a unique User-Agent header?

They're mostly to prevent scraping, presumably, but in theory they could also be for load management (if this weren't, you know, Google), and if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place.

Yes, but then we're mostly back to square one. Have the users wait out the 403 errors?

Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem.

What do you mean by running in the background?

We could maybe even speed up again if a certain amount of time has passed without a CAPTCHA?

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again. I suppose with PDFs that are more diverse (i.e. including DOIs) it could take longer.

I wouldn't want Zotero to be flagged by Google Scholar, but I also feel that we can't really stay under the limits imposed by Google Scholar and not be annoying to the users. Would be great if we could communicate with Google about this, but obviously that's not going to happen. Basically, idk what the best querying scheme would be. Since we're probably not actually causing any damage to Google with our automated queries, I think we could just give the current system a shot and see what happens. Otherwise, whatever throttling you decide on will be fine by me.

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 16, 2014

Member

Haven't we also seen cases where people were locked out in all browsers after retrieving metadata (so some sort of IP address block), or am I imagining that?

Member

dstillman commented Jan 16, 2014

Haven't we also seen cases where people were locked out in all browsers after retrieving metadata (so some sort of IP address block), or am I imagining that?

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 16, 2014

Contributor

Are you suggesting that we should send a unique User-Agent header?

Not necessarily, but if we did, we could maybe worry less about the request rate, since if they had a problem with it they could block us specifically (or, ideally, contact us) — or they might see that it's Zotero, realize it's fairly innocuous usage, and ignore it.

I rather like this. I think it's fair to announce ourselves. It would also send the message that we're not really trying to abuse the service.

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again.

Does that change at all with a 5-second delay?

My preliminary testing seems to indicate that there's no difference.

Well, we can certainly clear in Standalone. I think it's probably fine to clear it in Firefox too if it's Scholar-specific. It'd be unfortunate to unexpectedly clear the user's Scholar prefs (though doesn't the translator modify those anyway?), but it's worse to unexpectedly lock them out of Google Scholar.

If that cookie stores non-Scholar stuff, too, we should probably shouldn't touch it.

Both of those are set on the "google.com" domain. I'm fairly confident that GDSESS doesn't really play much role in normal browsing (it seems to only appear when CAPTCHAs are involved). PREF, on the other hand, I think is always present. It certainly doesn't log you out and I'm not sure what sort of temporary preferences it can be affecting, since if you're logged in, your important preferences should be permanent. I guess, on the other hand, if we clear it when the user is not logged in, it may affect something.

Haven't we also seen cases where people were locked out in all browsers after retrieving metadata (so some sort of IP address block), or am I imagining that?

https://forums.zotero.org/discussion/12036/locked-out-of-google-scholar/ is the closest thing I found. But it seems to be about the 403 page that does not present a captcha, which we now know how to bypass. I can still see how the users may encounter this if they use a combination of Retrieve Metadata and Safari/Chrome/Opera. Or if they used the URL bar icon to retrieve a lot of sources. We could make a note somewhere on the website that they can clear cookies and then will be presented with a captcha.

Contributor

aurimasv commented Jan 16, 2014

Are you suggesting that we should send a unique User-Agent header?

Not necessarily, but if we did, we could maybe worry less about the request rate, since if they had a problem with it they could block us specifically (or, ideally, contact us) — or they might see that it's Zotero, realize it's fairly innocuous usage, and ignore it.

I rather like this. I think it's fair to announce ourselves. It would also send the message that we're not really trying to abuse the service.

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again.

Does that change at all with a 5-second delay?

My preliminary testing seems to indicate that there's no difference.

Well, we can certainly clear in Standalone. I think it's probably fine to clear it in Firefox too if it's Scholar-specific. It'd be unfortunate to unexpectedly clear the user's Scholar prefs (though doesn't the translator modify those anyway?), but it's worse to unexpectedly lock them out of Google Scholar.

If that cookie stores non-Scholar stuff, too, we should probably shouldn't touch it.

Both of those are set on the "google.com" domain. I'm fairly confident that GDSESS doesn't really play much role in normal browsing (it seems to only appear when CAPTCHAs are involved). PREF, on the other hand, I think is always present. It certainly doesn't log you out and I'm not sure what sort of temporary preferences it can be affecting, since if you're logged in, your important preferences should be permanent. I guess, on the other hand, if we clear it when the user is not logged in, it may affect something.

Haven't we also seen cases where people were locked out in all browsers after retrieving metadata (so some sort of IP address block), or am I imagining that?

https://forums.zotero.org/discussion/12036/locked-out-of-google-scholar/ is the closest thing I found. But it seems to be about the 403 page that does not present a captcha, which we now know how to bypass. I can still see how the users may encounter this if they use a combination of Retrieve Metadata and Safari/Chrome/Opera. Or if they used the URL bar icon to retrieve a lot of sources. We could make a note somewhere on the website that they can clear cookies and then will be presented with a captcha.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 16, 2014

Contributor

GUI design is not my cup of coffee. The CAPTCHA pop-up looks presentable to me, but you'll probably want to tweak it to look better.

Contributor

aurimasv commented Jan 16, 2014

GUI design is not my cup of coffee. The CAPTCHA pop-up looks presentable to me, but you'll probably want to tweak it to look better.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 22, 2014

Contributor

Since we're fixing all of this up, here are two more issues that I'm thinking of:

  • Do not close window when user presses Cancel during retrieval, just stop it. More relevant for multiple items, not for a single item.
  • Make the columns resizable. Sometimes is nice to see more of the filename to determine if the metadata is correct. Also, column size doesn't have to persist.
Contributor

aurimasv commented Jan 22, 2014

Since we're fixing all of this up, here are two more issues that I'm thinking of:

  • Do not close window when user presses Cancel during retrieval, just stop it. More relevant for multiple items, not for a single item.
  • Make the columns resizable. Sometimes is nice to see more of the filename to determine if the metadata is correct. Also, column size doesn't have to persist.
@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 22, 2014

Member

Those both sound good to me.

Member

dstillman commented Jan 22, 2014

Those both sound good to me.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 22, 2014

Contributor

I think I covered all but #446 Don't have time for that one right now.

Contributor

aurimasv commented Jan 22, 2014

I think I covered all but #446 Don't have time for that one right now.

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 22, 2014

Contributor

done

Contributor

aurimasv commented Jan 22, 2014

done

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 22, 2014

Member

OK, great, thanks. That's it from me, at least for the moment. @simonster is going to test this some more in the next day or two.

Member

dstillman commented Jan 22, 2014

OK, great, thanks. That's it from me, at least for the moment. @simonster is going to test this some more in the next day or two.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 23, 2014

Member

Except for that tiny nit, looks good to me and seems to work quite well. Thanks for doing this!

Member

simonster commented Jan 23, 2014

Except for that tiny nit, looks good to me and seems to work quite well. Thanks for doing this!

@jgosmann

This comment has been minimized.

Show comment
Hide comment
@jgosmann

jgosmann Jan 24, 2014

I am switching to Zotero and had to retrieve metadata for about 600 to 700 papers. Of course I got locked out of Google Scholar when I first tried. But then I checked out this pull request and it worked like a charm (occasionally entering a captcha). Thanks!

jgosmann commented Jan 24, 2014

I am switching to Zotero and had to retrieve metadata for about 600 to 700 papers. Of course I got locked out of Google Scholar when I first tried. But then I checked out this pull request and it worked like a charm (occasionally entering a captcha). Thanks!

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 26, 2014

Contributor

@jgosmann, thanks for testing! Glad to see that it's working well.

@simonster, @dstillman fixed the nit and also fixed #446. Please take a look

Edit: and I guess I'm tired now, since my commit messages don't make sense. I'll fix them when squashing

Contributor

aurimasv commented Jan 26, 2014

@jgosmann, thanks for testing! Glad to see that it's working well.

@simonster, @dstillman fixed the nit and also fixed #446. Please take a look

Edit: and I guess I'm tired now, since my commit messages don't make sense. I'll fix them when squashing

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 28, 2014

Contributor

Except for the _rowIDs, I made all the requested changes. Also added/fixed comments. Let me know if I should clean up the commits and get this ready for merging.

Contributor

aurimasv commented Jan 28, 2014

Except for the _rowIDs, I made all the requested changes. Also added/fixed comments. Let me know if I should clean up the commits and get this ready for merging.

@simonster

This comment has been minimized.

Show comment
Hide comment
@simonster

simonster Jan 31, 2014

Member

Looks good to me. Thanks for your patience.

Member

simonster commented Jan 31, 2014

Looks good to me. Thanks for your patience.

aurimasv added some commits Dec 31, 2013

[Retrieve Metadata] Recognize HTTP 503 code as Google Scholar CAPTCHA…
… + other tweaks.

* Stop metadata retrieval when cancelled
* Display CAPTCHA dialog
* Don't close window on blur
* Use Zotero.Utilities.cleanISBN to validate ISBNs
[Retrieve Metadata] Use a single queue to query Google Scholar. Windo…
…w closing tweaks.

* Close window on blur after completion on Mac (revert previous change)
* Don't close window when canceling
* Add Esc handler to cancel/close window
* Allow columns to be resized
* Fixes #445
* Fixes #444
@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 31, 2014

Contributor

Great. Trimmed down the commits.

Contributor

aurimasv commented Jan 31, 2014

Great. Trimmed down the commits.

simonster added a commit that referenced this pull request Jan 31, 2014

Merge pull request #433 from aurimasv/retrieve-meta
Retrieve Metadata query limit fixes

@simonster simonster merged commit 71a3751 into zotero:4.0 Jan 31, 2014

@aurimasv aurimasv deleted the aurimasv:retrieve-meta branch Jan 31, 2014

@aurimasv

This comment has been minimized.

Show comment
Hide comment
@aurimasv

aurimasv Jan 31, 2014

Contributor

Doesn't look like this automatically closed #444 #445 or #446

Contributor

aurimasv commented Jan 31, 2014

Doesn't look like this automatically closed #444 #445 or #446

@dstillman

This comment has been minimized.

Show comment
Hide comment
@dstillman

dstillman Jan 31, 2014

Member

Issues don't get closed automatically until the commit hits the master branch.

Member

dstillman commented Jan 31, 2014

Issues don't get closed automatically until the commit hits the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment