Retrieve Metadata query limit fixes #433

Merged
merged 5 commits into from Jan 31, 2014

Projects

None yet

4 participants

@aurimasv
Contributor

It doesn't seem that we were recognizing HTTP 503 responses from GS as Query Limit Reached.

Also added captcha support. Works surprisingly well on my tests.

Some other tweaks:

  • Don't stop retrieving metadata if we hit a limit (with this patch, only if the user fails/refuses to solve captcha or GS captcha mechanism changes and we don't recognize it), because we may still be able to retrieve ISBN-/DOI-based metadata
  • Don't close window on blur. I'm not sure why this was ever a good idea. Sometimes you may want to analyze the retrieved metadata and see if everything is right. Since the dialog box is not modal, you can actually scroll through your items in Zotero, which is nice. IMO, there's really no good reason to close the dialog automatically.

I'm sure this will need some code restructuring, but I think the general idea works extremely well.

@aurimasv
Contributor
aurimasv commented Jan 2, 2014

FWIW, here's 155 PDFs that go the GS route (no DOI, no ISBN, positive matches on GS) https://www.dropbox.com/s/lp82ijrbho38wmd/GS_pdfs.tar.gz (for testing purposes only. I'll take it down when this gets pushed)

@dstillman dstillman commented on an outdated diff Jan 10, 2014
chrome/content/zotero/recognizePDF.js
- return false;
+
+ formData.input.captcha = io.dataOut.captcha;
+ var url = '', prop;
+ for(prop in formData.input) {
+ url += '&' + encodeURIComponent(prop) + '='
+ + encodeURIComponent(formData.input[prop]);
+ }
+
+ url = formData.action + '?' + url.substr(1);
+
+ return Zotero.HTTP.promise("GET", url, {"responseType":"document"})
+ .then(function() {
+ throw new Zotero_RecognizePDF.CaptchaResult(true);
+ })
+ .fail(function(e) {
@dstillman
dstillman Jan 10, 2014 Member

This (and the other instances) should be changed to .catch, which is the new way to do error handling in the promise spec. (We'll be bulk-changing these on master (along with a switch to Bluebird from Q), but merged code would likely get missed.)

@dstillman dstillman commented on an outdated diff Jan 10, 2014
chrome/content/zotero/recognizePDF.js
+ if(tries === undefined) tries = 3;
+ if(!tries) throw new Zotero_RecognizePDF.CaptchaResult(false);
+ tries--;
+
+ var formData = doc && _extractCaptchaFormData(doc);
+ if(!formData) throw new Zotero.Exception.Alert('recognizePDF.limit');
+
+ var io = { dataIn: {
+ imgUrl: formData.img
+ }};
+
+ _progressWindow.openDialog("chrome://zotero/content/captcha.xul", "",
+ "chrome,modal,resizable=no,centerscreen", io);
+
+ if(!io.dataOut) {
+ return Q.fcall(function () {
@dstillman
dstillman Jan 10, 2014 Member

And this should be Q.try().

@aurimasv
Contributor

Done. Let me know if it's ready to be squashed

@dstillman
Member

I'll try it out, but I'm going to let @simonster review this, since the original code is his.

@simonster
Member

I'm not sure how I feel about removing the auto-blur close. I think you're right for cases where the user retrieves metadata for a lot of items, but for 1-2 items it should probably close on blur, since the user probably doesn't care about the information in the dialog. We could use the translate progress window for 1-2 items, increase the delay before closing on blur, or make the behavior change depending on the number of items (but that seems kind of weird).

@simonster simonster commented on the diff Jan 13, 2014
chrome/content/zotero/recognizePDF.js
promise;
Zotero.debug(allText);
+ if(!doi) {
+ // Look for a JSTOR stable URL, which can be converted to a DOI by prepending 10.2307
+ doi = firstChunk.match(/www.\jstor\.org\/stable\/(\S+)/i);
@simonster
simonster Jan 13, 2014 Member

Is this always true now? I vaguely remember that at one point not all JSTOR items had registered DOIs.

@aurimasv
aurimasv Jan 13, 2014 Contributor

No, I don't think all of their DOIs are registered, but I don't really see a downside to trying it. I don't think this would produce false negatives.

@simonster
simonster Jan 13, 2014 Member

You're right; we'll just fall back on GS if it's in CrossRef.

@simonster simonster commented on an outdated diff Jan 13, 2014
chrome/content/zotero/recognizePDF.js
this._progressWindow.document.getElementById("cancel-button").addEventListener("command", function() {
me.stop();
me._progressWindow.close();
}, false);
this._progressWindow.addEventListener("close", function() { me.stop() }, false);
+ Zotero_RecognizePDF._gsQueryLimitReached = false; // Clear query limit flag
@simonster
simonster Jan 13, 2014 Member

Maybe it would be cleaner if this were a public property of ItemRecognizer

@simonster simonster and 1 other commented on an outdated diff Jan 13, 2014
chrome/content/zotero/recognizePDF.js
+ tries--;
+
+ var formData = doc && _extractCaptchaFormData(doc);
+ if(!formData) throw new Zotero.Exception.Alert('recognizePDF.limit');
+
+ var io = { dataIn: {
+ imgUrl: formData.img
+ }};
+
+ _progressWindow.openDialog("chrome://zotero/content/captcha.xul", "",
+ "chrome,modal,resizable=no,centerscreen", io);
+
+ if(!io.dataOut) {
+ return Q.try(function () {
+ throw new Zotero_RecognizePDF.CaptchaResult(false);
+ });
@simonster
simonster Jan 13, 2014 Member

I think this can just be return Q.reject(new Zotero_RecognizePDF.CaptchaResult(false))

@aurimasv
aurimasv Jan 13, 2014 Contributor

I thought in Q .reject() is only for deferred promises. I did see that it was a nice feature of bluebird promise library.

Edit: nvm, found it.

@simonster
Member

If we keep the close on blur, we should also clear the close timer when the window is refocused.

@aurimasv
Contributor

I'm not sure how I feel about removing the auto-blur close. I think you're right for cases where the user retrieves metadata for a lot of items, but for 1-2 items it should probably close on blur, since the user probably doesn't care about the information in the dialog. We could use the translate progress window for 1-2 items, increase the delay before closing on blur, or make the behavior change depending on the number of items (but that seems kind of weird).

Close on blur is just not a feature that I have encountered often (if ever) on Windows. The behavior is quite unexpected particularly since there is a Close button, the overhead of clicking which is extremely minimal.

For a single file, the contents of the dialog may not be that useful to the user, since the file is automatically selected in the library. Although I would still argue that if an error is produced, the user might want the dialog to not disappear if he goes to type in the error into his/her browser. We could of course not add the onBlur action in the case of an error (I'm not actually sure if we do), but then the behavior of that dialog becomes so chaotic for different cases.

For more than one item (as few as 2 or 3), even successful retrieval of all items may warrant for the dialog to stay open even after losing focus. Say, the first item was a false-positive, so it would not be highlighted after retrieval. If the user was retrieving metadata within a large collection or My Library, he may want to search for this item. If we close the dialog he may lose the title of the new false-positive item and will have to resort to sorting by date added to find it (not the most obvious step for most users).

I just don't see that much utility in closing this dialog on blur to warrant potential headache for the user.

@simonster
Member

I'll let @dstillman be the tie breaker on whether this should close on blur.

One additional question is whether we should even try to handle the case where there are multiple retrieve metadata processes happening at once. This is not trivial, since they might not even been opened from the same window. But maybe this is good enough, since that's not going to be particularly common and I think with this PR we'd just show a captcha for each of them, which is not all that bad.

@aurimasv
Contributor

I haven't considered that case. I'll have to see if anything can be broken with multiple simultaneous metadata retrievals.

@aurimasv
Contributor

I hope that's what you meant. The relationship between Zotero_RecognizePDF and ItemRecognizer is a bit weird. It doesn't seem like ItemRecognizer should be calling a Zotero_RecognizePDF method (https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L465), but it works, so whatever.

@simonster
Member

I agree that ItemRecognizer should probably not be calling a method on Zotero_RecognizePDF. If you want to change that I'd accept a PR for it.

@simonster
Member

(But this looks good for now, provided that you don't see any problems with multiple retrieve metadata windows open simultaneously.)

@aurimasv
Contributor

The behavior with two Retrieve Metadata windows is a bit messed up when it comes to captchas. Partially it's because Google Scholar will only accept a correct response from (probably) the latest captcha and the user is given two of them to respond to. I'll have to think about how to best approach this.

@dstillman
Member

I'm inclined to keep the close on blur, perhaps with a slightly longer timeout and with the close-on-blur disabled altogether if the windows is refocused after being blurred once (so that you could switch away and back and then it wouldn't disappear).

I agree that it's a bit weird for it to disappear, but I don't like the idea of there potentially being a whole bunch of these windows lying around in the background after multiple Retrieve Metadata attempts. If we always sent results to the same window if one was already open, I might change my mind.

@aurimasv
Contributor

but I don't like the idea of there potentially being a whole bunch of these windows lying around in the background

The dialog is opened with a "dependent" flag, so it's always on top, but I see that this is not implemented on MacOS X, which is why it presents this problem. Could we make the on blur behavior only present on Mac? (I'll check Ubuntu to make sure it's working as it does on Windows)

Edit: or we could use "alwaysRaised" to make the behavior consistent.

@dstillman
Member

Ah, OK. I agree that the auto-close is weirder when the window is on top, so I'm all right with making this Mac-only (and Linux depending on how it behaves).

I don't think alwaysRaised works on OS X (or, at least, it didn't used to, but I imagine if it did 'dependent' would work the same way).

@aurimasv
Contributor

OK. So I moved all of the GS code into its own singleton object. The way GS queries now work is that you end up queueing them up and they are processed one at a time. This way there is always only a single query to GS at a time and captchas cannot get messed up no matter how many windows are open at the same time. The downside is that each window has to wait for its turn to get GS metadata, so it's a bit slower, but I think that's what we want anyway.

There's one tiny pitfall left, which I don't think is really worth addressing (at least I don't see a straightforward way). If Google Scholar blocks us while we are trying to fetch the BibTeX (not when we're performing the search), then we get an "Unexpected Error". Odds of that are not too high and the user can redo retrieval afterwards.

I was also able to trigger the 403 error from GS, which is basically a page saying "We think you're a bot, try again later" with no captcha to solve. The good news is that clearing a single cookie gets you back to the 503 response with a captcha. With that in mind, I added some code that will clear the cookie and retry (it will ask the user about this when using Firefox, since I'm not sure what the consequences of clearing this cookie are in terms of staying logged in or persisting some Google per-session settings). Unfortunately, after I put that code in place, I was no longer able to trigger the 403 error, so it's not really tested.

Finally, I added the on blur close timer for Mac with a reset after re-focusing the window. I bumped the time up to 5 seconds.

@dstillman
Member

And we don't have any reason to believe that Google response with an IP block if too many requests hit the 403 (after repeated cookie clearings)?

Finally, I added the on blur close timer for Mac with a reset after re-focusing the window. I bumped the time up to 5 seconds.

I assume we don't close on blur if the process is still running, right?

@aurimasv
Contributor

And we don't have any reason to believe that Google response with an IP block if too many requests hit the 403 (after repeated cookie clearings)?

I don't exactly know what triggers 403. But after clearing the cookie, you are presented with a captcha. You fill it out and you're back in business. In my experience subsequent blocks have been 503s until a random 403 pops up again. I think it may be some left-over older code that is still being triggered somehow.

I assume we don't close on blur if the process is still running, right?

Of course.

I'll fix the messages.

@dstillman
Member

I think it may be some left-over older code that is still being triggered somehow.

Oh, I doubt that. I've always assumed that it's just a higher throttling threshold being reached at which they no longer particularly care to give you a captcha— hence my concern about persisting after the 403, in case there's a third threshold at which they block by IP address. But given that they're Google and the prevalence of NAT and proxies, I assume usage from an IP has to be pretty extreme for them to do that.

@aurimasv
Contributor

FWIW, I found some discussion (in Russian) here (through Google Translate). Half way down the page, they discuss the CAPTCHA page, 403 page, and the GDSESS cookie. My observations don't quite match theirs in terms of getting the 403 page to display (which I haven't been able to today) and the role of the GDSESS cookie. If anyone feels like toying around and trying to figure this out, that would be great. I'll keep fiddling with this myself.

@aurimasv aurimasv commented on an outdated diff Jan 15, 2014
chrome/content/zotero/recognizePDF.js
// scroll to this item
- me._progressWindow.document.getElementById("tree").treeBoxObject.scrollToRow(Math.max(0, me._itemTotal-me._items.length-5));
+ me._progressWindow.document.getElementById("tree").treeBoxObject.scrollToRow(Math.max(0, me._itemTotal-me._items.length-4));
@aurimasv
aurimasv Jan 15, 2014 Contributor

Just wanted to note that I altered this. At least on windows, item being looked up was always just below the visible area.

@aurimasv
Contributor

I didn't notice any side effects of removing GDSESS cookie. I've hit the 403 page quite a few times today and removing the cookie brought me back to the CAPTCHA page every time (except I think once, but I can't be sure that the cookie got removed properly). I never got permanently blocked and I probably ended up querying Google well over 1000 times. I also commented out the user prompt.

I think this should be ready for release.

@dstillman
Member

OK, thank for testing all of that. I'm still a bit uncomfortable silently doing all of this on the user's behalf, though.

  1. I think we need a bit more explanatory text in the CAPTCHA window (which I haven't seen, but from the code it looks like it's just a title, image, and textbox) saying that this is for Google Scholar. Not totally sure how to phrase that, since it will be news to most users that Google Scholar has anything to do with this feature, so we may need to communicate that as well. "Zotero uses Google Scholar to help identify PDFs. To continue using Google Scholar, please enter the text from the image below." I'm not sure if we want to be any more specific about why they're getting a CAPTCHA.

  2. Do we slow down at all when we hit these CAPTCHAs, or do we just present them to the user and then carry on? Particularly if we're not identifying our requests (in Firefox, though Standalone does have its own user agent string), I'm a little uncomfortable sending potentially a huge number of requests to Google without showing at least some deference for these blocks. They're mostly to prevent scraping, presumably, but in theory they could also be for load management (if this weren't, you know, Google), and if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place. I'm mostly concerned here that they would start IP-blocking Zotero users, but I also think we want to generally consider this an API and be respectful in our use of it. Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem. We could maybe even speed up again if a certain amount of time has passed without a CAPTCHA?

@dstillman
Member

I'm OK with not showing the user anything special when we clear the cookie, though, since that's client side and too complicated to explain anyway.

I could see adding a 1-2 seconds to the request delay each time we hit a CAPTCHA or 403, and then reducing the delay by 1-2 seconds after some period of time without a CAPTCHA (chosen by someone with a better sense of Google blocking patterns). Not sure if the delay should reset to 2 for a fresh attempt or persist (the latter assuming enough time hasn't passed).

@aurimasv
Contributor

Do we slow down at all when we hit these CAPTCHAs, or do we just present them to the user and then carry on?

There's no additional slow-down besides the usual 2 second wait that we do in-between requests. I suppose we could bump it up to 5 seconds, though I don't think it would help us avoid further CAPTCHAs. Based on what I've read around, the limits imposed by Google Scholar are way too low for us to try to avoid them and still seem like we're actually doing something. So in terms of being respectful of the API, yes, maybe we can back off a bit, but we're still going to be hitting the CAPTCHAs, so I don't see much point. Perhaps make the delay longer in general (i.e. always wait 5 seconds in-between GS queries)? I'm not sure if that would help avoid the 403 pages. Would have to test it.

Also, I'm pretty sure we're not imposing a delay between querying GS and retrieving the BibTeX file for the result. I think that would probably be a bit harder to do.

Particularly if we're not identifying our requests (in Firefox, though Standalone does have its own user agent string), I'm a little uncomfortable sending potentially a huge number of requests to Google without showing at least some deference for these blocks.

Are you suggesting that we should send a unique User-Agent header?

They're mostly to prevent scraping, presumably, but in theory they could also be for load management (if this weren't, you know, Google), and if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place.

Yes, but then we're mostly back to square one. Have the users wait out the 403 errors?

Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem.

What do you mean by running in the background?

We could maybe even speed up again if a certain amount of time has passed without a CAPTCHA?

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again. I suppose with PDFs that are more diverse (i.e. including DOIs) it could take longer.

I wouldn't want Zotero to be flagged by Google Scholar, but I also feel that we can't really stay under the limits imposed by Google Scholar and not be annoying to the users. Would be great if we could communicate with Google about this, but obviously that's not going to happen. Basically, idk what the best querying scheme would be. Since we're probably not actually causing any damage to Google with our automated queries, I think we could just give the current system a shot and see what happens. Otherwise, whatever throttling you decide on will be fine by me.

@aurimasv aurimasv commented on the diff Jan 16, 2014
chrome/locale/en-US/zotero/zotero.dtd
@@ -253,7 +255,6 @@
<!ENTITY zotero.recognizePDF.cancel.label "Cancel">
<!ENTITY zotero.recognizePDF.pdfName.label "PDF Name">
<!ENTITY zotero.recognizePDF.itemName.label "Item Name">
-<!ENTITY zotero.recognizePDF.captcha.label "Type the text below to continue retrieving metadata.">
@aurimasv
aurimasv Jan 16, 2014 Contributor

No longer used, so I removed it.

@dstillman
Member

Based on what I've read around, the limits imposed by Google Scholar are way too low for us to try to avoid them and still seem like we're actually doing something. So in terms of being respectful of the API, yes, maybe we can back off a bit, but we're still going to be hitting the CAPTCHAs, so I don't see much point.

That's fair.

Perhaps make the delay longer in general (i.e. always wait 5 seconds in-between GS queries)?

I don't know. I don't particularly want to make the initial interval shorter.

Are you suggesting that we should send a unique User-Agent header?

Not necessarily, but if we did, we could maybe worry less about the request rate, since if they had a problem with it they could block us specifically (or, ideally, contact us) — or they might see that it's Zotero, realize it's fairly innocuous usage, and ignore it.

if you take them as a form of communication that the request volume is too high, it's a little inappropriate to keep going and even automate around the more severe block they put in place.

Yes, but then we're mostly back to square one. Have the users wait out the 403 errors?

No, sorry, I wasn't suggesting that we shouldn't clear the cookie. I meant keep going without any sort of change in behavior.

Now that we better handle multiple windows and the like, slowing down by a second or two when we hit a CAPTCHA and letting these requests run in the background seems like less of a problem.

What do you mean by running in the background?

That the user can leave this going while they do other things better than before. I do think that as long as the progress window works well and can run without interfering with other usage, it doesn't matter all that much how long this takes.

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again.

Does that change at all with a 5-second delay?

@aurimasv aurimasv and 1 other commented on an outdated diff Jan 16, 2014
chrome/content/zotero/recognizePDF.js
+ //ask user first
+ var response = Components.classes["@mozilla.org/embedcomp/prompt-service;1"]
+ .getService(Components.interfaces.nsIPromptService)
+ .confirm(null, "Clear Google Scholar cookies?",
+ "Google Scholar is attempting to block further queries. We can "
+ + "clear certain cookies and try again. This may affect some "
+ + "temporary Google preferences or it may log you out. May we clear"
+ + " your Google Scholar cookies?");
+ if(!response) return;
+ }*/
+
+ //find GDSESS cookie
+ var removed = false, cookies = cookieService.getCookiesFromHost(host);
+ while(cookies.hasMoreElements()) {
+ var cookie = cookies.getNext().QueryInterface(Components.interfaces.nsICookie2);
+ if(["GDSESS", "PREF"].indexOf(cookie.name) !== -1) {
@aurimasv
aurimasv Jan 16, 2014 Contributor

I was doing some more 403 testing and it seems that sometimes (I think if you're logged in), removing GDSESS is not sufficient. You also need to remove PREF (we can imagine what that cookie controls. Doesn't log you out though). I wouldn't be opposed to not removing this cookie and just let the user deal with it. It's hackish as it is.

@dstillman
dstillman Jan 16, 2014 Member

Well, we can certainly clear in Standalone. I think it's probably fine to clear it in Firefox too if it's Scholar-specific. It'd be unfortunate to unexpectedly clear the user's Scholar prefs (though doesn't the translator modify those anyway?), but it's worse to unexpectedly lock them out of Google Scholar.

If that cookie stores non-Scholar stuff, too, we should probably shouldn't touch it.

@aurimasv
aurimasv Jan 22, 2014 Contributor

http://www.google.com/policies/technologies/types/ nothing critical, but of course it's your call

@dstillman
dstillman Jan 22, 2014 Member

Ah, good find. I'm fine with clearing that, given the alternative.

@aurimasv aurimasv and 1 other commented on an outdated diff Jan 16, 2014
chrome/content/zotero/captcha.js
+ errorMsg.textContent = this._io.dataIn.error;
+ errorMsg.hidden = false;
+ } else {
+ errorMsg.hidden = true;
+ }
+
+ document.getElementById('zotero-captcha-image').src = this._io.dataIn.imgUrl;
+ document.getElementById('zotero-captcha-input').focus();
+ }
+
+ this.imageOnLoad = function() {
+ var width = document.getElementById('zotero-captcha-image')
+ .getBoundingClientRect().width;
+ Zotero.debug(width);
+ document.getElementById('zotero-captcha-description').width = width;
+ document.getElementById('zotero-captcha-error').width = width;
@aurimasv
aurimasv Jan 16, 2014 Contributor

I cannot for the life of me get the description to wrap instead of stretching the pop-up. The above doesn't work either. Any thoughts? I'd like to avoid setting an explicit width (we could use a min-width), since I feel that the image should be the limiting factor here.

@dstillman
dstillman Jan 16, 2014 Member

I've had that problem with descriptions in the prefs. Setting an explicit width was the only way I was able to fix it.

@dstillman
Member

Haven't we also seen cases where people were locked out in all browsers after retrieving metadata (so some sort of IP address block), or am I imagining that?

@aurimasv
Contributor

Are you suggesting that we should send a unique User-Agent header?

Not necessarily, but if we did, we could maybe worry less about the request rate, since if they had a problem with it they could block us specifically (or, ideally, contact us) — or they might see that it's Zotero, realize it's fairly innocuous usage, and ignore it.

I rather like this. I think it's fair to announce ourselves. It would also send the message that we're not really trying to abuse the service.

In my experience you end up hitting the CAPTCHA every 30-40 PDFs, so there's not really much time to speed up again.

Does that change at all with a 5-second delay?

My preliminary testing seems to indicate that there's no difference.

Well, we can certainly clear in Standalone. I think it's probably fine to clear it in Firefox too if it's Scholar-specific. It'd be unfortunate to unexpectedly clear the user's Scholar prefs (though doesn't the translator modify those anyway?), but it's worse to unexpectedly lock them out of Google Scholar.

If that cookie stores non-Scholar stuff, too, we should probably shouldn't touch it.

Both of those are set on the "google.com" domain. I'm fairly confident that GDSESS doesn't really play much role in normal browsing (it seems to only appear when CAPTCHAs are involved). PREF, on the other hand, I think is always present. It certainly doesn't log you out and I'm not sure what sort of temporary preferences it can be affecting, since if you're logged in, your important preferences should be permanent. I guess, on the other hand, if we clear it when the user is not logged in, it may affect something.

Haven't we also seen cases where people were locked out in all browsers after retrieving metadata (so some sort of IP address block), or am I imagining that?

https://forums.zotero.org/discussion/12036/locked-out-of-google-scholar/ is the closest thing I found. But it seems to be about the 403 page that does not present a captcha, which we now know how to bypass. I can still see how the users may encounter this if they use a combination of Retrieve Metadata and Safari/Chrome/Opera. Or if they used the URL bar icon to retrieve a lot of sources. We could make a note somewhere on the website that they can clear cookies and then will be presented with a captcha.

@aurimasv
Contributor

GUI design is not my cup of coffee. The CAPTCHA pop-up looks presentable to me, but you'll probably want to tweak it to look better.

@aurimasv
Contributor

Since we're fixing all of this up, here are two more issues that I'm thinking of:

  • Do not close window when user presses Cancel during retrieval, just stop it. More relevant for multiple items, not for a single item.
  • Make the columns resizable. Sometimes is nice to see more of the filename to determine if the metadata is correct. Also, column size doesn't have to persist.
@dstillman
Member

Those both sound good to me.

@aurimasv
Contributor

I think I covered all but #446 Don't have time for that one right now.

@aurimasv
Contributor

done

@dstillman
Member

OK, great, thanks. That's it from me, at least for the moment. @simonster is going to test this some more in the next day or two.

@simonster simonster commented on an outdated diff Jan 23, 2014
chrome/content/zotero/recognizePDF.js
+ Components.utils.import("resource://gre/modules/Services.jsm");
+ var cookieService = Services.cookies;
+
+ this.resetQueryLimit = function() {
+ queryLimitReached = false;
+ };
+
+ this.findItem = function(lines, libraryID, stopCheckCallback) {
+ if(!inProgress && queryLimitReached) {
+ //there's no queue, so we can reject immediately
+ return Q.reject(new Zotero.Exception.Alert("recognizePDF.limit"));
+ }
+
+ var deferred = Q.defer();
+ queue.push({
+ promise: deferred,
@simonster
simonster Jan 23, 2014 Member

Can we call this property deferred instead of promise?

@simonster
Member

Except for that tiny nit, looks good to me and seems to work quite well. Thanks for doing this!

@jgosmann

I am switching to Zotero and had to retrieve metadata for about 600 to 700 papers. Of course I got locked out of Google Scholar when I first tried. But then I checked out this pull request and it worked like a charm (occasionally entering a captcha). Thanks!

@aurimasv
Contributor

@jgosmann, thanks for testing! Glad to see that it's working well.

@simonster, @dstillman fixed the nit and also fixed #446. Please take a look

Edit: and I guess I'm tired now, since my commit messages don't make sense. I'll fix them when squashing

@aurimasv
Contributor

Except for the _rowIDs, I made all the requested changes. Also added/fixed comments. Let me know if I should clean up the commits and get this ready for merging.

@simonster
Member

Looks good to me. Thanks for your patience.

aurimasv added some commits Dec 31, 2013
@aurimasv aurimasv [Retrieve Metadata] Recognize HTTP 503 code as Google Scholar CAPTCHA…
… + other tweaks.

* Stop metadata retrieval when cancelled
* Display CAPTCHA dialog
* Don't close window on blur
* Use Zotero.Utilities.cleanISBN to validate ISBNs
b3da19e
@aurimasv aurimasv [Retrieve Metadata] Look for JSTOR stable URLs and convert to DOIs 4bedb61
@aurimasv aurimasv [Retrieve Metadata] Use a single queue to query Google Scholar. Windo…
…w closing tweaks.

* Close window on blur after completion on Mac (revert previous change)
* Don't close window when canceling
* Add Esc handler to cancel/close window
* Allow columns to be resized
* Fixes #445
* Fixes #444
57350fa
@aurimasv @aurimasv aurimasv [Retrieve Metadata] Focus item in Zotero pane when double-clicking
* Fixes #446
859e99f
@aurimasv aurimasv [Retrieve Metadata] Add/fix comments 3c21e7c
@aurimasv
Contributor

Great. Trimmed down the commits.

@simonster simonster merged commit 71a3751 into zotero:4.0 Jan 31, 2014
@aurimasv aurimasv deleted the aurimasv:retrieve-meta branch Jan 31, 2014
@aurimasv
Contributor

Doesn't look like this automatically closed #444 #445 or #446

@dstillman
Member

Issues don't get closed automatically until the commit hits the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment