-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic Scholar extend PDF extraction + fix errors when logged in #2103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This looks fine. I only have some small comments for simplification of the code as well as make the extraction of arXiv IDs more general as well as one question. Everything should be easy to implement. Let me know if my comments/suggestions are not yet clear.
Semantic Scholar.js
Outdated
pdfLinkElement = rawData.primaryPaperLink; | ||
} | ||
else if (rawData.alternatePaperLinks) { | ||
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) { | |
for (let alternateElement of rawData.alternatePaperLinks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the following line is not anymore needed (and you never use the variable i
anyways here), so this is a simplification of your code.
Semantic Scholar.js
Outdated
else if (rawData.alternatePaperLinks) { | ||
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) { | ||
let alternateElement = rawData.alternatePaperLinks[i]; | ||
if (alternateElement.url.endsWith('.pdf')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (alternateElement.url.endsWith('.pdf')) { | |
if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then delete the break
below and move the code about the arXiv ID up here. (This then covers the cases that there is pdf link which we use to the element, but there is also also another link following to arXiv.)
Semantic Scholar.js
Outdated
mimeType: 'application/pdf' | ||
}); | ||
|
||
if (pdfLinkElement.linkType == 'arxiv') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this code block up.
Semantic Scholar.js
Outdated
"itemID": "Dalvi2018TrackingSC", | ||
"libraryCatalog": "Semantic Scholar", | ||
"proceedingsTitle": "NAACL-HLT", | ||
"publicationTitle": "NAACL-HLT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are both fields here come from Scaffold?
This looks strange as proceedingsTitle
is only some sort of alias to publicationTitle
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaffold doesn't work well with updating tests for some reason -- the element used to determine type in detectWeb
isn't found for some reason, even with defer: true
. This is despite it appearing when I wget or curl a page. Not sure why this is, but the tests run/pass as usual.
Thanks for the review -- I don't think I understood the exact changes you suggested with the arXiv IDs, but I think the amended code should incorporate the idea |
I believe this is superseded by #2112 which includes PDF scraping, but haven't compared closely |
Yep, I think that's the case |
Some relatively minor changes
hasPDF
property