Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Closed
wants to merge 2 commits into from

Conversation

GuyAglionby
Copy link
Contributor

Some relatively minor changes

  • Change method of checking if a PDF is available, as they removed the hasPDF property
  • Go through list of alternative paper URLs to find a PDF if the main link isn't one (the ones identified the old way, via 's2' and 'arxiv', all ended in .pdf already).
  • More robust way of combing through the encoded data, as the previous method broke when you were logged in
  • Updated the tests to reflect underlying changes on website (including new URLs). Removed one as it now redirects to a different paper. It wasn't covering anything the others don't.

@GuyAglionby GuyAglionby changed the title Extend PDF extraction + fix errors when logged in Semantic Scholar extend PDF extraction + fix errors when logged in Dec 31, 2019
@zuphilip zuphilip added the Improvements Pull requests that are improving existing translators label Dec 31, 2019
Copy link
Contributor

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks fine. I only have some small comments for simplification of the code as well as make the extraction of arXiv IDs more general as well as one question. Everything should be easy to implement. Let me know if my comments/suggestions are not yet clear.

pdfLinkElement = rawData.primaryPaperLink;
}
else if (rawData.alternatePaperLinks) {
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
for (let alternateElement of rawData.alternatePaperLinks) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the following line is not anymore needed (and you never use the variable i anyways here), so this is a simplification of your code.

else if (rawData.alternatePaperLinks) {
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
let alternateElement = rawData.alternatePaperLinks[i];
if (alternateElement.url.endsWith('.pdf')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (alternateElement.url.endsWith('.pdf')) {
if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then delete the break below and move the code about the arXiv ID up here. (This then covers the cases that there is pdf link which we use to the element, but there is also also another link following to arXiv.)

mimeType: 'application/pdf'
});

if (pdfLinkElement.linkType == 'arxiv') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this code block up.

"itemID": "Dalvi2018TrackingSC",
"libraryCatalog": "Semantic Scholar",
"proceedingsTitle": "NAACL-HLT",
"publicationTitle": "NAACL-HLT",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both fields here come from Scaffold?

This looks strange as proceedingsTitle is only some sort of alias to publicationTitle...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaffold doesn't work well with updating tests for some reason -- the element used to determine type in detectWeb isn't found for some reason, even with defer: true. This is despite it appearing when I wget or curl a page. Not sure why this is, but the tests run/pass as usual.

@GuyAglionby
Copy link
Contributor Author

Thanks for the review -- I don't think I understood the exact changes you suggested with the arXiv IDs, but I think the amended code should incorporate the idea

@adam3smith
Copy link
Collaborator

I believe this is superseded by #2112 which includes PDF scraping, but haven't compared closely

@GuyAglionby
Copy link
Contributor Author

Yep, I think that's the case

@adam3smith adam3smith closed this Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvements Pull requests that are improving existing translators
Development

Successfully merging this pull request may close these issues.

3 participants