Semantic Scholar extend PDF extraction + fix errors when logged in #2103

GuyAglionby · 2019-12-31T17:43:38Z

Some relatively minor changes

Change method of checking if a PDF is available, as they removed the hasPDF property
Go through list of alternative paper URLs to find a PDF if the main link isn't one (the ones identified the old way, via 's2' and 'arxiv', all ended in .pdf already).
More robust way of combing through the encoded data, as the previous method broke when you were logged in
Updated the tests to reflect underlying changes on website (including new URLs). Removed one as it now redirects to a different paper. It wasn't covering anything the others don't.

zuphilip

Thank you! This looks fine. I only have some small comments for simplification of the code as well as make the extraction of arXiv IDs more general as well as one question. Everything should be easy to implement. Let me know if my comments/suggestions are not yet clear.

zuphilip · 2019-12-31T21:37:54Z

Semantic Scholar.js

+				pdfLinkElement = rawData.primaryPaperLink;
+			}
+			else if (rawData.alternatePaperLinks) {
+				for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {


Suggested change

for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {

for (let alternateElement of rawData.alternatePaperLinks) {

Then the following line is not anymore needed (and you never use the variable i anyways here), so this is a simplification of your code.

zuphilip · 2019-12-31T21:39:53Z

Semantic Scholar.js

+			else if (rawData.alternatePaperLinks) {
+				for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
+					let alternateElement = rawData.alternatePaperLinks[i];
+					if (alternateElement.url.endsWith('.pdf')) {


Suggested change

if (alternateElement.url.endsWith('.pdf')) {

if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) {

Then delete the break below and move the code about the arXiv ID up here. (This then covers the cases that there is pdf link which we use to the element, but there is also also another link following to arXiv.)

zuphilip · 2019-12-31T21:44:05Z

Semantic Scholar.js

+					mimeType: 'application/pdf'
+				});
+
+				if (pdfLinkElement.linkType == 'arxiv') {


Move this code block up.

zuphilip · 2019-12-31T21:51:51Z

Semantic Scholar.js

 				"itemID": "Dalvi2018TrackingSC",
 				"libraryCatalog": "Semantic Scholar",
 				"proceedingsTitle": "NAACL-HLT",
+				"publicationTitle": "NAACL-HLT",


Are both fields here come from Scaffold?

This looks strange as proceedingsTitle is only some sort of alias to publicationTitle...

Scaffold doesn't work well with updating tests for some reason -- the element used to determine type in detectWeb isn't found for some reason, even with defer: true. This is despite it appearing when I wget or curl a page. Not sure why this is, but the tests run/pass as usual.

GuyAglionby · 2020-01-10T12:44:31Z

Thanks for the review -- I don't think I understood the exact changes you suggested with the arXiv IDs, but I think the amended code should incorporate the idea

adam3smith · 2020-01-19T01:38:01Z

I believe this is superseded by #2112 which includes PDF scraping, but haven't compared closely

GuyAglionby · 2020-01-19T01:39:33Z

Yep, I think that's the case

Extend PDF extraction + fix errors when logged in

ad09094

GuyAglionby changed the title ~~Extend PDF extraction + fix errors when logged in~~ Semantic Scholar extend PDF extraction + fix errors when logged in Dec 31, 2019

zuphilip added the Improvements Pull requests that are improving existing translators label Dec 31, 2019

zuphilip reviewed Dec 31, 2019

View reviewed changes

Semantic scholar PDF fixes

f75a1e6

adam3smith closed this Jan 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

GuyAglionby commented Dec 31, 2019

zuphilip left a comment

zuphilip Dec 31, 2019

zuphilip Dec 31, 2019

zuphilip Dec 31, 2019

zuphilip Dec 31, 2019

zuphilip Dec 31, 2019

zuphilip Dec 31, 2019

GuyAglionby Jan 10, 2020

GuyAglionby commented Jan 10, 2020

adam3smith commented Jan 19, 2020

GuyAglionby commented Jan 19, 2020

	for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
	for (let alternateElement of rawData.alternatePaperLinks) {

	if (alternateElement.url.endsWith('.pdf')) {
	if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) {

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Conversation

GuyAglionby commented Dec 31, 2019

zuphilip left a comment

Choose a reason for hiding this comment

zuphilip Dec 31, 2019

Choose a reason for hiding this comment

zuphilip Dec 31, 2019

Choose a reason for hiding this comment

zuphilip Dec 31, 2019

Choose a reason for hiding this comment

zuphilip Dec 31, 2019

Choose a reason for hiding this comment

zuphilip Dec 31, 2019

Choose a reason for hiding this comment

zuphilip Dec 31, 2019

Choose a reason for hiding this comment

GuyAglionby Jan 10, 2020

Choose a reason for hiding this comment

GuyAglionby commented Jan 10, 2020

adam3smith commented Jan 19, 2020

GuyAglionby commented Jan 19, 2020