OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162

zoe-translates · 2023-10-19T09:15:46Z

Update search-result page URL pattern and selectors for search-result entries.
Don't use a separate code path for scraping the current page; use the same scrape() function for both search results and current page
Asyncify the network requests
Clean up any HTML entities in the API-returned text fields (title, abstract, etc.)

- Update search-result page URL pattern and selectors for search-result entries. - Be less eager in monitoring DOM change; do this only when the page could possibly be a search-result page as determined by URL. - Update a test case due to change in canonical URL field in output.

- Don't use a separate code path for scraping the current page; use the same scrape() function for both search results and current page - Asyncify the network requests - Clean up any HTML entities in the API-returned text fields (title, abstract, etc.) - More reliable way to extract the "id" of individual preprints: it's the last segment in the path - Overall reduction of code duplication - Update and add tests

The API endpoint may respond with HTML (for human consumption) depending on a variety of factors. To prevent this, explicitly add "Accept:" header to the request.

AbeJellinek · 2023-10-27T15:43:40Z

OSF Preprints.js

-	item.date = attr.date_published;
+	// let type = inputJSON.data.type
+	item.title = ZU.unescapeHTML(attributes.title);
+	item.abstractNote = ZU.unescapeHTML(attributes.description);


Probably should do

Suggested change

item.abstractNote = ZU.unescapeHTML(attributes.description);

item.abstractNote = attributes.description && ZU.unescapeHTML(attributes.description);

because the current implementation of ZU.unescapeHTML() (which is weird, to say the least) looks like it will error on a null/undefined input.

It seems to be the case. I changed the lines to the like of ZU.unescapeHTML(something || "") to prevent passing in a nullish something.

OSF Preprints.js

AbeJellinek · 2023-10-27T15:45:45Z

OSF Preprints.js

 				"libraryCatalog": "OSF Preprints",
-				"repository": "OSF Preprints",
+				"repository": "Open Science Framework",


Not sure anyone actually calls it this - their site only uses "OSF" and I had no idea that's what it stands for.

This change was caused by the data source. In our code we have

item.publisher = embeds.provider.data.attributes.name;

and it appears that the exact phrasing used by the OSF to describe itself has changed to this ("Open Science Framework")

I think the original intent behind setting publisher/repository from the embedded provider data was that some OSF projects (psyarxiv, for example) identify themselves programmatically in this way. Therefore we could get the more specific repository name (e.g. psyarxiv) as opposed to the umbrella OSF.

OSF Preprints.js

- The PDF attachment is named "Preprint PDF" - Use `ZU.cleanAuthor()` to normalize author name in consistency with other translators

The domain osf.io is now the host of the discipline-specific projects, so we only need to match that.

zoe-translates added 3 commits October 19, 2023 15:35

OSF: Add Accept: header to API request just in case

76929f9

The API endpoint may respond with HTML (for human consumption) depending on a variety of factors. To prevent this, explicitly add "Accept:" header to the request.

zoe-translates mentioned this pull request Oct 19, 2023

PsyArXiv: Error saving last item on search-results page #3154

Open

AbeJellinek requested changes Oct 27, 2023

View reviewed changes

zoe-translates added 4 commits October 28, 2023 00:05

OSF Preprints: Prevent passing nullish value into ZU.unescapeHTML()

dce6216

OSF Preprints: Update metadata fields for attachment and authors

78fbbb5

- The PDF attachment is named "Preprint PDF" - Use `ZU.cleanAuthor()` to normalize author name in consistency with other translators

OSF Preprints: Simplify target/identification regexes

185e3b1

The domain osf.io is now the host of the discipline-specific projects, so we only need to match that.

OSF Preprints: Bump minVersion to 6.0, unconditionally support preprint

1351404

zoe-translates requested a review from AbeJellinek October 28, 2023 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162

OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162

zoe-translates commented Oct 19, 2023

AbeJellinek Oct 27, 2023

zoe-translates Oct 27, 2023

AbeJellinek Oct 27, 2023

zoe-translates Oct 28, 2023

	item.abstractNote = ZU.unescapeHTML(attributes.description);
	item.abstractNote = attributes.description && ZU.unescapeHTML(attributes.description);

OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162

Are you sure you want to change the base?

OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162

Conversation

zoe-translates commented Oct 19, 2023

AbeJellinek Oct 27, 2023

Choose a reason for hiding this comment

zoe-translates Oct 27, 2023

Choose a reason for hiding this comment

AbeJellinek Oct 27, 2023

Choose a reason for hiding this comment

zoe-translates Oct 28, 2023

Choose a reason for hiding this comment