Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

zoe-translates
Copy link
Collaborator

  • Update search-result page URL pattern and selectors for search-result entries.
  • Don't use a separate code path for scraping the current page; use the same scrape() function for both search results and current page
  • Asyncify the network requests
  • Clean up any HTML entities in the API-returned text fields (title, abstract, etc.)

- Update search-result page URL pattern and selectors for search-result
  entries.
- Be less eager in monitoring DOM change; do this only when the page
  could possibly be a search-result page as determined by URL.
- Update a test case due to change in canonical URL field in output.
- Don't use a separate code path for scraping the current page; use the
  same scrape() function for both search results and current page
- Asyncify the network requests
- Clean up any HTML entities in the API-returned text fields (title,
  abstract, etc.)
- More reliable way to extract the "id" of individual preprints: it's
  the last segment in the path
- Overall reduction of code duplication
- Update and add tests
The API endpoint may respond with HTML (for human consumption) depending
on a variety of factors. To prevent this, explicitly add "Accept:"
header to the request.
OSF Preprints.js Outdated
item.date = attr.date_published;
// let type = inputJSON.data.type
item.title = ZU.unescapeHTML(attributes.title);
item.abstractNote = ZU.unescapeHTML(attributes.description);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should do

Suggested change
item.abstractNote = ZU.unescapeHTML(attributes.description);
item.abstractNote = attributes.description && ZU.unescapeHTML(attributes.description);

because the current implementation of ZU.unescapeHTML() (which is weird, to say the least) looks like it will error on a null/undefined input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be the case. I changed the lines to the like of ZU.unescapeHTML(something || "") to prevent passing in a nullish something.

OSF Preprints.js Outdated Show resolved Hide resolved
OSF Preprints.js Outdated Show resolved Hide resolved
"libraryCatalog": "OSF Preprints",
"repository": "OSF Preprints",
"repository": "Open Science Framework",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure anyone actually calls it this - their site only uses "OSF" and I had no idea that's what it stands for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was caused by the data source. In our code we have

	item.publisher = embeds.provider.data.attributes.name;

and it appears that the exact phrasing used by the OSF to describe itself has changed to this ("Open Science Framework")

I think the original intent behind setting publisher/repository from the embedded provider data was that some OSF projects (psyarxiv, for example) identify themselves programmatically in this way. Therefore we could get the more specific repository name (e.g. psyarxiv) as opposed to the umbrella OSF.

OSF Preprints.js Show resolved Hide resolved
OSF Preprints.js Outdated Show resolved Hide resolved
- The PDF attachment is named "Preprint PDF"
- Use `ZU.cleanAuthor()` to normalize author name in consistency with
  other translators
The domain osf.io is now the host of the discipline-specific projects,
so we only need to match that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants