-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSF Preprint: fix broken detection for multiple search results; unify search-result & individual-article scraping #3162
base: master
Are you sure you want to change the base?
Conversation
zoe-translates
commented
Oct 19, 2023
- Update search-result page URL pattern and selectors for search-result entries.
- Don't use a separate code path for scraping the current page; use the same scrape() function for both search results and current page
- Asyncify the network requests
- Clean up any HTML entities in the API-returned text fields (title, abstract, etc.)
- Update search-result page URL pattern and selectors for search-result entries. - Be less eager in monitoring DOM change; do this only when the page could possibly be a search-result page as determined by URL. - Update a test case due to change in canonical URL field in output.
- Don't use a separate code path for scraping the current page; use the same scrape() function for both search results and current page - Asyncify the network requests - Clean up any HTML entities in the API-returned text fields (title, abstract, etc.) - More reliable way to extract the "id" of individual preprints: it's the last segment in the path - Overall reduction of code duplication - Update and add tests
The API endpoint may respond with HTML (for human consumption) depending on a variety of factors. To prevent this, explicitly add "Accept:" header to the request.
OSF Preprints.js
Outdated
item.date = attr.date_published; | ||
// let type = inputJSON.data.type | ||
item.title = ZU.unescapeHTML(attributes.title); | ||
item.abstractNote = ZU.unescapeHTML(attributes.description); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should do
item.abstractNote = ZU.unescapeHTML(attributes.description); | |
item.abstractNote = attributes.description && ZU.unescapeHTML(attributes.description); |
because the current implementation of ZU.unescapeHTML()
(which is weird, to say the least) looks like it will error on a null/undefined input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be the case. I changed the lines to the like of ZU.unescapeHTML(something || "")
to prevent passing in a nullish something
.
"libraryCatalog": "OSF Preprints", | ||
"repository": "OSF Preprints", | ||
"repository": "Open Science Framework", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure anyone actually calls it this - their site only uses "OSF" and I had no idea that's what it stands for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was caused by the data source. In our code we have
item.publisher = embeds.provider.data.attributes.name;
and it appears that the exact phrasing used by the OSF to describe itself has changed to this ("Open Science Framework"
)
I think the original intent behind setting publisher
/repository
from the embedded provider
data was that some OSF projects (psyarxiv, for example) identify themselves programmatically in this way. Therefore we could get the more specific repository name (e.g. psyarxiv) as opposed to the umbrella OSF.
- The PDF attachment is named "Preprint PDF" - Use `ZU.cleanAuthor()` to normalize author name in consistency with other translators
The domain osf.io is now the host of the discipline-specific projects, so we only need to match that.