Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exception while iterating "documents_by_identifiers()" #10

Open
bnewbold opened this issue May 5, 2020 · 3 comments
Open

exception while iterating "documents_by_identifiers()" #10

bnewbold opened this issue May 5, 2020 · 3 comments

Comments

@bnewbold
Copy link

bnewbold commented May 5, 2020

[...]
  File "./dump_scielo.py", line 73, in run_article_ids
    for ident in cl.documents_by_identifiers(only_identifiers=True):
  File "/home/bnewbold/scratch/ingests/scielo/.venv/lib/python3.7/site-packages/articlemeta/client.py", line 496, in documents_by_identifiers
    identifiers = self._do_request(url, params=params).get('objects', [])
AttributeError: 'NoneType' object has no attribute 'get'

Python version: 3.7
articlemetaapi version: 1.26.6

This error happens after many timeouts. Maybe due to HTTP 429 back-off responses? The self._do_request(url, params=params) statement should perhaps be called first and then status checked.

@jamilatta
Copy link
Contributor

@bnewbold is this constant error or is it sporadic error?

My intention is to know if this occurs in all processing? I need to know if you are not getting SciELO metadata, so that we can classify and prioritize this demand.

@bnewbold
Copy link
Author

bnewbold commented May 5, 2020

@jamilatta Thank you for your rapid reply!

This error occured on my first attempt, after iterating through about 19,700 identifiers. Here is the script I am writing:

https://gist.github.com/bnewbold/9918634282f6013e13174badbce64a93

I am running a second time now and have gotten past 50,000 identifiers, so this is probably sporadic. I'll note that I almost immediately get requests.exceptions.ReadTimeout errors (in both cases, trying from two separate machines). The complete failure happens if:

fail retrieving data from (http://articlemeta.scielo.org/api/v1/article/identifiers) attempt(1/10)

... all the attempts fail. I assume this is due to rate limiting, as mentioned in the source. Perhaps there should be an extra delay by default to prevent these timeouts?

As some context, I am hoping to extract the full metadata for all 900k - 1million articles as a JSON snapshot, to archive and include in https://fatcat.wiki. Particularly articles which do not have a DOI. If there is a more efficient way to achieve this, please let me know!

Thank you for maintaining articlemetaapi.

@jamilatta
Copy link
Contributor

@bnewbold I will think a way to avoid all the attempts fail.

Lets me talk with coworkers to think about and soon I return to you.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants