-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move checking against api_limit to cache? #18
Comments
How will the cache perform on large amounts of data? I think your idea is good but we should add a note that Unpaywall also offers database snapshots for faster access. |
@naustica I have not tested it with more than ~20 dois. Right now it saves the entire cache with pickle every time a new doi is searched, so that could get pretty slow. We should probably replace that with something more high-performance. |
@naustica What do you think about including an alternate "offline" backend? Maybe the user could give the path to the database snapshot, and then use the normal functions to interact with the local copy. |
I like the idea, but the performance will depend on the equipment of the user, because there are millions of entries in the unpaywall data dumps. |
@naustica I see. I guess it only makes sense to test the remote end. I have never used an unpaywall data dump before, so I guess I will have to find out how hard it is to work with the data on my machine, just to have a sense of perspective. Or have you used it before? |
Actually, I wrote a blog post on how to use Python and Google BigQuery with Unpaywall database snapshots (https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_python/). But since we want to use a local database, I cant say much about it. I think you could transform the database snapshots into a sql database (maybe with sqlite https://docs.python.org/3.7/library/sqlite3.html). The data dump uses JSON Lines (http://jsonlines.org/examples/) so I think It wont be that difficult. I could help here too. Maybe the integration of Google BigQuery would also be a nice feature. |
@naustica Yes, integrating BigQuery sounds good. You might have to take the lead on that part though because I have not used it before. |
@bganglia ok, I will create a new issue for this. |
Maybe the cache could keep track of the number of requests made in the past 24 hours and raise an error when it exceeds the limit
https://github.com/unpywall/unpaywall-python/blob/7de915fe0b1e1390054281e6722778cd9bdd72a0/unpywall/__init__.py#L47
The text was updated successfully, but these errors were encountered: