Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move checking against api_limit to cache? #18

Open
bganglia opened this issue Apr 15, 2020 · 8 comments
Open

Move checking against api_limit to cache? #18

bganglia opened this issue Apr 15, 2020 · 8 comments

Comments

@bganglia
Copy link
Collaborator

Maybe the cache could keep track of the number of requests made in the past 24 hours and raise an error when it exceeds the limit

https://github.com/unpywall/unpaywall-python/blob/7de915fe0b1e1390054281e6722778cd9bdd72a0/unpywall/__init__.py#L47

@naustica
Copy link
Member

How will the cache perform on large amounts of data? I think your idea is good but we should add a note that Unpaywall also offers database snapshots for faster access.

@bganglia
Copy link
Collaborator Author

@naustica I have not tested it with more than ~20 dois. Right now it saves the entire cache with pickle every time a new doi is searched, so that could get pretty slow. We should probably replace that with something more high-performance.

@bganglia
Copy link
Collaborator Author

@naustica What do you think about including an alternate "offline" backend? Maybe the user could give the path to the database snapshot, and then use the normal functions to interact with the local copy.

@naustica
Copy link
Member

I like the idea, but the performance will depend on the equipment of the user, because there are millions of entries in the unpaywall data dumps.

@bganglia
Copy link
Collaborator Author

@naustica I see. I guess it only makes sense to test the remote end.

I have never used an unpaywall data dump before, so I guess I will have to find out how hard it is to work with the data on my machine, just to have a sense of perspective. Or have you used it before?

@naustica
Copy link
Member

Actually, I wrote a blog post on how to use Python and Google BigQuery with Unpaywall database snapshots (https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_python/). But since we want to use a local database, I cant say much about it.

I think you could transform the database snapshots into a sql database (maybe with sqlite https://docs.python.org/3.7/library/sqlite3.html). The data dump uses JSON Lines (http://jsonlines.org/examples/) so I think It wont be that difficult. I could help here too.

Maybe the integration of Google BigQuery would also be a nice feature.

@bganglia
Copy link
Collaborator Author

@naustica Yes, integrating BigQuery sounds good. You might have to take the lead on that part though because I have not used it before.

@naustica
Copy link
Member

@bganglia ok, I will create a new issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants