Add cache status to objects #125

ale-de-vries · 2019-09-23T13:42:08Z

For any of the data entities (i.e. AuthorRetrieval, ContentAffiliationRetrieval, AbstractRetrieval, and conceivably also the search types) it would be helpful to include a property/method that indicates whether a local data cache already exists for that entity, and if so, how old it is. This allows a script to inspect if the data needs to be fetched/refreshed from the REST endpoint, which in turn can be used to apply throttling when needed.

Background:
Note that the Scopus API endpoints enforce throttling; any requests that exceed the default request/seconds limit will fail. Also, any client that continuously exceeds throttling limits, risks having its API key suspended. This means that the client needs to monitor/control the rate at which it is calling the API to avoid such failed requests, e.g. by including a timeout (`sleep') when looping over API calls.
The challenge is that this timeout is not necessary when initiating a retrieval/search object for which a cache already existed, as for such cached objects, the API call isn't made. In fact, doing so would be unhelpful, as looping with a timeout over a series of objects that have been cached, means that initiating those objects will take longer than needed, unnecessarily increasing program run time.

(A more elegant approach would be for pybliometrics to enforce throttling, eg. by building a timeout into the get_content.py module - but that requires that module to persist the timestamp of the last request made to api.elsevier.com one way or another, which isn't trivial as this either needs to be persisted on-disk - or maintained in memory, like the elsapy library does.)

Michael-E-Rose · 2019-09-25T12:18:36Z

Hi @ale-de-vries and thanks so much for this issue. You raise many of connected issues, all of which are worth thinking about!

I respond in reverse order:

We cannot enforce throttling on a global level of pybliometrics (between different queries) without a lot of change to the backend. But we can easily slow down requests within one query. A colleague of mine actually experimented with this once as an effort to reduce the number of incidences of broken request and missing data in one query, but to no avail. But well, if it should help in principle, let's do it.
I long thought about adding a property to all classes telling the user about when the file has last been cached (i.e. created or modified) as well. Doing so requires a new base class from which both the Search() and the Retrieval() class inherit from. Getting the modified timestamp via os is easy.
Using the timestamp from 2., I plan to adapt the refresh parameters slightly. User will be able to provide an integer additional to providing a boolean. The integer will be interpreted as maximum age of the cache in days. If the file is older than the provided value, pybliometrics refreshes the file.
Given these, I don't see so much the point of having a property telling the user whether the file has been cached or not. For one, there is the download parameter in the search classes. If it's set to False and the file exists, the relevant parameters are still filled. So that's how users see whether the file exists. For second, I don't see a use case for having information on the cache status if it's not True. That is, why would someone be interested in knowing whether the cached file is already there and then decide to not retrieve the corresponding information? Of course, I am open to discussion here.

Michael-E-Rose · 2020-03-29T16:14:09Z

With fde4a8c, any pybliometrics class can show how old the cached file is. That's certainly a good step in the right direction.

Michael-E-Rose · 2021-06-15T08:22:57Z

Throttling implemented in e32c349

Michael-E-Rose added Backend Effort: High labels Sep 25, 2019

Michael-E-Rose mentioned this issue Apr 2, 2020

Gateway timeout #140

Closed

Michael-E-Rose mentioned this issue Nov 16, 2022

Feature Request: Automatically update objects retrieved on older versions #269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache status to objects #125

Add cache status to objects #125

ale-de-vries commented Sep 23, 2019 •

edited

Loading

Michael-E-Rose commented Sep 25, 2019

Michael-E-Rose commented Mar 29, 2020

Michael-E-Rose commented Jun 15, 2021

Add cache status to objects #125

Add cache status to objects #125

Comments

ale-de-vries commented Sep 23, 2019 • edited Loading

Michael-E-Rose commented Sep 25, 2019

Michael-E-Rose commented Mar 29, 2020

Michael-E-Rose commented Jun 15, 2021

ale-de-vries commented Sep 23, 2019 •

edited

Loading