Generator function for getting new records #185

trangdata · 2023-10-26T22:40:36Z

Moved part of #182 here.

What would you think about a hook, i.e. a function which is called after one page is downloaded, containing the returned json, the page number, ...? This would open the possibility to implement the saving outside openalexR, while at the same time having no additional maintenance costs? To hide them from the arguments, the hook function could be in an option which needs to be set. So for the user, nothing would change, but the power user can set a hook (possibly different ones in the future?), to implement functionality way beyond what we think at the moment? A function like setHooks() could be used to set the hooks, and getHooks() to return a list containing all hooks?

@yjunechoe:

I'm personally a little cautious of exposing a feature that lets users inject code in the middle of a function executing queries. setHooks()/getHooks() also requires a persistent package state that can store closures, which feels too bulky to maintain for a light API wrapper package like openalexR.
For what it's worth, a power user could implement a hook-like behavior from the outside by trace()-ing the function to inject code. Or, paging can be more precisely controlled with something like {chromote}, which does have that kind of hook ("callback") argument that you're describing. Generally, I think it'd be nice to know how many users would benefit from such a power user feature given the availability of more flexible alternatives (of writing their own version of oa_request() building off of oa_query(), or even just writing http/curl directly) before we push further in this direction (so I could be convinced otherwise!).

@trangdata:
I mostly agree with @yjunechoe. Our vision for the package covers the common use case of accessing a smaller amount of OpenAlex records that can fit in memory. I would argue that the use cases for larger amount of data can possibly be solved using OpenAlex snapshot download/bulk export.

Nonetheless, @rkrug, while I'm not completely clear on setHooks/getHooks idea, I have implemented a generator function for in #184 with the new optional dependency coro. Would this cover your use case? @yjunechoe any thoughts?

The text was updated successfully, but these errors were encountered:

rkrug · 2023-10-27T10:36:27Z

I agree completely, that the package should keep its simplicity - and I also like that it is very light on dependencies.

Using openalexR is really straight forward (and the refactoring of the parameters as in #182 discussed will make it even easier. But the problem is, as you point out, the memory limit and, I assume this and I have no test cases, the slowdown when downloading a large number of pages (memory reallocations when the result list grows). I will not go into that second part - that is an independent discussion, but comes later in as a by product.

So we are at the memory limit.

The primary question is, as I see it, how can this package deal with the memory limit and avoid it.

So I see two steps here:

see in advance if there will be a memory limit, and
how to deal with it

I only looked at 'oa_request()`, so far, and I assume that the main memory limitation is happening when downloading multiple pages, which are than saved in memory in the result list. The most promising approach could be, as the number of pages is known, to allocate the memory for all downloads, and then just overwrite the individual page results. If I understand R memory management correctly, this should be faster as well as it does not need to re-allocate memory each time. And it would tell you before the retrieval starts, if there is enough unfragmented memory.

But still, there is the question on how to deal with the case, when the memory can not be allocated at the beginning. There are two (or three?) approaches:

The function aborts and shows a useful error message, so that the user has to take action and knows why or
the function uses internally a different approach when the memory can not be allocated, e.g. using intermediate files to process the data (this might be difficult and encounter additional memory limits)
to accompany (1), an option would be a hook function or using coco (which I have never heard of and sounds complicated to understand) which can be used to circumvent the memory issue

I am in general in favour of automatic functions, so that the user can concentrate on the question (which has to do with literature) and is not sidetracked by thinking about complicated ways of solving the problem of to many hits in the search or snowballing from to may papers, but this is as I see it the most difficult to implement. The easiest is an error message if the memory can not be allocated, and to provide a hook function which, if returning TRUE, continues processing, and if returning FALSE, skips processing and fetches the next page.

trangdata · 2023-10-29T13:35:35Z

Hi Rainer, a few thoughts:

Determining whether memory is sufficient to store all records the beginning is not straightforward. I think the (power) user should be in charge of this step, combining count_only = TRUE with what they know about their machine's RAM specs to determine whether they should write out the stepwise results themselves.

Currently, there are two approaches to combat this memory issue:

Use paging to control the number of pages they want to write in each chunk improve paging control #183

chunk_size <- 5
for (i in 1:3){
  start <- chunk_size*(i-1) + 1
  end <- chunk_size*i
  oa_fetch(
    cites = "W2755950973", 
    pages = start:end,
    per_page = 20,
    verbose = TRUE
  ) |>
    saveRDS(file = sprintf("tfc-nature_p%s-%s.rds", start, end))
}

Use oa_generate to go through each record one at a time. This doesn't sound very efficient but it actually is: I have implemented this function so that we only query the OpenAlex database every 200 records. With this method, the user does not have to worry about paging at all. Using coro to create a generator #184

query_url <- "https://api.openalex.org/works?filter=cites%3AW2755950973"
oar <- oa_generate(query_url)
for (i in seq.int(1000)){
# or something like: while record_i != ".__exhausted__."
  record_i <- oar()
  # processing record_i here
  # for example:
  # saveRDS(record_i, paste0("rec-", i, ".rds"))
}

rkrug · 2023-10-30T07:57:25Z

Thanks for the clarifications, @trangdata . Please feel free to close this issue.

trangdata · 2024-02-14T17:12:11Z

Done in #184

trangdata mentioned this issue Oct 26, 2023

Refactor oa_fetch parameters #182

Open

trangdata closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generator function for getting new records #185

Generator function for getting new records #185

trangdata commented Oct 26, 2023

rkrug commented Oct 27, 2023

trangdata commented Oct 29, 2023

rkrug commented Oct 30, 2023

trangdata commented Feb 14, 2024

Generator function for getting new records #185

Generator function for getting new records #185

Comments

trangdata commented Oct 26, 2023

rkrug commented Oct 27, 2023

trangdata commented Oct 29, 2023

rkrug commented Oct 30, 2023

trangdata commented Feb 14, 2024