Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generator function for getting new records #185

Closed
trangdata opened this issue Oct 26, 2023 · 4 comments
Closed

Generator function for getting new records #185

trangdata opened this issue Oct 26, 2023 · 4 comments

Comments

@trangdata
Copy link
Collaborator

Moved part of #182 here.

@rkrug:

What would you think about a hook, i.e. a function which is called after one page is downloaded, containing the returned json, the page number, ...? This would open the possibility to implement the saving outside openalexR, while at the same time having no additional maintenance costs? To hide them from the arguments, the hook function could be in an option which needs to be set. So for the user, nothing would change, but the power user can set a hook (possibly different ones in the future?), to implement functionality way beyond what we think at the moment? A function like setHooks() could be used to set the hooks, and getHooks() to return a list containing all hooks?

@yjunechoe:

I'm personally a little cautious of exposing a feature that lets users inject code in the middle of a function executing queries. setHooks()/getHooks() also requires a persistent package state that can store closures, which feels too bulky to maintain for a light API wrapper package like openalexR.
For what it's worth, a power user could implement a hook-like behavior from the outside by trace()-ing the function to inject code. Or, paging can be more precisely controlled with something like {chromote}, which does have that kind of hook ("callback") argument that you're describing. Generally, I think it'd be nice to know how many users would benefit from such a power user feature given the availability of more flexible alternatives (of writing their own version of oa_request() building off of oa_query(), or even just writing http/curl directly) before we push further in this direction (so I could be convinced otherwise!).

@trangdata:
I mostly agree with @yjunechoe. Our vision for the package covers the common use case of accessing a smaller amount of OpenAlex records that can fit in memory. I would argue that the use cases for larger amount of data can possibly be solved using OpenAlex snapshot download/bulk export.

Nonetheless, @rkrug, while I'm not completely clear on setHooks/getHooks idea, I have implemented a generator function for in #184 with the new optional dependency coro. Would this cover your use case? @yjunechoe any thoughts?

@rkrug
Copy link

rkrug commented Oct 27, 2023

I agree completely, that the package should keep its simplicity - and I also like that it is very light on dependencies.

Using openalexR is really straight forward (and the refactoring of the parameters as in #182 discussed will make it even easier. But the problem is, as you point out, the memory limit and, I assume this and I have no test cases, the slowdown when downloading a large number of pages (memory reallocations when the result list grows). I will not go into that second part - that is an independent discussion, but comes later in as a by product.

So we are at the memory limit.

The primary question is, as I see it, how can this package deal with the memory limit and avoid it.

So I see two steps here:

  1. see in advance if there will be a memory limit, and
  2. how to deal with it

I only looked at 'oa_request()`, so far, and I assume that the main memory limitation is happening when downloading multiple pages, which are than saved in memory in the result list. The most promising approach could be, as the number of pages is known, to allocate the memory for all downloads, and then just overwrite the individual page results. If I understand R memory management correctly, this should be faster as well as it does not need to re-allocate memory each time. And it would tell you before the retrieval starts, if there is enough unfragmented memory.

But still, there is the question on how to deal with the case, when the memory can not be allocated at the beginning. There are two (or three?) approaches:

  1. The function aborts and shows a useful error message, so that the user has to take action and knows why or
  2. the function uses internally a different approach when the memory can not be allocated, e.g. using intermediate files to process the data (this might be difficult and encounter additional memory limits)
  3. to accompany (1), an option would be a hook function or using coco (which I have never heard of and sounds complicated to understand) which can be used to circumvent the memory issue

I am in general in favour of automatic functions, so that the user can concentrate on the question (which has to do with literature) and is not sidetracked by thinking about complicated ways of solving the problem of to many hits in the search or snowballing from to may papers, but this is as I see it the most difficult to implement. The easiest is an error message if the memory can not be allocated, and to provide a hook function which, if returning TRUE, continues processing, and if returning FALSE, skips processing and fetches the next page.

@trangdata
Copy link
Collaborator Author

Hi Rainer, a few thoughts:

  • Determining whether memory is sufficient to store all records the beginning is not straightforward. I think the (power) user should be in charge of this step, combining count_only = TRUE with what they know about their machine's RAM specs to determine whether they should write out the stepwise results themselves.

  • Currently, there are two approaches to combat this memory issue:

    1. Use paging to control the number of pages they want to write in each chunk improve paging control #183
    chunk_size <- 5
    for (i in 1:3){
      start <- chunk_size*(i-1) + 1
      end <- chunk_size*i
      oa_fetch(
        cites = "W2755950973", 
        pages = start:end,
        per_page = 20,
        verbose = TRUE
      ) |>
        saveRDS(file = sprintf("tfc-nature_p%s-%s.rds", start, end))
    }
    1. Use oa_generate to go through each record one at a time. This doesn't sound very efficient but it actually is: I have implemented this function so that we only query the OpenAlex database every 200 records. With this method, the user does not have to worry about paging at all. Using coro to create a generator #184
    query_url <- "https://api.openalex.org/works?filter=cites%3AW2755950973"
    oar <- oa_generate(query_url)
    for (i in seq.int(1000)){
    # or something like: while record_i != ".__exhausted__."
      record_i <- oar()
      # processing record_i here
      # for example:
      # saveRDS(record_i, paste0("rec-", i, ".rds"))
    }

@rkrug
Copy link

rkrug commented Oct 30, 2023

Thanks for the clarifications, @trangdata . Please feel free to close this issue.

@trangdata
Copy link
Collaborator Author

Done in #184

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants