Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve paging control #183

Merged
merged 1 commit into from
Oct 22, 2023
Merged

improve paging control #183

merged 1 commit into from
Oct 22, 2023

Conversation

trangdata
Copy link
Collaborator

Resolves #166.

Hi @rkrug, could you try installing this branch and see if this works for your use case in #166?

# install.packages("remotes")
remotes::install_github("ropensci/openalexR@generalize-paging")

For example, what you can now do is specifying the pages:

chunk_size <- 5
for (i in 1:3){
  start <- chunk_size*(i-1) + 1
  end <- chunk_size*i
  oa_fetch(
    search = "transformative change", 
    pages = start:end,
    per_page = 20,
    verbose = TRUE
  ) |>
    saveRDS(file = sprintf("tfc-nature_p%s-%s.rds", start, end))
}

One thing I'd like to note is that concatenating these rds files later may still raise memory issues if you're trying to do this in R. I recommend checking out arrow::write_parquet to save these outputs as parquet files which would likely make it easier to combine later, potentially outside of R.

Copy link
Collaborator

@yjunechoe yjunechoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@trangdata trangdata merged commit b0a2770 into main Oct 22, 2023
9 checks passed
@trangdata trangdata deleted the generalize-paging branch October 22, 2023 12:18
@rkrug
Copy link

rkrug commented Oct 23, 2023

@trangdata Thanks a lot - this looks good, but there is one shortcoming: I have to do it essentially manually. I have to get the number of pages (an example should show how) and then iterate through the pages. I still think, that an automatic saving (and I will come to the format in a second) would be the best.

The following should work for an automatic workflow:

  1. add argument save_pages_directory, which you can set if you want to save automatically the individual pages. There are many options one could fine tune this.
  2. add second argument save_only which if TRUE only saves and returns a character string containing the directory name whee the pages are saved - this would make it possible, to use a temporary directory, and does not try to concatenate all results.

Regarding the format of saving: I was also thinking of arrow - and I would personally use it, but I do not think it is suitable for the default. That is why I included writeRDS(). But arrow sliced by page would be perfect. One could always, if preferred, re-slice it later by e.g. year.

To offer the flexibility of saving in different formats, one could introduce in oa_request()an argumentsave_function = saveRDS()which specifies how to save. This function should have the signaturefunction(x, file, ...)where onlyxis andfile` is used. This also gives the possibility for more complex saving mechanisms.

But these are many options, and I think that where and how these parameter / options are set should be discussed in the new discussion.

One final point: I see the pages argument as a power user feature, and the oa_fetch() as a "normal" user command. I would suggest to consider leave oa_fetch() as it is, and put the new features in the function oa_request(), and mention to use oa_query() |> oa_request() instead of oa_fetch() when power user features are needed.

@trangdata
Copy link
Collaborator Author

I have to get the number of pages (an example should show how) and then iterate through the pages.

I agree. We should include an example to calculate the expected number of pages. Perhaps you could contribute a vignette on "Paging control" given your use case?

I would suggest to consider leave oa_fetch() as it is, and put the new features in the function oa_request(), and mention to use oa_query() |> oa_request() instead of oa_fetch() when power user features are needed.

I think we want to stay consistent with the behavior of oa_fetch in general, i.e. it's simply a wrapper around oa_query and oa_request. It inherits almost all parameters from these lower-level functions. Again, a conversation about which "power user" params should go into options should continue in #182, but I'm currently against making any parameters exclusive to oa_request.

On your larger point of providing an option for saving the individual pages, I still want to keep any serialization/IO stuff outside of the package. Especially with the flexibility that we would potentially have to support: directory, file name/type, save functions, number of pages to save in each iteration, and other options like you said, I think it's best to leave this on the user's side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download large number of works - automatic slicing?
3 participants