-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iterate_next_request()
-> resp_next_request()
#341
Comments
iterate_next_request()
-> resp_next_request
iterate_next_request()
-> resp_next_request()
Actually, since we're introducing |
I agree that it might make sense to add a separate And we definitely need a helper like Automatically parsing and
It currently takes a request as the first argument. Did you mean response? |
Yeah, sorry, a response as a first argument |
Maybe we need (Also I think it makes sense for |
Or maybe we could do iteration by making # cursor pagination; metadata in body
request("https://example.com/cursor") %>%
req_url_query(per_page = 100) %>%
req_parse(function(resp, req) {
json <- resp_body_json(resp)
iterative_response(
data = json$data,
next_req = req %>% req_url_param(resp_link, cursor = json$next_cursor),
total = json$count
)
})
# keyset/seek pagination; metadata in headers
request("https://example.com/keyset") %>%
req_parse(function(resp, req) {
data <- resp_body_json(resp)
iterative_response(
data = data,
next_req = req %>% req_url_param(after_id = max(data$id)),
total = resp %>% resp_header()
)
})
# HATEOAS pagination (e.g. GitHub); metadata in headers
request("https://api.github.com/repositories") %>%
req_url_query(per_page = 100) %>%
req_parse(function(resp, req) {
iterative_response(
data = resp_body_json(resp),
next_req = req %>% req_url(resp_link_url(resp, "next")),
total = resp %>% resp_link_url("last") %>% url_parse() %>% .$query$page
)
}) Offset pagination doesn't work well with this framing, but that might make sense because offset pagination is random access, so it's really a better fit for |
Best article I found on pagination was https://ignaciochiazzo.medium.com/paginating-requests-in-apis-d4883d4c1c4c, as it includes a bunch of links to various styles in popular apis. From the hacker news comments:
Good write up of why offset pagination is suboptimal on the backend: https://use-the-index-luke.com/no-offset |
Having |
I agree that it’s a bit less discoverable, but it’s simpler and more general, and I think we could make up the difference with docs. |
We could also add And do you have an idea how we want to handle page/offset pagination? They are not really iterative but you need to do the first request to find out how many pages there are. So, creating a list of requests also doesn't work nicely. |
Let me noodle on this a bit more. I had forgotten about the wrinkle of requiring one request to figure out the total number of pages: it's like an iterative request that then spawns parallel requests. It also feels like there's maybe some connection to tidyverse/rvest#193, where you want to crawl/spider a bunch of pages. There it makes more sense to think about a queue where each request could add one or more pages to the queue, which are then potentially processed in parallel. That is a less common pattern for APIs, but a queue is a nicely general structure. |
Obviously we'd want some higher level wrapper, but maybe the internals could look something like this: page_size <- 100
# Paginated query
request("https://example.com/paginated") %>%
req_url_query(per_page = page_size) %>%
req_parse(function(resp, req) {
json <- resp_body_json(resp)
index <- seq_len(ceiling(json$count / page_size))
pages <- lapply(index, function(i) {
req %>%
req_parse(\(resp) resp_body_json(resp)$data) %>%
req_url_param(resp_link, index = i)
})
multi_response(
data = json$data,
responses = pages
)
}) |
I think I've swung back towards the idea of providing a way to register a parser so that you can call req <- req_parse_resp(req, resp_body_json)
resp <- req_perform(req)
# actually parses:
resp_body(resp)
# uses cached value:
resp_body(resp) Alternatively we could just make each of the existing The next change I'd suggest is that instead of providing the callback to generate the next request in req <- request("https://example.com/cursor") |>
req_url_query(per_page = 100) |>
req_parse_body(resp_body_json)
# cursor
resps <- req_perform_iterative(req, function(resp, req) {
body <- resp_body(resp)
if (is.null(body$next_cursor)) {
return()
}
req |> req_url_param(cursor = body$next_cursor)
})
# indexed
resps <- req_perform_iterative(req, function(resp, req) {
body <- resp_body(resp)
index <- seq_len(ceiling(body$count / page_size))[-1]
lapply(index, \(i) req |> req_url_param(resp_link, index = i))
}) This formulation still gives us the ability to write wrappers for common situations, e.g.: paginate_cursor <- function(param_name, param_value) {
check_string(param_name)
check_function2(param_value, "resp")
function(resp, req) {
value <- param_value(resp)
if (is.null(value)) {
return()
}
req |> req_url_param(!!param_value := value)
}
}
resps <- req_perform_iterative(req, paginate_cursor("cursor", \(body) body$cursor) Then resps_combine <- function(resps, data) {
check_function2(data, "body")
resps <- resps_responses(resps)
body <- lapply(resps, resp_body)
data <- lapply(body, data)
vctrs::vec_c(!!!data)
}
resps_is_resp <- function(resps) {
vapply(resps, is_response, logical(1))
}
resps_responses <- function(resps) {
resps[resps_is_resp(resps)]
}
resps_errors <- function(resps) {
resps[!resps_is_resp(resps)]
} That would make error handling much more general because we just put the tools in the hands of the user. Open questions:
|
Some more questions for
|
Thanks for taking a look! Since you're on board with the plan I'll start turning it into PRs. Currently I'd think that I hadn't thought about |
Started an implementation and it feels really good. Because the iterate_by_page <- function(param_name, start = 1, offset = 1) {
check_string(param_name)
check_number_whole(start)
check_number_whole(offset, min = 1)
i <- start
function(resp, req) {
old_i <- i
i <<- i + offset
req %>% req_url_query(!!param_name := old_i)
}
} We couldn't use a technique like this before because the current page state would shared across all requests. |
New `req_perform_iteratively()` takes the callback and returns a list of responses. Paired with `iterate_with_offset()` and friends to do the iteration, and `resps_combine()` and friends to work with the results. Fixes #341. Fixes #298. Co-authored-by: Maximilian Girlich <maximilian.girlich@metoda.com>
This would mean changing the arguments so that it takes a request as the first argument, and maybe automatically attaching the
parsed
data andnext_request
call back to that request.That possibly implies that registering a parser is a more general feature, which we could possibly implement lazily. i.e. you'd register a parser in the request (with
req_parser()
?), and then callresp_parsed()
(or mayberesp_body_parsed()
) to access the parsed body. We'd need to add an environment to the response so that we could do the parsing once and cache it. And then maybe have some helper that doesresp_parse(req_perform(req))
? And something similar forreq_perform_parallel()
? That would makereq_perform_iterate()
fit in better.@mgirlich these are very quick and rough notes, but does this make sense to you?
The text was updated successfully, but these errors were encountered: