multiple sources (wish) #80

mdsumner · 2019-03-01T01:14:14Z

Old code experiment

# # Tidy file collection
# #
# # Tidy files include a trick to only store the full metadata of the
# # first file in the collection. This is extracted using `ncmeta::nc_meta` in
# # a form that may changed substantially. The current plan is to always
# # have an *estimable* set of metadata using the first file as a sample, 
# # but to be able to scan the full metadata as required. 
# # This is based on the model where a set of files *do match* or can be expected
# # to match in every shape other than the *unlimited* dimension, the one 
# # that is dispersed over files (usually "time"). 
# # @param x raadfiles, data frame with "fullname", or character vector
# # @param all scan metadata from all files?
# # @param ...  ignored
# #
# # @return data frame of filename "fullname" and "meta" in a list-col
# # @noRd
# #
# # @examples
# # #library(dplyr)
# # #files <- raadfiles::oisst_daily_files() %>% dplyr::filter(as.Date(date) > (Sys.Date() - 40))
# # #f <- tidyfile(files, all = FALSE)
# tidyfile <- function(x, all = FALSE, ...){
#   UseMethod("tidyfile")
# }
# # @name tidyfile
# # @noRd
# tidyfile.data.frame <- function(x, all = FALSE, ...) {
#   if (is.null(x$fullname)) stop("x must have at least column 'fullname'")
#   if (all(file.exists(x$fullname)) ) {
#     class(x ) <- c("raadfiles", class(x))
#   }
#   tidyfile(x, all = all)
# }
# # @name tidyfile
# # @noRd
# tidyfile.raadfiles <- function(x, all = FALSE, ...) {
#   if (all) {
#     metalist <- tidyfile(x$fullname)
#   } else {
#     ml <- tidyfile(x$fullname[1L])
#     metalist <- c(ml, vector("list", nrow(x) - 1L))
#   }
#   structure(dplyr::mutate(x, fullname = x$fullname, 
#                  meta = metalist), class = c("tidyfile", "tbl_df", "tbl", "data.frame"))
# }
# # @name tidyfile
# # @noRd
# tidyfile.character <- function(x, all = FALSE, ...) {
#   purrr::map(x, ncmeta::nc_meta)  
# }
# 
# 
# # 
# # # @name tidync
# # # @noRd
# # tidync.tidyfile <- function(x, ...) {
# #   meta <- ncmeta::nc_meta(x$fullname[1L])
# #   out <- structure(list(file = x, 
# #                         shape = shapes(meta), 
# #                         dimension = dimensions(meta)),
# #                    
# #                    class = "tidync")
# #   activate(out, shapes(meta)$shape[1])
# # }
# # # @name tidync
# # # @noRd
# # tidync.data.frame <- function(x, ...) {
# #   tidync(tidyfile(x))
# # }
# 
#

aweyant · 2021-10-06T01:02:16Z

Hello,

I am constantly working with multi-file netcdf datasets, and often have questions which span over long time periods.

Suppose, as in my case, that you have netcdf files (which are stratified by time) with dimensions lat, lon, and time. To be clear, the file directory would look something like this:

6km_reanalysis_194801.nc
6km_reanalysis_194802.nc
...
6km_reanalysis_201712.nc

A goal one might have is to apply a function across time (the unlimited dimension) at each lat, lon point on one or multiple variables.

At the moment, I do not think this is possible with tidync (and I am not aware of a "clean" and efficient way of doing it in R). If the files were small enough, the solution would be to first concatenate the files, then proceed by calling tidync on a single file.

I am wondering if it is possible for tidync to have a lazy-loading solution to this problem. In essence, a tidync equivalent to Python's xarray's open_mfdataset() and lazy evaluation.

If tidync worked as I am suggesting, the first argument could be a vector of file names (in our case, 6km_reanalysis*.nc). A tidync object would be created by calling tidync() on this list, with metadata derived mostly from the first file (as suggested above). At this point, no data would be loaded.

Now, suppose I have a function, which is written to work on a dataframe grouped by location (unique lat, lon pairs in the case of the .nc file). This is to say, if I had a dataframe with columns lat, lon, time, var1, var2, and var3, my function outputs a dataframe with columns lat, lon, out_var1 and out_var2. Notice that the time dimension has "collapsed" here, but geographic points are still kept distinct.

The problem is that the data are split into multiple files over time. To construct such a dataframe for a few geographic points, we would currently need to loop through all the file names, extract the .nc data as tbls or dataframes, then bind them together. This involves a lot of reading and writing and dirty code. The size of the geographical chunks would need to depend on the resolution, the number of input variables and output variables, the temporal resolution, and the timespan.

Ideally, we could create a tidync object from a list of file names, lazily command that it be made into a tbl/df, then apply a function to this tbl/df (with sequential pipes). In this process, only small geographical chunks would be converted to a tbl/df at a time, and the function is applied to these chunks (sequentially or simultaneously on different cores). The result would be a tbl with one dimension collapsed or reduced. In the case of time, there might not be a time column anymore, or time might now represent the end of a time range over which a function was applied (think of a climatological summary).

I ask if this is technically possible for tidync to behave this way. I am ignorant of the low-level workings of R, so I might be asking too much. However, xarray has shown me that such things can be done in Python. Unfortunately, I do not think any of the python plotting features approach the quality of ggplot and dplyr is also unmatched. I would be over-the-moon if R had a good way of dealing with multi-file netcdf datasets.

Many thanks.

mdsumner · 2021-10-06T02:41:29Z

thanks for the detailed qs, I think I understand what you're asking - a fully lazy and chunked processing is pretty much a lot of work, but I think you can achieve what you're after with a bit of custom application.

I'd like to know what workflow you have atm - are you getting the output as hyper_tibble()? Are you applying hyper_filter(), or do you ultimately want to use the entire file? Could you show the process with a "time series" that involves just two or three files? What would you do now with hyper_tibble() and/or hyper_filter() if your times series actually was short? That will give me a sense of what's needed, you can run the entire thing for the full workflow and just say "scale that up to 10000 files", for example.

aweyant · 2021-10-16T17:58:42Z

Hello,

I have created an example analogous to my use case.

I have a function that works on a dataframe/tibble which summarizes a variable (currently just one variable) in terms of "events". These are defined as streaks of timesteps in which the selected variable surpasses a threshold. In this example, a precipitation event is a streak of hours in which hourly precipitaton surpasses a rate of 1mm/hr.

These events are summarized with their starting time, their cumulative magnitude (e.g. total precip), maximum magnitude (e.g. max hourly precip rate), and length.

The netCDF files with my data are broken along the time dimension. In this example, there is a file for each day and each day has a timestep for every hour.

In order to create precipitation events, one must consider multiple files at once. In the case of precipitation in the Western US, one would need to consider as many files as there are days in a rainy season (this depends on your exact location and purpose of your study, but the max you would need to consider is 366 files, with your cutoff on a day in which precipitation is very unlikely).

Here is my create_event() function, which takes in a "raw" df/tibble and returns a summary of events, as described above:

library(tidyverse)
library(tidync)

create_events <- function(df,
                          metadata_coords,
                          unique_id_coords,
                          event_var,
                          event_func,
                          event_var_threshold) {
  df <- df %>%
    # mark time increments as during an event "active" or not; specify a
    #' column of the unique id
    mutate(active = as.numeric(.data[[event_var]] > event_var_threshold)) %>%
    # uniquely identify geographic points by the specified combination of metadata_coords
    unite("unique_id", {{ unique_id_coords }}, remove = FALSE) %>%
    group_by(unique_id) %>%
    # number events at each geographic point
    mutate(event_number = {r <- rle(active)
           r$values <- cumsum(r$values) * r$values
           inverse.rle(r)
      }
    ) %>%
    ungroup() %>%
    # summarize events at each point
    filter(active == 1) %>%
    group_by(interaction(unique_id, event_number)) %>%
    mutate(event_length = sum(active, na.rm = TRUE)) %>%
    summarize(
      unique_id = unique_id,
      total = sum(.data[[event_var]]),
      max_rate = max(.data[[event_var]]),
      length = min(event_length),
      event_number = min(event_number),
      #event_number = min(event_number, na.rm = TRUE),
      across({{metadata_coords}}, ~min(., na.rm = TRUE))
    ) %>%
    ungroup()
  return(df[,-1]) # return data.frame without ugly grouping variable
}

This is how I can currently apply the function a tibble obtained from a for-loop of tidync over 31 files (only 1 month, or 31 * 24 timesteps).

path = "~/Documents/era5/"
fl_names = Sys.glob(paste0(path,"*.nc")) %>% sort()
w_us_bnds <- c(-126,-114,32,49) # min lon, max lon, min lat, max lat for
#' the west coast of the US, including all of Nevada

event_var = "TP"
event_var_threshold = 1 #mm
metadata_coords = c("lon", "lat")
unique_id_coords = c()


# Inefficiently (?) Create a Tibble from Multiple netCDF Files ------------
for(i in 1:length(fl_names)) {
  if(i == 1) {
    current_nc <- tidync(fl_names[i])
    #print(current_nc)
    
    big_tbl <- current_nc %>%
      hyper_filter(lon = lon >= w_us_bnds[1] & lon <= w_us_bnds[2],
                   lat = lat >= w_us_bnds[3] & lat <= w_us_bnds[4]) %>%
      hyper_tibble(select_var = event_var)
  }
  else {
    current_nc <- tidync(fl_names[i])
    
    big_tbl <- big_tbl %>% bind_rows(current_nc %>%
                                       hyper_filter(lon = lon >= w_us_bnds[1] & lon <= w_us_bnds[2],
                                                    lat = lat >= w_us_bnds[3] & lat <= w_us_bnds[4]) %>%
                                       hyper_tibble(select_var = event_var))
  }
}

big_tbl_event_summary <- big_tbl %>%
  create_events(metadata_coords = c("lon","lat","time"),
                unique_id_coords = c("lon", "lat"),
                event_var = event_var,
                #event_func,
                event_var_threshold = event_var_threshold)

This creates the desired output after a bit of a wait.

It would be great if I could do this with the following syntax:

event_summary_tbl <- tidync(file_names) %>% # all file names inputted at once
	hyper_filter(lon = lon >= w_us_bnds[1] & lon <= w_us_bnds[2],
                   lat = lat >= w_us_bnds[3] & lat <= w_us_bnds[4]) %>%
        hyper_tibble(select_var = event_var) %>%
        create_events(metadata_coords = c("lon","lat","time"),
                unique_id_coords = c("lon", "lat"),
                event_var = event_var,
                #event_func,
                event_var_threshold = event_var_threshold)

Going even further, it would be useful to not event need to write this (intermediate) event_summary_tbl to memory at all. If my goal is to summarize the properties of some subset of events at each gridcell, it would be nice to write

event_summary_tbl <- tidync(file_names) %>% # all file names inputted at once
hyper_filter(lon = lon >= w_us_bnds[1] & lon <= w_us_bnds[2],
lat = lat >= w_us_bnds[3] & lat <= w_us_bnds[4]) %>%
hyper_tibble(select_var = event_var) %>%
create_events(metadata_coords = c("lon","lat","time"),
unique_id_coords = c("lon", "lat"),
event_var = event_var,
#event_func,
event_var_threshold = event_var_threshold) %>%
group_by(unique_id) %>%
summarize(
lat = min(lat),
lon = min(lon),
median_length = median(length),
q99_total = quantile(total, 0.99))

Do you think this would be technically feasible?

Three of the netCDF files and an example image of some of the output are available on this drive.

Many thanks for your time and interest!

mdsumner · 2021-10-16T19:52:50Z

honestly, it's probably the bind_rows that is taking all the time - but thanks for the example, I will try it out and try to explain more.

you could collect in a list and do the bind_rows() in one final step and that might be fast enough, untested code:

listof <- vector("list", length(fl_names))
for(i in 1:length(fl_names)) {
    current_nc <- tidync(fl_names[i])
    #print(current_nc)
    
 listof[[i]] <- current_nc %>%
      hyper_filter(lon = lon >= w_us_bnds[1] & lon <= w_us_bnds[2],
                   lat = lat >= w_us_bnds[3] & lat <= w_us_bnds[4]) %>%
      hyper_tibble(select_var = event_var)
 }

big_tbl <- bind_rows(listof)

I hear you about better ways ... but realistically I probably can't achieve that without external help.

BarbaraRobson · 2023-05-26T07:19:45Z

If you create an ncml file that contains the metadata for your list of files and you can use that ncml file as your file name, otherwise using tidync as normal.

mdsumner · 2023-05-26T07:25:44Z

oh wow, I always wondered about that but never saw it in action, that would be awesome to wrap up 👌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple sources (wish) #80

multiple sources (wish) #80

mdsumner commented Mar 1, 2019

aweyant commented Oct 6, 2021

mdsumner commented Oct 6, 2021 •

edited

Loading

aweyant commented Oct 16, 2021

mdsumner commented Oct 16, 2021

BarbaraRobson commented May 26, 2023

mdsumner commented May 26, 2023

multiple sources (wish) #80

multiple sources (wish) #80

Comments

mdsumner commented Mar 1, 2019

aweyant commented Oct 6, 2021

mdsumner commented Oct 6, 2021 • edited Loading

aweyant commented Oct 16, 2021

mdsumner commented Oct 16, 2021

BarbaraRobson commented May 26, 2023

mdsumner commented May 26, 2023

mdsumner commented Oct 6, 2021 •

edited

Loading