No support for *.op.gz files in reformat_GSOD()? #76

taraskaduk · 2019-12-19T02:01:26Z

I may be confusing something here, but I think that a while back, reformat_GSOD() worked with *.op.gz files obtained from NOAA ftp.
It seems like this is no longer the case. Is something changed?

This is not an issue per se, it's a confusion. I couldn't rerun my code from a year ago, and I'm trying to see what changes do I need to make to the code to make it work.

The text was updated successfully, but these errors were encountered:

adamhsparks · 2019-12-19T12:49:09Z

You're not crazy. I'm sorry, it did. When NCEI changed the formats of the data that they served I had to update GSODR a few months ago and reformat_GSOD() uses the same code behind the scenes that get_GSOD() uses. So it no longer supports the .op.gz format because NCEI is serving .gz files of .csv files.

taraskaduk · 2019-12-19T14:34:58Z

I figured as much after poking around this repo. Thanks. I repointed my code to load data from the other source and not from ftp, but ran into a different issue, and now I'm just trying to simply use get_GSOD() for consistency purposes (my previous workflow avoided get_GSOD() and downloaded full year archives instead, as get_GSOD() was erroring out and I couldn't figure out why)

taraskaduk · 2019-12-19T15:44:07Z

I remember my problem with get_GSOD() now. It is very long to run for many stations (global scale) for many years (e.g. 10). I found that curl_download() of the entire year archive of all the data is a much faster route.

In the old process, when I downloaded data from ftp, each .op.gz file was titled as stationid-year.op.gz. In the current source that I assume you point to, the .csv files are named after a station, no year. I now just need to figure out the renaming process so that files from different years don't overwrite each other.

get_GSOD() would be a much more straight-forward way to pull the data, it's a shame it doesn't work well for me.

adamhsparks · 2019-12-19T20:29:14Z

Ah, interesting. It's faster to download the entire .gz and sort it out after? How many stations are you fetching at once?

I hadn't considered this case. I though either download all or just a few selected not many selected.

taraskaduk · 2019-12-19T22:16:48Z

My case is exactly in between. I pull 10 years worth of data for every station within a 25 km radius of every major city in the world.
What I do instead is download full year archives, then delete the files of the stations that I don't need, and then process the remaining station files with reformat_GSOD

Here is the function that does it all for me:

get_weather <- function(yrs = seq(year(today()) - 11, year(today())),
                        stns = stations_v) {
  for (yr in yrs) {
    file <- paste0(yr, '.tar.gz')
    destfile <- paste0('data/gsod/', file)
    if (!file.exists(destfile)) {
      link <- paste0('https://www.ncei.noaa.gov/data/global-summary-of-the-day/archive/', file)
      curl::curl_download(link, destfile)
    }
    untar(destfile, exdir = paste(tempdir(), yr, sep = "/"))
  }
  
  # Go through all unpacked files, decide what to remove and what to keep
  # based on the stations of interest
  files_all <- list.files(path = tempdir(), pattern = "^.*\\.csv$", recursive = TRUE, full.names = FALSE)
  
  # Get a cartesian join of all stations of interest and all years.
  files_stations <- 
    purrr::cross(list(x1 = paste0(yrs, "/"), x2 = paste0(stns, ".csv"))) %>%
    purrr::map(purrr::lift(paste0)) %>%
    as_vector()
  
  files_keep <- subset(files_all, files_all %in% files_stations)

  # Transform weather data ----------------------------------------------------
  out <- GSODR::reformat_GSOD(file_list = paste(tempdir(), files_keep, sep = "/"))
  unlink(tempdir(), force = TRUE, recursive = TRUE)
  out
}

taraskaduk · 2019-12-19T22:19:57Z

Here is the repo, the code snippet above is from functions.R

https://github.com/taraskaduk/weather

adamhsparks · 2019-12-19T23:16:35Z

Awesome, thanks! I'll have a look and see if I can improve the package.

taraskaduk · 2019-12-20T00:14:57Z

Sweet! Let me know if I can contribute in any way.

adamhsparks · 2020-01-11T23:26:27Z

I've updated the internal functionality to check how many requests are being made. If the number of stations is greater than 10, GSODR will download the entire global annual file and sort out the needed files locally. If there are less than 10 stations, then it will download each requested station individually.

I did a few tests to check how many individual requests were faster vs downloading the whole. The number is not exact due to things I can't control (Internet), but this should help in most cases to make it faster to request large numbers of stations that are not ALL stations or just a few.

GSODR/R/internal_functions.R

Line 155 in b1b6a5d

if (is.null(station) | length(station) > 10) {

taraskaduk · 2020-01-12T01:51:52Z

Sweet! I'll make sure to chuck my extra piece of code to pull full archives for my analysis on my next update.

adamhsparks · 2020-01-12T05:27:25Z

Thank you for letting me know about the bottleneck. You OK if I add you as a contributor to the package for the ideas/input?

taraskaduk · 2020-01-13T18:55:48Z

I would be honored! To be honest, I'd do a pull request on this item (and a couple of others), but I've never done a PR before, and was afraid to mess things up 🤷‍♂️

adamhsparks · 2020-01-13T23:28:29Z

feel free. submit it and we can work through everything together after

…

-- Dr Adam H. Sparks Associate Professor of Field Crops Pathology | Centre for Crop Health Institute for Life Sciences and the Environment (ILSE) | Research and Innovation Division University of Southern Queensland | Toowoomba, Queensland | 4350 | Australia Phone: +61746311948 Mobile: +61415489422 Mobile: +61415489422

On 14 Jan 2020, 04:55 +1000, Taras Kaduk ***@***.***>, wrote: I would be honored! To be honest, I'd do a pull request on this item (and a couple of others), but I've never done before, and was afraid to mess things up 🤷‍♂️ — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

taraskaduk closed this as completed Dec 19, 2019

adamhsparks self-assigned this Jan 10, 2020

adamhsparks added the enhancement label Jan 10, 2020

adamhsparks reopened this Jan 10, 2020

adamhsparks closed this as completed Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for *.op.gz files in reformat_GSOD()? #76

No support for *.op.gz files in reformat_GSOD()? #76

taraskaduk commented Dec 19, 2019

adamhsparks commented Dec 19, 2019

taraskaduk commented Dec 19, 2019

taraskaduk commented Dec 19, 2019

adamhsparks commented Dec 19, 2019 •

edited

Loading

taraskaduk commented Dec 19, 2019 •

edited

Loading

taraskaduk commented Dec 19, 2019

adamhsparks commented Dec 19, 2019

taraskaduk commented Dec 20, 2019

adamhsparks commented Jan 11, 2020 •

edited

Loading

taraskaduk commented Jan 12, 2020 •

edited

Loading

adamhsparks commented Jan 12, 2020

taraskaduk commented Jan 13, 2020 •

edited

Loading

adamhsparks commented Jan 13, 2020 via email

No support for *.op.gz files in reformat_GSOD()? #76

No support for *.op.gz files in reformat_GSOD()? #76

Comments

taraskaduk commented Dec 19, 2019

adamhsparks commented Dec 19, 2019

taraskaduk commented Dec 19, 2019

taraskaduk commented Dec 19, 2019

adamhsparks commented Dec 19, 2019 • edited Loading

taraskaduk commented Dec 19, 2019 • edited Loading

taraskaduk commented Dec 19, 2019

adamhsparks commented Dec 19, 2019

taraskaduk commented Dec 20, 2019

adamhsparks commented Jan 11, 2020 • edited Loading

taraskaduk commented Jan 12, 2020 • edited Loading

adamhsparks commented Jan 12, 2020

taraskaduk commented Jan 13, 2020 • edited Loading

adamhsparks commented Jan 13, 2020 via email

adamhsparks commented Dec 19, 2019 •

edited

Loading

taraskaduk commented Dec 19, 2019 •

edited

Loading

adamhsparks commented Jan 11, 2020 •

edited

Loading

taraskaduk commented Jan 12, 2020 •

edited

Loading

taraskaduk commented Jan 13, 2020 •

edited

Loading