<a name="top"></a>
<div style="width:1000 px">

<div style="float:right; width:98 px; height:98px;">
<img src="https://cdn.miami.edu/_assets-common/images/system/um-logo-gray-bg.png" alt="Miami Logo" style="height: 98px;">
</div>

<div style="float:right; width:98 px; height:98px;">
<img src="https://media.licdn.com/dms/image/C4E0BAQFlOZSAJABP4w/company-logo_200_200/0/1548285168598?e=2147483647&v=beta&t=g4jl8rEhB7HLJuNZhU6OkJWHW4cul_y9Kj_aoD7p0_Y" alt="STI Logo" style="height: 98px;">
</div>


<h1>Data Download</h1>
By: Kayla Besong, PhD
    <br>
Last Edited: 01/09/24
<br>
<br>    
<br>
This code is designed to download fire weather variables from multiple renalaysis and forecast products including the HRRR, NAM, NARR, CONUS404, NCEP Reanalysis II, and ERA5. A complementary notebook 'Data_Grab_Functions.ipynb' hosts the suite of functions tailored to each product's download process that this notebook leverages. Each product has various input parameters specific to itself such as the dates the product is available for, timesteps, variables, etc. This notebook was designed with intentention for those inputs to be changed depending on user need. Below it is tailored to fire weather metrics including: temperature, u-,v-wind components, relative humidty, soil moisture, planetary boundary layer height (mixing height), CAPE, and either precipitation accumulation or precipitation rate or both. If additional or changed variables are desired it is often necessary to find the level or surface your variable is on and align it properly with the examples below. Each function will generate a file tree and save the files to that file tree structure, with the only object returned being a list of unavailable files for your specified dates/hours/variables. Lastly, be patient sometimes the servers from which you are requesting fies from can be slow. Approximate run times for example cases are provided in each section.
    
<div style="clear:both"></div>
</div>

<hr style="height:2px;">

### The integral notebook of functions to run

In [1]:
%run ../../Universal_Functions/Data_Grab_Functions.ipynb

# HRRR

source: AWS or Google
<br>
file type: zarr or grib2
<br>
zarr dates available from: 2016-08-23 to Present; grib2 dates = 2014-Present
<br>
analysis time steps: 0 to 23hr by 1 (not all always available) 
<br>
domain: all of the HRRR domain is downloaded covering most of North America
<br>
<br>
Resources on HRRR AWS Zarr:
<br>
https://registry.opendata.aws/noaa-hrrr-pds/
<br>
https://hrrrzarr.s3.amazonaws.com/index.html
<br>
https://mesowest.utah.edu/html/hrrr/
<br>
<br>
Resources on HRRR Google Cloud:
<br>
https://console.cloud.google.com/marketplace/product/noaa-public/hrrr?project=python-232920&pli=1
<br>
<br>
Estimated time to run zarr: 1-year, 4x daily, 9 vars = ~18h hours; ~0.5T
<br>
<br>
output:
<br>
1. A list of all missing or incomplete files
2. A file tree structured HRRR/variable/hrrr_variable_YEARMONDAY_HR.nc


On AWS Zarr, the files are stored as analysis (F00) and forecast (F01-FXX). Analysis is the only thing available until mid 2018 for the 2016-08-23 to Present stored. There is another bucket that stores the grib2 files, but that would require simple-caching as in the codes for UFS (see function script if curious). The reason a second, grib2-Google script was develop was due to lack of forecast data stored in zarr format. So to get anything, such as precipitation for F01 (blank for F00) prior to 2018, you need to use the google method which treats F00 as analysis and downloads F00 only by default unless specifying the "fcst_hr_step" below. 

## AWS ZARR

In [None]:
### Variables are set up in a dictionary format as not all variables are stored with the same coordinates.
### The keys represent the 'level' or coordinate each variable is found on and is used in determining the path 
### to obtain the data. Each variable of interest is stored in a list as the 'values' of the dictionary,
### pairing to the level 'keys' of the dictionary. 


variables = {'2m_above_ground': ['TMP', 'RH'],
             '10m_above_ground': ['UGRD', 'VGRD'],
             'surface': ['HPBL', 'CAPE', 'GUST'],
             '0m_underground': ['MSTAV']}
             

In [None]:
### inputs to the hrrr_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### hour_range = array of timesteps you would like to grab. For example, np.arange(0, 24, 6) will produce [00, 06, 12, 18] in the function. Leave in numerics.


start_date = '2016-08-23'
end_date = '2018-12-31'
hour_range = np.arange(0, 24, 6)
output_dir = 'HRRR'

In [None]:
%%time

### Calling the HRRR grabber function to download all data of interest. The only item that is returned is a 
### list of files that does not exists (is missing or otherwise) for one to be aware of and double check on. 


### OPTIONAL ####
### choose if you want the analysis (i.e. F00) or the forecast (F01-FXX). For some variables such as PRATE, only forecast is available. 
### The input time/hour will be the forecast init time. Function returns all forecasted hours. 

non_exist_hrr = hrrr_zarr_grabber(start_date, end_date, hour_range, variables, output_dir, forecast = None)

## Google Grib2

In [None]:
### Variables are set up in a dictionary format as not all variables are stored with the same coordinates.
### The keys represent basically whatever you want. Each variable of interest is stored in a list of lists as the 'values' of the dictionary.
### The additional list is the 'filter_by_keys' input for each of the variables required by xarray-cfgrib to obtain the variables of interest.
### Each variable at a minimum requires 'typeOfLevel' while others require more such as below. If you add/change variables, some investigation
### as to which filters are needed may be required. Presently, you cannot provide multiple choices per key-value pair. For example, for the first
### two they both are on the typeOfLevel: 'heightAboveGround' but require seperate 'level' values...level: [2, 10] will return an empty xr.dataset
### due to xarray-cfgrib not "liking" the different 'coordinates'. Best of luck! 


variables = {'heightAboveGround2': [['t2m', 'd2m'], {'typeOfLevel': 'heightAboveGround', 'level': 2}],
             'heightAboveGround10': [['u10', 'v10'], {'typeOfLevel': 'heightAboveGround', 'level': 10}],
             'surface1': [['gust', 'blh', 'cape', 'prate', 'lsm'], {'typeOfLevel': 'surface', 'stepType': 'instant'}],
             'surface2': [['tp'], {'typeOfLevel': 'surface', 'stepType': 'accum'}],
             'depthBelowLandLayer': [['mstav'], {'typeOfLevel': 'depthBelowLand'}]}


precip_vars = {'surface1': [['prate'], {'typeOfLevel': 'surface', 'stepType': 'instant'}],
              'surface2': [['tp'], {'typeOfLevel': 'surface', 'stepType': 'accum'}]}

In [None]:
### inputs to the hrrr_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### hour_range = array of timesteps you would like to grab. For example, np.arange(0, 24, 6) will produce [00, 06, 12, 18] in the function. Leave in numerics.


start_date = '2014-07-30'
end_date = '2018-12-31'
hour_range = np.arange(0, 24, 6)
output_dir = 'HRRR2'

In [None]:
%%time

### Calling the HRRR grabber function to download all data of interest. The only item that is returned is a 
### list of files that does not exists (is missing or otherwise) for one to be aware of and double check on. 


### OPTIONAL ####
### choose if you want the analysis (i.e. F00) or the forecast (F01-FXX). For some variables such as PRATE, only forecast is available. 
### The input time/hour will be the forecast init time. Function returns all forecasted hours. 


non_exist_hrr = hrrr_google_grabber(start_date, end_date, hour_range, variables, output_dir, fcst_hr_step = None)

In [None]:
%%time

non_exist_hrr = hrrr_google_grabber(start_date, end_date, hour_range, precip_vars, output_dir, fcst_hr_step = [1])

# NAM

source: NCEI Data https://www.ncei.noaa.gov/data/north-american-mesoscale-model/
<br>
file type: grib, grib2 
<br>
dates available from: 2004-03-03 - 2020-05-15
<br>
hour time steps: 00, 06, 12, 18
<br>
lead times per time step: 00, 03, 06 (until 2012-12-31, on 20200515) --> 00, 01, 02, 03, 06 2013-01-01 through 2020-05-14
<br>
domain: all of the NAM domain is downloaded covering most of North America
<br>
<br>
Resources on NAM:
<br>
https://www.ncei.noaa.gov/products/weather-climate-models/north-american-mesoscale
<br>
<br>
Estimated time to run: 1-year, 4x daily, 6-11 vars = ~5.5 hours; ~72GB
<br>
<br>
output:
<br>
1. A list of all missing or incomplete files
2. A file tree structured NAM/variable/nam_variable_YEARMONDAY_HR_FCSTHR.nc

NOTE: to read grib/grib2 off of this server, this code will store temporary files that can manually be deleted later. Controlling where the files are cached has proven difficult. This method may work on your machine, it may not depending on what is set up. Within the nam_grabber function in the Data_Grab_Functions.ipynb there are 5 lines commented out that delete the cached grib files as you go. If you find that all the cached files are causing a problem, you can uncomment these lines to automatically delete them. It is advised to print 'file' before doing so to become aware of where the cached files are being stored or generated first. This method uses shutil.rmtree() and could delete things you do not want deleted. 


In [None]:
### Variables are set up in a dictionary format as not all variables are stored with the same coordinates.
### The keys represent basically whatever you want. Each variable of interest is stored in a list of lists as the 'values' of the dictionary.
### The additional list is the 'filter_by_keys' input for each of the variables required by xarray-cfgrib to obtain the variables of interest.
### Each variable at a minimum requires 'typeOfLevel' while others require more such as below. If you add/change variables, some investigation
### as to which filters are needed may be required. Presently, you cannot provide multiple choices per key-value pair. For example, for the first
### two they both are on the typeOfLevel: 'heightAboveGround' but require seperate 'level' values...level: [2, 10] will return an empty xr.dataset
### due to xarray-cfgrib not "liking" the different 'coordinates'. Best of luck! 

### For NAM specifically, the variables change after an update that occurs in spring 2017. Thus if you need prior to and/or after different dictionaries 
### of variables and their levels will need to be passed. You only have to input the ones created, each is optional but at least one is required.

In [None]:
variables_b4_04092017 = {'heightAboveGround2': [['t2m', 'r'], {'typeOfLevel': 'heightAboveGround', 'level': 2}],
                         'heightAboveGround10': [['u10', 'v10'], {'typeOfLevel': 'heightAboveGround', 'level': 10}],
                         'surface': [['tp'], {'typeOfLevel': 'surface', 'stepType': 'accum'}],
                         'depthBelowLandLayer': [['sm'], {'typeOfLevel': 'depthBelowLandLayer', 'shortName': 'sm'}]}


In [None]:
variables_grib2 = {'heightAboveGround2': [['t2m', 'r2'], {'typeOfLevel': 'heightAboveGround', 'level': 2}],
             'heightAboveGround10': [['u10', 'v10'], {'typeOfLevel': 'heightAboveGround', 'level': 10}],
             'surface': [['gust', 'hpbl', 'cape', 'lsm', 'hindex'], {'typeOfLevel': 'surface', 'stepType': 'instant'}],
             'surface_accum': [['tp'], {'typeOfLevel': 'surface', 'stepType': 'accum'}],
             'depthBelowLandLayer': [['soilw'], {'typeOfLevel': 'depthBelowLandLayer', 'shortName': 'soilw'}]}


In [None]:
### inputs to the nam_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### hour_range = array of timesteps you would like to grab. For example, np.arange(0, 24, 6) will produce [00, 06, 12, 18] in the function. Leave in numerics.

start_date = '2017-11-27'
end_date = '2018-12-31'
hour_range = np.arange(0, 24, 6)
output_dir = 'NAM'


### OPTIONAL ####
### forecast hour step of list of integers. The default = None in the function which will return only the ['00'] forecast for each init hour and date. 
### edit and uncomment the list of integers below to your desired lead times if ['00'] is not sufficient. The below is all that is available in thredds. Info above. 

## note that if you are doing total precipitation the 0 hr fcst will be blank, 1hr forecast needed to get a signal

### fcst_hr_step = [0, 1, 2, 3, 6] 

In [None]:
%%time
#non_exist_nam = nam_grabber(start_date, end_date, hour_range, output_dir, variables_grib = None, variables_grib2 =None, fcst_hr_step = None)

non_exist_nam = nam_grabber(start_date, end_date, hour_range, output_dir, variables_grib = variables_b4_04092017, variables_grib2 =variables_grib2, fcst_hr_step = None)

# NARR

source: THREDDS
<br>
file type: grib
<br>
dates available from: 1971-01-01 to 2014-10-02
<br>
hour time steps: 0 to 21 by 3 ([ 0,  3,  6,  9, 12, 15, 18, 21])
<br>
domain: all of the NARR domain is downloaded covering all of North America including Alaska and Hawai'i 
<br>
<br>
Resources on NARR:
<br>
https://www.ncei.noaa.gov/products/weather-climate-models/north-american-regional
<br>
<br>
Estimated time to run: 1-year, 4x daily, 10 vars = ~6 hours; ~6.6GB
<br>
<br>
output:
<br>
1. A list of all missing or incomplete files
2. A file tree structured NARR/variable/narr_variable_YEARMONDAY_HR.nc


In [None]:
### variables are just a list of variables but with their associated pressure level 

variables = ['Temperature_height_above_ground', 'u-component_of_wind_height_above_ground', 'v-component_of_wind_height_above_ground', 'Relative_humidity_height_above_ground',
             'Soil_moisture_content_layer_between_two_depths_below_surface_layer', 'Planetary_boundary_layer_height_surface', 'Convective_Available_Potential_Energy_surface',
             'Convective_Available_Potential_Energy_layer_between_two_pressure_difference_from_ground_layer',
             'Total_precipitation_surface_3_Hour_Accumulation', 'Precipitation_rate_surface']

In [None]:
### inputs to the narr_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### hour_range = array of timesteps you would like to grab. For example, np.arange(0, 24, 6) will produce [00, 06, 12, 18] in the function. Leave in numerics.

start_date = '2011-01-01'
end_date = '2011-12-31'
hour_range = np.arange(0, 24, 6)
output_dir = 'NARR'


In [None]:
%%time

### Calling the NARR grabber function to download all data of interest. The only item that is returned is a 
### list of files that does not exists (is missing or otherwise) for one to be aware of and double check on. 

non_exist_narr = narr_grabber(start_date, end_date, hour_range, variables, output_dir)

# CONUS404

source: THREDDS
<br>
file type: netCDF
<br>
dates available from: 1979-10-01 to 2022-09-30; by water year --> starts in October of year N, ends in September of year N+1
<br>
hour time steps: 0 to 23 by 1 (all hours of day)
<br>
domain: all of the CONUS404 domain is downloaded covering all of the U.S. 
<br>
<br>
Resources on CONUS404:
<br>
https://rda.ucar.edu/datasets/ds559.0/
<br>
https://journals.ametsoc.org/view/journals/bams/104/8/BAMS-D-21-0326.1.xml
<br>
<br>
Estimated time to run: 1-year, 4x daily, 10 vars = ~4 hours; ~200GB
<br>
<br>
output:
<br>
1. A list of all missing or incomplete files
2. A file tree structured CONUS404/variable/conus404_variable_YEARMONDAY_HR.nc


In [2]:
### variables are just a list of variables  

#variables = ['T2', 'U10', 'V10', 'TD2', 'SMOIS', 'PBLH', 'SBCAPE', 'MLCAPE', 'PREC_ACC_NC']
variables = ['SMOIS']

In [3]:
### inputs to the conus404_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### hour_range = array of timesteps you would like to grab. For example, np.arange(0, 24, 6) will produce [00, 06, 12, 18] in the function. Leave in numerics.

start_date = '2016-01-01'
end_date = '2016-01-01'
hour_range = np.arange(0, 24, 6)
output_dir = 'CONUS404'


In [4]:
%%time

### Calling the CONUS404 grabber function to download all data of interest. The only item that is returned is a 
### list of files that does not exists (is missing or otherwise) for one to be aware of and double check on. 

non_exist_conus404 = conus404_grabber(start_date, end_date, hour_range, variables, output_dir)

Struct() takes at most 1 argument (3 given)


all of 2016-01-01 00:00:00 has been saved
CPU times: user 3.75 s, sys: 971 ms, total: 4.72 s
Wall time: 15.6 s


# NCEP Reanalysis II 

source: THREDDS
<br>
file type: netCDF
<br>
dates available from: 1979 to present; BY YEAR 
<br>
domain: the domain is subset to 85N-0N, 180W-360W in this code to cover all of North America including Alaska and Hawai'i; global available
<br>
<br>
Resources on NCEP Renalysis II:
<br>
https://www.ncei.noaa.gov/products/weather-climate-models/reanalysis-1-2
<br>
<br>
Estimated time to run: 1-year, 4x daily, 7 vars, 1 level each = ~ 30min; ~130MB
<br>
<br>
output:
<br>
1. A list of all missing or incomplete files
2. A file tree structured NCEP/variable/ncep_variable_YEAR.nc or ncep_variable_YEAR_LEVELmb.nc


In [None]:
### Variables are set up in a dictionary format as not all variables are stored with the same coordinates.
### The keys represent the 'level' or coordinate each variable is found on and is used in determining the path 
### to obtain the data. Each variable of interest is stored in a list as the 'values' of the dictionary,
### pairing to the level 'keys' of the dictionary. 

variables = {'pressure': ['air', 'uwnd', 'vwnd', 'rhum'],
             'gaussian_grid': ['soilw.0-10cm.gauss', 'prate.sfc.gauss']}


In [None]:
### inputs to the conus404_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### domain == array of integers --> [N, S, E, W] presently set up for NA domain


start_date = '1979-12-01'
end_date = '2022-02-28'
output_dir = 'NCEP'
domain = [85, 0, 180, 360]

### OPTIONAL ####
### levels = list of float values. The default = None in the function which will return only the [1000] level for each year. 
### edit and uncomment the list of floats below to your desired levels. This is only for variables on pressure surfaces!! 

### levels = [1000.,  925.,  850.,  700.,  600.,  500.,  400.,  300.,  250.,  200., 150.,  100.,   70.,   50.,   30.,   20.,   10.] 


In [None]:
%%time

### Calling the NCEP grabber function to download all data of interest. The only item that is returned is a 
### list of files that does not exists (is missing or otherwise) for one to be aware of and double check on. 
### warning: the opendap process with this function can be particular or slow 

non_exist_ncep = ncep_grabber(start_date, end_date, variables, output_dir, domain, levels = None)

# ERA5 on single levels 

source: copernicus
<br>
file type: netCDF
<br>
dates available from: 1940 to present
<br>
hour time steps: 0 to 23 by 1 (all hours of day)
<br>
domain: the domain is subset to 85N-0N, 180W-360W in this code to cover all of North America including Alaska and Hawai'i; global available
<br>
<br>
Resources on ERA5:
<br>
https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation
<br>
<br>
Estimated time to run: 1-year, 4x daily, 11 vars = ~30 min download, 1 hour waiting; ~7.5GB
<br>
<br>
output:
<br>
1. A file tree structured ERA5/era5_YEAR.nc


## This process uses dask, multiprocess pool to expedite the process. 
The correct packages to align this may be difficult. 
<br>

## You also need install the Copernicus client. 

Incorrect input will immediately lead to error. Another immediate error related to 'signing in' to the Copernicus CDS may pop up. If you click the link and sign in, then it should be fine. Follow more instructions below:
<br>
Linux: https://cds.climate.copernicus.eu/api-how-to
<br>
Mac: https://confluence.ecmwf.int/display/CKB/How+to+install+and+use+CDS+API+on+macOS
<br>
Windows: https://confluence.ecmwf.int/display/CKB/How+to+install+and+use+CDS+API+on+Windows
<br>

In [None]:
### variables are just a list of variables

variables = ['10m_u_component_of_wind', '10m_v_component_of_wind', '2m_dewpoint_temperature',
    '2m_temperature'] 

#, 'boundary_layer_height', 'convective_available_potential_energy',
    # 'convective_precipitation', 'instantaneous_10m_wind_gust', 'large_scale_rain_rate',
    # 'total_precipitation', 'volumetric_soil_water_layer_1',]

In [None]:
### inputs to the conus404_grabber function are established here. 
### start_date, end_date, and output_dir = strings
### hour_range = array of timesteps you would like to grab. For example, np.arange(0, 24, 6) will produce [00, 06, 12, 18] in the function. Leave in numerics.
### domain == array of integers --> [N, S, E, W] presently set up for NA domain

start_date = '1979-12-01'
end_date = '2022-02-28'
hour_range = np.arange(0, 24, 6)
output_dir = 'ERA5'
domain = [85, 0, 180*-1, 360]


In [None]:
%%time

### Calling the NCEP grabber function to download all data of interest. Nothing is returned, files are just saved.
### warning: the download process with this function can be particular or slow at times 
### ''INFO Request is queued'' = you are waiting on their server

era_grabber(start_date, end_date, hour_range, variables, output_dir, domain)

### NOTE:
This file tree is not like the others and is lumped into all variables per year. To have the files re-organized by variable_year.nc run the following function with the 'infile_vars'. These are the variables in the returned netcdf which have different names from the request variables. You might have to peak at one file for the accurate names or check the docs.

In [None]:
def era_dir_maker(input_dir, variables):

    for files in os.listdir(input_dir):
        if files[0] == 'e':
            print(files)
                
            for v in variables:
                            
                dir_maker(f'{input_dir}/{v}')                          
    
                filename = f'{files[0:4]}_{v}_{files[5:9]}.nc'
    
                if filename in os.listdir(f'{input_dir}/{v}'):
    
                    print(f'{filename}.nc has already been saved')
    
                else:

                    tfile = xr.open_dataset(f'{input_dir}/{files}')
    
                    tfile[v].to_netcdf(f'{input_dir}/{v}/{filename}')



In [None]:
infile_vars = variables = ['u10', 'v10', 'd2m', 't2m', 'blh', 'cape', 'cp', 'i10fg', 'lsrr', 'tp', 'swvl1']
era_dir_maker('ERA5', infile_vars)