# Point Extraction of WRF Climate Data
## Purpose
The purpose of this Jupyter Notebook is to extract data from WRF datasets at two coordiantes (latitude and longitude) provided by Kannon Lee at PND Engineers. Kannon's application is thermal analysis, so we'll extract the following variables at a **daily** resolution:
 - `t2`: Two meter daily average air temperature (Kelvin)
 - `t2max` Two meter daily maximum air temperature (Kelvin)
 - `t2min` Two meter daily mnimum air temperature (Kelvin)
 
 and the following variables at **hourly** resolution:
 - `t2`: Two meter air temperature (Kelvin)
 
The following daily datasets will be queried for each variable:
 - The ERA-Interim Reanalysis Historical Baseline (1979 - 2015)
 - The NCAR-CCSM4 Historical Model Run (1970 - 2005)
 - The NCAR-CCSM4 RCP 8.5 Model Run (2006 - 2100)
 
 The following daily datasets will be queried for each variable:
 - The ERA-Interim Reanalysis Historical Baseline (1979 - 2015)
 - The NCAR-CCSM4 Historical Model Run (1970 - 2005)
 - The NCAR-CCSM4 RCP85 Model Run (2006 - 2100)
 - The GFDL-CM3 Historical Model Run (1970 - 2005)
 - The GFDL-CM3 RCP 8.5 Model Run (2006 - 2100)

 
## Objective
Create minimalist datasets of temperature data suitable for ingest to Kannon's thermal analysis. The data product will consitute several CSV files organized by location, temperature variable, and the particular reanalysis / climate model / emissions scenario combination. Files with data in units Kelvin and degrees Centigrade will be provided for convenience.

## Background
The data set for extraction is: [Historical and Projected Dynamically Downscaled Climate Data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution](http://ckan.snap.uaf.edu/dataset/historical-and-projected-dynamically-downscaled-climate-data-for-the-state-of-alaska-and-surrou)

Spatial resolution is hourly, with daily summaries also available.
Any missing data will be tracked and reported with an appropriate "no data" value 
These will be provided both intact, and in a "dropped" version for convenience.


## Implementation
This Jupyter Notebook will weave together the extraction code, narrative, and documention. The cells are intended to be ran top to bottom in sequential order.

In [2]:
# import the necessary libraries
import xarray as xr
import numpy as np
import pandas as pd
import warnings
from pathlib import Path
from pyproj import Transformer
from pyproj.crs import CRS
warnings.filterwarnings("ignore", category=FutureWarning) 

In [3]:
# coordinates provided via a KMZ file from Kannon
wgs84_coords = {
    "Chefornak": (60.158787283199999, -164.281157796799988),
    "Ambler": (67.08675982, -157.8569733),
}

In [4]:
# climate variables of interest for daily summaries
climate_vars = ["t2", "t2max", "t2min"]

In [5]:
# paths to local copy of the data
daily_basepath = Path("/rcs/project_data/wrf_data/daily/")
hourly_basepath = Path("/rcs/project_data/wrf_data/hourly_fix/")

In [6]:
# empty list to track netCDF files that lack coordinates
no_coord_files = []

In [7]:
"""This code block defines functions to fetch the paths to the netCDF files
containing the WRF outputs for the variables and combinations of interest."""


def fetch_target_directory(basepath, climate_variable):
    target_dir = basepath.joinpath(climate_variable)
    return target_dir


def fetch_target_filepaths(target_dir, filename_match):
    target_fps = [fp for fp in target_dir.glob(f"*{filename_match}*.nc")]
    return target_fps

In [8]:
"""This code block defines functions to extract and store the actual data from netCDF files
for a particular set of geospatial coordinates."""


def get_data(fp, var):
    """This function extracts data from a single netCDF file.
    Coordinates are supplied via the global dictionary."""
    with xr.open_dataset(fp) as ds:
        # project WGS84 coordinates using proj string from WRF file
        wrf_proj_str = ds.attrs["proj_parameters"]
        wrf_crs = CRS.from_proj4(wrf_proj_str)
        transformer = Transformer.from_crs("epsg:4326", wrf_crs)
        wrf_coords = {
            p_name: transformer.transform(*coords)
            for p_name, coords in wgs84_coords.items()
        }

        # query xarray dataset using "method" parameter to
        # choose nearest cell to each coordinate
        try:
            temp_data = {
                p_name: ds[var].sel(xc=coords[0], yc=coords[1], method="nearest").values
                for p_name, coords in wrf_coords.items()
            }
            # make a pandas dataframe with time series from points as columns
            df = pd.DataFrame(temp_data, index=ds.time.values)
        except:
            no_coord_files.append(fp)
            temp_data = {p_name: np.nan for p_name in wrf_coords.keys()}
            df = pd.DataFrame(temp_data, index = ds.time.values)
        return df.round(2)
    

def get_data_from_all_filepaths(filepaths, var):    
    list_of_temp_data_from_single_files = []
    for fp in filepaths:
        temp_data_from_one_file = get_data(fp, var)
        list_of_temp_data_from_single_files.append(temp_data_from_one_file)
    all_temp_data = pd.concat(list_of_temp_data_from_single_files)
    # consider droppping nan rows here.
    return all_temp_data


#### Notes on the above extraction code

The WRF coordinate reference system is obtained from the proj4 string in the netCDF dataset attributes. Then the `pyproj.Transformer` is used to project the coordinates to the CRS used in the `xarray` dataset. 

The `xarray.DataSet` object has a `.sel` method to query the data by location. The paramter `method="nearest"` must be set to query the nearest grid cell to the specified input coordinates, otherwise the method will look for the the specified coordinate exactly as provided and likely fail. 

In [9]:
"""This code block extracts data at the daily resolution."""

# initialize empty dictionary object for results
climate_data_di = {}

# get extraction results for each climate variable and store in the dictionary
for var in climate_vars:
    
    climate_data_di[var] = {}
    
    # set the file paths
    target_dir = fetch_target_directory(daily_basepath, var)    
    rcp85_fps = fetch_target_filepaths(target_dir, "rcp85")
    era_fps = fetch_target_filepaths(target_dir, "ERA-Interim")
    historicalcm3_fps = fetch_target_filepaths(target_dir, "GFDL-CM3_historical")
    
    # do the extraction
    climate_data_di[var]["daily_rcp85_cm3"] = get_data_from_all_filepaths(rcp85_fps, var)
    climate_data_di[var]["daily_historical_era_interim_reanalysis"] = get_data_from_all_filepaths(era_fps, var)
    climate_data_di[var]["daily_historical_modeled_cm3"] = get_data_from_all_filepaths(historicalcm3_fps, var)

In [10]:
"""This code block extracts data at the hourly resolution."""

# set the filepaths
var = "t2" # no min and max for hourly resolution
target_dir = fetch_target_directory(hourly_basepath, var)
hourly_ccsm4_fps = fetch_target_filepaths(target_dir, "CCSM4")
hourly_cm3_fps = fetch_target_filepaths(target_dir, "CM3")
cm3_historical_model_fps = [x for x in hourly_cm3_fps if "historical" in x.name]
cm3_rcp85_fps = [x for x in hourly_cm3_fps if "rcp85" in x.name]
ccsm4_historical_model_fps = [x for x in hourly_ccsm4_fps if "historical" in x.name]
ccsm4_rcp85_fps = [x for x in hourly_ccsm4_fps if "rcp85" in x.name]
hourly_era_fps = fetch_target_filepaths(target_dir, "ERA-Interim")

# do the extraction
climate_data_di[var]["hourly_reanalysis"] = get_data_from_all_filepaths(hourly_era_fps, var)
climate_data_di[var]["hourly_cm3_rcp85"] = get_data_from_all_filepaths(cm3_rcp85_fps, var)
climate_data_di[var]["hourly_historical_modeled_cm3"] = get_data_from_all_filepaths(cm3_historical_model_fps, var)
climate_data_di[var]["hourly_ccsm4_rcp85"] = get_data_from_all_filepaths(ccsm4_rcp85_fps, var)
climate_data_di[var]["hourly_historical_modeled_ccsm4"] = get_data_from_all_filepaths(ccsm4_historical_model_fps, var)


In [11]:
# inspect results by variable
climate_data_di["t2min"]

{'daily_rcp85_cm3':              Chefornak      Ambler
 2006-01-02  262.850006  259.059998
 2006-01-03  267.850006  268.679993
 2006-01-04  263.320007  248.679993
 2006-01-05  255.899994  245.720001
 2006-01-06  253.250000  241.639999
 ...                ...         ...
 2100-12-27  278.220001  271.640015
 2100-12-28  277.940002  270.350006
 2100-12-29  274.410004  268.989990
 2100-12-30  274.820007  266.399994
 2100-12-31  277.950012  266.869995
 
 [34697 rows x 2 columns],
 'daily_historical_era_interim_reanalysis':              Chefornak      Ambler
 1979-01-02  272.570007  260.920013
 1979-01-03  274.220001  263.690002
 1979-01-04  275.510010  256.720001
 1979-01-05  275.089996  256.920013
 1979-01-06  273.380005  255.539993
 ...                ...         ...
 2015-10-25  277.589996  268.660004
 2015-10-26  276.899994  272.739990
 2015-10-27  277.170013  270.970001
 2015-10-28  275.700012  263.359985
 2015-10-29  271.829987  259.839996
 
 [13450 rows x 2 columns],
 'daily_historic

In [12]:
# inspect results by variable and output type
climate_data_di["t2min"]["daily_rcp85_cm3"]

Unnamed: 0,Chefornak,Ambler
2006-01-02,262.850006,259.059998
2006-01-03,267.850006,268.679993
2006-01-04,263.320007,248.679993
2006-01-05,255.899994,245.720001
2006-01-06,253.250000,241.639999
...,...,...
2100-12-27,278.220001,271.640015
2100-12-28,277.940002,270.350006
2100-12-29,274.410004,268.989990
2100-12-30,274.820007,266.399994


In [13]:
# inspect results by variable and output type and place
climate_data_di["t2min"]["daily_rcp85_cm3"]["Ambler"]

2006-01-02    259.059998
2006-01-03    268.679993
2006-01-04    248.679993
2006-01-05    245.720001
2006-01-06    241.639999
                 ...    
2100-12-27    271.640015
2100-12-28    270.350006
2100-12-29    268.989990
2100-12-30    266.399994
2100-12-31    266.869995
Name: Ambler, Length: 34697, dtype: float32

In [16]:
# inspect results by variable and output type and place and time
climate_data_di["t2min"]["daily_rcp85_cm3"]["Ambler"].loc["2099-07-04"]

292.8795

In [17]:
# inspect results by variable and output type and place and time range
climate_data_di["t2min"]["daily_rcp85_cm3"]["Ambler"].loc["2099-07-04":"2099-07-11"]

2099-07-04    292.879486
2099-07-05    290.896820
2099-07-06    287.966492
2099-07-07    286.342255
2099-07-08    290.897614
2099-07-09    292.350128
2099-07-10    295.230804
2099-07-11    294.635559
Name: Ambler, dtype: float32

In [21]:
for j in climate_data_di.keys():
    for k in climate_data_di[j]:
        
        print("Ambler_"+ k + "_" + j)

Ambler_daily_rcp85_cm3_t2
Ambler_daily_historical_era_interim_reanalysis_t2
Ambler_daily_historical_modeled_cm3_t2
Ambler_hourly_reanalysis_t2
Ambler_hourly_cm3_rcp85_t2
Ambler_hourly_historical_modeled_cm3_t2
Ambler_hourly_ccsm4_rcp85_t2
Ambler_hourly_historical_modeled_ccsm4_t2
Ambler_daily_rcp85_cm3_t2max
Ambler_daily_historical_era_interim_reanalysis_t2max
Ambler_daily_historical_modeled_cm3_t2max
Ambler_daily_rcp85_cm3_t2min
Ambler_daily_historical_era_interim_reanalysis_t2min
Ambler_daily_historical_modeled_cm3_t2min


In [None]:
climate_data_di[var]["hourly_cm3_rcp85"].isnull().values.any()
climate_data_di[var]["hourly_cm3_rcp85"].isnull().sum()

In [None]:
x = climate_data_di[var]["hourly_cm3_rcp85"]
x[x == np.nan]

In [None]:
ds = xr.open_dataset(hourly_reanalysis_fps[0])
ds

In [18]:
no_coord_files

[PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_ERA-Interim_historical_1994.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_ERA-Interim_historical_2002.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_ERA-Interim_historical_2010.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_rcp85_2026.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_rcp85_2042.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_rcp85_2097.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_rcp85_2050.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_historical_1970.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_historical_2002.nc'),
 PosixPath('/rcs/project_data/wrf_data/hourly_fix/t2/t2_hourly_wrf_GFDL-CM3_historical_1980.nc'),
 PosixPath('/rcs/project_data/w

In [54]:
def compute_mean(df, location, time_start, time_stop):
    return df.dropna(axis=0, how='all')[location].loc[time_start:time_stop].mean()

def compute_bias(modeled_mean, observed_mean):
    return modeled_mean - observed_mean

def apply_bias(df, location, bias):
    df[location] = df[location] + bias
    return df

In [56]:
for var in climate_vars:
    for place in wgs84_coords.keys():
        era_mean = compute_mean(climate_data_di[var]["daily_historical_era_interim_reanalysis"],
                                place, "1980-01-01", "2009-12-31")
        cm3_historical_mean = compute_mean(climate_data_di[var]["daily_historical_modeled_cm3"],
                                place, "1980-01-01", "2009-12-31")
        bias = compute_bias(cm3_historical_mean, era_mean)
        apply_bias(climate_data_di[var]["daily_rcp85_cm3"], place, bias)
        

In [62]:
climate_data_di["t2"]["hourly_historical_modeled_ccsm4"]

Unnamed: 0,Chefornak,Ambler
2002-01-01 00:00:00,249.160004,242.169998
2002-01-01 01:00:00,249.039993,243.100006
2002-01-01 02:00:00,248.509995,243.500000
2002-01-01 03:00:00,247.979996,243.669998
2002-01-01 04:00:00,247.770004,243.610001
...,...,...
1977-12-31 19:00:00,226.339996,238.289993
1977-12-31 20:00:00,226.649994,238.630005
1977-12-31 21:00:00,226.800003,238.479996
1977-12-31 22:00:00,226.880005,238.679993


In [57]:
for place in wgs84_coords.keys():
    cm3_historical_mean = compute_mean(climate_data_di[var]["hourly_historical_modeled_cm3"],
                            place, "1980-01-01", "2009-12-31")
    ccsm4_historical_mean = compute_mean(climate_data_di[var]["hourly_historical_modeled_cm3"],
                        place, "1980-01-01", "2009-12-31")

    cm3_bias = compute_bias(cm3_historical_mean, era_mean)
    ccsm4_bias = compute_bias(ccsm4_historical_mean, era_mean)
    
    apply_bias(climate_data_di[var]["hourly_ccsm4_rcp85"], place, ccsm4_bias)
    apply_bias(climate_data_di[var]["hourly_cm3_rcp85"], place, cm3_bias)


{'daily_rcp85_cm3':              Chefornak      Ambler
 2006-01-02  263.279114  259.760315
 2006-01-03  268.279114  269.380310
 2006-01-04  263.749115  249.380310
 2006-01-05  256.329102  246.420319
 2006-01-06  253.679108  242.340317
 ...                ...         ...
 2100-12-27  278.649109  272.340332
 2100-12-28  278.369110  271.050323
 2100-12-29  274.839111  269.690308
 2100-12-30  275.249115  267.100311
 2100-12-31  278.379120  267.570312
 
 [34697 rows x 2 columns],
 'daily_historical_era_interim_reanalysis':              Chefornak      Ambler
 1979-01-02  272.570007  260.920013
 1979-01-03  274.220001  263.690002
 1979-01-04  275.510010  256.720001
 1979-01-05  275.089996  256.920013
 1979-01-06  273.380005  255.539993
 ...                ...         ...
 2015-10-25  277.589996  268.660004
 2015-10-26  276.899994  272.739990
 2015-10-27  277.170013  270.970001
 2015-10-28  275.700012  263.359985
 2015-10-29  271.829987  259.839996
 
 [13450 rows x 2 columns],
 'daily_historic

In [43]:
era_reanalysis_mean

265.98804

In [44]:
compute_bias(model_historical_mean, era_reanalysis_mean)

0.7003174

268.5943