# Exploratory Data Analysis (Model Runs)

The purpose of this notebook is to perform exploratory data analysis (EDA) for the NCAR statistically downscaled (BCSD) Alaska Near Surface Meteorology Daily Averages dataset. The data are 12 km resolution for the period 1950–2099

The goal of this EDA notebook is to execute some of the normal tasks (what is here? what is missing? etc.) and understand the structures and value ranges within the data.

The source data are annual netCDF files (with a daily frequency time step) that contain the data for a single model and scenario. There are 10 models and 2 scenarios. Within each model-scenario combinationt he Alaska Near Surface Meteorology Daily Averages have files names like `ACCESS1-3_rcp45_BCSD_met_1958.nc` where the `met` tag indicates that these files contain the following climate variables:

    tmax (Maximum Daily 2-m air temperature, degrees C)
    tmin (Minimum Daily 2-m air temperature, degrees C)
    pcp (Daily precipitation, mm per day)



In [1]:
import os
import time
from pathlib import Path

import xarray as xr
from tqdm import tqdm
import numpy as np
import dask
import dask.array as da
from dask.distributed import Client
from dask import delayed
import pandas as pd
import matplotlib.pyplot as plt

from config import DATA_DIR, daymet_dir, models, scenarios

First we will verify that the expected number of files exists. There are ten (10) models, two scenarios (2), and 1950-2099 (150) years worth of output. We should therefore have 10 * 2 * 150 total files.

In [2]:
expected_total_files = 10 * 2 * 150
expected_total_files

3000

In [3]:
projected_model_files = []
for model in models:
    model_path = DATA_DIR / model
    input_data = [x for x in list(model_path.rglob("*.nc*"))]
    if len(input_data) != 300:
        print(model)
        print(len(input_data))
    projected_model_files.extend(input_data)

assert len(projected_model_files) == expected_total_files

In [4]:
years = [x for x in range(1950, 2100)]

model_di = {}
for model in models:
    model_di[model] = []
    
scenario_di = {}
for scenario in scenarios:
    scenario_di[scenario] = []

year_di = {}
for year in years:
    year_di[year] = []

for nc_file in projected_model_files:
    
    file_model = nc_file.name.split("_")[0]
    file_scenario = nc_file.name.split("_")[1]
    file_year = nc_file.name.split("_")[-1].split(".")[0]
    
    model_di[file_model].append(nc_file)
    scenario_di[file_scenario].append(nc_file)
    year_di[int(file_year)].append(nc_file)

# basically asserting that no matter how we group the data (by model, by year, etc.)
# the number of files in each group is equal (no missing or duplicated data)

assert set([len(model_di[x]) for x in model_di.keys()]) == set([300])

assert set([len(scenario_di[x]) for x in scenario_di.keys()]) == set([1500])

assert set([len(year_di[x]) for x in year_di.keys()]) == set([20])

In [5]:
normal_dim = [209, 299, 365]
leap_dim = [209, 299, 366]

with xr.open_dataset(DATA_DIR / "CCSM4" / "rcp85" / "CCSM4_rcp85_BCSD_met_2005.nc4") as ds:
    met_ref_coords = ds.coords
    
unruly_files = []
ds_dims = []
ds_indices = []
ds_coords = []

for model in models:
    for nc_file in tqdm(model_di[model], desc=f"Scanning {model} files..."):
        with xr.open_dataset(nc_file) as ds:

            # check data three dimensional (time)
            dims = list(ds.dims.values())
            if sorted(dims) == normal_dim or sorted(dims) == leap_dim:
                pass
            else:
                print(f"{nc_file.name} has unusual dimensions of {dims}")
                unruly_files.append(nc_file)

            # check daily frequency including leap years
            if ds.coords["time"].shape[0] == 365 or ds.coords["time"].shape[0] == 366:
                pass
            else:
                print(f"{nc_file.name} has unusual coordinates of {ds.coords}")
                unruly_files.append(nc_file)

            ds_indices.append(ds.indexes)

            # check expected variables exist in each file as a DataArray
            data_vars = set(list(ds.data_vars.keys()))
            ref_vars = set(["tmin", "tmax", "pcp"])
            test_vars = set.intersection(data_vars, ref_vars)
            if test_vars == ref_vars:
                pass
            else:
                print(f"{nc_file.name} only has the following data variables: {data_vars}")



Scanning ACCESS1-3 files...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:02<00:00, 103.22it/s]
Scanning CanESM2 files...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:03<00:00, 79.00it/s]
Scanning CCSM4 files...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:04<00:00, 74.10it/s]
Scanning CSIRO-Mk3-6-0 files...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:03<00:00, 81.69it/s]
Scanning GFDL-ESM2M files...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:03<00:00, 84.33it/s]
Scann

HadGEM2-ES_rcp45_BCSD_met_2005.nc4 has unusual dimensions of [209, 299, 334]
HadGEM2-ES_rcp45_BCSD_met_2005.nc4 has unusual coordinates of Coordinates:
    latitude   (y, x) float64 ...
    longitude  (y, x) float64 ...
  * time       (time) datetime64[ns] 2005-01-01 2005-01-02 ... 2005-11-30


Scanning HadGEM2-ES files...:  63%|████████████████████████████████████████████████████████████████████████████████▊                                                | 188/300 [00:02<00:01, 70.56it/s]

HadGEM2-ES_rcp85_BCSD_met_2005.nc4 has unusual dimensions of [209, 299, 334]
HadGEM2-ES_rcp85_BCSD_met_2005.nc4 has unusual coordinates of Coordinates:
    latitude   (y, x) float64 ...
    longitude  (y, x) float64 ...
  * time       (time) datetime64[ns] 2005-01-01 2005-01-02 ... 2005-11-30


Scanning HadGEM2-ES files...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:03<00:00, 75.64it/s]
Scanning inmcm4 files...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:04<00:00, 63.28it/s]
Scanning MIROC5 files...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:05<00:00, 59.01it/s]
Scanning MPI-ESM-MR files...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:05<00:00, 57.97it/s]
Scanning MRI-CGCM3 files...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:05<00:00, 59.91it/s]


In [6]:
if len(np.unique(ds_indices)) == 1:
    print("All files have the same indices.")
else:
    print("Some files have different indices.")
    print(np.unique(ds_indices))
np.unique(unruly_files)

All files have the same indices.


array([PosixPath('/atlas_scratch/Base_Data/AK_NCAR_12km/met/HadGEM2-ES/rcp45/HadGEM2-ES_rcp45_BCSD_met_2005.nc4'),
       PosixPath('/atlas_scratch/Base_Data/AK_NCAR_12km/met/HadGEM2-ES/rcp85/HadGEM2-ES_rcp85_BCSD_met_2005.nc4')],
      dtype=object)

So two files don't have a full year's worth of data - they seem to be missing data for the month of December. These files are HadGEM2-ES_rcp45_BCSD_met_2005.nc and HadGEM2-ES_rcp85_BCSD_met_2005.nc.
After contacting NCAR we learned that this model run just didn't quite complete, and there is no plan to re-run it.
We will set this particular model (HadGEM2-ES) aside when processing. Aside from these two files the data seem homogeneous enough - the files have the expected variables and are structured with the same 209 X 299 spatial grid and with a daily time-step, including leap years. The next step is to start sampling for value ranges and nodata values.

Next we will check the value ranges for `tmin` and `tmax`. These data are in C (will be converted to F later). Here are some historical, actual record temperature bounds from Wikipedia:
 - highest is 100 °F (38 °C) in Fort Yukon
 - lowest is −80 °F (−62 °C) in Prospect Creek

In [7]:
# define a delayed function to compute stats for all variables for a single file
@delayed
def compute_stats_for_file(file):
    with xr.open_dataset(file) as ds:
        file_stats = {}
        for variable in ["tmin", "tmax"]:
            da = ds[variable].chunk()  # chunk the data for parallel processing
            file_stats[variable] = {"filename": file.name,
                                    "min_vals": float(da.min()),
                                    "max_vals": float(da.max()),
                                    "nan_count": float(da.isnull().sum()), # also checking for a constant count of no data cells while we are "in here"
                                   }
        return file_stats

In [8]:
summary_stat_di = {}


for model in models:
    all_file_stats = []

    for nc_file in tqdm(model_di[model], desc=f"Sampling values from the {model} model..."):
        all_file_stats.append(compute_stats_for_file(nc_file))
    stat_result = dask.compute(*all_file_stats)
    init_df = pd.DataFrame.from_dict(stat_result).T

    output_dfs = []

    for idx in init_df.index:
        row_dict = init_df.loc[idx].to_dict()
        df = pd.DataFrame(row_dict).T
        df["variable"] = idx
        output_dfs.append(df)
    summary_stat_di[model] = pd.concat(output_dfs)

Sampling values from the ACCESS1-3 model...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 12378.30it/s]
Sampling values from the CanESM2 model...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 11204.63it/s]
Sampling values from the CCSM4 model...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 14674.63it/s]
Sampling values from the CSIRO-Mk3-6-0 model...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 16462.02it/s]
Sampling values from the GFDL-ESM2M model...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 15415.89it/s]
Sampl

In [12]:
for k in summary_stat_di.keys():
    print(f"Minimum value summary (all scenarios) for model {k}")
    mindf = summary_stat_di[k].drop(["filename"], axis=1).groupby("variable").min().round(1)
    print(mindf)

Minimum value summary (all scenarios) for model ACCESS1-3
          min_vals  max_vals   nan_count
variable                                
tmax         -68.3      29.5  17607965.0
tmin         -67.2      14.5  17607965.0
Minimum value summary (all scenarios) for model CanESM2
          min_vals  max_vals   nan_count
variable                                
tmax         -70.9      28.5  17607965.0
tmin         -70.6      14.2  17607965.0
Minimum value summary (all scenarios) for model CCSM4
          min_vals  max_vals   nan_count
variable                                
tmax         -68.8      28.4  17607965.0
tmin         -67.4      14.5  17607965.0
Minimum value summary (all scenarios) for model CSIRO-Mk3-6-0
          min_vals  max_vals   nan_count
variable                                
tmax         -69.3      28.6  17607965.0
tmin         -67.3      14.1  17607965.0
Minimum value summary (all scenarios) for model GFDL-ESM2M
          min_vals  max_vals   nan_count
variable      

In [13]:
for k in summary_stat_di.keys():
    print(f"Maximum value summary (all scenarios) for model {k}")
    mindf = summary_stat_di[k].drop(["filename"], axis=1).groupby("variable").max().round(1)
    print(mindf)

Maximum value summary (all scenarios) for model ACCESS1-3
          min_vals  max_vals   nan_count
variable                                
tmax         -45.0      49.0  17656206.0
tmin         -46.0      38.7  17656206.0
Maximum value summary (all scenarios) for model CanESM2
          min_vals  max_vals   nan_count
variable                                
tmax         -43.6      50.8  17656206.0
tmin         -47.5      30.3  17656206.0
Maximum value summary (all scenarios) for model CCSM4
          min_vals  max_vals   nan_count
variable                                
tmax         -45.2      48.3  17656206.0
tmin         -46.5      28.0  17656206.0
Maximum value summary (all scenarios) for model CSIRO-Mk3-6-0
          min_vals  max_vals   nan_count
variable                                
tmax         -45.6      44.8  17656206.0
tmin         -44.8      35.1  17656206.0
Maximum value summary (all scenarios) for model GFDL-ESM2M
          min_vals  max_vals   nan_count
variable      

The minimums of the min-mean-max-value extractions look OK. Although we can see there are some very extreme temps here - the coldest tmax and tmin temps in this entire dataset are about -71°C, which is about -96°F! However, that could be realistic for January on top of Denali or something like that. It is unexpected that the coldest tmax is actually colder than the coldest tmin - but it is close enough that might just be a downscaling / bias correction artifact (it is a known possible issue that we've seen elsewhere). Good to see a stable minimum nan_count (other than HadGEM2-ES for which we already established is missing data) - put another way, the maximum data extent for each of these variables is very likely indentical.
The hottest daily max temperatures in the entire dataset are definitely scary (~50°C or 122°F). Ulimately we'll be creating an average daily temperature dataset which will moderate these minima and maxima.

## EDA Takeaways
* data are generally homogenous
* Discard HadGEM2-ES for now
* the data variables `tmin` and `tmax` extreme values that are reasonable, but they are worth noting, though we'll squash this variability a bit when we create a `tavg` data variable
* these data contain leap years
* if you are plotting slices of the source data, beware that the auto-labeling will be incorrect for tmin/tmax