# Arctic Rivers exploratory data analysis

This notebook is for exploring data from the Arctic Rivers Project, sponsored by NSF NNA. The goal is to see what we are working with here, asking questions that are relevant to how this data could be integrated into NCR or similar.

Context: co-developed with Indigenous partners a model chain that involves dynamic downscaling (4 km) of a calibrated RASM/WRF/CSM configuration for a historical 30 yr period, and two mid-century simulations using the delta-method, and four downscaled CESM2 LE members each for historical and mid-century 30-yr periods. Simulations have been routed to estimate streamflow and (calibrated) river temperatures (paper in press) for AK and YK rivers. 

In [23]:
import numpy as np
import pandas as pd
import xarray as xr

from pathlib import Path

# path to data is /beegfs/CMIP6/arctic-cmip6/Arctic_Rivers_Data/
data_dir = Path("/beegfs/CMIP6/arctic-cmip6/Arctic_Rivers_Data/")

Looks like all files are netCDFs:

In [15]:
fps = list(data_dir.glob("*"))
assert all(fp == fp2 for fp, fp2 in zip(fps, data_dir.glob("*.nc")))
print(f"Number of files: {len(fps)}")

Number of files: 1056


And all files have a 3 part name structure:

In [17]:
assert all([len(fp.name.split("_")) == 3 for fp in fps])
print("Some random files:")

print(fps[0].name)
print(fps[200].name)
print(fps[999].name)

Some random files:
2056_fC2LE2_climate.nc
2019_hC2LE9_WT.nc
2009_historical_climate.nc


Looks like we have `<year>_<model>_<variables>.nc` , with the following possible values:

In [37]:
years, models, variables = zip(*[fp.name.split(".")[0].split("_") for fp in fps])
models = list(set(models))
variables = list(set(variables))
print(f"year -- first year: {min(years)}, last year: {max(years)}")
print(f"models: {sorted(set(models))}")
print(f"scenarios: {set(variables)}")

year -- first year: 1990, last year: 2065
models: ['fC2LE2', 'fC2LE4', 'fC2LE7', 'fC2LE9', 'fPGWh', 'fPGWm', 'hC2LE2', 'hC2LE4', 'hC2LE7', 'hC2LE9', 'historical']
scenarios: {'WT', 'Q', 'climate'}


Let's get an idea of sizes:

In [47]:
rows = []
for model in models:
    for var in variables:
        modvar_sizes = []
        for year in range(1990, 2066):
            fp = data_dir.joinpath(f"{year}_{model}_{var}.nc")
            if not fp.exists():
                continue
            else:
                size = fp.stat().st_size / 1e6
                modvar_sizes.append(size)

        rows.append(
            {
                "model": model,
                "var": var,
                "mean_size": f"{np.mean(modvar_sizes).astype(int)} MB",
                "std_size": f"{np.std(modvar_sizes).astype(int)} MB",
            }
        )

df = pd.DataFrame(rows)

Mean sizes for each:

In [48]:
df.pivot(index="var", columns="model", values="mean_size")

model,fC2LE2,fC2LE4,fC2LE7,fC2LE9,fPGWh,fPGWm,hC2LE2,hC2LE4,hC2LE7,hC2LE9,historical
var,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Q,50 MB,50 MB,50 MB,50 MB,50 MB,50 MB,50 MB,50 MB,50 MB,50 MB,52 MB
WT,594 MB,594 MB,594 MB,594 MB,595 MB,595 MB,594 MB,594 MB,594 MB,594 MB,595 MB
climate,969 MB,969 MB,971 MB,971 MB,966 MB,969 MB,986 MB,990 MB,987 MB,988 MB,981 MB


Standard deviations:

In [49]:
df.pivot(index="var", columns="model", values="std_size")

model,fC2LE2,fC2LE4,fC2LE7,fC2LE9,fPGWh,fPGWm,hC2LE2,hC2LE4,hC2LE7,hC2LE9,historical
var,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Q,2 MB,2 MB,2 MB,2 MB,2 MB,2 MB,2 MB,2 MB,2 MB,2 MB,2 MB
WT,28 MB,28 MB,28 MB,28 MB,26 MB,26 MB,28 MB,28 MB,28 MB,28 MB,26 MB
climate,46 MB,50 MB,46 MB,48 MB,46 MB,45 MB,47 MB,47 MB,47 MB,48 MB,45 MB
