# Explore LLNL ESGF holdings

Use this notebook to explore the LLNL ESGF holdings and determine which grid we should use for each of the models. Most models only have one grid, but some have more and we need to explicitly define the best choice. The bottom of the notebook includes the justification for each model.

In [38]:
import numpy as np
import pandas as pd
from itertools import chain

Read a CSV containing info on holdings. Since empty directories are listed here, remove any rows where number of files is 0.

In [39]:
holdings_df = pd.read_csv("llnl_esgf_holdings.csv")
holdings_df = holdings_df.where(holdings_df.n_files > 0)
holdings_df

Unnamed: 0,model,scenario,variant,table_id,variable,grid_type,version,n_files,filenames
0,ACCESS-CM2,historical,r10i1p1f1,Amon,tas,gn,v20220819,2.0,['tas_Amon_ACCESS-CM2_historical_r10i1p1f1_gn_...
1,ACCESS-CM2,historical,r1i1p1f1,Amon,tas,gn,v20191108,1.0,['tas_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_1...
2,ACCESS-CM2,historical,r2i1p1f1,Amon,tas,gn,v20191125,1.0,['tas_Amon_ACCESS-CM2_historical_r2i1p1f1_gn_1...
3,ACCESS-CM2,historical,r3i1p1f1,Amon,tas,gn,v20200306,1.0,['tas_Amon_ACCESS-CM2_historical_r3i1p1f1_gn_1...
4,ACCESS-CM2,historical,r4i1p1f1,Amon,tas,gn,v20210607,1.0,['tas_Amon_ACCESS-CM2_historical_r4i1p1f1_gn_1...
...,...,...,...,...,...,...,...,...,...
57150,MPI-ESM1-2-LR,ssp585,r5i1p1f1,Eday,hfss,gn,v20190710,5.0,['hfss_Eday_MPI-ESM1-2-LR_ssp585_r5i1p1f1_gn_2...
57151,MPI-ESM1-2-LR,ssp585,r6i1p1f1,Eday,hfss,gn,v20190710,5.0,['hfss_Eday_MPI-ESM1-2-LR_ssp585_r6i1p1f1_gn_2...
57152,MPI-ESM1-2-LR,ssp585,r7i1p1f1,Eday,hfss,gn,v20190710,5.0,['hfss_Eday_MPI-ESM1-2-LR_ssp585_r7i1p1f1_gn_2...
57153,MPI-ESM1-2-LR,ssp585,r8i1p1f1,Eday,hfss,gn,v20190710,5.0,['hfss_Eday_MPI-ESM1-2-LR_ssp585_r8i1p1f1_gn_2...


And load the specific variants for each model that were determined in `select_variants.ipynb`. (This is stored in `transfers.config.py`.... right now I have included empty values for models that still need to be assessed.)

In [40]:
prod_variant_lu = {
    "CNRM-CM6-1-HR": "r1i1p1f2",
    "EC-Earth3-Veg": "r1i1p1f1",
    "GFDL-ESM4": "r1i1p1f1",
    "HadGEM3-GC31-LL": "r1i1p1f3",
    "HadGEM3-GC31-MM": "r1i1p1f3",
    "KACE-1-0-G": "r1i1p1f1",
    "MIROC6": "r1i1p1f1",
    "MPI-ESM1-2-HR": "r1i1p1f1",
    "MPI-ESM1-2-LR": None,
    "MRI-ESM2-0": "r1i1p1f1",
    "NorESM2-MM": "r1i1p1f1",
    "TaiESM1": "r1i1p1f1",
    "CESM2-WACCM": "r1i1p1f1",
    "E3SM-1-0": None,
    "E3SM-1-1-ECA": None,
    "E3SM-2-0": None,
    "E3SM-2-0-NARRM": None,
}

Subset the holdings using the model/variant dictionary.

In [41]:
results = []
for model in prod_variant_lu.keys():
    variant = prod_variant_lu[model]
    if variant is not None:
        results.append(holdings_df.query(f"model == '{model}' & variant == '{variant}'"))
    else:
        results.append(holdings_df.query(f"model == '{model}'"))
df = pd.concat(results)
df

Unnamed: 0,model,scenario,variant,table_id,variable,grid_type,version,n_files,filenames
1492,CNRM-CM6-1-HR,historical,r1i1p1f2,day,tas,gr,v20191021,4.0,['tas_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr...
1496,CNRM-CM6-1-HR,historical,r1i1p1f2,day,tasmax,gr,v20191021,4.0,['tasmax_day_CNRM-CM6-1-HR_historical_r1i1p1f2...
1498,CNRM-CM6-1-HR,historical,r1i1p1f2,day,tasmin,gr,v20191021,4.0,['tasmin_day_CNRM-CM6-1-HR_historical_r1i1p1f2...
1500,CNRM-CM6-1-HR,historical,r1i1p1f2,day,pr,gr,v20191021,7.0,['pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_...
1502,CNRM-CM6-1-HR,historical,r1i1p1f2,day,psl,gr,v20191021,4.0,['psl_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr...
...,...,...,...,...,...,...,...,...,...
43346,CESM2-WACCM,ssp585,r1i1p1f1,SIday,sithick,gn,v20200702,5.0,['sithick_SIday_CESM2-WACCM_ssp585_r1i1p1f1_gn...
43351,CESM2-WACCM,ssp585,r1i1p1f1,Amon,hfls,gn,v20200702,5.0,['hfls_Amon_CESM2-WACCM_ssp585_r1i1p1f1_gn_201...
43356,CESM2-WACCM,ssp585,r1i1p1f1,day,hfls,gn,v20200702,29.0,['hfls_day_CESM2-WACCM_ssp585_r1i1p1f1_gn_2015...
43366,CESM2-WACCM,ssp585,r1i1p1f1,Amon,hfss,gn,v20200702,5.0,['hfss_Amon_CESM2-WACCM_ssp585_r1i1p1f1_gn_201...


## Grids

Here we will do some exploring to see what grids we will want to mirror on the ACDN. Some grids could have greater representation of variables/ frequencies etc, we want to select one that has the most. We also need to take into account the shape of the grid, as we would prefer rectangular grid shapes.

So, do any grids have more data than the others for a given model and scenario?

Determine the data availability for each model, scenario, and grid, in terms of number of variables x temporal frequencies. We will combine those two fields to get an idea of representation across both daily and monthly frequencies.

In [42]:
# create a column that is just a concatenation of temnporal frequency and variable name to simplify
df["freq_var"] = df["table_id"] + "_" + df["variable"]

Next, group by model, scenario, and grid type and tally the number of unique variable-table ID combinations:

In [43]:
rep_df = pd.DataFrame(
    df.groupby(["model", "scenario", "grid_type"])["freq_var"]
    .nunique()
)
rep_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,freq_var
model,scenario,grid_type,Unnamed: 3_level_1
CESM2-WACCM,historical,gn,46
CESM2-WACCM,ssp126,gn,51
CESM2-WACCM,ssp245,gn,44
CESM2-WACCM,ssp370,gn,44
CESM2-WACCM,ssp585,gn,52
...,...,...,...
TaiESM1,historical,gn,45
TaiESM1,ssp126,gn,48
TaiESM1,ssp245,gn,48
TaiESM1,ssp370,gn,45


Make a table for each model that shows the number of variable-table ID combinations for each grid type.

In [44]:
models = df.model.unique()
model_grids = {}

for model in models:
    model_df = rep_df.loc[model]
    repr_df = (
        model_df.reset_index()
        .pivot(index="grid_type", columns="scenario", values="freq_var")
    )
    print(model)
    print(repr_df, "\n")
    if len(repr_df.index)>1:
        model_grids[model] = list(repr_df.index)

CNRM-CM6-1-HR
scenario   historical  ssp126  ssp245  ssp370  ssp585
grid_type                                            
gn                  3       5       4       4       5
gr                 25      44      41      41      44 

EC-Earth3-Veg
scenario   historical  ssp126  ssp245  ssp370  ssp585
grid_type                                            
gn                  7       6       6       6       6
gr                 50      47      47      47      47 

GFDL-ESM4
scenario   historical  ssp126  ssp245  ssp370  ssp585
grid_type                                            
gn                  4       5       5       5       5
gr1                45      44      41      43      41 

HadGEM3-GC31-LL
scenario   historical  ssp126  ssp245  ssp585
grid_type                                    
gn                 56      52      54      52 

HadGEM3-GC31-MM
scenario   historical  ssp126  ssp585
grid_type                            
gn                 53      48      48 

KACE-1-0-G
scenario 

We see there are a few models with multiple grid types. In each case it's clear to see that one grid type has the majority of the data, but we need to consider that some variables may only be available under a certain grid. Let's compare the data between grid types and see whats going on here.


In [45]:
model_grids

{'CNRM-CM6-1-HR': ['gn', 'gr'],
 'EC-Earth3-Veg': ['gn', 'gr'],
 'GFDL-ESM4': ['gn', 'gr1']}

In [51]:
for model in model_grids.keys():
    print(f"{model}\n")
    print(f"Grids: {model_grids[model]}")
    # groupby grid type and get all freq_var combos by grid
    freq_vars = df[df['model']==model].groupby(["grid_type"])["freq_var"].apply(lambda x: list(np.unique(x)))
    # for each grid type, list the other grids and add their freq_var combos to a list of lists
    for test_grid in model_grids[model]:

        other_grid_freq_vars = []
        other_grids = model_grids[model][:]
        other_grids.remove(test_grid)

        for other_grid in other_grids:
            other_grid_freq_vars.append(freq_vars[other_grid])
        
        # then test against the other grids to find any data unique to the test grid
        unique = [t for t in freq_vars[test_grid] if t not in list(set(chain(*other_grid_freq_vars)))]    
        # and NOT unique to the test grid
        shared = [t for t in freq_vars[test_grid] if t in list(set(chain(*other_grid_freq_vars)))]
    
        print(f"Frequency/variable combos unique to grid {test_grid}:  {unique}")
        print(f"Frequency/variable combos shared with another grid besides {test_grid}: {shared}")
        print("\n")

CNRM-CM6-1-HR

Grids: ['gn', 'gr']
Frequency/variable combos unique to grid gn:  ['Oday_tos', 'Omon_tos', 'SIday_siconc', 'SIday_sithick', 'SImon_siconc', 'SImon_sithick']
Frequency/variable combos shared with another grid besides gn: []


Frequency/variable combos unique to grid gr:  ['Amon_clt', 'Amon_evspsbl', 'Amon_hfls', 'Amon_hfss', 'Amon_hus', 'Amon_huss', 'Amon_pr', 'Amon_prsn', 'Amon_psl', 'Amon_rlds', 'Amon_rsds', 'Amon_sfcWind', 'Amon_ta', 'Amon_tas', 'Amon_tasmax', 'Amon_tasmin', 'Amon_ts', 'Amon_ua', 'Amon_uas', 'Amon_va', 'Amon_vas', 'LImon_snd', 'LImon_snw', 'Lmon_mrro', 'Lmon_mrsos', 'day_clt', 'day_hfls', 'day_hfss', 'day_hus', 'day_huss', 'day_mrro', 'day_mrsos', 'day_pr', 'day_prsn', 'day_psl', 'day_rlds', 'day_rsds', 'day_sfcWind', 'day_sfcWindmax', 'day_snw', 'day_ta', 'day_tas', 'day_tasmax', 'day_tasmin', 'day_ua', 'day_uas', 'day_va', 'day_vas', 'fx_orog', 'fx_sftlf']
Frequency/variable combos shared with another grid besides gr: []


EC-Earth3-Veg

Grids: ['gn'

In [None]:
len([f for f in freq_vars])

2

In [None]:
freq_vars.index.to_list()

['gn', 'gr1']

In [None]:
freq_vars['gn']

['Oday_tos',
 'Omon_tos',
 'SIday_siconc',
 'SIday_sithick',
 'SImon_siconc',
 'SImon_sithick']


Here are the results of the grid selection:

In [None]:
#TBD!

prod_grid_lu = {
    "CNRM-CM6-1-HR": ,
    "EC-Earth3-Veg": ,
    "GFDL-ESM4": ,
    "HadGEM3-GC31-LL": ,
    "HadGEM3-GC31-MM": ,
    "KACE-1-0-G": ,
    "MIROC6": ,
    "MPI-ESM1-2-HR": ,
    "MPI-ESM1-2-LR": ,
    "MRI-ESM2-0": ,
    "NorESM2-MM": ,
    "TaiESM1": ,
    "CESM2-WACCM": ,
    "E3SM-1-0": ,
    "E3SM-1-1-ECA": ,
    "E3SM-2-0": ,
    "E3SM-2-0-NARRM": ,
}

SyntaxError: invalid syntax (3658779440.py, line 4)