# Explore LLNL ESGF holdings

Use this notebook to explore the LLNL ESGF holdings.

In [1]:
import pandas as pd

Read a CSV containing info on holdings for an ESGF node (LLNL only for now):

In [2]:
df = pd.read_csv("llnl_esgf_holdings.csv")

## Variants

Here we will do some exploring to see what variants we will want to mirror on the ACDN. Since some variants could have greater representation of variables/ frequencies etc, we want to select one that has the most.

So, do any variants have more data than the others for a given model and scenario?

Determine the data availability for each model, scenario, and variant, in terms of number of variables x temporal frequencies. We will combine those two fields to get an idea of representation across both daily and monthly frequencies.

In [3]:
# create a column that is just a concatenation of temnporal frequency and vairbale name to simplify
df["freq_var"] = df["frequency"] + "_" + df["variable"]

Next, group by model, scenario, and variant and tally the number of unique variable-frequency combinations:

In [4]:
rep_df = pd.DataFrame(
    df[df["grid_type"].notna()].groupby(["model", "scenario", "variant"])["freq_var"].nunique()
)

Then, for each model, look to see if there are any variants that have the maximum representation for all desired scenarios. In other words, check for a variant that has the most variable-frequency combinations for all target scenarios:

In [28]:
models = df.model.unique()

# unique sorted list of scenarios represented for each variant should be this if all desired scenarios are present
target_scenarios = ["historical", "ssp126", "ssp245", "ssp370", "ssp585"]

for model in models:
    model_df = rep_df.loc[model]
    max_rep = model_df.max()

    # I guess first check if there is the ideal situation, which is max representation for all 5 scenarios
    # iterate over variants to achieve this
    # max rep variants DataFrame
    mrv_df = model_df[model_df >= max_rep].dropna().reset_index()
    # sort the unique scenarios represented by each variant and assign as a "best" variant if all target_scenarios are found
    mrv_scenarios = mrv_df.groupby("variant")["scenario"].unique().apply(sorted)
    best_variants = mrv_scenarios[mrv_scenarios.isin([target_scenarios])].index.values

    print(model, best_variants, "\n")

ACCESS-CM2 ['r1i1p1f1' 'r4i1p1f1' 'r5i1p1f1'] 

CESM2 ['r11i1p1f1'] 

CNRM-CM6-1-HR [] 

EC-Earth3-Veg [] 

GFDL-ESM4 [] 

HadGEM3-GC31-LL [] 

HadGEM3-GC31-MM [] 

KACE-1-0-G [] 

MIROC6 [] 

MPI-ESM1-2-LR ['r10i1p1f1' 'r11i1p1f1' 'r12i1p1f1' 'r13i1p1f1' 'r14i1p1f1' 'r15i1p1f1'
 'r16i1p1f1' 'r17i1p1f1' 'r18i1p1f1' 'r19i1p1f1' 'r1i1p1f1' 'r20i1p1f1'
 'r21i1p1f1' 'r22i1p1f1' 'r23i1p1f1' 'r24i1p1f1' 'r25i1p1f1' 'r26i1p1f1'
 'r27i1p1f1' 'r28i1p1f1' 'r29i1p1f1' 'r2i1p1f1' 'r30i1p1f1' 'r3i1p1f1'
 'r4i1p1f1' 'r5i1p1f1' 'r6i1p1f1' 'r7i1p1f1' 'r8i1p1f1' 'r9i1p1f1'] 

NorESM2-MM ['r1i1p1f1'] 



Wow, it looks like only three models have variants where all variables of interest are found for all target scenarios. 

Instead, we should probably just make a table for each model that shows the number of variable-frequency combinations for each variant.

In [68]:
for model in models:
    model_df = rep_df.loc[model]
    repr_df = model_df.reset_index().pivot(
        index="variant", columns="scenario", values="freq_var"
    ).sort_values(by=["historical"], ascending=False)
    print(model)
    print(repr_df, "\n")

ACCESS-CM2
scenario   historical  ssp126  ssp245  ssp370  ssp585
variant                                              
r1i1p1f1         30.0    30.0    30.0    30.0    30.0
r4i1p1f1         30.0    30.0    30.0    30.0    30.0
r5i1p1f1         30.0    30.0    30.0    30.0    30.0
r10i1p1f1        19.0     NaN     NaN     NaN     NaN
r2i1p1f1         19.0    19.0    19.0    19.0    18.0
r3i1p1f1         19.0    19.0    19.0    19.0    19.0
r6i1p1f1         19.0     NaN     NaN     NaN     NaN
r7i1p1f1         19.0     NaN     NaN     NaN     NaN
r8i1p1f1         19.0     NaN     NaN     NaN     NaN
r9i1p1f1         19.0     NaN     NaN     NaN     NaN 

CESM2
scenario   historical  ssp126  ssp245  ssp370  ssp585
variant                                              
r11i1p1f1        26.0    26.0    26.0    26.0    26.0
r1i1p1f1         22.0    22.0    22.0    22.0    22.0
r2i1p1f1         22.0    24.0    16.0    26.0    26.0
r3i1p1f1         22.0     NaN    12.0    12.0     NaN
r4i1p1f1 

MPI-ESM1-2-LR gets a medal.