# Explore LLNL ESGF holdings

Use this notebook to explore the LLNL ESGF holdings and determine if we should use a variant besides r1i1p1f1 for any of the models. The bottom of the notebook includes the justification for each model.

In [1]:
import numpy as np
import pandas as pd

Read a CSV containing info on holdings for an ESGF node (LLNL only for now):

In [2]:
df = pd.read_csv("llnl_esgf_holdings.csv")

## Variants

Here we will do some exploring to see what variants we will want to mirror on the ACDN. Since some variants could have greater representation of variables/ frequencies etc, we want to select one that has the most.

So, do any variants have more data than the others for a given model and scenario?

Determine the data availability for each model, scenario, and variant, in terms of number of variables x temporal frequencies. We will combine those two fields to get an idea of representation across both daily and monthly frequencies.

In [3]:
# create a column that is just a concatenation of temporal frequency and variable name to simplify
df["freq_var"] = df["table_id"] + "_" + df["variable"]

Next, group by model, scenario, and variant and tally the number of unique variable-table ID combinations:

In [4]:
rep_df = pd.DataFrame(
    df[df["grid_type"].notna()]
    .groupby(["model", "scenario", "variant"])["freq_var"]
    .nunique()
)

Then, for each model, look to see if there are any variants that have the maximum representation for all desired scenarios. In other words, check for a variant that has the most variable-table ID combinations for all target scenarios:

In [5]:
models = df.model.unique()

# unique sorted list of scenarios represented for each variant should be this if all desired scenarios are present
target_scenarios = ["historical", "ssp126", "ssp245", "ssp370", "ssp585"]

for model in models:
    model_df = rep_df.loc[model]
    max_rep = model_df.max()

    # I guess first check if there is the ideal situation, which is max representation for all 5 scenarios
    # iterate over variants to achieve this
    # max rep variants DataFrame
    mrv_df = model_df[model_df >= max_rep].dropna().reset_index()
    # sort the unique scenarios represented by each variant and assign as a "best" variant if all target_scenarios are found
    mrv_scenarios = mrv_df.groupby("variant")["scenario"].unique().apply(sorted)
    best_variants = mrv_scenarios[mrv_scenarios.isin([target_scenarios])].index.values

    print(model, best_variants, f"max representation: {max_rep.values[0]}", "\n")

ACCESS-CM2 ['r4i1p1f1' 'r5i1p1f1'] max representation: 60 

CESM2 [] max representation: 56 

CNRM-CM6-1-HR [] max representation: 49 

EC-Earth3-Veg [] max representation: 57 

GFDL-ESM4 [] max representation: 49 

HadGEM3-GC31-LL [] max representation: 56 

HadGEM3-GC31-MM [] max representation: 53 

KACE-1-0-G [] max representation: 48 

MIROC6 [] max representation: 60 

MRI-ESM2-0 ['r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' 'r4i1p1f1' 'r5i1p1f1'] max representation: 58 

NorESM2-MM ['r1i1p1f1'] max representation: 55 

TaiESM1 [] max representation: 48 

CESM2-WACCM [] max representation: 52 

MPI-ESM1-2-HR ['r1i1p1f1'] max representation: 58 

MPI-ESM1-2-LR [] max representation: 63 

E3SM-1-0 [] max representation: 18 

E3SM-1-1 [] max representation: 17 

E3SM-1-1-ECA [] max representation: 17 



So it looks like there are only four models which have variants where the max representation (number of variable x table ID pairings) exists for all target scenarios. 

Instead, we should probably just make a table for each model that shows the number of variable-table ID combinations for each variant.

In [6]:
for model in models:
    model_df = rep_df.loc[model]
    repr_df = (
        model_df.reset_index()
        .pivot(index="variant", columns="scenario", values="freq_var")
        .sort_values(by=["historical"], ascending=False)
    )
    print(model)
    print(repr_df, "\n")

ACCESS-CM2
scenario   historical  ssp126  ssp245  ssp370  ssp585
variant                                              
r4i1p1f1           60      60      60      60      60
r5i1p1f1           60      60      60      60      60
r1i1p1f1           59      59      59      59      59
r10i1p1f1          47      17      17      17      22
r2i1p1f1           47      47      47      47      47
r3i1p1f1           47      47      47      47      47
r6i1p1f1           47      18      17      15      22
r7i1p1f1           47      18      17      17      22
r8i1p1f1           47      18      17      16      22
r9i1p1f1           47      18      17       9      22 

CESM2
scenario   historical  ssp126  ssp245  ssp370  ssp585
variant                                              
r11i1p1f1        53.0    56.0    54.0    54.0    56.0
r1i1p1f1         47.0    51.0    48.0    49.0    51.0
r2i1p1f1         47.0    47.0    32.0    53.0    56.0
r3i1p1f1         47.0     NaN    25.0    32.0     NaN
r4i1p1f1 

MPI-ESM1-2-LR gets a medal.

Looks like we need to rule out **MPI-ESM1-2-HR**, as there is no ScenarioMIP data for it.

Finally, we want to select variants based on this.

First filter out all rows from data frame that do not have data files:

In [7]:
valid_df = df[df["grid_type"].notna()]

#### ACCESS-CM2

For ACCESS-CM2, there is a two-way tie between r1i1p1f1 and  r4i1p1f1 / r5i1p1f1, although r1i1p1f1 is only 1 behind each! Let's check to see whether the variable-table ID combinations are the same between them:

In [8]:
test_variants = ["r1i1p1f1", "r4i1p1f1", "r5i1p1f1"]
freq_vars = (
    valid_df.query("model == 'ACCESS-CM2' & variant in @test_variants")
    .groupby("variant")["freq_var"]
    .apply(lambda x: list(np.unique(x)))
)
np.unique(freq_vars).shape

(2,)

Okay, so two of those variants should have the exact same variable-table ID combinations, which of course are r4i1p1f1 and r5i1p1f1. What is r1i1p1f1 missing here?

In [9]:
set(freq_vars.iloc[1]) - set(freq_vars.iloc[0])

{'day_sfcWindmax'}


Looks like r1i1p1f1 does not have the daily maximum near surface wind speec. This would be nice to have, so we will choose one of the other variants - r4i1p1f1.

#### CESM2 (NCAR)

It looks like the r11i1p1f1 variant has more variable-table ID combinations than the others, so that should be the obvious choice. But let's compare those to another variant just to check:

In [10]:
test_variants = ["r11i1p1f1", "r1i1p1f1"]
freq_vars = (
    valid_df.query("model == 'CESM2' & variant in @test_variants")
    .groupby("variant")["freq_var"]
    .apply(lambda x: list(np.unique(x)))
)

Show the variable-frequencies in r11i1p1f1 that are not in r1i1p1f1

In [11]:
set(freq_vars["r11i1p1f1"]).difference(set(freq_vars["r1i1p1f1"]))

{'Eday_mrsol', 'Eday_snd', 'day_mrro', 'day_mrsos', 'day_prsn', 'day_snw'}

Now look at the opposite:

In [12]:
set(freq_vars["r1i1p1f1"]).difference(set(freq_vars["r11i1p1f1"]))

set()

Yeah, so we will go with r11i1p1f1, as it has a couple more variables of interest.

#### CNRM-CM6-1-HR

There is only one choice here! r1i1p1f2

#### EC-Earth3-Veg

It looks like r1i1p1f1 variant has more variable-table ID combos for historical and all desired projected scenarios. 

There are five other variants here that all have high numbers of variable-table ID combos for historical and all desired projected scenarios: r2i1p1f1, r3i1p1f1, r12i1p1f1, r14i1p1f1, r4i1p1f1. Let's compare them for representation:

In [13]:
test_variants = ["r1i1p1f1", "r2i1p1f1", "r3i1p1f1", "r12i1p1f1", "r14i1p1f1", "r4i1p1f1"]
freq_vars = (
    valid_df.query("model == 'EC-Earth3-Veg' & variant in @test_variants")
    .groupby("variant")["freq_var"]
    .apply(lambda x: list(np.unique(x)))
)
np.unique(freq_vars).shape

(3,)

OK, 3 out of 6 have the same representation, so let's compare our leader r1i1p1f1 with each of the others.

In [14]:
for tv in test_variants[1:]:
    print(set(freq_vars["r1i1p1f1"]).difference(set(freq_vars[tv])))

set()
set()
{'fx_sftlf'}
{'fx_sftlf'}
{'Ofx_sftof', 'fx_sftlf'}


Since the only difference are fixed variables (which we aren't using right now anyways) we will just go with r1i1p1f1.



#### GFDL-ESM4

This model has a clear winner, with more freq-vars in every scenario than the other variants: r1i1p1f1

#### HadGEM3-GC31-LL

This model also has a clear winner, as there is only one variant that has data for SSP1-2.6, and it has many more variables for SSP5-8.5: r1i1p1f3

#### HadGEM3-GC31-MM

This one doesn't have the SSP2-4.5 or SSP3-7.0 scenarios. We may want to consider dropping. 

#### KACE-1-0-G

Again, a fairly clear winner here, r1i1p1f1 has more freq-vars for each scenario. 

#### MIROC6

Another winner here, r1i1p1f1 has the most representation. 

#### MPI-ESM1-2-HR

Only one variant has historical and scenario data, so we have to go with r1i1p1f1.

#### MPI-ESM1-2-LR

This one is interesting - most variants have the same representation for all scenarios. Let's see if any stand out:

In [15]:
freq_vars = (
    valid_df.query("model == 'MPI-ESM1-2-LR'")
    .groupby("variant")["freq_var"]
    .apply(lambda x: list(np.unique(x)))
)
np.unique(freq_vars).shape

(5,)

Nope! They all have the same representation. We will go with the classic r1i1p1f1.

#### NorESM2-MM

r1i1p1f1 is another winner here, as it has the most representation across scenarios. 



#### TaiESM1

Again, it's gotta be r1i1p1f1.

#### CESM2-WACCM

r1i1p1f1 again. 

However, we should compare CESM2-WACCM vs plain CESM2, using the best variant of each:

In [16]:
for scenario in ["historical", "ssp245", "ssp585"]:
    cesm2_waccm_vars = set(
        valid_df.query(
            "model == 'CESM2-WACCM' & variant == 'r1i1p1f1' & scenario == @scenario"
        )
        .groupby("variant")["freq_var"]
        .apply(lambda x: list(np.unique(x)))
        .values[0]
    )
    cesm2_vars = set(
        valid_df.query(
            "model == 'CESM2' & variant == 'r11i1p1f1' & scenario == @scenario"
        )
        .groupby("variant")["freq_var"]
        .apply(lambda x: list(np.unique(x)))
        .values[0]
    )
    print(f"Scenario: {scenario}")
    print(
        "CESM2 freq_vars not in CESM2-WACCM:", cesm2_vars.difference(cesm2_waccm_vars)
    )
    print(
        "CESM2-WACCM freq_vars not in CESM2:",
        cesm2_waccm_vars.difference(cesm2_vars),
        "\n",
    )

Scenario: historical
CESM2 freq_vars not in CESM2-WACCM: {'day_prsn', 'Amon_prsn', 'Eday_mrsol', 'day_mrro', 'day_snw', 'Eday_snd', 'day_mrsos'}
CESM2-WACCM freq_vars not in CESM2: set() 

Scenario: ssp245
CESM2 freq_vars not in CESM2-WACCM: {'day_prsn', 'Amon_prsn', 'Eday_mrsol', 'day_tasmin', 'day_mrro', 'Amon_tasmax', 'day_snw', 'day_tasmax', 'Amon_tasmin', 'day_mrsos'}
CESM2-WACCM freq_vars not in CESM2: set() 

Scenario: ssp585
CESM2 freq_vars not in CESM2-WACCM: {'day_prsn', 'day_snw', 'Eday_mrsol', 'day_mrsos'}
CESM2-WACCM freq_vars not in CESM2: set() 



So, if we want monthly snowfall flux, daily runoff, snowfall, and snow mass, then we need to go with CESM2 instead.

### E3SM Models

There is no ScenarioMIP data for these models - they should be excluded from the transfer.