# Explore grids of CMIP6

This notebook is for exploring the grids used by CMIP6 models. We are mainly interested in the variation in resolution, as we assume all grids will be rectilinear WGS84 grids and will vary mostly in resolution, but are unsure about how much they will vary. We will also want to explore other things, like the consistency of grid use within a model.

In [70]:
from pathlib import Path
import numpy as np
import pandas as pd
import xarray as xr
from math import radians
from multiprocessing import Pool
import tqdm

# ignore serializationWarnings from xarray for datasets with multiple FillValues
import warnings
warnings.filterwarnings("ignore", category=xr.SerializationWarning)

In [71]:
cmip6_dir = Path("/beegfs/CMIP6/arctic-cmip6/CMIP6")

Get a list of one file for each model. Each one should have monthly temperature, with a file starting in 2015-01:

In [72]:
fps = list(cmip6_dir.joinpath("ScenarioMIP").glob("*/*/ssp585/*/Amon/tas/*/*/*201501*.nc"))
print("Unique models (files):", len(fps))
fps

Unique models (files): 12


[PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-ESM4_ssp585_r1i1p1f1_gr1_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NIMS-KMA/KACE-1-0-G/ssp585/r1i1p1f1/Amon/tas/gr/v20190920/tas_Amon_KACE-1-0-G_ssp585_r1i1p1f1_gr_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-CM6-1-HR/ssp585/r1i1p1f2/Amon/tas/gr/v20191202/tas_Amon_CNRM-CM6-1-HR_ssp585_r1i1p1f2_gr_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NCC/NorESM2-MM/ssp585/r1i1p1f1/Amon/tas/gn/v20191108/tas_Amon_NorESM2-MM_ssp585_r1i1p1f1_gn_201501-202012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/AS-RCEC/TaiESM1/ssp585/r1i1p1f1/Amon/tas/gn/v20200901/tas_Amon_TaiESM1_ssp585_r1i1p1f1_gn_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-MM/ssp585/r1i1p1f3/Amon/tas/gn/v20200515/tas_Amon_HadGEM

In [73]:
def get_res(fp):
    with xr.open_dataset(fp) as ds:
        lat_res, lon_res = np.diff(ds.lat)[0].round(2), np.diff(ds.lon)[0].round(2)
        
    return lat_res, lon_res

In [74]:
rows = []
for fp in fps:
    model = fp.parent.parent.parent.parent.parent.parent.parent.name
    inst = fp.parent.parent.parent.parent.parent.parent.parent.parent.name
    lat_res, lon_res = get_res(fp)
    ysize_km = np.round(lat_res * 110)
    xsize_km = np.abs(np.round(lon_res * (111.320 * np.cos(radians(65)))))
    rows.append({
        "inst_model": f"{inst}_{model}",
        "lat_res": lat_res,
        "lon_res": lon_res,
        "ysize_km": ysize_km,
        "xsize_km": xsize_km
    })

res_df = pd.DataFrame(rows)

In [106]:
res_df

Unnamed: 0,inst_model,lat_res,lon_res,ysize_km,xsize_km
0,NOAA-GFDL_GFDL-ESM4,1.0,1.25,110.0,59.0
1,NIMS-KMA_KACE-1-0-G,1.25,1.88,138.0,88.0
2,CNRM-CERFACS_CNRM-CM6-1-HR,0.5,0.5,55.0,24.0
3,NCC_NorESM2-MM,0.94,1.25,103.0,59.0
4,AS-RCEC_TaiESM1,0.94,1.25,103.0,59.0
5,MOHC_HadGEM3-GC31-MM,0.56,0.83,62.0,39.0
6,MOHC_HadGEM3-GC31-LL,1.25,1.88,138.0,88.0
7,MIROC_MIROC6,1.39,1.41,153.0,66.0
8,EC-Earth-Consortium_EC-Earth3-Veg,0.7,0.7,77.0,33.0
9,NCAR_CESM2,0.94,1.25,103.0,59.0


Well it looks like there is one grid which is used more than the others, the grid used by CESM2, CESM2-WACCM, TaiESM1 and NorESM2-MM. The latter three models are based on the CESM2 model, so this makes sense. But this grid also represents a middle ground between the range of resolutions, so this should be a candidate for regridding to.

## Intramodel-consistency of grids

Here we will check that all files for a particular model have the same grid.

Define some functions to help with comparing grids among files within models:

In [192]:
GRID_VARS = ["lat", "lon", "lat_bnds", "lon_bnds"]

def get_grid(fp):
    """Read the info from a grid for a single file"""
    grid_di = {}
    with xr.open_dataset(fp) as ds:
        for varname in GRID_VARS:
            try:
                grid_di[varname] = ds[varname].values
            except KeyError:
                grid_di[varname] = None
    grid_di["filepath"] = fp
        
    return grid_di


def read_grids(fps):
    """Read the grid info from all files in fps, using multiprocessing and with a progress bar"""
    grids = []
    with Pool(32) as pool:
        for grid_di in tqdm.tqdm(
            pool.imap_unordered(get_grid, fps), total=len(fps)
        ):
            grids.append(grid_di)
            
    return grids


def compare_grids(fps):
    """Generate a table of results from comparing grids of all files with the first file"""
    grids = read_grids(fps)
    ref_grid = grids[0]
    results = []
    for grid_di in grids:
        result = {"filepath": grid_di["filepath"]}
        for varname in GRID_VARS:
            if isinstance(grid_di[varname], np.ndarray):
                # if array, check that the shapes are the same, call it dim_error if not
                if grid_di[varname].shape == ref_grid[varname].shape:
                    if np.all(grid_di[varname] == grids[0][varname]):
                        result[varname] = "match"
                    elif np.allclose(grid_di[varname], grids[0][varname]):
                        result[varname] = "close"
                    else:
                        result[varname] = "mismatch"
                else:
                    result[varname] = "dim_error"
            elif grid_di[varname] == None:
                result[varname] = "missing"
            else:
                # if the grid is not None or an array, we have a problem
                result[varname] = "problem"
        
        results.append(result)
    return pd.DataFrame(results), grids[0]["filepath"]

Iterate through each model, collect all filepaths for it, and run the comparison between all files and the first file in the pack:

In [193]:
results = []
ref_fps = {}
for inst_model in res_df.inst_model:
    inst, model = inst_model.split("_")
    fps = list(cmip6_dir.joinpath("ScenarioMIP").glob(f"{inst}/{model}/*/*/*/*/*/*/*.nc"))
    compare_df, ref_fp = compare_grids(fps)
    results.append(compare_df)
    ref_fps[model] = ref_fp

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [00:15<00:00, 18.41it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 143/143 [00:09<00:00, 15.22it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 546/546 [00:09<00:00, 58.65it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 864/864 [00:13<00:00, 65.98it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████

Okay well we should know by this point that we probably won't see "match" for all files, for any of the four grid variables:

In [194]:
results_df = pd.concat(results)
for varname in GRID_VARS:
    print(f"{varname} all match: {np.all(results_df[varname] =='match')}")

lat all match: False
lon all match: False
lat_bnds all match: False
lon_bnds all match: False


So now for the fun part of picking through this and figuring out why some grids are different within the same model.

View the types of "mismatches" that occurred for each grid variable:

In [195]:
for varname in GRID_VARS:
    print(f"{varname}:", results_df.query(f"{varname} != 'match'")[varname].unique())

lat: ['dim_error' 'close']
lon: ['mismatch' 'close']
lat_bnds: ['dim_error' 'missing' 'close']
lon_bnds: ['mismatch' 'missing' 'close']


Let's check out the frequency of mismatches by attributes of the data. Pull out those attributes into separate columns:

In [198]:
def fp_to_attrs(fp):
    varname = fp.parent.parent.parent.name
    frequency = fp.parent.parent.parent.parent.name
    scenario = fp.parent.parent.parent.parent.parent.parent.name
    model = fp.parent.parent.parent.parent.parent.parent.parent.name
    timeframe = fp.name.split("_")[-1].split(".nc")[0]
    
    return model, scenario, frequency, varname, timeframe

results_df["model"], results_df["scenario"], results_df["frequency"], results_df["variable"], results_df["timeframe"] = zip(*results_df["filepath"].map(fp_to_attrs))

Then groupby on them and count:

In [214]:
groupby_cols = ["model", "scenario", "frequency", "variable"]
mismatch = results_df.query(f"{varname} == 'mismatch'").groupby(groupby_cols).filepath.count().to_frame().reset_index()

In [218]:
test.model.unique()

array(['HadGEM3-GC31-LL', 'HadGEM3-GC31-MM', 'KACE-1-0-G'], dtype=object)

In [219]:
missing = results_df.query(f"{varname} == 'missing'").groupby(groupby_cols).filepath.count().to_frame().reset_index()

In [221]:
missing

Unnamed: 0,model,scenario,frequency,variable,filepath
0,CNRM-CM6-1-HR,ssp126,Amon,evspsbl,1
1,CNRM-CM6-1-HR,ssp126,Amon,hus,2
2,CNRM-CM6-1-HR,ssp126,Amon,huss,1
3,CNRM-CM6-1-HR,ssp126,Amon,pr,1
4,CNRM-CM6-1-HR,ssp126,Amon,prsn,1
...,...,...,...,...,...
101,CNRM-CM6-1-HR,ssp585,day,tas,2
102,CNRM-CM6-1-HR,ssp585,day,ua,18
103,CNRM-CM6-1-HR,ssp585,day,uas,3
104,CNRM-CM6-1-HR,ssp585,day,va,42
