# Explore grids of CMIP6

This notebook is for exploring the grids used by CMIP6 models. We are mainly interested in the variation in resolution, as we assume all grids will be rectilinear WGS84 grids and will vary mostly in resolution, but are unsure about how much they will vary. We will also want to explore other things, like the consistency of grid use within a model.

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import xarray as xr
from math import radians
from multiprocessing import Pool
import tqdm

# ignore serializationWarnings from xarray for datasets with multiple FillValues
import warnings
warnings.filterwarnings("ignore", category=xr.SerializationWarning)

In [2]:
cmip6_dir = Path("/beegfs/CMIP6/arctic-cmip6/CMIP6")

Get a list of one file for each model. Each one should have monthly temperature, with a file starting in 2015-01:

In [3]:
fps = list(cmip6_dir.joinpath("ScenarioMIP").glob("*/*/ssp585/*/Amon/tas/*/*/*201501*.nc"))
print("Unique models (files):", len(fps))
fps

Unique models (files): 12


[PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-ESM4_ssp585_r1i1p1f1_gr1_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NIMS-KMA/KACE-1-0-G/ssp585/r1i1p1f1/Amon/tas/gr/v20190920/tas_Amon_KACE-1-0-G_ssp585_r1i1p1f1_gr_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM-CM6-1-HR/ssp585/r1i1p1f2/Amon/tas/gr/v20191202/tas_Amon_CNRM-CM6-1-HR_ssp585_r1i1p1f2_gr_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NCC/NorESM2-MM/ssp585/r1i1p1f1/Amon/tas/gn/v20191108/tas_Amon_NorESM2-MM_ssp585_r1i1p1f1_gn_201501-202012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/AS-RCEC/TaiESM1/ssp585/r1i1p1f1/Amon/tas/gn/v20200901/tas_Amon_TaiESM1_ssp585_r1i1p1f1_gn_201501-210012.nc'),
 PosixPath('/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-MM/ssp585/r1i1p1f3/Amon/tas/gn/v20200515/tas_Amon_HadGEM

In [4]:
def get_res(fp):
    with xr.open_dataset(fp) as ds:
        lat_res, lon_res = np.diff(ds.lat)[0].round(2), np.diff(ds.lon)[0].round(2)
        
    return lat_res, lon_res

In [5]:
rows = []
for fp in fps:
    model = fp.parent.parent.parent.parent.parent.parent.parent.name
    inst = fp.parent.parent.parent.parent.parent.parent.parent.parent.name
    lat_res, lon_res = get_res(fp)
    ysize_km = np.round(lat_res * 110)
    xsize_km = np.abs(np.round(lon_res * (111.320 * np.cos(radians(65)))))
    rows.append({
        "inst_model": f"{inst}_{model}",
        "lat_res": lat_res,
        "lon_res": lon_res,
        "ysize_km": ysize_km,
        "xsize_km": xsize_km
    })

res_df = pd.DataFrame(rows)

In [6]:
res_df

Unnamed: 0,inst_model,lat_res,lon_res,ysize_km,xsize_km
0,NOAA-GFDL_GFDL-ESM4,1.0,1.25,110.0,59.0
1,NIMS-KMA_KACE-1-0-G,1.25,1.88,138.0,88.0
2,CNRM-CERFACS_CNRM-CM6-1-HR,0.5,0.5,55.0,24.0
3,NCC_NorESM2-MM,0.94,1.25,103.0,59.0
4,AS-RCEC_TaiESM1,0.94,1.25,103.0,59.0
5,MOHC_HadGEM3-GC31-MM,0.56,0.83,62.0,39.0
6,MOHC_HadGEM3-GC31-LL,1.25,1.88,138.0,88.0
7,MIROC_MIROC6,1.39,1.41,153.0,66.0
8,EC-Earth-Consortium_EC-Earth3-Veg,0.7,0.7,77.0,33.0
9,NCAR_CESM2,0.94,1.25,103.0,59.0


Well it looks like there is one grid which is used more than the others, the grid used by CESM2, CESM2-WACCM, TaiESM1 and NorESM2-MM. The latter three models are based on the CESM2 model, so this makes sense. But this grid also represents a middle ground between the range of resolutions, so this should be a candidate for regridding to.

## Intramodel-consistency of grids

Here we will make an assessment of the variation of grids within models. You would think that all files from a given model would share the same grid, but this is not the case. We are going to scrape some info to get an idea of the grid, and generate a unique grid identifier for each unique grid from that info. A simple thing to do is just generate a string that is a concatenation of the various grid attributes, such as min and max of lat and lon variables, etc. 

Define some functions to help with comparing grids among files within models:

In [130]:
GRID_VARS = ["lat", "lon", "lat_bnds", "lon_bnds"]


def fp_to_attrs(fp):
    """pull the data attributes from a filepath"""
    varname = fp.parent.parent.parent.name
    frequency = fp.parent.parent.parent.parent.name
    scenario = fp.parent.parent.parent.parent.parent.parent.name
    model = fp.parent.parent.parent.parent.parent.parent.parent.name
    timeframe = fp.name.split("_")[-1].split(".nc")[0]
    
    attr_di = {
        "model": model,
        "scenario": scenario,
        "frequency": frequency,
        "varname": varname,
        "timeframe": timeframe
    }
    
    return attr_di


def get_grid(fp):
    """Read the info from a grid for a single file"""
    grid_di = {}
    with xr.open_dataset(fp) as ds:
        for varname in GRID_VARS:
            if (varname in ds.dims) or (varname in ds.data_vars):
                grid_di[f"{varname}_min"] = ds[varname].values.min()
                grid_di[f"{varname}_max"] = ds[varname].values.max()
                grid_di[f"{varname}_size"] = ds[varname].values.shape[0]
                grid_di[f"{varname}_step"] = np.diff(ds[varname].values)[0]
            else:
                grid_di[f"{varname}_min"] = None
                grid_di[f"{varname}_max"] = None
                grid_di[f"{varname}_size"] = None
                grid_di[f"{varname}_step"] = None
            
    # create a new column that is a concatenation of all of these values
    grid_di["grid"] = "_".join([str(grid_di[key]) for key in grid_di.keys()])
    # pull out file attributes (model scenario etc)
    grid_di.update(fp_to_attrs(fp))
    # also keep the filename for reference
    grid_di["fp"] = fp
        
    return grid_di


def read_grids(fps):
    """Read the grid info from all files in fps, using multiprocessing and with a progress bar"""
    grids = []
    with Pool(32) as pool:
        for grid_di in tqdm.tqdm(
            pool.imap_unordered(get_grid, fps), total=len(fps)
        ):
            grids.append(grid_di)
            
    return grids

Iterate through each model, collect all filepaths for it, and run the comparison between all files and the first file in the pack:

In [131]:
results = []

for inst_model in res_df.inst_model:
    inst, model = inst_model.split("_")
    fps = list(cmip6_dir.joinpath("ScenarioMIP").glob(f"{inst}/{model}/*/*/*/*/*/*/*.nc"))
    results.append(read_grids(fps))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [00:16<00:00, 17.86it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 143/143 [00:09<00:00, 14.34it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 546/546 [00:09<00:00, 58.40it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 864/864 [00:13<00:00, 63.25it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Combine results into a DataFrame, with the grid info for each file contained in a single row:

In [133]:
results_df = pd.concat([pd.DataFrame(rows) for rows in results])

Group by model and count the unique grids for each:

In [134]:
results_df.groupby("model").grid.value_counts()

model            grid                                                                                                                                                                                                          
CESM2            -90.0_90.0_192_0.9424083769633569_0.0_358.75_288_1.25_-90.0_90.0_192_[0.47120419]_-0.625_359.375_288_[1.25]                                                                                                        420
                 -90.0_90.0_192_0.9424057006835938_0.0_358.75_288_1.25_-90.0_90.0_192_[0.47120667]_-0.625_359.375_288_[1.25]                                                                                                         40
CESM2-WACCM      -90.0_90.0_192_0.9424083769633569_0.0_358.75_288_1.25_-90.0_90.0_192_[0.47120419]_-0.625_359.375_288_[1.25]                                                                                                        731
                 -90.0_90.0_192_0.9424083769633569_0.0_358.75_288_1.25_-90.0_90.

This information is probably not required for regridding, it is mostly just for reference, and for deciding which of the CESM2 grids to use for our regridding. 

Notice that some of the models share the same grids with the more common of the CESM2 grid, the one being used in 420 of the 460 files for that model:

In [135]:
cesm2_grids = results_df.query("model == 'CESM2'").grid.value_counts()
print(cesm2_grids)

-90.0_90.0_192_0.9424083769633569_0.0_358.75_288_1.25_-90.0_90.0_192_[0.47120419]_-0.625_359.375_288_[1.25]    420
-90.0_90.0_192_0.9424057006835938_0.0_358.75_288_1.25_-90.0_90.0_192_[0.47120667]_-0.625_359.375_288_[1.25]     40
Name: grid, dtype: int64


Note that these are very subtle differences, differing only after the 5th significant digit for both the latitude and `lat_bnds` step sizes.

Check that this more common grid is also used in the other three two models based on CESM2:

In [136]:
cesm2_gr1, cesm2_gr2 = cesm2_grids.index
print("Models with CESM2's first (common) grid:", np.unique(results_df.query("grid == @cesm2_gr1").model))
print("Models with CESM2's second grid:", np.unique(results_df.query("grid == @cesm2_gr2").model))

Models with CESM2's first (common) grid: ['CESM2' 'CESM2-WACCM' 'NorESM2-MM' 'TaiESM1']
Models with CESM2's second grid: ['CESM2' 'CESM2-WACCM']


This tells us that this more common version of the CESM2 grid is used in the other derivative models, so we can regrid to the more common of the CESM2 grids. Choose a monthly file for reference for all regridding:

In [137]:
use_row = results_df.query("grid == @cesm2_gr1 & model == 'CESM2' & frequency == 'Amon' & varname == 'tas'").iloc[0]
print("Use this file for regridding:", use_row["fp"])

Use this file for regridding: /beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NCAR/CESM2/ssp370/r11i1p1f1/Amon/tas/gn/v20200528/tas_Amon_CESM2_ssp370_r11i1p1f1_gn_206501-210012.nc


There we have it, we can use `/beegfs/CMIP6/arctic-cmip6/CMIP6/ScenarioMIP/NCAR/CESM2/ssp370/r11i1p1f1/Amon/tas/gn/v20200528/tas_Amon_CESM2_ssp370_r11i1p1f1_gn_206501-210012.nc` for regridding all other files. 