# EDA: GIPL IEM Edition

The purpose of this notebook is to exmaine the most recent set of GIPL model outputs provided to SNAP by Dr. Sergey Marchenko. At a high-level we already know that this set of GIPL outputs has a few distinguishing features including:

 - Spatial extent over the entire Integrated Ecosystem Model (IEM) project domain. This includes portions of Western Canada.
 - Inlcudes monthly data rather than just annual summaries.
 - Inlcudes model outputs driven by two climate models and two emissions scenarios.
 - Includes a "historical modeled" baseline set of outputs forced by CRU TS (1901-2015).
 - The permafrost variables are expected, except that for some variations the ground temperature at 4 m depth (`magt_4m`) is missing. For example, the CRU TS data do not have `magt4m`.
 
The projected data span years 2006 through 2100. The spatial resolution of the source dataset is 1 km<sup>2</sup>.

## Decompress Source Dataset: Unzip to Flat

The source dataset has bit of an inconsistent structure. Each model-scenario (or historical) set of outputs has a directory that contains one .zip file per variable for the annual data. But - while some of the monthly data have the same structure, some have monthly subdirectories as well. The key thing here is to just verify that we extracted everything we thought we would, and to tag each extracted file with the parent directory to be clear about what directory each file came from. We can clean up those filenames later. To add zest, some of the monthly data also use numerals to indicate the month (like `01`) and some use the abbreviation (`Jan`). There is also a fun mix of hyphens and underescores in the filenames, as well as various upper vs. lower case siutations. Some filenames have leading underscores as well which is a bit odd.

In general, a flat data structure where the filenames can be parsed is a little easier to work with in my opinion. We'll try to accomplish that here. The fastest and most portable way this seems to be a shell script. We can spawn these processes from this notebook, and handle the monthly and annual data separately to keep this simple. We'll also separate out the decompression of the flat and nested monthly data, because the nested data may need to be renamed to deconflict filenames. The source dataset is 174 GB compressed, so this could take a bit because we'll end up with over ten thousand of GeoTIFFs.

In [49]:
import subprocess
import re
from pathlib import Path

import numpy as np
import pandas as pd

from config import SOURCE_DIR, TARGET_DIR, extracted_annual_dir, extracted_monthly_dir, mo_names, months

In [2]:
UNZIP = False
if UNZIP:
    %%time
    command = f"bash ./annual_unzipit.sh {SOURCE_DIR} {extracted_annual_dir}"
    output = subprocess.check_output(command, shell=True)
    print(output.decode())

In [3]:
UNZIP = False
if UNZIP:
    %%time
    command = f"bash ./monthly_nested_unzipit.sh {SOURCE_DIR} {extracted_monthly_dir}"
    output = subprocess.check_output(command, shell=True)
    print(output.decode())

In [4]:
UNZIP = False
if UNZIP:
    %%time
    command = f"bash ./monthly_flat_unzipit.sh {extracted_monthly_dir}"
    output = subprocess.check_output(command, shell=True)
    print(output.decode())

## Examine Annual Data

Now that everything has been extracted we should verify some basic expectations about the annual dataset.

 - For the historical CRU baseline data, there should be 115 years of data across 9 variables. However, it looks like year 1901 is missing the talik variable. It is possible that this particular variable requires initialization from the previous year's data, which would explain why it does not exist for the first year of the historical time series.
 - We would expect each model-scenario combination to have the same number of files: same variables for the same time series.
 

In [5]:
# check historical CRU
annual_cru = [x for x in extracted_annual_dir.glob("*CRU*.tif")]
cru_1901 = [x for x in extracted_annual_dir.glob("*1901.tif")]

print("Notice that there is no talik variable for the 1901 data:")
for f in [x.name for x in cru_1901]:
    print(f)

assert len(annual_cru) == 115 * 9 - 1

Notice that there is no talik variable for the 1901 data:
gipl2_1km_CRU_TS4.0_magt_0.5m_1901.tif
gipl2_1km_CRU_TS4.0_magt_1m_1901.tif
gipl2_1km_CRU_TS4.0_magt_2m_1901.tif
gipl2_1km_CRU_TS4.0_magt_3m_1901.tif
gipl2_1km_CRU_TS4.0_magt_5m_1901.tif
gipl2_1km_CRU_TS4.0_pfBase_1901.tif
gipl2_1km_CRU_TS4.0_pfTop_1901.tif
gipl2_1km_CRU_TS4.0_Surf_1901.tif


In [6]:
# check projected and chunk up the data by model and scenario
annual_projected = [x for x in extracted_annual_dir.glob("**/*") if not "CRU" in x.name]

rcp45_fps = [x for x in annual_projected if "rcp45" in x.name.lower()]
rcp85_fps = [x for x in annual_projected if "rcp85" in x.name.lower()]

mri_rcp45_fps = [x for x in rcp45_fps if "mri" in x.name.lower()]
ncar_rcp45_fps = [x for x in rcp45_fps if "ncar" in x.name.lower()]

mri_rcp85_fps = [x for x in rcp85_fps if "mri" in x.name.lower()]
ncar_rcp85_fps = [x for x in rcp85_fps if "ncar" in x.name.lower()]

In [7]:
proj_years = np.array(range(2006, 2101))
proj_years

array([2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
       2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027,
       2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038,
       2039, 2040, 2041, 2042, 2043, 2044, 2045, 2046, 2047, 2048, 2049,
       2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060,
       2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071,
       2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082,
       2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093,
       2094, 2095, 2096, 2097, 2098, 2099, 2100])

Validate that for each model-scenario the dataset is constant over the time series, and that all years are represented.

In [8]:
mri_rcp85_years = [x.name.split("_")[-1][:-4] for x in mri_rcp85_fps]
mri_rcp85_year_count = np.unique(mri_rcp85_years, return_counts=True)
assert (mri_rcp85_year_count[0].astype(int) - proj_years).all() == 0
mri_rcp85_year_count

(array(['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
        '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
        '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2029',
        '2030', '2031', '2032', '2033', '2034', '2035', '2036', '2037',
        '2038', '2039', '2040', '2041', '2042', '2043', '2044', '2045',
        '2046', '2047', '2048', '2049', '2050', '2051', '2052', '2053',
        '2054', '2055', '2056', '2057', '2058', '2059', '2060', '2061',
        '2062', '2063', '2064', '2065', '2066', '2067', '2068', '2069',
        '2070', '2071', '2072', '2073', '2074', '2075', '2076', '2077',
        '2078', '2079', '2080', '2081', '2082', '2083', '2084', '2085',
        '2086', '2087', '2088', '2089', '2090', '2091', '2092', '2093',
        '2094', '2095', '2096', '2097', '2098', '2099', '2100'],
       dtype='<U4'),
 array([9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
        9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9

Interpret the above like this: MRI-CGCM3 RCP 8.5 has 9 files for each year of the time series.

In [9]:
ncar_rcp85_years = [x.name.split("_")[-1][:-4] for x in ncar_rcp85_fps]
ncar_rcp85_year_count = np.unique(ncar_rcp85_years, return_counts=True)
assert (ncar_rcp85_year_count[0].astype(int) - proj_years).all() == 0
ncar_rcp85_year_count

(array(['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
        '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
        '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2029',
        '2030', '2031', '2032', '2033', '2034', '2035', '2036', '2037',
        '2038', '2039', '2040', '2041', '2042', '2043', '2044', '2045',
        '2046', '2047', '2048', '2049', '2050', '2051', '2052', '2053',
        '2054', '2055', '2056', '2057', '2058', '2059', '2060', '2061',
        '2062', '2063', '2064', '2065', '2066', '2067', '2068', '2069',
        '2070', '2071', '2072', '2073', '2074', '2075', '2076', '2077',
        '2078', '2079', '2080', '2081', '2082', '2083', '2084', '2085',
        '2086', '2087', '2088', '2089', '2090', '2091', '2092', '2093',
        '2094', '2095', '2096', '2097', '2098', '2099', '2100'],
       dtype='<U4'),
 array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
        10, 10, 10, 10, 10, 10, 10, 10, 10, 10

Well, we can see now that the NCAR RCP 8.5 has one more file for each year of the time series. Let's look at the variables next.

In [10]:
mri_rcp85_vars = [x.name.split("_")[-2] for x in mri_rcp85_fps]
np.unique(mri_rcp85_vars, return_counts=True)

(array(['0.5m', '1m', '2m', '3m', '5m', 'ALT', 'Surf', 'Talik', 'pfBase'],
       dtype='<U6'),
 array([95, 95, 95, 95, 95, 95, 95, 95, 95]))

In [11]:
ncar_rcp85_vars = [x.name.split("_")[-2] for x in ncar_rcp85_fps]
np.unique(ncar_rcp85_vars, return_counts=True)

(array(['0.5m', '1m', '2m', '3m', '4m', '5m', 'Surf', 'Talik', 'pfBase',
        'pfTop'], dtype='<U6'),
 array([95, 95, 95, 95, 95, 95, 95, 95, 95, 95]))

In [12]:
set(ncar_rcp85_vars) - set(mri_rcp85_vars)

{'4m', 'pfTop'}

In [13]:
set(mri_rcp85_vars) - set(ncar_rcp85_vars)

{'ALT'}

Well that explains that - the NCAR data have two variables that the MRI data lack: Mean Annual Ground Temperature (MAGT) at 4 m depth and Permafrost Top (pfTop), but the MRI data then have one variable that that the NCAR data lack: Active Layer Thickness (ALT). Let's try the same analysis for the RCP 4.5 versions of these two models.

In [14]:
mri_rcp45_years = [x.name.split("_")[-1][:-4] for x in mri_rcp45_fps]
mri_rcp45_year_count = np.unique(mri_rcp45_years, return_counts=True)
assert (mri_rcp45_year_count[0].astype(int) - proj_years).all() == 0
mri_rcp45_year_count

(array(['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
        '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
        '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2029',
        '2030', '2031', '2032', '2033', '2034', '2035', '2036', '2037',
        '2038', '2039', '2040', '2041', '2042', '2043', '2044', '2045',
        '2046', '2047', '2048', '2049', '2050', '2051', '2052', '2053',
        '2054', '2055', '2056', '2057', '2058', '2059', '2060', '2061',
        '2062', '2063', '2064', '2065', '2066', '2067', '2068', '2069',
        '2070', '2071', '2072', '2073', '2074', '2075', '2076', '2077',
        '2078', '2079', '2080', '2081', '2082', '2083', '2084', '2085',
        '2086', '2087', '2088', '2089', '2090', '2091', '2092', '2093',
        '2094', '2095', '2096', '2097', '2098', '2099', '2100'],
       dtype='<U4'),
 array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
        10, 10, 10, 10, 10, 10, 10, 10, 10, 10

In [15]:
ncar_rcp45_years = [x.name.split("_")[-1][:-4] for x in ncar_rcp45_fps]
ncar_rcp45_year_count = np.unique(ncar_rcp45_years, return_counts=True)
assert (ncar_rcp45_year_count[0].astype(int) - proj_years).all() == 0

ValueError: operands could not be broadcast together with shapes (121,) (95,) 

Uh oh.

In [16]:
ncar_rcp45_year_count

(array(['1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
        '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
        '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
        '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
        '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
        '2021', '2022', '2023', '2024', '2025', '2026', '2027', '2028',
        '2029', '2030', '2031', '2032', '2033', '2034', '2035', '2036',
        '2037', '2038', '2039', '2040', '2041', '2042', '2043', '2044',
        '2045', '2046', '2047', '2048', '2049', '2050', '2051', '2052',
        '2053', '2054', '2055', '2056', '2057', '2058', '2059', '2060',
        '2061', '2062', '2063', '2064', '2065', '2066', '2067', '2068',
        '2069', '2070', '2071', '2072', '2073', '2074', '2075', '2076',
        '2077', '2078', '2079', '2080', '2081', '2082', '2083', '2084',
        '2085', '2086', '2087', '2088', '2089', '2090', '2091', 

Wow so this is definitely not expected - the NCAR-CCSM4 data for RCP 4.5 seems to go back to the 1980s? But then the time series extends to 2101?

In [17]:
set(np.unique(ncar_rcp45_years)) - set(np.unique(mri_rcp45_years))

{'1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2101'}

In [18]:
set(np.unique(mri_rcp45_years)) - set(np.unique(ncar_rcp45_years)) 

set()

Alright well it seems that there are just bonus years in the NCAR-CCSM4 RCP 4.5 dataset? We'll want to confirm that these are not needed, but this isn't a dealbreaker because we can just omit the extra years when building a coverage. Let's look at the RCP 4.5 variables next.

In [19]:
mri_rcp45_vars = [x.name.split("_")[-2] for x in mri_rcp45_fps]
np.unique(mri_rcp45_vars, return_counts=True)

(array(['0.5m', '1m', '2m', '3m', '4m', '5m', 'Surf', 'Talik', 'pfBase',
        'pfTop'], dtype='<U6'),
 array([95, 95, 95, 95, 95, 95, 95, 95, 95, 95]))

In [20]:
ncar_rcp45_vars = [x.name.split("_")[-2] for x in ncar_rcp45_fps]
np.unique(ncar_rcp45_vars, return_counts=True)

(array(['0.5m', '1m', '2m', '3m', '4m', '5m', 'Surf', 'Talik', 'pfBase',
        'pfTop'], dtype='<U6'),
 array([121, 121, 121, 121, 121, 121, 120, 120, 120, 120]))

In [21]:
set(ncar_rcp45_vars) - set(mri_rcp45_vars)

set()

In [22]:
set(mri_rcp45_vars) - set(ncar_rcp45_vars)

set()

Variables are consistent between these two cases - that is good. Notice that `pfTop` and `magt` at 4 m are present, but `ALT` is not. I'm guessing the "correct" variable set is the former. Now we can check not just between models, but between scenarios as well.

In [23]:
set(mri_rcp85_vars) - set(mri_rcp45_vars)

{'ALT'}

In [24]:
set(mri_rcp45_vars) - set(mri_rcp85_vars)

{'4m', 'pfTop'}

In [25]:
set(ncar_rcp85_vars) - set(ncar_rcp45_vars)

set()

In [26]:
set(ncar_rcp45_vars) - set(ncar_rcp85_vars)

set()

That makes sense - we have the anomalous `ALT` appearance, but otherwise the variables are consistent.

**So, to document some inconsistencies in the annual dataset:**

- Why do the CRU TS 4.0 outputs for 1901 not have a Talik variable?
- Why is the entire CRU TS series missing data for MAGT @ 4m depth?
- The NCAR RCP 8.5 data have two variables that the MRI-CGCM3 RCP 8.5 do not: Mean Annual Ground Temperature at 4 m depth and Permafrost Top (pfTop). However, the MRI-CGCM3 RCP 8.5 data have one variable that that the NCAR-CCSM4 RCP 8.5 data lack: Active Layer Thickness (ALT). No other model-scenario combinations have ALT, so can we assume that intended "correct" variable set excludes ALT and includes MAGT 4m and pfTop?
- The NCAR-CCSM4 RCP 4.5 data go back to the 1980s and also contain output for the year 2101. All other model-scenario time series run 2006-2100.

## Examine Monthly Data

Our basic expectations about the monthly dataset are basically the same as for the annual data.

- But we've got a monthly frequency so we'd expect each month has the same amount of data.
- Recall there is no historical CRU baseline for the monthly data
- We would expect each model-scenario combination to have the same number of files: same variables for the same time series.
 

In [27]:
monthly_projected = [x for x in extracted_monthly_dir.rglob("*.tif")]

mo_rcp45_fps = [x for x in monthly_projected if "rcp45" in x.name.lower()]
mo_rcp85_fps = [x for x in monthly_projected if "rcp85" in x.name.lower()]

In [28]:
len(monthly_projected)

36155

In [29]:
len(mo_rcp85_fps)

18161

In [30]:
len(mo_rcp45_fps)

17994

In [31]:
# ensure we are parsing the scenario correctly
assert len(mo_rcp85_fps) + len(mo_rcp45_fps) == len(monthly_projected)

Well we can immediately see that there are more files for RCP 8.5 than RCP 4.5 - we'd expect those two to have the same number of files. Let's take a closer look at the RCP 8.5 data.

In [32]:
mo_mri_rcp85_fps = [x for x in mo_rcp85_fps if "mri" in x.name.lower()]
mo_ncar_rcp85_fps = [x for x in mo_rcp85_fps if "ncar" in x.name.lower()]

In [33]:
len(mo_mri_rcp85_fps)

9033

In [34]:
len(mo_ncar_rcp85_fps)

9128

In [35]:
# ensure we are parsing the model correctly for the RCP 8.5 data
assert len(mo_mri_rcp85_fps) + len(mo_ncar_rcp85_fps) == len(mo_rcp85_fps)

In [36]:
len(mo_ncar_rcp85_fps) - len(mo_mri_rcp85_fps)

95

OK so even within the RCP 8.5 data, it seems like MRI-CGCM3 is missing some data compared to NCAR-CCSM4. 95 is the numbers of years in the time series, so it is likely that MRI-CGCM3 is missing a variable.

In [37]:
mo_mri_rcp45_fps = [x for x in mo_rcp45_fps if "mri" in x.name.lower()]
mo_ncar_rcp45_fps = [x for x in mo_rcp45_fps if "ncar" in x.name.lower()]

In [38]:
len(mo_mri_rcp45_fps)

8884

In [39]:
len(mo_ncar_rcp45_fps)

9110

In [40]:
# ensure we are parsing the model correctly for the RCP 4.5 data
assert len(mo_mri_rcp45_fps) + len(mo_ncar_rcp45_fps) == len(mo_rcp45_fps)

In [41]:
len(mo_ncar_rcp45_fps) - len(mo_mri_rcp45_fps)

226

Same pattern for the RCP 4.5 data, but different numbers and a different delta between the two. Let's try to break down this data on a monthly basis.

In [42]:
month_dict = {}

for month_num in months:

    month_abbr = mo_names[month_num]

    # format the month numeral with leading zeroes where needed
    month_str = f'{month_num:02}'

    month_dict[month_str] = month_abbr

month_dict

{'01': 'jan',
 '02': 'feb',
 '03': 'mar',
 '04': 'apr',
 '05': 'may',
 '06': 'jun',
 '07': 'jul',
 '08': 'aug',
 '09': 'sep',
 '10': 'oct',
 '11': 'nov',
 '12': 'dec'}

In [43]:
def parse_month_abbr(filename):
    match = re.search(r'jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec', filename, re.IGNORECASE)
    if match:
        return match.group().lower()
    else:
        return None

In [44]:
def parse_month_numeral(file_name):
    # Define a regular expression to match the month numeral
    pattern = r'(?:^|_)(\d{2})(?:_|$)'
    
    # Use the re.search() function to find the month numeral in the file name
    match = re.search(pattern, file_name)
    
    # Return the month numeral if found, or None otherwise
    return match.group(1) if match else None


In [45]:
file_di = {}

file_di["NCAR-CCSM4 RCP 8.5"] = {}
file_di["MRI-CGCM3 RCP 8.5"] = {}
file_di["NCAR-CCSM4 RCP 4.5"] = {}
file_di["MRI-CGCM3 RCP 4.5"] = {}


file_di["NCAR-CCSM4 RCP 8.5"]["files"] = mo_ncar_rcp85_fps
file_di["MRI-CGCM3 RCP 8.5"]["files"] = mo_mri_rcp85_fps
file_di["NCAR-CCSM4 RCP 4.5"]["files"] = mo_ncar_rcp45_fps
file_di["MRI-CGCM3 RCP 4.5"]["files"] = mo_mri_rcp45_fps

for model_scenario in file_di.keys():
    file_di[model_scenario]["no_month_parsed"] = []
    for month in mo_names[1:]:
        file_di[model_scenario][f"{month}_files"] = []

In [46]:
for model_scenario in file_di.keys():
    for fp in file_di[model_scenario]["files"]:
        mo_abbr = parse_month_abbr(fp.name)
        if mo_abbr is not None:
            file_di[model_scenario][f"{mo_abbr}_files"].append(fp)
        elif mo_abbr is None:
            mo_num = parse_month_numeral(fp.name)
            if mo_num is not None:
                file_di[model_scenario][f"{month_dict[mo_num]}_files"].append(fp)
            else:
                file_di[model_scenario]["no_month_parsed"].append(fp)

In [48]:
for model_scenario in file_di.keys():
    print([x.name for x in file_di[model_scenario]["no_month_parsed"]])

[]
['gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2006.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2007.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2008.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2009.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2010.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2011.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2012.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2013.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2014.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2015.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2016.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2017.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2018.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2019.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2020.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2021.tif', 'gipl2_IEM_Domain_1km_MRI-CGCM3-RCP8

So these files are a mystery - there is no month information (and these were orginally zipped by variable, not by month), and the variable name scheme is conflicting as well with both `mmgt` and `pfTop` in the filename. We can take a peek at one file at least.

In [51]:
# call gdalinfo and capture the output
gdalinfo_output = subprocess.check_output(["gdalinfo", file_di["MRI-CGCM3 RCP 8.5"]["no_month_parsed"][0]])
print(gdalinfo_output.decode())

Driver: GTiff/GeoTIFF
Files: /atlas_scratch/cparr4/GIPL_IEM/extracted_monthly/gipl2_IEM_Domain_1km_MRI-CGCM3-RCP85_mmgt_pfTop_2006.tif
Size is 2560, 1861
Coordinate System is:
PROJCRS["NAD_1983_Albers",
    BASEGEOGCRS["NAD83",
        DATUM["North American Datum 1983",
            ELLIPSOID["GRS 1980",6378137,298.257222101004,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4269]],
    CONVERSION["Albers Equal Area",
        METHOD["Albers Equal Area",
            ID["EPSG",9822]],
        PARAMETER["Latitude of false origin",50,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-154,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",55,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAM

Still no clues as to what month these data represent, but judging from the statistics fields this is the `pfTop` variable because the minimum value is zero, and not any mean monthly ground temperature (`mmgt`). 

In [57]:
for model_scenario in file_di.keys():
    file_di[model_scenario]["no_month_count"] = len(file_di[model_scenario]["no_month_parsed"])
    for month in mo_names[1:]:
        file_di[model_scenario][f"{month}_count"] = len(file_di[model_scenario][f"{month}_files"])

In [59]:
df = pd.DataFrame.from_dict(file_di)[-13:]
df

Unnamed: 0,NCAR-CCSM4 RCP 8.5,MRI-CGCM3 RCP 8.5,NCAR-CCSM4 RCP 4.5,MRI-CGCM3 RCP 4.5
jan_count,760,760,758,95
feb_count,760,760,760,760
mar_count,760,761,760,760
apr_count,760,760,759,760
may_count,760,760,759,760
jun_count,760,760,757,760
jul_count,760,760,760,760
aug_count,760,760,760,1177
sep_count,760,760,760,760
oct_count,768,665,759,772


Huh, well I think we just need to go through these and find out why we have such differences across the months and models and scenarios. It seems like the most consistent "standard" here is 760 files per month per model-scenario. So, we'd expect 8 variables for each month for each year, for the 95 years in the time series.

In [60]:
760 / 8

95.0

We'll create a reference based on this so we can take a closer look at each case that does not match this standard.

In [61]:
for mo in mo_names[1:]:
    mo_years = [x.name.split("_")[-1][:-4] for x in file_di["NCAR-CCSM4 RCP 8.5"][f"{mo}_files"]]
    mo_year_count = np.unique(mo_years, return_counts=True)
    if mo_year_count[0].size != 95:
        print(mo, mo_year_count)

oct (array(['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
       '2021', '2022', '2023', '2024', '2025', '2026', '2027', '2028',
       '2029', '2030', '2031', '2032', '2033', '2034', '2035', '2036',
       '2037', '2038', '2039', '2040', '2041', '2042', '2043', '2044',
       '2045', '2046', '2047', '2048', '2049', '2050', '2051', '2052',
       '2053', '2054', '2055', '2056', '2057', '2058', '2059', '2060',
       '2061', '2062', '2063', '2064', '2065', '2066', '2067', '2068',
       '2069', '2070', '2071', '2072', '2073', '2074', '2075', '2076',
       '2077', '2078', '2079', '2080', '2081', '2082', '2083', '2084',
       '2085', '2086', '2087', '2088', '2089', '2090', '2091', '2092',
       '2093', '2094', '2095', '2096', '2097', '2098', '2099', '2100'],
      dtype='<U4'), array([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 

So the NCAR-CCSM4 RCP 8.5 data look OK, except the month of October has data for 2005. We'll note that for sure, but otherwise it could just be omitted when building a coverage. We can see the MRI-CGCM3 RCP 8.5 data is a little less consistent, so we'll look at that next.

In [62]:
for mo in mo_names[1:]:
    mo_years = [x.name.split("_")[-1][:-4] for x in file_di["MRI-CGCM3 RCP 8.5"][f"{mo}_files"]]
    mo_year_count = np.unique(mo_years, return_counts=True)
    if mo_year_count[0].size != 95:
        print(mo, mo_year_count)

mar (array(['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
       '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2029',
       '2030', '2031', '2032', '2033', '2034', '2035', '2036', '2037',
       '2038', '2039', '2040', '2041', '2042', '2043', '2044', '2045',
       '2046', '2047', '2048', '2049', '2050', '2051', '2052', '2053',
       '2054', '2055', '2056', '2057', '2058', '2059', '2060', '2061',
       '2062', '2063', '2064', '2065', '2066', '2067', '2068', '2069',
       '2070', '2071', '2072', '2073', '2074', '2075', '2076', '2077',
       '2078', '2079', '2080', '2081', '2082', '2083', '2084', '2085',
       '2086', '2087', '2088', '2089', '2090', '2091', '2092', '2093',
       '2094', '2095', '2096', '2097', '2098', '2099', '2100', '2101'],
      dtype='<U4'), array([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 

In [63]:
[x.name for x in file_di["MRI-CGCM3 RCP 8.5"]["dec_files"] if 'pftop' in x.name.lower()]

[]

MRI-CGCM3 RCP 8.5 March has 1 variable for 2101 - we could just omit that. But December is missing the permafrost top variable for all years, but does have 7 variables for 2005. What about October?

In [64]:
[x.name for x in file_di["MRI-CGCM3 RCP 8.5"]["oct_files"] if 'pftop' in x.name.lower()]

[]

In [65]:
760 - file_di["MRI-CGCM3 RCP 8.5"]["oct_count"]

95

MRI-CGCM3 RCP 8.5 October is missing  `pftop` for all years (2006-2100). Next we'll look at NCAR-CCSM4 RCP 4.5. We can see from the table that all the anomalous monthly file counts are less than the 760 benchmark, so maybe these are missing `pftop` or missing years as well.

In [66]:
for mo in mo_names[1:]:
    mo_years = [x.name.split("_")[-1][:-4] for x in file_di["NCAR-CCSM4 RCP 4.5"][f"{mo}_files"]]
    mo_year_count = np.unique(mo_years, return_counts=True)
    if mo_year_count[0].size != 95:
        print("Strange number of years:")
        print(mo, mo_year_count)
    if mo_year_count[1].min() != 8:
        print(mo)

jan
apr
may
jun
oct
nov
dec


So the year sequence is in fine (2006-2100), but the above months are missing variables.

In [67]:
for mo in mo_names[1:]:
    mo_years = [x.name.split("_")[-1][:-4] for x in file_di["NCAR-CCSM4 RCP 4.5"][f"{mo}_files"]]
    mo_year_count = np.unique(mo_years, return_counts=True)
    for yr, ct in zip(list(mo_year_count[0]), list(mo_year_count[1])):
        if ct !=8:
            print(f"\n{mo} is missing {8 - ct} variables for {yr}:")
            print(f"list of current {mo} {yr} files:")
            # arbitary clipping to make this output more legible
            print([x.name[42:-8] for x in file_di["NCAR-CCSM4 RCP 4.5"][f"{mo}_files"] if yr in x.name])


jan is missing 1 variables for 2011:
list of current jan 2011 files:
['nad83_mmgt_Jan_0.5m_', 'nad83_mmgt_Jan_1m_', 'nad83_mmgt_Jan_2m_', 'nad83_mmgt_Jan_3m_', 'nad83_mmgt_Jan_5m_', 'nad83_mmgt_Jan_pfTop_', 'nad83_mmgt_Jan_Surf_']

jan is missing 1 variables for 2042:
list of current jan 2042 files:
['nad83_mmgt_Jan_1m_', 'nad83_mmgt_Jan_2m_', 'nad83_mmgt_Jan_3m_', 'nad83_mmgt_Jan_4m_', 'nad83_mmgt_Jan_5m_', 'nad83_mmgt_Jan_pfTop_', 'nad83_mmgt_Jan_Surf_']

apr is missing 1 variables for 2044:
list of current apr 2044 files:
['nad83_mmgt_Apr_0.5m_', 'nad83_mmgt_Apr_1m_', 'nad83_mmgt_Apr_2m_', 'nad83_mmgt_Apr_3m_', 'nad83_mmgt_Apr_4m_', 'nad83_mmgt_Apr_5m_', 'nad83_mmgt_Apr_pfTop_']

may is missing 1 variables for 2006:
list of current may 2006 files:
['nad83_mmgt_May_1m_', 'nad83_mmgt_May_0.5m_', 'nad83_mmgt_May_2m_', 'nad83_mmgt_May_3m_', 'nad83_mmgt_May_4m_', 'nad83_mmgt_May_pfTop_', 'nad83_mmgt_May_Surf_']

jun is missing 1 variables for 2038:
list of current jun 2038 files:
['nad8

Alright, so for NCAR-CCSM4 RCP 4.5:
 - January 2011 is missing MMGT 4 m.
 - January 2042 is missing MMGT 0.5 m.
 - April 2044 is missing MMGT Surface.
 - May 2006 is missing MMGT 0.5 m.
 - June 2038 is missing MMGT Surface.
 - June 2072 is missing MMGT 3 m.
 - June 2094 is missing MMGT 2 m.
 - October 2064 is missing MMGT 2 m.
 - November 2060 is missing MMGT 4 m.
 - December 2006 is missing MMGT Surface.
 
The final case, MRI-CGCM3 RCP 4.5, has the most variance. January only has 95 files, and two months more files than the 760 benchmark.

In [68]:
for mo in mo_names[1:]:
    mo_years = [x.name.split("_")[-1][:-4] for x in file_di["MRI-CGCM3 RCP 4.5"][f"{mo}_files"]]
    mo_year_count = np.unique(mo_years, return_counts=True)
    if mo_year_count[0].size != 95:
        print(mo, mo_year_count)

aug (array(['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
       '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2029',
       '2030', '2031', '2032', '2033', '2034', '2035', '2036', '2037',
       '2038', '2039', '2040', '2041', '2042', '2043', '2044', '2045',
       '2046', '2047', '2048', '2049', '2050', '2051', '2052', '2053',
       '2054', '2055', '2056', '2057', '2058', '2059', '2060', '2061',
       '2062', '2063', '2064', '2065', '2066', '2067', '2068', '2069',
       '2070', '2071', '2072', '2073', '2074', '2075', '2076', '2077',
       '2078', '2079', '2080', '2081', '2082', '2083', '2084', '2085',
       '2086', '2087', '2088', '2089', '2090', '2091', '2092', '2093',
       '2094', '2095', '2096', '2097', '2098', '2099', '2100', '2101',
       '2102', '2103', '2104', '2105', '2106', '2107', '2108', '2109',
       '2110', '2111', '2112', '2113', '2114', '2115', '2116', '2117',
 

Wow, so October has two files for some bonus years of 2101 through 2106! And August just has loads of bonus years...between 2101 and...2505!

In [69]:
[x.name for x in file_di["MRI-CGCM3 RCP 4.5"]["aug_files"] if "2505" in x.name]

['_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_Surf_08_2505.tif']

In [70]:
[x.name for x in file_di["MRI-CGCM3 RCP 4.5"]["aug_files"] if "2101" in x.name]

['_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_08_2101.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_1m_08_2101.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_Surf_08_2101.tif']

In [71]:
[x.name for x in file_di["MRI-CGCM3 RCP 4.5"]["aug_files"] if "2100" in x.name]

['_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_1m_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_2m_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_3m_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_4m_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_5m_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_pfTop_08_2100.tif',
 '_08_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_Surf_08_2100.tif']

In [72]:
[x.name for x in file_di["MRI-CGCM3 RCP 4.5"]["oct_files"] if "2100" in x.name]

['_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_1m_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_2m_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_3m_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_4m_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_5m_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_pfTop_10_2100.tif',
 '_10_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_Surf_10_2100.tif']

In [73]:
[x.name for x in file_di["MRI-CGCM3 RCP 4.5"]["jan_files"]]

['_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2006.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2007.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2008.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2009.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2010.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2011.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2012.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2013.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2014.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2015.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2016.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2017.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mmgt_0.5m_01_2018.tif',
 '_01_gipl2_MRI-CGCM3-RCP45_IEM_Domain_1km_nad83_mm

In [75]:
# quick check to make sure no files fell through the cracks here
assert df["MRI-CGCM3 RCP 4.5"].sum() == len(mo_mri_rcp45_fps)
assert df["NCAR-CCSM4 RCP 4.5"].sum() == len(mo_ncar_rcp45_fps)
assert df["NCAR-CCSM4 RCP 8.5"].sum() == len(mo_ncar_rcp85_fps)
assert df["MRI-CGCM3 RCP 8.5"].sum() == len(mo_mri_rcp85_fps)

### Monthly Inconsistencies

 - NCAR-CCSM4 RCP 8.5 October has 8 variables for 2005.
 - MRI-CGCM3 RCP 8.5 March has 1 variable for 2101.
 - MRI-CGCM3 RCP 8.5 December has 7 variables for 2005, and 7 variables for all other years as well (missing `pfTop`).
 - MRI-CGCM3 RCP 8.5 December has 95 files with no month information and conflicting (`mmgt` and `pfTop`) variables encoded in the filename.
 - MRI-CGCM3 RCP 8.5 October is missing  `pftop` for all years (2006-2100).
     - It seems likely that the 95 files missing clear month and variable names are these missing data!
 - NCAR-CCSM4 RCP 4.5 January 2011 is missing MMGT 4 m.
 - NCAR-CCSM4 RCP 4.5 January 2042 is missing MMGT 0.5 m.
 - NCAR-CCSM4 RCP 4.5 April 2044 is missing MMGT Surface.
 - NCAR-CCSM4 RCP 4.5 May 2006 is missing MMGT 0.5 m.
 - NCAR-CCSM4 RCP 4.5 June 2038 is missing MMGT Surface.
 - NCAR-CCSM4 RCP 4.5 June 2072 is missing MMGT 3 m.
 - NCAR-CCSM4 RCP 4.5 June 2094 is missing MMGT 2 m.
 - NCAR-CCSM4 RCP 4.5 October 2064 is missing MMGT 2 m.
 - NCAR-CCSM4 RCP 4.5 November 2060 is missing MMGT 4 m.
 - NCAR-CCSM4 RCP 4.5 December 2006 is missing MMGT Surface.
 - MRI-CGCM3 RCP 4.5 October has 2 variables for years 2101 through 2106
 - MRI-CGCM3 RCP 4.5 August has 3 variables for years 2101 through 2106, and then one variable for many other far-future years up util year 2505.
 - MRI-CGCM3 RCP 4.5 January is missing all variables except MMGT 0.5 m.