# WRF 10/21 cleanup audit

We need to free up some space on `/rcs/` which is used for post-processing the 20km WRF data. So we want to remove redundant data but it is not clear what files are actually redundant. This notebook is to be used for auditing all data in `/rcs/`. 

Note - since we will be removing and re-organizing these files eventually, the code in this notebook will likely go stale. It will simply serve as a record of the current state of things prior to the 10/21 cleanup.

## Data folders

There are two folders in `/rcs/project_data/` that hold the bulk of the data: `wrf_data/` and `WRF_DATA_SEP2019`.

### `WRF_DATA_SEP2019/`

This folder is definitely incomplete. It seems to have all variables, but not all year/model combinations. This was seen while investigating the [wind issue](https://github.com/ua-snap/wrf_utils/blob/287e607684d10290eeee5b3ba4e2ea4e76577f1d/snap_wrf_data_prep/wind-issue/repair_winds_issue.ipynb). 

For most variables (potentially all) there seems to be ~35 years of CCSM RCP 8.5, 70 years of GFDL RCP 8.5, and sometimes 1 historical year from each GCM, and usually no ERA-Interim. Here are the files present for a few variables:

In [55]:
ls /rcs/project_data/WRF_DATA_SEP2019/acsnow

ACSNOW_wrf_hourly_ccsm_hist_1970.nc   ACSNOW_wrf_hourly_gfdl_rcp85_2023.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2005.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2024.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2006.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2025.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2007.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2026.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2008.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2027.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2009.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2028.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2010.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2029.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2011.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2030.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2012.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2031.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2013.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2032.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2014.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2033.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2015.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2034.nc
ACSNOW_wrf_hourly_ccsm_rcp85_2016.nc  ACSNOW_wrf_hourly_gfdl_rcp85_2035.nc
ACSNOW_wrf_hourly_ccsm_rc

In [58]:
ls /rcs/project_data/WRF_DATA_SEP2019/t

T_wrf_hourly_ccsm_hist_1970.nc   T_wrf_hourly_gfdl_rcp85_2023.nc
T_wrf_hourly_ccsm_rcp85_2005.nc  T_wrf_hourly_gfdl_rcp85_2024.nc
T_wrf_hourly_ccsm_rcp85_2006.nc  T_wrf_hourly_gfdl_rcp85_2025.nc
T_wrf_hourly_ccsm_rcp85_2007.nc  T_wrf_hourly_gfdl_rcp85_2026.nc
T_wrf_hourly_ccsm_rcp85_2008.nc  T_wrf_hourly_gfdl_rcp85_2027.nc
T_wrf_hourly_ccsm_rcp85_2009.nc  T_wrf_hourly_gfdl_rcp85_2028.nc
T_wrf_hourly_ccsm_rcp85_2010.nc  T_wrf_hourly_gfdl_rcp85_2029.nc
T_wrf_hourly_ccsm_rcp85_2011.nc  T_wrf_hourly_gfdl_rcp85_2030.nc
T_wrf_hourly_ccsm_rcp85_2012.nc  T_wrf_hourly_gfdl_rcp85_2031.nc
T_wrf_hourly_ccsm_rcp85_2013.nc  T_wrf_hourly_gfdl_rcp85_2032.nc
T_wrf_hourly_ccsm_rcp85_2014.nc  T_wrf_hourly_gfdl_rcp85_2033.nc
T_wrf_hourly_ccsm_rcp85_2015.nc  T_wrf_hourly_gfdl_rcp85_2034.nc
T_wrf_hourly_ccsm_rcp85_2016.nc  T_wrf_hourly_gfdl_rcp85_2035.nc
T_wrf_hourly_ccsm_rcp85_2017.nc  T_wrf_hourly_gfdl_rcp85_2036.nc
T_wrf_hourly_ccsm_rcp85_2018.nc  T_wrf_hourly_gfdl_rcp85_2037.nc
T_wrf_hourly_ccsm_rcp85_2

In [60]:
ls /rcs/project_data/WRF_DATA_SEP2019/pcpt

PCPT_wrf_hourly_ccsm_hist_1970.nc   PCPT_wrf_hourly_gfdl_rcp85_2023.nc
PCPT_wrf_hourly_ccsm_rcp85_2005.nc  PCPT_wrf_hourly_gfdl_rcp85_2024.nc
PCPT_wrf_hourly_ccsm_rcp85_2006.nc  PCPT_wrf_hourly_gfdl_rcp85_2025.nc
PCPT_wrf_hourly_ccsm_rcp85_2007.nc  PCPT_wrf_hourly_gfdl_rcp85_2026.nc
PCPT_wrf_hourly_ccsm_rcp85_2008.nc  PCPT_wrf_hourly_gfdl_rcp85_2027.nc
PCPT_wrf_hourly_ccsm_rcp85_2009.nc  PCPT_wrf_hourly_gfdl_rcp85_2028.nc
PCPT_wrf_hourly_ccsm_rcp85_2010.nc  PCPT_wrf_hourly_gfdl_rcp85_2029.nc
PCPT_wrf_hourly_ccsm_rcp85_2011.nc  PCPT_wrf_hourly_gfdl_rcp85_2030.nc
PCPT_wrf_hourly_ccsm_rcp85_2012.nc  PCPT_wrf_hourly_gfdl_rcp85_2031.nc
PCPT_wrf_hourly_ccsm_rcp85_2013.nc  PCPT_wrf_hourly_gfdl_rcp85_2032.nc
PCPT_wrf_hourly_ccsm_rcp85_2014.nc  PCPT_wrf_hourly_gfdl_rcp85_2033.nc
PCPT_wrf_hourly_ccsm_rcp85_2015.nc  PCPT_wrf_hourly_gfdl_rcp85_2034.nc
PCPT_wrf_hourly_ccsm_rcp85_2016.nc  PCPT_wrf_hourly_gfdl_rcp85_2035.nc
PCPT_wrf_hourly_ccsm_rcp85_2017.nc  PCPT_wrf_hourly_gfdl_rcp85_2036.nc
PCPT_w

#### `hourly` / `hourly_fix`?

Another point of confusion is whether these data correspond to the re-stacked and un-"fixed" data, or the final post-processed product. According to the wind-repair notebook linked above, these data do indeed match data that has been restacked using the most recent version of the restacking code. This means they should correspond to the data in `/rcs/project_data/wrf_data/hourly/`, if *those* data were also restacked using the most recent code. So here is a check to see whether these data actually match the data in `/rcs/project_data/wrf_data/hourly/` for a variety of variable/year/model combinations:

In [109]:
import xarray as xr
import numpy as np
import itertools
from pathlib import Path

def compare_restacked_data(restacked_fp, original_fp, var, year, month=None, day=None):
    if month is None:
        month = str(np.random.randint(1, 12)).zfill(2)
    if day is None:
        day = str(np.random.randint(1, 28)).zfill(2)
        
    with xr.open_dataset(restacked_fp) as ds:
        restacked_arr = ds[var].sel(time=f"{year}-{month}-{day}T00:00:00").values

    with xr.open_dataset(original_fp) as ds:
        original_arr = ds[var].sel(time=f"{year}-{month}-{day}T00:00:00").values

    print("match:", np.all(original_arr == restacked_arr))


original_dir = Path("/rcs/project_data/wrf_data/hourly")
sep19_dir = Path("/rcs/project_data/WRF_DATA_SEP2019")

# some random variables (not wind on purpose because this 
# directory does not have them because of wind issue
variables = ["acsnow", "pcpt", "t2", "lwupb", "snow", "ght"]
models = ["ccsm", "gfdl"]
year = 2012

for var, model in itertools.product(variables, models):
    og_fp = list(original_dir.joinpath(var).glob(f"*{model}*{year}*"))[0]
    s19_fp = list(sep19_dir.joinpath(var).glob(f"*{model}*{year}*"))[0]
    print(f"Variable: {var}; Model: {model}", end="......")
    print("size:", s19_fp.stat().st_size == og_fp.stat().st_size, end="...")
    compare_restacked_data(s19_fp, og_fp, var.upper(), year)

Variable: acsnow; Model: ccsm......match: True
Variable: acsnow; Model: gfdl......match: True
Variable: pcpt; Model: ccsm......match: True
Variable: pcpt; Model: gfdl......match: True
Variable: t2; Model: ccsm......match: True
Variable: t2; Model: gfdl......match: True
Variable: lwupb; Model: ccsm......match: True
Variable: lwupb; Model: gfdl......match: True
Variable: snow; Model: ccsm......match: True
Variable: snow; Model: gfdl......match: True
Variable: ght; Model: ccsm......match: True
Variable: ght; Model: gfdl......match: True


This is somewhat unexpected, because according to the WRF wind repair efforts, the wind data in this folder were different from what was in `wrf_data/hourly`. Can't re-verify directly because those non-matchings files were cleaned out of there because they were assumed bad. But some were saved in `/rcs/project_data/wrf_data/wind-issue/incorrect_samples/hourly`. So compare with these data again to make sure:

In [145]:
bad_dir = Path("/rcs/project_data/wrf_data/wind-issue/incorrect_samples/hourly")

# some random variables (not wind on purpose because this 
# directory does not have them because of wind issue
variables = ["U10", "UBOT", "V10", "VBOT"]
models = ["ccsm", "gfdl"]
year = 2006

for var, model in itertools.product(variables, models):
    bad_fp = list(bad_dir.glob(f"*{var.upper()}*{model}*{year}*"))[0]
    s19_fp = list(sep19_dir.joinpath(var).glob(f"*{var.upper()}*{model}*{year}*"))[0]
    print(f"Variable: {var}; Model: {model}", end="......")
    print("size:", s19_fp.stat().st_size == bad_fp.stat().st_size, end="...")
    compare_restacked_data(s19_fp, bad_fp, var.upper(), year)

Variable: U10; Model: ccsm......size: False...match: False
Variable: U10; Model: gfdl......size: False...match: False
Variable: UBOT; Model: ccsm......size: False...match: False
Variable: UBOT; Model: gfdl......size: False...match: False
Variable: V10; Model: ccsm......size: False...match: False
Variable: V10; Model: gfdl......size: False...match: False
Variable: VBOT; Model: ccsm......size: False...match: False
Variable: VBOT; Model: gfdl......size: False...match: False


Okay so it looks like the wind variables don't match but potentially other variables do. 

This means we should verify that all non-wind variables in `/rcs/project_data/wrf_data/hourly/` match what's in `WRF_DATA_SEP2019`. We will do so by checking that filesizes match between all files:

In [342]:
from multiprocessing import Pool
import time

def compare_files(fp, seed, search_dir=original_dir):
    """Lookup a file with same name as fp in 
    search_dir and compare sizes, and whether
    file values for a random time slice are the same.
    """
    test_fp = Path(search_dir.joinpath(f"{fp.parent.name}/{fp.name}"))
    var = fp.parent.name
    year = fp.name.split("_")[-1].split(".")[0]
    local_state = np.random.RandomState(seed)
    month = str(local_state.randint(1, 12)).zfill(2)
    day = str(local_state.randint(1, 28)).zfill(2)
    
    if test_fp.exists():
        size_result = test_fp.stat().st_size == fp.stat().st_size
        try:
            with xr.open_dataset(fp) as ds1:
                arr1 = ds1[var.upper()].sel(time=f"{year}-{month}-{day}T00:00:00", x=slice(100, 150), y=slice(100, 150)).values
            bad_fp = False
        except OSError:
            bad_fp = fp
        except KeyError:
            bad_fp = f"{fp} TYPE 2"
            
        try:
            with xr.open_dataset(test_fp) as ds2:
                arr2 = ds2[var.upper()].sel(time=f"{year}-{month}-{day}T00:00:00", x=slice(100, 150), y=slice(100, 150)).values
            bad_test_fp = False
        except OSError:
            bad_test_fp = test_fp
        except KeyError:
            bad_test_fp = f"{test_fp} TYPE 2"
        
        if (not bad_fp) & (not bad_test_fp):
            arr_result = np.all(np.isclose(arr1, arr2))
        else:
            arr_result = f"bad fp: {bad_fp}. Bad test fp: {bad_test_fp}"
    else:
        size_result = None
        arr_result = None
    
    return {
        "fp": fp.name, 
        "variable": fp.parent.name, 
        "year": year, 
        "month": month, 
        "day": day, 
        "size_result": size_result, 
        "arr_result": arr_result
    }
    

# get all files in WRF_DATA_SEP2019
s19_fps = list(sep19_dir.glob("*/*.nc"))

args = list(zip(s19_fps, np.arange(len(s19_fps))))

tic = time.perf_counter()

# commented out to resume from 2575
#results = []
for arg in args[2575:]:
    results.append(compare_files(*arg))

print(f"Done, {round((time.perf_counter() - tic) / 60, 1)}m")

Done, 64.7m
