# Profiling for dates associated with Nome storms

This notebook is for visualizing and analsing the results from profiling the analog forecast method using dates where Nome, AK was hit by a storm (provided by collaborators). 


The dates provided will be referred to as the "dates of interest." We applied the skill profiling framework for the the third and fifth days preceding to the dates of interest. 

In [1]:
import os
from pathlib import Path
import pandas as pd
import xarray as xr
import numpy as np
# local
import luts
from config import data_dir
import analog_forecast as af

Concatenate the results into single tables:

In [2]:
analog_df = pd.concat([pd.read_csv(fp) for fp in Path("results").glob("*.csv") if "naive" not in fp.name])
naive_df = pd.concat([pd.read_csv(fp) for fp in Path("results").glob("*naive.csv")])

## QC

First, some validation just to make sure we have things straight here. Let's take the first and last row here and manually check each aspect of the algorithm. 

In [3]:
row = analog_df.iloc[0]

Load the ERA5 data that we will use to search and generate forecasts:

In [4]:
row

variable                      t2m
spatial_domain             alaska
anomaly_search               True
reference_date         2004-10-08
forecast_day_number             1
forecast_error              4.851
Name: 0, dtype: object

In [5]:
varname = row["variable"]
ref_date = row["reference_date"]
ds = xr.load_dataset(data_dir.joinpath(luts.varnames_lu[varname]["anom_filename"]))

Subset to the spatial domain:

In [6]:
spatial_domain = row["spatial_domain"]
bbox = luts.spatial_domains[spatial_domain]["bbox"]
sub_da = ds[varname].sel(latitude=slice(bbox[3], bbox[1]), longitude=slice(bbox[0], bbox[2]))
print("Original shape:", ds[varname].shape)
print("Shape after spatial subset:", sub_da.shape)

Original shape: (23011, 361, 1440)
Shape after spatial subset: (23011, 129, 221)


Compute RMSE for all time slices before and after the reference date and forecast window:

In [7]:
%%time
analogs = af.find_analogs(sub_da, ref_date)

CPU times: user 4.44 s, sys: 3.6 s, total: 8.04 s
Wall time: 8.04 s


Load the raw value version for generating and checking the forecast:

In [8]:
%%time
raw_ds = xr.load_dataset(data_dir.joinpath(luts.varnames_lu[varname]["filename"]))

CPU times: user 1min 2s, sys: 1min 6s, total: 2min 8s
Wall time: 2min 14s


Subset the raw data spatially and compute the forecast as the mean of the arrays for day t+1 for each of the analogs:

In [9]:
raw_sub_da = raw_ds[varname].sel(latitude=slice(bbox[3], bbox[1]), longitude=slice(bbox[0], bbox[2]))
forecast = (raw_sub_da.sel(time=analogs.time.values + pd.to_timedelta(1, "d")).values).mean(axis=0)

Compute the RMSE between forecast and the date after the reference date, and cross check with the results:

In [10]:
test_rmse = np.sqrt(
    ((raw_sub_da.sel(
        time=pd.to_datetime(ref_date + " 12:00:00") + pd.to_timedelta(1, "d")
    ) - forecast) ** 2).mean()
).round(3)

assert test_rmse == row["forecast_error"].astype(np.float32)

Make sure memory is freed up for loading different datasets:

In [12]:
import gc
try:
    del ds
    del sub_da
    del raw_sub_da
    del raw_ds
except:
    pass
gc.collect()

538

Do the same as above for another row with a different variable, spatial domain, etc. Also with a different forecast day number:

In [13]:
%%time
# ensure it is a different variable
row = analog_df.query("variable == 'sst'").iloc[-1]
varname = row["variable"]
ref_date = row["reference_date"]
spatial_domain = row["spatial_domain"]
bbox = luts.spatial_domains[spatial_domain]["bbox"]
ds = xr.load_dataset(data_dir.joinpath(luts.varnames_lu[varname]["filename"]))
sub_da = ds[varname].sel(latitude=slice(bbox[3], bbox[1]), longitude=slice(bbox[0], bbox[2]))
analogs = af.find_analogs(sub_da, ref_date)
date_offset = row["forecast_day_number"]
forecast = (sub_da.sel(time=analogs.time.values + pd.to_timedelta(date_offset, "d")).values).mean(axis=0)
test_rmse = np.sqrt(
    ((sub_da.sel(
        time=pd.to_datetime(ref_date + " 12:00:00") + pd.to_timedelta(date_offset, "d")
    ) - forecast) ** 2).mean()
).round(3)

assert test_rmse == row["forecast_error"].astype(np.float32)

Original shape: (23011, 361, 1440)
Shape after spatial subset: (23011, 361, 240)
CPU times: user 1min 16s, sys: 1min 26s, total: 2min 43s
Wall time: 2min 50s


## Save tables

Save the combined tables for collaborators to have a look at.

Make a table with the forecast date included, and limit it to those particular dates provided by partners.

In [36]:
ref_dates = ["2004-10-11", "2004-10-18", "2005-09-22", "2013-11-06", "2004-05-09", "2015-11-09", "2015-11-23"]

analog_df["forecast_date"] = (pd.to_datetime(analog_df["reference_date"]) + pd.to_timedelta(analog_df["forecast_day_number"], unit="d"))
analog_df.query("forecast_date in @ref_dates & forecast_day_number in [3, 5]").to_csv("analog_profiling_results_Nome.csv", index=False)

Also save a table of the naive forecast results subset to only days 3 and 5:

In [35]:
naive_df.query("forecast_day_number in [3, 5]").to_csv("naive_profiling_results_Nome.csv", index=False)

end