# EOF analysis of TG data

## Learning Goals

* Fundamentals of Principal Component Analysis in climate science (EOFs)
* Matrix operations using numpy
* Defining a function in Python
* Create a Dataset from scratch in Xarray 

### Import Packages

In [8]:
%matplotlib widget
import os                         
import xarray as xr 
import pandas as pd
import matplotlib                 
import matplotlib.pyplot as plt   
import cartopy                  
import cartopy.crs as ccrs      
import plotly.express as px
import hvplot.xarray
import holoviews as hv
hv.extension('bokeh')
import numpy as np
import panel as pn
import panel.widgets as pnw
import ipywidgets as ipw
from sklearn import datasets, linear_model
from scipy import stats
from sklearn.metrics import mean_squared_error, r2_score
import glob2
import hvplot.pandas  # noqa
import hvplot.xarray  # noqa
import cartopy.crs as ccrs

<head>
    <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css" rel="stylesheet">
</head>

<head>
    <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css" rel="stylesheet">
</head>

### Upload the Data

Uploading the dataset we created and saved in the previous chapter. 

In [2]:
path = '/Users/ynorden/Research/code/coastal-sl/data/'
tg_ds=xr.open_dataset(path+'ds_tg.nc')

In [3]:
tg_ds.hvplot.scatter(x='lon', y='lat', c='id',geo=True, coastline=True)



## EOF analysis 

EOFs (Empirical Orthogonal Functions) are a widely used statistical technique in climate science and other  fields of earth science data analysis. \
It is a type of dimensionality reduction technique that retains modes of variability present in a given dataset. \
The key characteristic of EOFs is that they are orthogonal to each other, meaning they are statistically independent. This property allows each EOF to capture a unique and independent pattern of variability in the dataset.\


```{admonition} Important
:class: tip-yellow
:icon: fas fa-lightbulb
   
EOF analysis is a purely mathematical technique and does not imply anything about the physical mechanisms of the patterns that arise. It is up to a researcher to interpret the results. 

In [None]:
More on eigenvalue decompositio 

<style>
    .bright-yellow-cell {
        background-color: yellow;
        color: black;
        font-weight: bold;
        padding: 8px;
        border: 1px solid black;
        border-radius: 4px;
    }
</style>

## EOF function

In [97]:
print(tg_ds.sla.shape)

(31, 9784)


In [9]:
# because I keep tide gauges time-series and metadata such as 
def eof_tg(tg_ds):
    '''
    because I keep tide gauges time-series and metadata such as station id in separate df's 
    that's why I need these two df's as input 
    Example. my tg_data.shape : (9784,33)
                tg_meta.shape : (33,5)
    '''
    data = tg_ds.sla.T.data
    nt = len(tg_ds.time)
    covmat = (data.T @ data) / nt  #covariance matrix
    eigvals, eigvecs = np.linalg.eig(covmat)
    pc = data @ eigvecs
    
    c = len(tg_ds.id)+1
    eof_names = ['EOF%s' % k for k in range(1, c)]
    pc_names = ['PC%s' % l for l in range(1, c)]
    eof_df = pd.DataFrame(eigvecs, index=tg_ds.id, columns=eof_names)
    pc_df = pd.DataFrame(pc, index=tg_ds.time, columns=pc_names)

    data_from_eof_df = pc_df.values @ eof_df.T.values
    data_from_eof_df = pd.DataFrame(data_from_eof_df, index=tg_ds.time, columns=tg_ds.id)

    return eof_df, pc_df, data_from_eof_df

In [10]:
eof_df, pc_df, data_from_eof_df = eof_tg(tg_ds)

In [11]:
pc_ds = xr.Dataset({'pc_n':(('time','pc'),pc_df.values),
                    },
                        {'time':tg_ds.time, 'pc': np.arange(1,tg_ds.id.size+1)},
)
#maybe I better create an xarray dataarray 

In [124]:
# saving a dataset to use in the next chapter  
#pc_ds.to_netcdf(path+'pc_ds.nc')

This whole section should be time for me to work of pythia cookbook material to refer people to it when it's done. I kind of should understand what eigvecs and eigvals are better, include variance of each mode and then variance explained at each location section. 

### Plots

#### Spatial Patterns

In [25]:
eof_df_g = eof_df.copy()
eof_df_g['lat']=tg_ds.lat.data
eof_df_g['lon']=tg_ds.lon.data

In [24]:
eof_df_g 

Unnamed: 0,EOF1,EOF2,EOF3,EOF4,EOF5,EOF6,EOF7,EOF8,EOF9,EOF10,...,EOF24,EOF25,EOF26,EOF27,EOF28,EOF29,EOF30,EOF31,lat,lon
8410140,-0.111759,-0.064118,0.151425,-0.193984,0.303376,-0.001483,0.272311,0.058981,0.040844,0.010152,...,-0.104914,-0.28935,0.059312,-0.124085,0.054359,0.076516,0.02043,0.007549,44.904598,44.904598
8418150,-0.141232,-0.112236,0.181463,-0.148933,0.249461,-0.115628,0.166304,0.06932,0.0485,-0.025495,...,0.17164,0.136686,-0.274324,0.397766,-0.133845,0.002612,0.210897,-0.176941,43.65806,43.65806
8443970,-0.188163,-0.109269,0.169621,-0.172418,0.222142,-0.035318,0.101072,0.110958,0.047713,0.010957,...,-0.176911,-0.044033,0.282733,-0.428173,0.197988,-0.062411,-0.150792,0.350241,42.354801,42.354801
8447930,-0.183028,-0.088591,0.139571,-0.124425,0.166466,-0.060155,0.02981,0.019439,0.027926,0.061553,...,0.258986,0.354732,-0.143197,-0.056737,0.050288,0.409783,0.063988,0.401042,41.523613,41.523613
8449130,-0.181594,-0.092365,0.100638,-0.203448,0.184764,-0.01855,0.177009,0.093149,0.039548,0.273261,...,0.06664,-0.052739,-0.060813,0.147265,-0.074266,0.016855,0.040109,-0.320175,41.285278,41.285278
8452660,-0.186126,-0.098641,0.141271,-0.085431,0.125278,-0.083479,-0.023434,0.021854,0.015917,-0.070373,...,0.063412,0.171363,0.589761,0.156862,-0.28387,-0.466514,-0.135206,-0.110736,41.504333,41.504333
8461490,-0.212953,-0.112007,0.145516,-0.061612,0.076055,-0.059158,-0.102719,0.062262,0.005895,-0.201495,...,-0.285042,-0.084026,-0.602158,-0.014221,0.199944,-0.378919,-0.27838,-0.019598,41.3717,41.3717
8467150,-0.228104,-0.135224,0.139187,0.026386,-0.036037,-0.164821,-0.220339,0.06808,-0.038363,-0.273836,...,0.056921,-0.393664,0.158711,-0.049294,-0.038828,0.433199,0.254297,-0.15545,41.173302,41.173302
8518750,-0.243364,-0.116148,0.085427,0.079643,-0.109172,-0.104912,-0.341069,-0.043449,-0.05061,-0.144254,...,-0.008219,0.058985,0.021188,0.004323,-0.006924,0.00499,-0.030387,0.018702,40.700556,40.700556
8531680,-0.277924,-0.111745,0.072979,0.053675,-0.091611,-0.046845,-0.286315,-0.019356,-0.014532,0.149509,...,0.005029,0.07738,-0.003023,0.007223,0.018565,-0.022463,0.016033,-0.018677,40.466944,40.466944


In [37]:
eig_eof_1 = eof_df_g.hvplot.points('lon','lat',c='EOF1',geo=True,projection=ccrs.PlateCarree(),
                                    coastline=True,cmap='Spectral',line_width=5, title = 'EOF1: spatial pattern of variability',colorbar=True)
eig_eof_2 = eof_df_g.hvplot.points('lon','lat',c='EOF2',geo=True,projection=ccrs.PlateCarree(),
                                    coastline=True,cmap='Spectral',line_width=5, title = 'EOF2: spatial pattern of variability',colorbar=True)
eig_eof_3 = eof_df_g.hvplot.points('lon','lat',c='EOF3',geo=True,projection=ccrs.PlateCarree(),
                                    coastline=True,cmap='Spectral',line_width=5, title = 'EOF3: spatial pattern of variability',colorbar=True)
eig_eof_4 = eof_df_g.hvplot.points('lon','lat',c='EOF4',geo=True,projection=ccrs.PlateCarree(),
                                    coastline=True,cmap='Spectral',line_width=5, title = 'EOF4: spatial pattern of variability',colorbar=True)

#### Principal Components

In [12]:
pc_df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31
1993-01-01,0.129575,-0.057145,0.021785,-0.037181,-0.046846,0.308842,0.109787,-0.029381,-0.004198,0.010261,...,0.000545,-0.006427,0.001542,0.03909,0.027648,-0.004983,-0.014321,-0.0262,0.028805,-0.000147
1993-01-02,0.457598,0.3223,-0.374299,0.041988,-0.021173,0.302031,-0.032742,0.018476,0.060944,-0.028407,...,3.3e-05,-0.024126,0.027133,0.052957,0.025456,-0.002528,-0.049843,-0.022028,-0.020625,-0.006009
1993-01-03,0.140078,0.371345,-0.173575,-0.032414,-0.045105,0.130998,-0.279333,0.046727,0.067727,-0.024421,...,-0.013811,-0.016556,-0.004022,0.006613,0.015825,-0.006091,-0.026559,-0.013624,-0.004098,-0.018203
1993-01-04,0.036469,0.407508,0.040033,0.021855,-0.122062,0.134408,-0.092999,0.004606,0.034009,-0.025902,...,-0.03062,0.001755,-0.025642,0.001179,0.029968,0.004056,-0.013653,-0.015378,0.006626,-0.017922
1993-01-05,0.031347,0.420764,0.24327,0.098134,0.106355,0.135519,0.052505,-0.083359,-0.013644,-0.124486,...,-0.037327,0.009473,-0.023338,0.00666,0.024226,-0.010894,-0.008484,-0.016809,-0.006559,-0.00083


In [35]:
pc_1 = hv.Curve(pc_df['PC1']).opts(width=700,title ='PC1: temporal pattern of variability')
pc_2 = hv.Curve(pc_df['PC2']).opts(width=700,title ='PC2: temporal pattern of variability')
pc_3 = hv.Curve(pc_df['PC3']).opts(width=700,title ='PC3: temporal pattern of variability')
pc_4 = hv.Curve(pc_df['PC4']).opts(width=700,title ='PC4: temporal pattern of variability')
pc_5 = hv.Curve(pc_df['PC5']).opts(width=700,title ='PC5: temporal pattern of variability')

In [41]:
hv.Layout(eig_eof_1+pc_1)

In [43]:
hv.Layout(eig_eof_2+pc_2)

In [46]:
hv.Layout(eig_eof_3+pc_3)

In [None]:
#eof_df, pc_df, data_from_eof_df = eof_tg(tg_data, tg_meta)

### EOF function (taking up xarray dataset)

In [None]:
if I have 

## EOF analysis (step by step approach)

Below is the same 

In [56]:
dat_00 = tg_ds.sla.T.data
nt = len(tg_data.index)
covdat_00 = (dat_00.T @ dat_00)/nt 
eigvals_00, eigvecs_00 = np.linalg.eig(covdat_00)
pc_00 = dat_00 @ eigvecs_00

In [65]:
pc_00.shape

(9784, 31)

In [66]:
eof_names = ['EOF%s' % k for k in range(1,len(tg_ds.id)+1)]
pc_names = ['PC%s' % l for l in range(1,len(tg_ds.id)+1)]
eof_df = pd.DataFrame(eigvecs_00, index=tg_ds.id,columns = eof_names) #change eigvecs to eigvals
pc_df = pd.DataFrame(pc_00,index=tg_ds.time,columns = pc_names)

In [93]:
pc_df

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31
1993-01-01,0.129575,-0.057145,0.021785,-0.037181,-0.046846,0.308842,0.109787,-0.029381,-0.004198,0.010261,...,0.000545,-0.006427,0.001542,0.039090,0.027648,-0.004983,-0.014321,-0.026200,0.028805,-0.000147
1993-01-02,0.457598,0.322300,-0.374299,0.041988,-0.021173,0.302031,-0.032742,0.018476,0.060944,-0.028407,...,0.000033,-0.024126,0.027133,0.052957,0.025456,-0.002528,-0.049843,-0.022028,-0.020625,-0.006009
1993-01-03,0.140078,0.371345,-0.173575,-0.032414,-0.045105,0.130998,-0.279333,0.046727,0.067727,-0.024421,...,-0.013811,-0.016556,-0.004022,0.006613,0.015825,-0.006091,-0.026559,-0.013624,-0.004098,-0.018203
1993-01-04,0.036469,0.407508,0.040033,0.021855,-0.122062,0.134408,-0.092999,0.004606,0.034009,-0.025902,...,-0.030620,0.001755,-0.025642,0.001179,0.029968,0.004056,-0.013653,-0.015378,0.006626,-0.017922
1993-01-05,0.031347,0.420764,0.243270,0.098134,0.106355,0.135519,0.052505,-0.083359,-0.013644,-0.124486,...,-0.037327,0.009473,-0.023338,0.006660,0.024226,-0.010894,-0.008484,-0.016809,-0.006559,-0.000830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-10-11,-2.936219,0.083368,-0.174679,-0.100367,-0.268591,-0.323296,-0.045517,0.057186,0.056732,0.131321,...,0.068744,-0.017393,0.011616,0.042399,0.026486,-0.031288,0.036376,-0.010764,0.025256,-0.039344
2019-10-12,-2.632806,0.244800,-0.235762,-0.030300,-0.057545,-0.131455,0.087946,-0.051197,0.003279,-0.025928,...,0.013184,-0.011453,0.058669,0.031386,0.014869,-0.031105,0.035423,-0.008737,0.005026,-0.031160
2019-10-13,-1.638232,0.704313,-0.230098,0.025144,0.126025,-0.053695,0.174941,-0.104357,-0.035668,-0.045590,...,0.043003,0.016979,0.041636,0.007109,0.019979,0.000293,0.021653,-0.003226,0.027769,-0.022753
2019-10-14,-1.398735,0.827116,0.139969,-0.000277,0.158596,-0.108451,0.011442,-0.041105,-0.103715,0.085600,...,0.010244,0.003827,0.009510,0.011328,0.018099,0.002473,0.005859,-0.004041,0.023538,0.001580


In [67]:
val_sum0 = eigvals_00.sum()
var_exp0=np.round((eigvals_00*100/val_sum0),decimals=2) #variance explained by the 1st mode 
#we take sum of eigvals
d0 = {'EOF1':var_exp0[0],'EOF2':var_exp0[1],'EOF3':var_exp0[2],'EOF4':var_exp0[3],'rest':var_exp0[4:].sum()}
var_table0 = pd.DataFrame(d0,index=['%'])
var_table0

Unnamed: 0,EOF1,EOF2,EOF3,EOF4,rest
%,53.59,17.790001,8.23,5.29,15.110001


In [116]:
var_exp0

array([5.359e+01, 1.779e+01, 8.230e+00, 5.290e+00, 3.810e+00, 2.720e+00,
       2.060e+00, 1.150e+00, 7.500e-01, 6.200e-01, 5.400e-01, 4.600e-01,
       3.700e-01, 3.400e-01, 2.700e-01, 2.600e-01, 2.400e-01, 2.200e-01,
       1.900e-01, 1.800e-01, 1.700e-01, 1.400e-01, 1.100e-01, 1.000e-01,
       8.000e-02, 4.000e-02, 4.000e-02, 5.000e-02, 6.000e-02, 8.000e-02,
       6.000e-02], dtype=float32)

### Reconstructing data from EOF

In [69]:
data_from_eof_df = pc_df.values @ eof_df.T.values # the correct way 
data_from_eof_df = pd.DataFrame(data_from_eof_df, index = tg_ds.time, columns = tg_ds.id)

In [72]:
data_from_eof_df

Unnamed: 0,8410140,8418150,8443970,8447930,8449130,8452660,8461490,8467150,8518750,8531680,...,8725110,8726520,8727520,8728690,8729108,8729840,8761724,8771450,8775870,8779770
1993-01-01,-0.002500,-0.019833,0.001667,-0.055167,-0.005833,-0.018875,-0.069000,-0.093875,-0.112958,-0.077125,...,0.051542,-0.018333,0.021500,0.061083,-0.025917,-0.065833,-0.209167,-0.027083,0.037125,0.024292
1993-01-02,-0.217333,-0.207083,-0.169500,-0.213792,-0.126917,-0.161000,-0.221083,-0.237917,-0.224833,-0.210833,...,-0.018542,-0.051958,-0.025792,0.063500,0.001083,-0.030500,-0.203167,0.025458,0.133292,0.056792
1993-01-03,-0.183042,-0.152917,-0.108125,-0.130333,-0.122792,-0.063083,-0.056833,-0.020333,-0.030667,-0.028625,...,0.000250,-0.039500,-0.005042,0.096542,0.061875,0.073375,-0.118375,0.171292,0.163917,0.112250
1993-01-04,-0.056333,-0.096542,-0.108708,-0.096417,-0.156542,-0.015000,-0.038667,-0.019917,-0.015500,-0.024708,...,0.076667,0.054708,0.104417,0.175500,0.124792,0.141875,-0.012125,0.179917,0.134583,0.109292
1993-01-05,0.065708,-0.005833,-0.021833,-0.038917,-0.124333,0.017042,-0.007583,-0.036083,-0.030833,-0.044625,...,0.133458,0.147833,0.211167,0.197250,0.187667,0.166083,-0.002250,0.029667,0.113417,0.136125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-10-11,0.215458,0.305625,0.509917,0.500417,0.621750,0.492833,0.536667,0.617333,0.753458,0.826083,...,0.245208,0.202958,0.212000,0.276625,0.268125,0.374208,0.319125,0.351458,0.278000,0.297042
2019-10-12,0.255292,0.305167,0.434375,0.429583,0.509042,0.416417,0.438958,0.477792,0.606000,0.647708,...,0.236833,0.260333,0.305500,0.286042,0.269958,0.343958,0.192208,0.191042,0.365042,0.387542
2019-10-13,0.161000,0.157000,0.202000,0.256042,0.257208,0.234208,0.209750,0.219000,0.259583,0.286750,...,0.259458,0.284833,0.368333,0.305333,0.297625,0.324375,0.189292,0.287708,0.316583,0.372417
2019-10-14,0.169667,0.169125,0.222292,0.250417,0.241958,0.233000,0.207417,0.217625,0.250750,0.275292,...,0.361667,0.371417,0.435333,0.350708,0.335292,0.370125,0.236000,0.377708,0.321375,0.332208


In [77]:
data_from_eof_df['8449130'][:100].hvplot()

In [92]:
tg_ds.sel(id='8449130',time=slice('1993-01-01','1993-11-01')).hvplot()+data_from_eof_df['8449130'][:300].hvplot()

In [82]:
data_from_eof_df['8449130'].hvplot()*tg_ds.sel(id='8449130').hvplot().opts(line_dash='dashed',color='yellow')