# Using Generalized Linear Model with Feature Vectors for Event Detections

Finally, we are now using the **feature vectors** described in the [previous notebook](https://github.com/tingsyo/dlgridsat/blob/main/notebook/08_DataFlow_for_Experiments.ipynb) along with the [**Generalized Linear Model**](https://en.wikipedia.org/wiki/Generalized_linear_model) to detect and to predict the events described in the [other notebook](https://github.com/tingsyo/dlgridsat/blob/main/notebook/00_weather_events.ipynb).


## Define the terms

### Weather Events

- **HRD**: Precip >= 40mm/day
- **HRH**: Precip >= 10mm/hr
- **CS**: 寒潮，台北測站24小時內有任一小時10度C以下
- **TYW**: 中央氣象局發布颱風警報
- **NWPTC**: 西北太平洋地區有熱帶氣旋
- **FT**: 中央氣象局地面天氣圖，2000年以後以00Z代表
- **NE**: 彭佳嶼測站日平均風向為東北風(15-75度)及風速達4m/s
- **SWF**: CFSR 850hPa 紅色區域內 u平均>0並且v平均>0並且平均風達3m/s 或者 >6m/s的風速範圍站紅色區域30%


### Feature Vectors

- **PCA** : Principle Component Analysis with 2,048 components.
- **CAE** : Convolutional Auto-Encoder encoded vectors, with a length of 2,048.
- **CVAE**: Variational Auto-Encoder encoded vectors, with a length of 2,048.
- **PTBE**: [ResNet50 pre-trained with Big-Earth dataset](https://tfhub.dev/google/remote_sensing/bigearthnet-resnet50/1), with a length of 2,048.
- **PTIN**: [ResNet50 pre-trained with ImageNet dataset](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1), with a length of 2,048.


## Data Partition

We keep the year 2016 for validation, and 2013~2015 for training and development. For Training and development, we use cross-validation for model tuning.

### Event Dataset


In [1]:
import numpy as np
import pandas as pd
import os, re

# Read all events
events = pd.read_csv('../data/tad_filtered.csv', index_col=0)

print(events.head())
print(events.shape)
for c in events.columns:
    print(c + '\t counts: ' + str(events[c].sum()) + '\t prob:' + str(events[c].sum()/events.shape[0])) 

          CS  TYW  NWPTY  FT  NE  SWF  HRD  HRH
20130101   0  0.0      1   0   0    0    1    0
20130102   0  0.0      1   0   1    0    0    0
20130103   0  0.0      1   0   1    0    1    1
20130104   0  0.0      1   0   1    0    1    0
20130105   0  0.0      1   0   1    0    1    1
(1461, 8)
CS	 counts: 12	 prob:0.008213552361396304
TYW	 counts: 65.0	 prob:0.044490075290896644
NWPTY	 counts: 702	 prob:0.4804928131416838
FT	 counts: 244	 prob:0.16700889801505817
NE	 counts: 471	 prob:0.32238193018480493
SWF	 counts: 406	 prob:0.2778918548939083
HRD	 counts: 420	 prob:0.2874743326488706
HRH	 counts: 520	 prob:0.35592060232717315


## Feature Vectors

In [2]:
# PCA
fv_pca = pd.read_csv('../data/fv_pca.zip', compression='zip', index_col=0)
print(fv_pca.head())
print(fv_pca.shape)

                  0          1         2         3         4         5  \
20130101  10.258309  -4.656837 -3.172782  0.632345  2.013453  3.179133   
20130102   8.503345  -6.036067 -2.736589 -3.880598  2.831698  2.121468   
20130103   7.555368 -10.951109 -2.391194 -0.284879  2.710904 -1.378869   
20130104   7.380538 -10.050537 -3.675916  1.638377  2.332644  0.610044   
20130105   7.389496  -7.858804 -3.157701 -1.307389  2.557700  0.431949   

                 6         7         8         9  ...      2038      2039  \
20130101 -1.438060  0.716551  3.222975 -1.491041  ... -0.066616  0.034927   
20130102  1.351488  1.792430 -0.812001 -1.764388  ... -0.019680 -0.009342   
20130103  0.165945  2.193658 -2.956463  0.340051  ... -0.026055  0.284996   
20130104  0.575185  2.279190 -0.802140 -1.236430  ... -0.106318  0.014793   
20130105 -0.242018  2.094918  0.823646 -2.035190  ... -0.031707  0.090493   

              2040      2041      2042      2043      2044      2045  \
20130101  0.031727 -

## Cleaning and Splitting the Dataset

The satellite dataset containing missing data, and hence we need to remove those entry before we put them into the model. Also, we want to split the dataset into training and testing data.

In [3]:
# Partition training/testing data
index_20160101 = 1095
print(events.index[1095])

x_train = fv_pca.iloc[:index_20160101,:]
x_test = fv_pca.iloc[index_20160101:,:]

y_train = events.iloc[:index_20160101, :]
y_test = events.iloc[index_20160101:,:]

print(x_train.shape)
print(x_test.shape)

20160101
(1095, 2048)
(366, 2048)


In [4]:
# Drop NA
x_train = x_train.dropna(axis=0, how='any')
y_train = y_train.loc[x_train.index,:]

print(x_train.shape)
print(y_train.shape)

x_test = x_test.dropna(axis=0, how='any')
y_test = y_test.loc[x_test.index,:]

print(x_test.shape)
print(y_test.shape)

(1082, 2048)
(1082, 8)
(366, 2048)
(366, 8)


In [5]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, max_iter=1000).fit(x_train, y_train['HRD'])

In [6]:
from sklearn.metrics import confusion_matrix

# Training CM
y_pred = clf.predict(x_train)
y_true = y_train['HRD']
cm_train = confusion_matrix(y_true, y_pred)
print("Training Confusion Matrix")
print(cm_train)

# Testing CM
y_pred = clf.predict(x_test)
y_true = y_test['HRD']
cm_test = confusion_matrix(y_true, y_pred)
print("Testing Confusion Matrix")
print(cm_test)

Training Confusion Matrix
[[794   0]
 [  0 288]]
Testing Confusion Matrix
[[199  39]
 [ 83  45]]


## For other feature vectors


In [7]:
# Define feature vector(X) and event(Y)
FVPATH = '../data/fv_cae.zip'
EVENT = 'HRD'

print('Feature Vectors:' + FVPATH)
print('Event: ' + EVENT)
# Read feature Vector
fv = pd.read_csv(FVPATH, compression='zip', index_col=0)

# Splitting training/testing
x_train = fv.iloc[:index_20160101,:]
x_test = fv.iloc[index_20160101:,:]

y_train = events.iloc[:index_20160101, :]
y_test = events.iloc[index_20160101:,:]

# Drop NA
print('Original training data shape: ' + str(x_train.shape))
x_train = x_train.dropna(axis=0, how='any')
y_train = y_train.loc[x_train.index,:]
print('After dropping NaNs: ' + str(x_train.shape))

print()
print('Original testing data shape: ' + str(x_test.shape))
x_test = x_test.dropna(axis=0, how='any')
y_test = y_test.loc[x_test.index,:]
print('After dropping NaNs: ' + str(x_test.shape))

print()
print('Fitting GLM:')
clf = LogisticRegression(random_state=0, max_iter=1000).fit(x_train, y_train[EVENT])

# Training CM
y_pred = clf.predict(x_train)
y_true = y_train[EVENT]
cm_train = confusion_matrix(y_true, y_pred)
print("Training Confusion Matrix")
print(cm_train)

# Testing CM
y_pred = clf.predict(x_test)
y_true = y_test[EVENT]
cm_test = confusion_matrix(y_true, y_pred)
print("Testing Confusion Matrix")
print(cm_test)

Feature Vectors:../data/fv_cae.zip
Event: HRD
Original training data shape: (1095, 2048)
After dropping NaNs: (1082, 2048)

Original testing data shape: (366, 2048)
After dropping NaNs: (366, 2048)

Fitting GLM:
Training Confusion Matrix
[[794   0]
 [  4 284]]
Testing Confusion Matrix
[[194  44]
 [ 84  44]]


In [8]:
# Define feature vector(X) and event(Y)
FVPATH = '../data/fv_cvae.zip'
EVENT = 'HRD'

print('Feature Vectors:' + FVPATH)
print('Event: ' + EVENT)

# Read feature Vector
fv = pd.read_csv(FVPATH, compression='zip', index_col=0)

# Splitting training/testing
x_train = fv.iloc[:index_20160101,:]
x_test = fv.iloc[index_20160101:,:]

y_train = events.iloc[:index_20160101, :]
y_test = events.iloc[index_20160101:,:]

# Drop NA
print('Original training data shape: ' + str(x_train.shape))
x_train = x_train.dropna(axis=0, how='any')
y_train = y_train.loc[x_train.index,:]
print('After dropping NaNs: ' + str(x_train.shape))

print()
print('Original testing data shape: ' + str(x_test.shape))
x_test = x_test.dropna(axis=0, how='any')
y_test = y_test.loc[x_test.index,:]
print('After dropping NaNs: ' + str(x_test.shape))

print()
print('Fitting GLM:')
clf = LogisticRegression(random_state=0, max_iter=1000).fit(x_train, y_train[EVENT])

# Training CM
y_pred = clf.predict(x_train)
y_true = y_train[EVENT]
cm_train = confusion_matrix(y_true, y_pred)
print("Training Confusion Matrix")
print(cm_train)

# Testing CM
y_pred = clf.predict(x_test)
y_true = y_test[EVENT]
cm_test = confusion_matrix(y_true, y_pred)
print("Testing Confusion Matrix")
print(cm_test)

Feature Vectors:../data/fv_cvae.zip
Event: HRD
Original training data shape: (1095, 2048)
After dropping NaNs: (1082, 2048)

Original testing data shape: (366, 2048)
After dropping NaNs: (366, 2048)

Fitting GLM:
Training Confusion Matrix
[[794   0]
 [  0 288]]
Testing Confusion Matrix
[[182  56]
 [104  24]]


In [9]:
# Define feature vector(X) and event(Y)
FVPATH = '../data/fv_ptbe.zip'
EVENT = 'HRD'

print('Feature Vectors:' + FVPATH)
print('Event: ' + EVENT)

# Read feature Vector
fv = pd.read_csv(FVPATH, compression='zip', index_col=0)

# Splitting training/testing
x_train = fv.iloc[:index_20160101,:]
x_test = fv.iloc[index_20160101:,:]

y_train = events.iloc[:index_20160101, :]
y_test = events.iloc[index_20160101:,:]

# Drop NA
print('Original training data shape: ' + str(x_train.shape))
x_train = x_train.dropna(axis=0, how='any')
y_train = y_train.loc[x_train.index,:]
print('After dropping NaNs: ' + str(x_train.shape))

print()
print('Original testing data shape: ' + str(x_test.shape))
x_test = x_test.dropna(axis=0, how='any')
y_test = y_test.loc[x_test.index,:]
print('After dropping NaNs: ' + str(x_test.shape))

print()
print('Fitting GLM:')
clf = LogisticRegression(random_state=0, max_iter=1000).fit(x_train, y_train[EVENT])

# Training CM
y_pred = clf.predict(x_train)
y_true = y_train[EVENT]
cm_train = confusion_matrix(y_true, y_pred)
print("Training Confusion Matrix")
print(cm_train)

# Testing CM
y_pred = clf.predict(x_test)
y_true = y_test[EVENT]
cm_test = confusion_matrix(y_true, y_pred)
print("Testing Confusion Matrix")
print(cm_test)

Feature Vectors:../data/fv_ptbe.zip
Event: HRD
Original training data shape: (1095, 2048)
After dropping NaNs: (1082, 2048)

Original testing data shape: (366, 2048)
After dropping NaNs: (366, 2048)

Fitting GLM:
Training Confusion Matrix
[[789   5]
 [ 58 230]]
Testing Confusion Matrix
[[195  43]
 [ 94  34]]


In [10]:
# Define feature vector(X) and event(Y)
FVPATH = '../data/fv_ptin.zip'
EVENT = 'HRD'

print('Feature Vectors:' + FVPATH)
print('Event: ' + EVENT)

# Read feature Vector
fv = pd.read_csv(FVPATH, compression='zip', index_col=0)

# Splitting training/testing
x_train = fv.iloc[:index_20160101,:]
x_test = fv.iloc[index_20160101:,:]

y_train = events.iloc[:index_20160101, :]
y_test = events.iloc[index_20160101:,:]

# Drop NA
print('Original training data shape: ' + str(x_train.shape))
x_train = x_train.dropna(axis=0, how='any')
y_train = y_train.loc[x_train.index,:]
print('After dropping NaNs: ' + str(x_train.shape))

print()
print('Original testing data shape: ' + str(x_test.shape))
x_test = x_test.dropna(axis=0, how='any')
y_test = y_test.loc[x_test.index,:]
print('After dropping NaNs: ' + str(x_test.shape))

print()
print('Fitting GLM:')
clf = LogisticRegression(random_state=0, max_iter=1000).fit(x_train, y_train[EVENT])

# Training CM
y_pred = clf.predict(x_train)
y_true = y_train[EVENT]
cm_train = confusion_matrix(y_true, y_pred)
print("Training Confusion Matrix")
print(cm_train)

# Testing CM
y_pred = clf.predict(x_test)
y_true = y_test[EVENT]
cm_test = confusion_matrix(y_true, y_pred)
print("Testing Confusion Matrix")
print(cm_test)

Feature Vectors:../data/fv_ptin.zip
Event: HRD
Original training data shape: (1095, 2048)
After dropping NaNs: (1082, 2048)

Original testing data shape: (366, 2048)
After dropping NaNs: (366, 2048)

Fitting GLM:
Training Confusion Matrix
[[794   0]
 [ 10 278]]
Testing Confusion Matrix
[[152  86]
 [ 82  46]]


For detecting daily precipitation greater or equal than 40-mm, it seems the **GLM-PCA** yields pretty decent results. We will see other events with scripts.