# Data-flow for Experiments

After testing with the previous notebooks, we are about to wrap up with a concrete plan of experiments.

## Input Data: NOAA GridSat in East Asia

### Raw Data
The details of the data is described in the [official document](https://developers.google.com/earth-engine/datasets/catalog/NOAA_CDR_GRIDSAT-B1_V2).

The downloaded NOAA-GridSta-B1 image are stored in netCDF4 format (.nc file). The main variable, brightness temperature data, was stored in int16 as 'irwin_cdr', with a scale factor of 0.01 and offset of 200. The missing values is flagged as -31999.

### Preprocessing
1. Read raw data: -70' to 69.93'N, -180' to 179.94'E, with 0.07' intervals. shape=(2000, 5143)
2. Crop to East Asia region (100-160E, 0-60N) -> shape=(858,858) 
3. Resize the cropped data into a domain of (2^N, 2^N) for better processing. (using opencv2)
    - cv2.resize(512,512) -> (512,512)
    - cv2.resize(256,256) -> (256,256)
4. Rescale the values of white temperature to (0,1). (divided by the max value)


## Output Data: Weather Events in Taiwan Area

- **HRD**: Precip >= 40mm/day
- **HRH**: Precip >= 10mm/hr
- **CS**: 寒潮，台北測站24小時內有任一小時10度C以下
- **TYW**: 中央氣象局發布颱風警報
- **NWPTC**: 西北太平洋地區有熱帶氣旋
- **FT**: 中央氣象局地面天氣圖，2000年以後以00Z代表
- **NE**: 彭佳嶼測站日平均風向為東北風(15-75度)及風速達4m/s
- **SWF**: CFSR 850hPa 紅色區域內 u平均>0並且v平均>0並且平均風達3m/s 或者 >6m/s的風速範圍站紅色區域30%

In [1]:
import numpy as np
import pandas as pd
import os, re

# Read all events
events = pd.read_csv('../data/tad_filtered.csv', index_col=0)

print(events.head())
print(events.shape)
for c in events.columns:
    print(c + '\t counts: ' + str(events[c].sum()) + '\t prob:' + str(events[c].sum()/events.shape[0])) 

          CS  TYW  NWPTY  FT  NE  SWF  HRD  HRH
20130101   0  0.0      1   0   0    0    1    0
20130102   0  0.0      1   0   1    0    0    0
20130103   0  0.0      1   0   1    0    1    1
20130104   0  0.0      1   0   1    0    1    0
20130105   0  0.0      1   0   1    0    1    1
(1461, 8)
CS	 counts: 12	 prob:0.008213552361396304
TYW	 counts: 65.0	 prob:0.044490075290896644
NWPTY	 counts: 702	 prob:0.4804928131416838
FT	 counts: 244	 prob:0.16700889801505817
NE	 counts: 471	 prob:0.32238193018480493
SWF	 counts: 406	 prob:0.2778918548939083
HRD	 counts: 420	 prob:0.2874743326488706
HRH	 counts: 520	 prob:0.35592060232717315


## Methods

Feature extraction (dimension reduction) with generalized linear model (logistic regression).

### Principle Component Analysis
> python utils/ipca_transform_preproc_gridsatb1.py -i \[PATH_TO_DATA\] -o \[PREFIX_FOR_OUTPUT\] -m \[PATH_TO_MODEL\] 


In [2]:
# PCA
tmp = pd.read_csv('D:/workspace/noaa/ws.pca/w256_proj.csv')
dates = list(tmp.timestamp)
dates = [int(d.replace('.','')) for d in dates]
tmp.index = dates
fv_pca = tmp.loc[events.index, np.arange(2048).astype('str')]
print(fv_pca.head())
print(fv_pca.shape)

                  0          1         2         3         4         5  \
20130101  10.258309  -4.656837 -3.172782  0.632345  2.013453  3.179133   
20130102   8.503345  -6.036067 -2.736589 -3.880598  2.831698  2.121468   
20130103   7.555368 -10.951109 -2.391194 -0.284879  2.710904 -1.378869   
20130104   7.380538 -10.050537 -3.675916  1.638377  2.332644  0.610044   
20130105   7.389496  -7.858804 -3.157701 -1.307389  2.557700  0.431949   

                 6         7         8         9  ...      2038      2039  \
20130101 -1.438060  0.716551  3.222975 -1.491041  ... -0.066616  0.034927   
20130102  1.351488  1.792430 -0.812001 -1.764388  ... -0.019680 -0.009342   
20130103  0.165945  2.193658 -2.956463  0.340051  ... -0.026055  0.284996   
20130104  0.575185  2.279190 -0.802140 -1.236430  ... -0.106318  0.014793   
20130105 -0.242018  2.094918  0.823646 -2.035190  ... -0.031707  0.090493   

              2040      2041      2042      2043      2044      2045  \
20130101  0.031727 -

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)



### Convolutional Auto-Encoder
> python utils/cae_encode_preproc_gridsatb1.py -i \[PATH_TO_DATA\] -o \[PREFIX_FOR_OUTPUT\] -m \[PATH_TO_MODEL\] 


In [3]:
# CAE
tmp = pd.read_csv('D:/workspace/noaa/ws.cae/cae256_encoded.csv')
dates = list(tmp.timestamp)
dates = [int(d.replace('.','')) for d in dates]
tmp.index = dates
fv_cae = tmp.loc[events.index, np.arange(2048).astype('str')]
print(fv_cae.head())
print(fv_cae.shape)

                 0    1         2    3         4    5         6    7  \
20130101  0.729401  0.0  0.709286  0.0  0.606639  0.0  0.433708  0.0   
20130102  0.530478  0.0  1.175602  0.0  0.779702  0.0  0.934944  0.0   
20130103  0.779093  0.0  0.913878  0.0  0.692427  0.0  0.786661  0.0   
20130104  0.706526  0.0  0.808630  0.0  0.629371  0.0  0.457699  0.0   
20130105  0.612746  0.0  0.688398  0.0  0.703917  0.0  1.020927  0.0   

                 8    9  ...      2038  2039      2040  2041      2042  2043  \
20130101  0.921396  0.0  ...  0.442833   0.0  0.441170   0.0  0.549486   0.0   
20130102  0.977582  0.0  ...  0.695305   0.0  0.531373   0.0  0.576476   0.0   
20130103  0.966315  0.0  ...  0.534694   0.0  0.601885   0.0  0.647570   0.0   
20130104  0.489022  0.0  ...  0.593137   0.0  0.661233   0.0  0.652576   0.0   
20130105  0.812755  0.0  ...  0.647179   0.0  0.635999   0.0  0.637713   0.0   

              2044  2045      2046  2047  
20130101  0.573760   0.0  0.635431   0.0  


### Variational Auto-Encoder
> python utils/cvae_encode_preproc_gridsatb1.py -i \[PATH_TO_DATA\] -o \[PREFIX_FOR_OUTPUT\] -m \[PATH_TO_MODEL\] 


In [4]:
# CVAE
tmp = pd.read_csv('D:/workspace/noaa/ws.cvae/cvae256_encoded.csv')
dates = list(tmp.timestamp)
dates = [int(d.replace('.','')) for d in dates]
tmp.index = dates
fv_cvae = tmp.loc[events.index, np.arange(2048).astype('str')]
print(fv_cvae.head())
print(fv_cvae.shape)

                 0         1         2         3         4         5  \
20130101  0.414171 -0.466501  0.752024  1.123145 -0.331437 -1.435205   
20130102 -1.547030  2.590840  1.038288  2.328560 -1.464619  0.307699   
20130103  1.062752  0.015614 -0.039290  1.356836 -0.407510 -0.546152   
20130104  0.449923 -0.004109  0.743570  1.791207  0.150745  0.199603   
20130105 -0.435087 -2.080692  0.073629  0.404752 -1.289809 -0.265119   

                 6         7         8         9  ...      2038      2039  \
20130101  1.552950 -0.028973 -1.523870 -0.214476  ...  0.039675  1.934730   
20130102 -0.806043  1.047620  1.701204  0.253724  ...  0.340050 -2.037837   
20130103 -3.368214 -0.734818  1.048689  1.018091  ...  0.418223 -0.170373   
20130104 -1.170279 -0.008397  0.759227  0.591091  ... -0.916299  0.219440   
20130105  1.564287  0.818954  0.542338  0.280204  ...  0.493177 -1.972590   

              2040      2041      2042      2043      2044      2045  \
20130101 -1.876483 -0.039899  1.

### Pre-trained ResNet50

- [ResNet50 for bigearth net](https://tfhub.dev/google/remote_sensing/bigearthnet-resnet50/1)
> python ../utils/pretrained_encode_preproc_gridsatb1.py -i ../data/256/ -o rn50bigearth -m "https://tfhub.dev/google/remote_sensing/bigearthnet-resnet50/1" -b 128


- [Feature vectors of images with ResNet 50](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1)
> python ../utils/pretrained2_encode_preproc_gridsatb1.py -i ../data/256/ -o rn50bigearth -b 128

In [5]:
# Pre-trained with Big-Earth dataset
tmp = pd.read_csv('D:/workspace/noaa/ws.pretr/rn50bigearth_features.csv')
dates = list(tmp.timestamp)
dates = [int(d.replace('.','')) for d in dates]
tmp.index = dates
fv_ptbe = tmp.loc[events.index, np.arange(2048).astype('str')]
print(fv_ptbe.head())
print(fv_ptbe.shape)

                 0         1         2        3         4         5         6  \
20130101  0.910615  0.022137  0.033863  0.00000  0.227198  0.083715  0.132397   
20130102  1.191630  0.017410  0.017347  0.00000  0.199252  0.011307  0.131674   
20130103  1.306608  0.057959  0.019189  0.02313  0.465368  0.101091  0.082207   
20130104  1.419659  0.035566  0.000000  0.00000  0.136999  0.147608  0.258210   
20130105  1.204523  0.064296  0.031918  0.00000  0.145212  0.036379  0.105308   

                 7         8         9  ...      2038      2039      2040  \
20130101  0.202215  0.966795  0.097639  ...  0.218303  0.695137  0.069968   
20130102  0.276117  1.002557  0.040593  ...  0.112136  0.479326  0.021380   
20130103  0.539429  1.222262  0.488504  ...  0.032564  0.756178  0.017557   
20130104  0.276583  1.239150  0.271416  ...  0.304184  0.385106  0.013154   
20130105  0.393798  0.725334  0.102611  ...  0.199956  0.930488  0.071619   

              2041      2042      2043      2044  

In [6]:
# Pre-trained with Big-Earth dataset
tmp = pd.read_csv('D:/workspace/noaa/ws.pretr/rn50imagenet_features.csv')
dates = list(tmp.timestamp)
dates = [int(d.replace('.','')) for d in dates]
tmp.index = dates
fv_ptin = tmp.loc[events.index, np.arange(2048).astype('str')]
print(fv_ptin.head())
print(fv_ptin.shape)

                 0         1    2         3         4         5    6  \
20130101  0.073251  0.159820  0.0  2.294162  0.686011  0.007495  0.0   
20130102  0.233406  0.037223  0.0  1.995400  0.210638  0.000000  0.0   
20130103  0.461574  0.304058  0.0  1.916351  0.848359  0.000000  0.0   
20130104  0.055840  0.043495  0.0  2.017454  0.301700  0.000000  0.0   
20130105  0.159815  0.080647  0.0  1.476036  0.031678  0.000000  0.0   

                 7         8    9  ...      2038      2039      2040  \
20130101  0.228225  0.000000  0.0  ...  0.041753  0.000000  0.000000   
20130102  0.052133  0.000000  0.0  ...  0.000000  0.000000  0.000000   
20130103  0.102642  0.000817  0.0  ...  0.000000  0.000000  0.755562   
20130104  0.371948  0.044949  0.0  ...  0.316018  0.000000  0.003577   
20130105  0.176256  0.005338  0.0  ...  0.000000  0.001598  0.000000   

              2041      2042      2043      2044      2045      2046      2047  
20130101  0.021072  0.146149  0.000000  0.000000  0.0

In [7]:
# Save feature vectors
fv_pca.to_csv('../data/fv_pca.csv')
fv_cae.to_csv('../data/fv_cae.csv')
fv_cvae.to_csv('../data/fv_cvae.csv')
fv_ptbe.to_csv('../data/fv_ptbe.csv')
fv_ptin.to_csv('../data/fv_ptin.csv')