<a href="https://colab.research.google.com/github/thainguyen222/KHDLUD_NHOM1/blob/main/deep_lstm_to_predict_rainfall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Forked from https://www.kaggle.com/ilya16/lstm-models?scriptVersionId=10420679 with refactoring, simplification and some changes to the model

Data Preprocessing and Deep LSTM model are inspired by the top solution described here: 
http://simaaron.github.io/Estimating-rainfall-from-weather-radar-readings-using-recurrent-neural-networks/

In [None]:
!rm -rf KHDLUD_NHOM1
!git clone https://github.com/thainguyen222/KHDLUD_NHOM1.git
!mv KHDLUD_NHOM1 ../input

Cloning into 'KHDLUD_NHOM1'...
remote: Enumerating objects: 140, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 140 (delta 18), reused 0 (delta 0), pack-reused 91[K
Receiving objects: 100% (140/140), 21.27 MiB | 6.86 MiB/s, done.
Resolving deltas: 100% (50/50), done.


In [None]:
import numpy as np
import pandas as pd

import os
print(os.listdir("../input"))

['test.csv', 'README.md', 'train.csv', '.git']


In [None]:
N_FEATURES = 22

# taken from http://simaaron.github.io/Estimating-rainfall-from-weather-radar-readings-using-recurrent-neural-networks/
THRESHOLD = 73 

# Data preprocessing

## Training set

In [None]:
train_df = pd.read_csv("../input/train.csv")

In [None]:
# to reduce memory consumption
train_df[train_df.columns[1:]] = train_df[train_df.columns[1:]].astype(np.float32)

In [None]:
train_df.shape

(274999, 24)

Remove ids with NaNs in `Ref` column for each observation (obeservations, where we have no data from radar)

In [None]:
good_ids = set(train_df.loc[train_df['Ref'].notna(), 'Id'])
train_df = train_df[train_df['Id'].isin(good_ids)]
train_df.shape

(182313, 24)

Replace NaN values with zeros

In [None]:
train_df.fillna(0.0, inplace=True)
train_df.reset_index(drop=True, inplace=True)
train_df.head()

Unnamed: 0,Id,minutes_past,radardist_km,Ref,Ref_5x5_10th,Ref_5x5_50th,Ref_5x5_90th,RefComposite,RefComposite_5x5_10th,RefComposite_5x5_50th,...,RhoHV_5x5_90th,Zdr,Zdr_5x5_10th,Zdr_5x5_50th,Zdr_5x5_90th,Kdp,Kdp_5x5_10th,Kdp_5x5_50th,Kdp_5x5_90th,Expected
0,2,1.0,2.0,9.0,5.0,7.5,10.5,15.0,10.5,16.5,...,0.998333,0.375,-0.125,0.3125,0.875,1.059998,-1.410004,-0.350006,1.059998,1.016001
1,2,6.0,2.0,26.5,22.5,25.5,31.5,26.5,26.5,28.5,...,1.005,0.0625,-0.1875,0.25,0.6875,0.0,0.0,0.0,1.409988,1.016001
2,2,11.0,2.0,21.5,15.5,20.5,25.0,26.5,23.5,25.0,...,1.001667,0.3125,-0.0625,0.3125,0.625,0.349991,0.0,-0.350006,1.759995,1.016001
3,2,16.0,2.0,18.0,14.0,17.5,21.0,20.5,18.0,20.5,...,1.001667,0.25,0.125,0.375,0.6875,0.349991,-1.059998,0.0,1.059998,1.016001
4,2,21.0,2.0,24.5,16.5,21.0,24.5,24.5,21.0,24.0,...,0.998333,0.25,0.0625,0.1875,0.5625,-0.350006,-1.059998,-0.350006,1.759995,1.016001


In [None]:
train_df.shape

(182313, 24)

Define and exclude outliers from training set

In [None]:
train_df = train_df[train_df['Expected'] < THRESHOLD]

In [None]:
train_df.shape

(178995, 24)

### Grouping and padding into sequences

In [None]:
train_groups = train_df.groupby("Id")
train_groups.size()

Id
2        12
4        13
7        15
8        12
10       12
         ..
24376    14
24379    14
24381    11
24383    12
24385    11
Length: 14538, dtype: int64

In [None]:
train_groups = train_df.groupby("Id")
train_size = len(train_groups)

In [None]:
MAX_SEQ_LEN = train_groups.size().max()
MAX_SEQ_LEN

19

In [None]:
X_train = np.zeros((train_size, MAX_SEQ_LEN, N_FEATURES), dtype=np.float32)
y_train = np.zeros(train_size, dtype=np.float32)

i = 0
for _, group in train_groups:
    X = group.values
    seq_len = X.shape[0]
    X_train[i,:seq_len,:] = X[:,1:23]
    y_train[i] = X[0,23]
    i += 1
    del X
    
del train_groups
X_train.shape, y_train.shape

((14538, 19, 22), (14538,))

## Test set

In [None]:
test_df = pd.read_csv("../input/test.csv")
test_df[test_df.columns[1:]] = test_df[test_df.columns[1:]].astype(np.float32)
test_ids = test_df['Id'].unique()

# Convert all NaNs to zero
test_df = test_df.fillna(0.0)
test_df = test_df.reset_index(drop=True)

In [None]:
test_groups = test_df.groupby("Id")
test_size = len(test_groups)

X_test = np.zeros((test_size, MAX_SEQ_LEN, N_FEATURES), dtype=np.float32)

i = 0
for _, group in test_groups:
    X = group.values
    seq_len = X.shape[0]
    X_test[i,:seq_len,:] = X[:,1:23]
    i += 1
    del X
    
del test_groups
X_test.shape

(14570, 19, 22)

# Models

In [None]:
from keras.layers import (
    Input,
    Dense,
    LSTM,
    AveragePooling1D,
    TimeDistributed,
    Flatten,
    Bidirectional,
    Dropout
)
from keras.models import Model

In [None]:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=5)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_delta=0.01)

In [None]:
BATCH_SIZE = 1024
N_EPOCHS = 30

## Deep model

Deep NN inspired by the top solution

In [None]:
def get_model_deep(shape=(19, 22)):
    inp = Input(shape)
    x = Dense(16)(inp)
    x = Bidirectional(LSTM(64, return_sequences=True))(x)
    x = TimeDistributed(Dense(64))(x)
    x = Bidirectional(LSTM(128, return_sequences=True))(x)
    x = TimeDistributed(Dense(1))(x)
    x = AveragePooling1D()(x)
    x = Flatten()(x)
    x = Dropout(0.5)(x)
    x = Dense(1)(x)

    model = Model(inp, x)
    return model

In [None]:
model = get_model_deep((19,22))
model.compile(optimizer='adam', loss='mae',)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 19, 22)]          0         
                                                                 
 dense (Dense)               (None, 19, 16)            368       
                                                                 
 bidirectional (Bidirectiona  (None, 19, 128)          41472     
 l)                                                              
                                                                 
 time_distributed (TimeDistr  (None, 19, 64)           8256      
 ibuted)                                                         
                                                                 
 bidirectional_1 (Bidirectio  (None, 19, 256)          197632    
 nal)                                                            
                                                             

In [None]:
model.fit(X_train, y_train, 
            batch_size=BATCH_SIZE, epochs=N_EPOCHS, 
            validation_split=0.2, callbacks=[early_stopping, reduce_lr])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30


<keras.callbacks.History at 0x7f18c6f68c10>

In [None]:
y_pred = model.predict(X_test, batch_size=BATCH_SIZE)
submission = pd.DataFrame({'Id': test_ids, 'Expected': y_pred.reshape(-1)})
submission.to_csv('submission.csv', index=False)