# PyTorch LSTM with TensorFlow-like initialization

My purpose for this notebook is to reproduce,

Dmitry Uarov: https://www.kaggle.com/dmitryuarov/ventilator-pressure-eda-lstm-0-189/notebook

with PyTorch (public LB 0.189). This sounds easy, but not always. I first got a significantly worse score ~ 0.3 using the same features and the model. Since I have heard that the weight initializations are different between PyTorch and TensorFlow, I am trying to make them as similar as I can in this notebook.

The weight initializations are as follows, according to the official documents
([Keras](https://keras.io/api/layers/recurrent_layers/lstm/), 
[PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)):

| Parameter | TensorFlow/Keras | PyTorch |
| ---       | --- | --- |
| weight_ih | xavier uniform | uniform √hidden_size |
| weight_hh | orthogonal     | same as above                    |
| bias      | 1 for forget gate, 0 other wise | same as above    |
| linear    | xavier uniform | uniform √input_size |

I wrote `_reinitialize()` in class `Model`, which is the main content of this notebook.

For me, using Xavier uniform for the fully connected (linear) after the LSTM was most important (which 
looks least important to me, though). TensorFlow initialization scheme for LSTM helped, too.

Remaining uncertainties:

* One LSTM weight is actually 4 matrices packed in one tensor. Should I initialize 4 matrices separately?
* Two biases bi and bh in Pytorch LSTM seem redundant. For the forget gate, I only set one of them to 1 because I saw somewhere that Keras have only one bias, but I am not sure.

Change log
* Version 3: Public score is computed from 5 models (5 folds) trained locally. It was from 2 folds trained in notebook in version 2.

In [None]:
import numpy as np
import pandas as pd
import math
import time
import pickle
import argparse
import sklearn.preprocessing
import torch
import torch.nn as nn
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.model_selection import KFold

debug = False

def set_seed(seed=42):
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

## Features and Dataset

From: https://www.kaggle.com/dmitryuarov/ventilator-pressure-eda-lstm-0-189/notebook

In [None]:
def create_features(df):
    df = df.copy()
    df['area'] = df['time_step'] * df['u_in']
    df['area'] = df.groupby('breath_id')['area'].cumsum()

    df['u_in_cumsum'] = (df['u_in']).groupby(df['breath_id']).cumsum()

    df['u_in_lag2'] = df['u_in'].shift(2).fillna(0)
    df['u_in_lag4'] = df['u_in'].shift(4).fillna(0)

    df['R'] = df['R'].astype(str)
    df['C'] = df['C'].astype(str)
    df = pd.get_dummies(df)

    g = df.groupby('breath_id')['u_in']
    df['ewm_u_in_mean'] = g.ewm(halflife=10).mean()\
                           .reset_index(level=0, drop=True)
    df['ewm_u_in_std'] = g.ewm(halflife=10).std()\
                          .reset_index(level=0, drop=True)
    df['ewm_u_in_corr'] = g.ewm(halflife=10).corr()\
                           .reset_index(level=0, drop=True)

    df['rolling_10_mean'] = g.rolling(window=10, min_periods=1).mean()\
                             .reset_index(level=0, drop=True)
    df['rolling_10_max'] = g.rolling(window=10, min_periods=1).max()\
                            .reset_index(level=0, drop=True)
    df['rolling_10_std'] = g.rolling(window=10, min_periods=1).std()\
                            .reset_index(level=0, drop=True)

    df['expand_mean'] = g.expanding(2).mean()\
                         .reset_index(level=0, drop=True)
    df['expand_max'] = g.expanding(2).max()\
                        .reset_index(level=0, drop=True)
    df['expand_std'] = g.expanding(2).std()\
                        .reset_index(level=0, drop=True)
    df = df.fillna(0)

    df.drop(['id', 'breath_id'], axis=1, inplace=True)
    if 'pressure' in df.columns:
        df.drop('pressure', axis=1, inplace=True)

    return df


class Dataset(torch.utils.data.Dataset):
    def __init__(self, X, y, w):
        if y is None:
            y = np.zeros(len(X), dtype=np.float32)

        self.X = X.astype(np.float32)
        self.y = y.astype(np.float32)
        self.w = w.astype(np.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, i):
        return self.X[i], self.y[i], self.w[i]

In [None]:
n = 100*1024 if debug else None

di = '/kaggle/input/ventilator-pressure-prediction/'
train = pd.read_csv(di + 'train.csv', nrows=n)
test = pd.read_csv(di + 'test.csv', nrows=n)
submit = pd.read_csv(di + 'sample_submission.csv', nrows=n)

features = create_features(train)
rs = sklearn.preprocessing.RobustScaler()
features = rs.fit_transform(features)  # => np.ndarray

X_all = features.reshape(-1, 80, features.shape[-1])
y_all = train.pressure.values.reshape(-1, 80)
w_all = 1 - train.u_out.values.reshape(-1, 80)  # weights for the score, but not used in this notebook

input_size = X_all.shape[2]

print(len(X_all))

# Model

In [None]:
class Model(nn.Module):
    def __init__(self, input_size):
        hidden = [400, 300, 200, 100]
        super().__init__()
        self.lstm1 = nn.LSTM(input_size, hidden[0],
                             batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(2 * hidden[0], hidden[1],
                             batch_first=True, bidirectional=True)
        self.lstm3 = nn.LSTM(2 * hidden[1], hidden[2],
                             batch_first=True, bidirectional=True)
        self.lstm4 = nn.LSTM(2 * hidden[2], hidden[3],
                             batch_first=True, bidirectional=True)
        self.fc1 = nn.Linear(2 * hidden[3], 50)
        self.selu = nn.SELU()
        self.fc2 = nn.Linear(50, 1)
        self._reinitialize()

    def _reinitialize(self):
        """
        Tensorflow/Keras-like initialization
        """
        for name, p in self.named_parameters():
            if 'lstm' in name:
                if 'weight_ih' in name:
                    nn.init.xavier_uniform_(p.data)
                elif 'weight_hh' in name:
                    nn.init.orthogonal_(p.data)
                elif 'bias_ih' in name:
                    p.data.fill_(0)
                    # Set forget-gate bias to 1
                    n = p.size(0)
                    p.data[(n // 4):(n // 2)].fill_(1)
                elif 'bias_hh' in name:
                    p.data.fill_(0)
            elif 'fc' in name:
                if 'weight' in name:
                    nn.init.xavier_uniform_(p.data)
                elif 'bias' in name:
                    p.data.fill_(0)

    def forward(self, x):
        x, _ = self.lstm1(x)
        x, _ = self.lstm2(x)
        x, _ = self.lstm3(x)
        x, _ = self.lstm4(x)
        x = self.fc1(x)
        x = self.selu(x)
        x = self.fc2(x)

        return x

In [None]:
model = Model(input_size)
for name, p in model.named_parameters():
    print('%-32s %s' % (name, tuple(p.shape)))

# Training

In [None]:
criterion = torch.nn.L1Loss()

def evaluate(model, loader_val):
    tb = time.time()
    was_training = model.training
    model.eval()

    loss_sum = 0
    score_sum = 0
    n_sum = 0
    y_pred_all = []

    for ibatch, (x, y, w) in enumerate(loader_val):
        n = y.size(0)
        x = x.to(device)
        y = y.to(device)
        w = w.to(device)

        with torch.no_grad():
            y_pred = model(x).squeeze()

        loss = criterion(y_pred, y)

        n_sum += n
        loss_sum += n*loss.item()
        
        y_pred_all.append(y_pred.cpu().detach().numpy())

    loss_val = loss_sum / n_sum

    model.train(was_training)

    d = {'loss': loss_val,
         'time': time.time() - tb,
         'y_pred': np.concatenate(y_pred_all, axis=0)}

    return d

In [None]:
nfold = 5
kfold = KFold(n_splits=nfold, shuffle=True, random_state=228)
epochs = 2 if debug else 300
lr = 1e-3
batch_size = 1024
max_grad_norm = 1000
log = {}

for ifold, (idx_train, idx_val) in enumerate(kfold.split(X_all)):
    print('Fold %d' % ifold)
    tb = time.time()
    model = Model(input_size)
    model.to(device)
    model.train()

    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=10)

    X_train = X_all[idx_train]
    y_train = y_all[idx_train]
    w_train = w_all[idx_train]
    X_val = X_all[idx_val]
    y_val = y_all[idx_val]
    w_val = w_all[idx_val]

    dataset_train = Dataset(X_train, y_train, w_train)
    dataset_val = Dataset(X_val, y_val, w_val)
    loader_train = torch.utils.data.DataLoader(dataset_train, shuffle=True,
                         batch_size=batch_size, drop_last=True)
    loader_val = torch.utils.data.DataLoader(dataset_val, shuffle=False,
                         batch_size=batch_size, drop_last=False)

    losses_train = []
    losses_val = []
    lrs = []
    time_val = 0
    best_score = np.inf
   
    print('epoch loss_train loss_val lr time')
    for iepoch in range(epochs):
        loss_train = 0
        n_sum = 0
        
        for ibatch, (x, y, w) in enumerate(loader_train):
            n = y.size(0)
            x = x.to(device)
            y = y.to(device)

            optimizer.zero_grad()

            y_pred = model(x).squeeze()

            loss = criterion(y_pred, y)
            loss_train += n*loss.item()
            n_sum += n

            loss.backward()
            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

            optimizer.step()

        val = evaluate(model, loader_val)
        loss_val = val['loss']
        time_val += val['time']

        losses_train.append(loss_train / n_sum)
        losses_val.append(val['loss'])
        lrs.append(optimizer.param_groups[0]['lr'])

        print('%3d %9.6f %9.6f %7.3e %7.1f %6.1f' %
              (iepoch + 1,
               losses_train[-1], losses_val[-1], 
               lrs[-1], time.time() - tb, time_val))

        scheduler.step(losses_val[-1])


    ofilename = 'model%d.pth' % ifold
    torch.save(model.state_dict(), ofilename)
    print(ofilename, 'written')

    log['fold%d' % ifold] = {
        'loss_train': np.array(losses_train),
        'loss_val': np.array(losses_val),
        'learning_rate': np.array(lrs),
        'y_pred': val['y_pred'],
        'idx': idx_val
    }
    
    if ifold >= 1: # due to time limit
        break


In [None]:
print('Fold loss_train loss_val best loss_val')
for ifold in range(2):
    d = log['fold%d' % ifold]
    print('%4d %9.6f %9.6f %9.6f' % (ifold, d['loss_train'][-1], d['loss_val'][-1], np.min(d['loss_val'])))

I trained 5 folds locally,

```
epoch loss_train loss_val
0.119303  0.184425 
0.105525  0.184154
0.109591  0.179805
0.127961  0.191654
0.141102  0.202042
```

The original TensorFlow scores at the end are,

```
loss val_loss (epoch 300)
0.1351 0.1902
0.1365 0.1897
0.1292 0.1972
0.1221 0.1970
0.1276 0.1976
```

I am satisfied with the similarity.

Note that the loss here is the overall MAE including the expiratory phase, which is not the evaluation metric.

# Predict and submit

In [None]:
features = create_features(test)
features = rs.transform(features)

X_test = features.reshape(-1, 80, features.shape[-1])
y_test = np.zeros(len(features)).reshape(-1, 80)
w_test = 1 - test.u_out.values.reshape(-1, 80)

dataset_test = Dataset(X_test, y_test, w_test)
loader_test = torch.utils.data.DataLoader(dataset_test, batch_size=batch_size)

y_pred_folds = np.zeros((len(test), 5), dtype=np.float32)
for ifold in range(5):
    model = Model(input_size)
    model.to(device)
    filename = '/kaggle/input/pytorchlstmwithtensorflowlikeinitialization/' \
               'model%d.pth' % ifold
    model.load_state_dict(torch.load(filename, map_location=device))
    model.eval()
    
    y_preds = []
    for x, y, _ in loader_test:
        x = x.to(device)
        with torch.no_grad():
            y_pred = model(x).squeeze()

        y_preds.append(y_pred.cpu().numpy())
    
    y_preds = np.concatenate(y_preds, axis=0)
    y_pred_folds[:, ifold] = y_preds.flatten()

submit.pressure = np.mean(y_pred_folds, axis=1)
submit.to_csv('submission.csv', index=False)
print('submission.csv written')

Minor differences from the original, with no reason.

* Learning rate scheduler is ReduceLROnPlateau, which is used in [other notebooks](https://www.kaggle.com/tenffe/finetune-of-tensorflow-bidirectional-lstm);
* I have not implemented early stopping;
* random seeds other than kfold are not fixed in the original. Mine is not strictly deterministic either.

There are more features, loss function, or better aggrigation of nfold predictions, and so on, in public notebooks, but my goal here is to reproduce score as good as TensorFlow using same model and features.