# Preface

In this notebook, we consider a sequence prediction problem. Our goal is to illustrate a different setting from the IMDB sentiment analysis problem, where the prediction output is no longer just 1 label.

Goals:
1. `return_sequences` keyword
2. exploit properties in task to use right scaling/activation functions

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pathlib
import tqdm
sns.set(font_scale=1.5, style='darkgrid')

# Covid 19 Dataset

Information: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

In [None]:
import kaggle
kaggle.api.authenticate()

In [None]:
kaggle.api.dataset_download_files(
    'sudalairajkumar/novel-corona-virus-2019-dataset',
    path='./data/covid19',
    quiet=False,
    unzip=True,
    force=False,
)

# Some Minimal Data Exploration

Let us read in the CSV files and look at its contents

In [None]:
data_confirmed = pd.read_csv('data/covid19/time_series_covid_19_confirmed.csv')
data_deaths = pd.read_csv('data/covid19/time_series_covid_19_deaths.csv')

In [None]:
data_confirmed.head(5)

In [None]:
data_deaths.head(5)

We extract some numpy arrays of the counts, and country names for labelling

In [None]:
number_confirmed = np.asarray(data_confirmed)[:, 4:].astype('float64')
number_deaths = np.asarray(data_deaths)[:, 4:].astype('float64')
countries = np.asarray(data_confirmed['Country/Region'])
provinces = np.asarray(data_confirmed['Province/State'].fillna(''))
names = [f'{c} {p}' for c, p in zip(countries, provinces)]

The numbers are rather large, so we take a logarithm scaling to control the magnitude. Why is 1.0 added?

In [None]:
number_confirmed = np.log(1.0 + number_confirmed)
number_deaths = np.log(1.0 + number_deaths)

Here, we can plot the numbers and see some rough trends

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
for i in range(5):
    ax[0].plot(number_confirmed[i], label=countries[i])
    ax[1].plot(number_deaths[i], label=countries[i])
for a in ax:
    a.legend()
    a.set_xlabel('days')
    a.set_ylabel('numbers (log)')

# LSTM Model

Here, we will build a model that links the confirmed cases to the number of deaths.

We know that there is a link, but there is also a time lag - we cannot just use the same day's confirmed cases to predict that days number of deaths.

However, we should expect a link if we look at all the cumulative confirmed counts. 

We will now keep 20% of the countries data as test set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test, names_train, names_test = train_test_split(
    number_confirmed[:, :, None], number_deaths[:, :, None], names,
    test_size=0.2, random_state=123)

Now, we build a simple LSTM model for this, using the canned layers from `keras`

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import Adam
from tqdm.keras import TqdmCallback

In [None]:
model = Sequential()
model.add(LSTM(256, return_sequences=True, input_shape=[None, x_train.shape[-1]]))
model.add(Dense(1, activation='relu'))
model.compile(loss='mse', optimizer=Adam(1e-4))

In [None]:
model_save_dir = pathlib.Path('covid_lstm.h5')

if model_save_dir.exists():
    model.load_weights(str(model_save_dir))
else:
    history = model.fit(
        x=x_train,
        y=y_train,
        batch_size=32,
        validation_data=(x_test, y_test),
        epochs=100,
        verbose=0,
        callbacks=[TqdmCallback(verbose=1)]
    )
    model.save_weights(str(model_save_dir))
    results = pd.DataFrame(history.history)
    results['epoch'] = history.epoch

Finally, let us look at the predictions on the test countries/provinces

In [None]:
y_pred = np.squeeze(model.predict(x_test))

In [None]:
n_rows = 3
n_cols = 5

fig, ax = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows), sharex=True, sharey=True)

for i in range(n_rows):
    for j in range(n_cols):
        count = i * n_cols + j
        ax[i,j].plot(y_test[count], label='True')
        ax[i,j].plot(y_pred[count], label='Predicted')
        
        ax[i,j].legend()
        ax[i,j].set_title(names_test[count])
        ax[i,j].set_xlabel('days')
        ax[i,j].set_ylabel('numbers (log)')

fig.tight_layout()

# Exercise

1. Try to modify the target to be a 10-day advance prediction, i.e. the task is to predict the number of deaths 10 days from the current, given the current knowledge of confirmed cases.
2. Try without log scaling, or without ReLU activation. These are called ablation studies
3. Try improving the model in other ways (we will learn some techniques in the following classes)