# CS451 Covid Prediction

This notebook contains the group project solution for the cs451 final project, which aims to predict the spread of COVID-19 in Canada through usage of large quantities of Twitter data and other sources. The collection and sentiment analysis of Twitter data is performed separately, and is then used as an extra input alongside other covid-related data with the hopes of improving the model accuracy. We have chosen to implement the techniques outlined in a paper which boasts improvements on state-of-the-art techniques, including ARIMA, Simple Moving Average witha 6-day window, and Double Exponential Moving Average. Their technique uses a bidirectional LSTM and clusters countries by demographic, socioeconomic and health sector indicators to train the model on a richer dataset. We will include the Twitter sentiment analysis as an extra feature and see if this can improve the results even further.

In order to expand our dataset for better accuracy, we performed K-Means clustering on countries using several demographic, socioeconomic and health sector indicators to find countries similar to Canada. We then collected data pertaining to the degree of lockdown measures for each country as supplemental features to our model. The data consists of:

*   School closing:
   * 0: No measures.
   * 1: Safety precautions are required.
   * 2: Recommended closing.
   * 3: Require closing (only some levels/categories).
   * 4: Require closing at all levels.
*   Workplace closing:
   * 0: No measures.
   * 1: Safety precautions are required.
   * 2: Recommended closing or working from home.
   * 3: Require closing or working from home for some sectors or categories of workers.
   * 4: Require closing for all sectors except for essential workplaces.
*   Restrictions on gatherings:
   * 0: No restrictions.
   * 1: Restrictions on very large gatherings (limit > 1000 people).
   * 2: Restrictions on gatherings between 101-1000 people.
   * 3: Restrictions on gatherings between 11-100 people.
   * 4: Restrictions on gatherings of 10 people or less.
*   Public transport shutdown:
   * 0: No restrictions.
   * 1: Recommended closing or significantly reduce volume or routes or means of transportation available.
   * 2: Require closing.
*   International travel controls:
   * 0: No restrictions.
   * 1: Screening arrivals.
   * 2: Quarantine arrivals from some or all regions.
   * 3: Ban arrivals from some regions.
   * 4: Ban on all regions or total border closure.


In [2]:
from datetime import date

# Data for Canada was collected from https://www.cihi.ca/en/covid-19-intervention-timeline-in-canada
# In case of discrepancy for certain measures between provinces, the most common date
# of the measure implementation between the most populous provinces was selected.
canada_lockdown_measures = {
    'school-closing': [
        (date(2020, 1, 23), 0),
        (date(2020, 3, 17), 4),
        (date(2020, 9, 8), 1),
    ],
    'workplace-closing': [
        (date(2020, 1, 23), 0),
        (date(2020, 3, 17), 3),
        (date(2020, 3, 25), 4),
        (date(2020, 5, 5), 3),
    ],
    'gatherings': [
        (date(2020, 1, 23), 0),
        (date(2020, 3, 16), 2),
        (date(2020, 3, 27), 4),
    ],
    'public-transport': [
        (date(2020, 1, 23), 0),
    ],
    'international-travel': [
        (date(2020, 1, 23), 0),
        (date(2020, 3, 16), 2),
    ],
}

In [16]:
from datetime import timedelta
import pandas as pd
import numpy as np

def daterange(start_date, end_date, x):
    for n in range(int((end_date - start_date).days)):
        yield (start_date + timedelta(n), x)

measures = ['school-closing', 'workplace-closing', 'gatherings', 'public-transport', 'international-travel']

def generate_time_series(lockdown_measures):
    end_date = date(2020, 12, 10)
    data = {
        'school-closing': [],
        'workplace-closing': [],
        'gatherings': [],
        'public-transport': [],
        'international-travel': [],
    }
    for measure in measures:
        curr_date = date(2020, 1, 23)
        curr_val = 0
        for d, val in lockdown_measures[measure]:
            data[measure] += list(daterange(curr_date, d, curr_val))
            curr_date = d
            curr_val = val
        if curr_date < end_date:
            data[measure] += list(daterange(curr_date, end_date, curr_val))
    return data

canada_data = generate_time_series(canada_lockdown_measures)
canada_df = pd.DataFrame.from_dict(canada_data)
canada_df

Unnamed: 0,school-closing,workplace-closing,gatherings,public-transport,international-travel
0,"(2020-01-23, 0)","(2020-01-23, 0)","(2020-01-23, 0)","(2020-01-23, 0)","(2020-01-23, 0)"
1,"(2020-01-24, 0)","(2020-01-24, 0)","(2020-01-24, 0)","(2020-01-24, 0)","(2020-01-24, 0)"
2,"(2020-01-25, 0)","(2020-01-25, 0)","(2020-01-25, 0)","(2020-01-25, 0)","(2020-01-25, 0)"
3,"(2020-01-26, 0)","(2020-01-26, 0)","(2020-01-26, 0)","(2020-01-26, 0)","(2020-01-26, 0)"
4,"(2020-01-27, 0)","(2020-01-27, 0)","(2020-01-27, 0)","(2020-01-27, 0)","(2020-01-27, 0)"
...,...,...,...,...,...
317,"(2020-12-05, 1)","(2020-12-05, 3)","(2020-12-05, 4)","(2020-12-05, 0)","(2020-12-05, 2)"
318,"(2020-12-06, 1)","(2020-12-06, 3)","(2020-12-06, 4)","(2020-12-06, 0)","(2020-12-06, 2)"
319,"(2020-12-07, 1)","(2020-12-07, 3)","(2020-12-07, 4)","(2020-12-07, 0)","(2020-12-07, 2)"
320,"(2020-12-08, 1)","(2020-12-08, 3)","(2020-12-08, 4)","(2020-12-08, 0)","(2020-12-08, 2)"


In [None]:
import torch

# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
import torch.nn as nn

# Define the model
class BiLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, seq_len=1, batch_size=1):
        super(BiLSTM, self).__init__()

        # Number of features
        self.input_size = input_size

        # Number of features in hidden state
        self.hidden_size = hidden_size

        # Number of stacked recurrent layers
        self.num_layers = num_layers

        # Size of each input sequence 
        self.seq_len = seq_len
        self.batch_size = batch_size
        
        # We use a bidirectional LSTM to continuously update older predictions
        # based on newer data, which will hopefully improve accuracy by using
        # greater context.
        self.lstm = nn.LSTM(
            input_size,
            hidden_size,
            num_layers,
            bidirectional=True,
            batch_first=True
        )

        # The input to the linear layer will be the output of the LSTM.
        # Since we are using a bidirectional LSTM, we will have both the outputs
        # of the forward-LSTM and the backward-LSTM concatenated, which we will then
        # use to create our prediction. This is why the input size is twice the
        # size of the hidden output dimension. Our output size is 1 since we are
        # performing regression, and need a single value.
        self.fc = nn.Linear(hidden_size * 2, 1)
        
    def init_hidden(self):
        return (torch.zeros(self.num_layers * 2, self.batch_size, self.hidden_size).to(device),
            torch.zeros(self.num_layers * 2, self.batch_size, self.hidden_size).to(device))
    
    def forward(self, x, hidden):
        out, (hn, cn) = self.lstm(x.view(self.batch_size, self.seq_len, self.input_size), hidden)
        return self.fc(out.view(-1, self.batch_size, self.hidden_size * 2)), (hn.detach(), cn.detach())

Next we define a dataset class inheriting from `torch.utils.data.Dataset`, which will make loading data for training a cleaner and easier process. Our data consists of various relevant features, including:
 

*   The current day's Twitter sentiment related to COVID-19
*   The number of confirmed COVID-19 cases
*   The number of 

COVID-19, the number of confirmed COVID-19 cases, the number of

In [None]:
from torch.utils.data import Dataset

class CovidDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return torch.FloatTensor(self.data[idx]).to(device), \
            torch.FloatTensor([self.targets[idx]]).to(device)

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Set seed to ensure our results are reproducible
np.random.seed(42)

# Read data
df = pd.read_csv('canada-covid-data.csv', usecols=['location', 'total_cases'])
df = df[df['location'] == 'Canada']
dataset = np.array(df['total_cases'].dropna().values)
dataset = np.expand_dims(dataset, axis=1)

# Since LSTMs are sensitive to the scale of the input data,
# we normalize the inputs:
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)

dataset = np.expand_dims(dataset, axis=1)
print(dataset.shape)

# Split the data into train and test sets
train_size = int(len(dataset) * 0.75)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

(319, 1, 1)


Our input data will be a sequence of values, and our output will be the single value for the cumulative COVID-19 cases. Given input data $\{x_1, x_2, ..., x_{n-1}\}$ and output data $\{y_1, y_2, ..., y_n\}$, if we would like to predict a value $y_i$, we will use a sequence $x_{i-k-1},...,x_{i-1}$ to make the prediction. For this purpose, we have defined the `create_dataset` method to reshape the data in this format, where `look_back` corresponds to $k$.

In [None]:
from torch.utils.data import DataLoader

def create_dataset(dataset, look_back=1):
    data_x, data_y = [], []
    for i in range(len(dataset) - look_back-1):
        a = dataset[i:(i+look_back), 0]
        data_x.append(a)
        data_y.append(dataset[i + look_back, 0])
    return np.array(data_x), np.array(data_y)

look_back = 6
train_x, train_y = create_dataset(train, look_back)
test_x, test_y = create_dataset(test, look_back)

train_loader = DataLoader(CovidDataset(train_x, train_y), shuffle=False)
test_loader = DataLoader(CovidDataset(test_x, test_y), shuffle=False)

Now that our data is formatted we will create the model and perform the training.

In [None]:
import torch.optim as optim

epochs = 100
batch_size = 1
num_layers = 2
input_size = train_x.shape[2]
hidden_size = 256

model = BiLSTM(input_size, hidden_size, num_layers, look_back, batch_size=1)
model.to(device)

criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
train_mse = []
test_mse = []
torch.autograd.set_detect_anomaly(True)

for epoch in range(epochs):
    hidden = model.init_hidden()
    train_loss_tot = 0
    train_ctr = 0
    for i, (data, target) in enumerate(train_loader):
        train_ctr += 1
        optimizer.zero_grad()

        # Forward
        prediction, hidden = model(data, hidden)

        # Calculate loss
        loss = criterion(prediction, target.squeeze())

        # Backpropagate
        loss.backward()

        # Perform Adam step
        optimizer.step()

        train_loss_tot += loss.item()

    train_mse += [train_loss_tot / train_ctr]

    hidden = model.init_hidden()
    test_loss_tot = 0
    test_ctr = 0
    with torch.no_grad():
        for i, (data, target) in enumerate(test_loader):
            test_ctr += 1

            # Calculate prediction
            prediction, hidden = model(data, hidden)
            
            # Calculate loss
            loss = criterion(prediction, target.squeeze())

            test_loss_tot += loss.item()
    
    test_mse += [test_loss_tot / test_ctr]
    print('[INFO] epoch: {}, train MSE: {:.5f}, test MSE: {:.5f}'.format(epoch, train_mse[-1], test_mse[-1]))


  return F.mse_loss(input, target, reduction=self.reduction)


[INFO] epoch: 0, train MSE: 0.00025, test MSE: 0.09050
[INFO] epoch: 1, train MSE: 0.00080, test MSE: 0.10920
[INFO] epoch: 2, train MSE: 0.00113, test MSE: 0.10763
[INFO] epoch: 3, train MSE: 0.00123, test MSE: 0.10974
[INFO] epoch: 4, train MSE: 0.00134, test MSE: 0.10504
[INFO] epoch: 5, train MSE: 0.00138, test MSE: 0.10742
[INFO] epoch: 6, train MSE: 0.00135, test MSE: 0.10493
[INFO] epoch: 7, train MSE: 0.00136, test MSE: 0.10686
[INFO] epoch: 8, train MSE: 0.00129, test MSE: 0.10312
[INFO] epoch: 9, train MSE: 0.00133, test MSE: 0.10434


KeyboardInterrupt: ignored