# Final Project - COVID Case/Death Prediction


## Introduction
Throughout this course  we have learned various methods of Machine Learning. Through using different packages such as sklearn, keras, xgboost, and  imblearning we were able to perform different techniques of learning on different datasets. In this instance today, we are training a neural network to predict the Covid-19 cases and deaths in Canada for the days of April 18, 19 and 20 in the year of 2022. It will utilize a training set of the past data and it will be tested against current data in order to come up with the perdiciton.

## Packages
The packages we will need to build these nerual networks are numpy, pandas, sklearn.preprocessing and from there we will need the MinMaxScaler, keras.models and from there we will need the sequential function, keras.layers and from there we will need LTSM, Dense, and Dropout.

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM,Dense ,Dropout
from keras import callbacks

#Needed packages

## Seed
Here we are using Numpy to set a seed so that every time this code is run, the results of training the model does not vary too much. We set our future value of 3 which indicates how many variables we plan to predict and we set our past variable to 56 to indicate how many days will be used to predict the future days.

In [3]:
np.random.seed(17) #Set seed
future = 3 #How many predictions
past = 56  #How many past observations these predictions are based on

## Training Case Data Set
In this cell, we are going to load the dataset that we will use to make our training set using Pandas. Intially, these dataset consists of 177,143 observations and 67 different variables. The variables that are needed for the training set will be total_cases and total_deaths from Canada's observations. Due to this, the next step was to filter the dataset based on the location so that it would return only Canadian values. Then located the needed variables using iloc and tranformed the dataset using values and stored it in their respective local variables.

In [4]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv' #Link to datasaet
dataset = pd.read_csv(url,index_col=0,parse_dates=[0]) #Load dataset
dataset = dataset[dataset['location']=='Canada'] #Filter for Canada data
dataset = dataset.fillna(0) #Replace null values with 0
training_setcase = dataset.iloc[:-past,3:4].values #Create training set of total_cases as an array
dataset.tail(3)

  dataset = pd.read_csv(url,index_col=0,parse_dates=[0]) #Load dataset


Unnamed: 0_level_0,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
iso_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CAN,North America,Canada,2024-08-02,4819017.0,0.0,46.71,55282.0,0.0,0.29,124133.45,...,16.6,0.0,2.5,82.43,0.93,38454328,0.0,0.0,0.0,0.0
CAN,North America,Canada,2024-08-03,4819017.0,0.0,46.71,55282.0,0.0,0.29,124133.45,...,16.6,0.0,2.5,82.43,0.93,38454328,0.0,0.0,0.0,0.0
CAN,North America,Canada,2024-08-04,4819055.0,38.0,5.43,55282.0,0.0,0.0,124134.43,...,16.6,0.0,2.5,82.43,0.93,38454328,0.0,0.0,0.0,0.0


## Training Data Set Case, Ctd
After the training sets have been made, the next step is to transform the set so that they are scaled. We used the MinMaxScaler in order to normalize the cases between 0 and 1.

In [23]:
sc = MinMaxScaler(feature_range=(0,1)) #create function to normalize data between 0 and 1
training_set_scaledcase = sc.fit_transform(training_setcase) # normalize the total_cases array

## Training Data Set, Ctd

We are now prepping are the training data by creating empty lists and a future variable that will dictate how many values the predictions will be based on and how many predictions will be made. In this case we are using 8 weeks (56 days) of data to predict the next three days of data. We then use an for loop to add these values according to the values that were inputted into the future and past variables. After this list is created we convert the list to an array by using numpy and then we reshape the array using numpy again so that it cant be training in our model.

In [24]:
x_traincase = [] #empty list for x train
y_traincase = [] #empty list for y train
for i in range(0,len(training_set_scaledcase)-past-future+1):
    x_traincase.append(training_set_scaledcase[i : i + past , 0])
    y_traincase.append(training_set_scaledcase[i + past : i + past + future , 0 ]) #for loop that iterates values based on how the past and future variables defined earlier
x_traincase , y_traincase = np.array(x_traincase), np.array(y_traincase)   #transform list to array
x_traincase = np.reshape(x_traincase, (x_traincase.shape[0] , x_traincase.shape[1], 1) )  #reshape dataset

## Model Building for Cases
Here is the beginning of building our model. We first started by using Sequential to develop our layered model. We decided to build layers using LTSM, which stands for Long Short-Term Memory. We are using a LTSM network after research, it "has only recently become a viable and powerful forecasting technique". (https://towardsdatascience.com/lstm-framework-for-univariate-time-series-prediction-d9e7252699e#:~:text=LSTM%20methodology%2C%20while%20introduced%20in,based%20models%20like%20LSTM%20offer) We built four input layers, three of them being LTSM and two of the LTSM layers with return sequences so that when the predictions are spitout, they are reiterated back into the model. The final input layer is a dropout with a standard value of 0.2 to help mitigate overfitting. After we've added all the layers, we compile the model using the adam optimizer and use accuracy as a metric. We then fit the model to our x and y train values alongside epochs of 100 and batch_size of 64.

In [None]:
regressorcase = Sequential()
regressorcase.add(LSTM(units = past, return_sequences = True, input_shape = (x_traincase.shape[1], 1)))
regressorcase.add(LSTM(units = past, return_sequences = True))
regressorcase.add(LSTM(units = past))
regressorcase.add(Dropout(0.2))
regressorcase.add(Dense(units = future))
regressorcase.compile(optimizer ="adam", loss ="mean_absolute_error",
                                             metrics =['accuracy']) #build the model with 4 input layers and one output that prints
                                             #out the number of predictions defined in the earlier code. Then compiled the model using adam optimizer, mean_absolute_error for loss and accuracy for metrics
regressorcase.fit(x_traincase, y_traincase, epochs=100,batch_size= 64) #fitted the model to x train. and y train with 100 epochs and batch size of 64

  super().__init__(**kwargs)


Epoch 1/100
[1m4808/6709[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m3:18[0m 105ms/step - accuracy: 0.3315 - loss: 0.0020

## Test Data Set and Prediction for Cases

Our test dataset will be the last 8 weeks of data from the same dataaset we used to create our training data.

The same way we reshaped the training dataset we also normalized, reshaped the testing datashape, and converted it to a numpy array. We then used the testing case on the model to predict the next days of covid cases. After we run the test dataset against the model, we need to reverse the transformation process that the dataset intially underwent in order to get the data in the form that is readible. The predictions now come out as a Numpy array.  

In [None]:

testconfirm = dataset.iloc[-past:,3:4].values #create test dataset
testingcase = sc.transform(testconfirm.reshape(-1,1)) #trainsform and reshape the dataset the same way we did for the training set
testingcase = np.array(testingcase) #transform it into an array
testingcase = np.reshape(testingcase,(testingcase.shape[1],testingcase.shape[0],1)) #reshape the test dataset
predictedcase = regressorcase.predict(testingcase) #predict the dataset based on the testing data
predictedcase = sc.inverse_transform(predictedcase)
predictedcase = np.reshape(predictedcase,(predictedcase.shape[1],predictedcase.shape[0])) #reverse the process of normalizing the dataset


# Process for Deaths
Repeat the same process for the the deaths prediction as well.

In [None]:
training_setdeath = dataset.iloc[:-past,6:7].values
training_set_scaleddeath = sc.fit_transform(training_setdeath)
x_traindeath = []
y_traindeath = []
for i in range(0,len(training_set_scaleddeath)-past-future+1):
    x_traindeath.append(training_set_scaleddeath[i : i + past , 0])
    y_traindeath.append(training_set_scaleddeath[i + past : i + past + future , 0 ])
x_traindeath , y_traindeath = np.array(x_traindeath), np.array(y_traindeath)
x_traindeath = np.reshape(x_traindeath, (x_traindeath.shape[0] , x_traindeath.shape[1], 1) )
regressordeath = Sequential()
regressordeath.add(LSTM(units = past, return_sequences = True, input_shape = (x_traindeath.shape[1], 1)))
regressordeath.add(LSTM(units = past, return_sequences = True))
regressordeath.add(LSTM(units = past))
regressordeath.add(Dropout(0.2))
regressordeath.add(Dense(units = future))
regressordeath.compile(optimizer ="adam", loss ="mean_absolute_error",
                                             metrics =['accuracy'])
regressordeath.fit(x_traindeath, y_traindeath, epochs=100,batch_size= 64 )
testdeath = dataset.iloc[-past:,6:7].values
testingdeath = sc.transform(testdeath.reshape(-1,1))
testingdeath = np.array(testingdeath)
testingdeath = np.reshape(testingdeath,(testingdeath.shape[1],testingdeath.shape[0],1))
predicteddeath = regressordeath.predict(testingdeath)
predicteddeath = sc.inverse_transform(predicteddeath)
predicteddeath = np.reshape(predicteddeath,(predicteddeath.shape[1],predicteddeath.shape[0]))
#repeat the exact same process except with deaths instaed of cases

## Final Results

We then transform these arrays into a rounded number and put them into a list. These lists are then combined and then we used pandas to develop a DataFrame to input this list. This DataFrame has the predicted values for their respective category and date.

In [None]:
case = (np.squeeze(predictedcase.astype(int)))
death =  (np.squeeze(predicteddeath.astype(int)))#from float to integer for both case and death array
combined = [case, death] #put case and death into a list
column_names = ['April 18, 2022', 'April 19, 2022', 'April 20, 2022'] #create column names for the DataFrame
df = pd.DataFrame(columns = column_names, index=[ 'Total Cases', 'Total Deaths'], data=combined) #turn the predictions to DataFrame
df.index.name = 'Canada' #create index name
print(df) #print final results
df.to_csv("FinalResults.csv")

# Reference
https://medium.com/analytics-vidhya/weather-forecasting-with-recurrent-neural-networks-1eaa057d70c3 for guidance