COVID-19's effect on the entire world is apparent as we are struggling to fight this deadly virus. 
The first COVID-19 case was reported on December 31, 2019 in Wuhan, China. 
On January 21, 2020, the CDC (Center for Disease Control) confirmed the first COVID-19 case in the U.S. ("A Timeline of Covid-19 Developments in 2020"). 

Since then, cases have risen exponentially and numerous precautionary measures have been taken to prevent the spread of the virus. By May 26, 2021, there were 168 million cases around the world with 3.49 million deaths. AI, machine learning, and data science allows us to analyze the spread of COVID-19 to better understand the virus. It also allows us to help inform the world of new discoveries around COVID-19. 

![download](https://user-images.githubusercontent.com/75640165/119728300-2691f100-be28-11eb-9dcd-3a74360577aa.jpg)

Predictions for confirmed cases, recovered, deaths, and were made based on the dataset. To make these predictions, I used Prophet.
[Prophet](https://facebook.github.io/prophet/#:~:text=Prophet%20is%20a%20procedure%20for,daily%20seasonality%2C%20plus%20holiday%20effects.&text=Prophet%20is%20open%20source%20software,download%20on%20CRAN%20and%20PyPI.) is a forecasting tool from Facebook that uses data to predict future behavoir. Prophet gives a future trend to the data instead of just predictions.
It makes forecasts based on irregular holidays and year, weekly, and daily seasonality.



# Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import plotly.express as px
%matplotlib inline

In [None]:
world = pd.read_csv('../input/corona-virus-report/worldometer_data.csv')

In [None]:
full = pd.read_csv('../input/corona-virus-report/full_grouped.csv') # used in Prophet model

In [None]:
covid = pd.read_csv('../input/corona-virus-report/country_wise_latest.csv')

In [None]:
day = pd.read_csv('../input/corona-virus-report/day_wise.csv')

In [None]:
covid.head()

In [None]:
day.head()

In [None]:
world.head()

In [None]:
full.head()

In [None]:
covid.info()

In [None]:
covid.describe()

The describe() method presents some quick statistics to the data. 

In [None]:
day.info()

Using the info() method, we can the data type, and the amount of non-null items in the data frame.

In [None]:
day.describe()

In [None]:
world.info()

In [None]:
world.describe()

In [None]:
full.info()

In [None]:
full.describe()

# Exploratory Data Analysis

In [None]:
data = dict(type='choropleth',
            locations = covid['Country/Region'],
            locationmode = 'country names',
            z = covid['Confirmed'],
            text = covid['Country/Region'],
            colorbar = {'title':'Confirmed Cases'}
            ) 

layout = dict(title='Covid Cases (CONFIRMED)',
             geo=dict(showframe=False,
                     projection={'type':'natural earth'}))

choromap1=go.Figure(data=[data],layout=layout)
iplot(choromap1)

Most of the concentration of Covid-19 confirmed cases are in the U.S. and Brazil. These countries have the most confirmed cases. We can visualize this data using a choropleth map created with Plotly libraries.

In [None]:
data = dict(type='choropleth',
            locations = world['Country/Region'],
            locationmode = 'country names',
            z = world['TotalCases'],
            text = world['Country/Region'],
            colorbar = {'title':'# of Total Cases'},
            colorscale = 'viridis'
            ) 

layout = dict(title='Covid Cases (TOTAL)',
             geo=dict(showframe=False,
                     projection={'type':'natural earth'}))

choromap2=go.Figure(data=[data],layout=layout)
iplot(choromap2)



Just like the previous map, U.S. and Brazil have the most total cases. India and Russia also have a significant amount of total cases. Information about China is not present in this data set and on the map.

In [None]:
data = dict(type='choropleth',
            locations = world['Country/Region'],
            locationmode = 'country names',
            z = world['TotalRecovered'],
            text = world['Country/Region'],
            colorbar = {'title':'# of Recovered Cases'},
            colorscale = 'blues'
            ) 

layout = dict(title='Recovered',
             geo=dict(showframe=False,
                     projection={'type':'natural earth'}))

choromap3=go.Figure(data=[data],layout=layout)
iplot(choromap3)

Though U.S. has the most cases, they also have the most recovered. This may be because they had many cases to start off with. 

In [None]:
data = dict(type='choropleth',
            locations = world['Country/Region'],
            locationmode = 'country names',
            z = world['TotalDeaths'],
            text = world['Country/Region'],
            colorbar = {'title':'# of Deaths'},
            colorscale = 'reds'
            ) 

layout = dict(title='Deaths',
             geo=dict(showframe=False,
                     projection={'type':'natural earth'}))

choromap4=go.Figure(data=[data],layout=layout)
iplot(choromap4)

Just like the previous geographical maps, U.S. and Brazil also have the most deaths.

In [None]:
sns.set_style('darkgrid')

In [None]:
plt.figure(figsize=(8,8))
sns.kdeplot(x = 'Recovered',data=covid, hue="WHO Region")

This graph shows the distribution of recovered cases for each WHO region plotted on top of each other.

In [None]:
# Confirmed
plt.figure(figsize=(10,4))
ax1 = sns.kdeplot(data=covid, x="Confirmed",color='r')
plt.show(ax1)

# Deaths
plt.figure(figsize=(10,4))
ax2 = sns.kdeplot(data=covid, x="Deaths", color='g')
plt.show(ax2)

# Recovered
plt.figure(figsize=(10,4))
ax3 = sns.kdeplot(data=covid, x="Recovered", color='b')
plt.show(ax3)

# Active
plt.figure(figsize=(10,4))
ax4 = sns.kdeplot(data=covid, x="Active", color='y')
plt.show(ax4)

# New Cases
plt.figure(figsize=(10,4))
ax5 = sns.kdeplot(data=covid, x="New cases", color='purple')
plt.show(ax5)

# New Deaths
plt.figure(figsize=(10,4))
ax6 = sns.kdeplot(data=covid, x="New deaths", color='gray')
plt.show(ax6)

# New Recovered
plt.figure(figsize=(10,4))
ax7 = sns.kdeplot(data=covid, x="New recovered", color='pink')
plt.show(ax7)

These kde plots represent the distribution of several of the numerical features (recovered, confirmed, deaths...)

In [None]:
px.pie(world[:25], values='TotalCases', names='Country/Region', 
       title='Top 25 Countries')

The percentage of the top 25 countries with the most cases is represented in this pie chart. U.S. has almost 1/3 of the total cases among these 25 countries. This pie chart was created using Plotly libraries.

In [None]:
europe = world[world['Continent'] == 'Europe']
asia = world[world['Continent'] == 'Asia']
north_america = world[world['Continent'] == 'North America']
south_america = world[world['Continent'] == 'South America']
australia_oceania = world[world['Continent'] == 'Australia/Oceania']

In [None]:
px.pie(europe[:25], values='TotalCases', names='Country/Region', 
       title='Top 25 Countries/Regions in Europe')

Russia has over a quarter of the total cases in Europe. 

In [None]:
px.pie(asia[:25], values='TotalCases', names='Country/Region', 
       title='Top 25 Countries/Regions in Asia')

India has the most cases in Asia. This is consisent with the previous geographical maps.

In [None]:
px.pie(north_america[:15], values='TotalCases', names='Country/Region', 
       title='Top 15 Countries/Regions in North America')

U.S. has 85% of the total cases in North America.

In [None]:
px.pie(south_america, values='TotalCases', names='Country/Region', 
       title='Countries/Regions in South America')

Brazil has the majority of the total cases in South America.

In [None]:
px.pie(australia_oceania, values='TotalCases', names='Country/Region', 
       title='Countries/Regions in Australia/Oceania')

Australia has the most cases. This makes sense as Australia is the largest country/continent in this area.

In [None]:
fig = px.bar(world[:50], x = 'Country/Region', y = 'TotalRecovered',color = 'Country/Region')
fig

This is another representation of the total cases through an interactive bar plot.

In [None]:
fig2 = px.bar(world[:50], x = 'Country/Region', y = 'TotalDeaths',color = 'Country/Region')
fig2

The following plots show the amount of confirmed, recovered, and death cases in the WHO Region. Each bar is also divided into larger and smaller sections based on which country has the most cases.

In [None]:
fig3 = px.bar(covid, x = 'WHO Region', y = 'Confirmed',color = 'WHO Region')
fig3

In [None]:
fig4 = px.bar(covid, x = 'WHO Region', y = 'Recovered',color = 'WHO Region')
fig4

In [None]:
fig5 = px.bar(covid, x = 'WHO Region', y = 'Deaths',color = 'WHO Region')
fig5

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(covid.corr(),cmap='viridis',annot=True)

With this heatmap, we can see what features are most correlated with the number of confirmed cases. 
Most correlated features:
- Deaths (93%)
- Active (93%)
- Recovered (91%)
- New cases (91%)
- New deaths (87%)
- New recovered (86%)
- Confirmed last week (100%)
- 1 week change (95%)

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(day.corr(),cmap='coolwarm',annot=True)

Most correlated with confirmed:
- Recovered (99%)
- Active (99%)
- Deaths (98%)
- New cases (96%)
- New recovered (94%)
- Recoverd/100 Cases (72%)
- No. of Countries (60%)
- New deaths (56%)

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(world.corr(),cmap='Pastel2_r',annot=True)

Most correlated with total cases:
- NewCases (100%)
- NewDeaths (100%)
- NewRecovered (100%)
- ActiveCases (97%)
- SeriousCritical (97%)
- TotalDeaths (94%)
- TotalTests (89%)
- Population (55%)

# Data PreProcessing

In [None]:
day.head()

In [None]:
covid.head()

In [None]:
covid.isnull().sum() # no null values

In [None]:
day.isnull().sum() # no null values

In [None]:
day['Date']=pd.to_datetime(day['Date']) # converting Data into datetime

In [None]:
day['Day'] = day['Date'].apply(lambda x: x.hour)
day['Month'] = day['Date'].apply(lambda x: x.month)
day['Year'] = day['Date'].apply(lambda x: x.dayofweek)

This creates seperate numerical columns with the year, month, and day.

In [None]:
day_temp = day.copy()
day_temp = day_temp.drop('Date',axis=1)
day_temp.head()

day_temp will be used in model

# Model for Confirmed Cases

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = day_temp.drop('Confirmed',axis=1).values
y = day_temp['Confirmed'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## Building Neural Network Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
X_train.shape

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

MinMaxScaler scales each feature to a given range. This is known as feature scaling and normalizes the range of features in the data. This allows the neural network to converge faster to the local minimum. 

To learn more about why feature scaling is important, [read this article](https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35).

In [None]:
model = Sequential()

model.add(Dense(19,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(19,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(19,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(19,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1))

model.compile(optimizer='adam',loss='mse')

When building the model, it is important to use [dropout layers](https://keras.io/api/layers/regularization_layers/dropout/) to prevent overfitting. The activation function is [relu](https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/#:~:text=The%20rectified%20linear%20activation%20function,otherwise%2C%20it%20will%20output%20zero.&text=The%20rectified%20linear%20activation%20function%20overcomes%20the%20vanishing%20gradient%20problem,learn%20faster%20and%20perform%20better.) (rectified linear unit) with the optimizer being [adam](https://www.tensorflow.org/swift/api_docs/Classes/Adam). Relu is a piecewise linear function that is easy to train and is generally the default activation function. The adam optimizer helps update weights more efficiently than stochastic gradient descent (adam is an extension of this).

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
#early_stop = EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=25)
model.fit(x=X_train,y=y_train,epochs=600,validation_data=(X_test,y_test),callbacks=[])

[EarlyStopping](https://en.wikipedia.org/wiki/Early_stopping) is another way to prevent overfitting, but I found that this model did better without it. 

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

In [None]:
from sklearn.metrics import mean_squared_error, explained_variance_score

In [None]:
predictions = model.predict(X_test)
np.sqrt(mean_squared_error(y_test,predictions))

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mean_absolute_error(y_test,predictions)

In [None]:
explained_variance_score(y_test,predictions)

## Prophet Model

Since the neural network model did not do a great job at predicting the number of confirmed cases, let's try using Prophet. 

[Prophet](https://research.fb.com/blog/2017/02/prophet-forecasting-at-scale/) is a forecasting tool from Facebook that uses the data to predict future behavior. It is an additive regression model, a nonparametric model, ( these are contructed using info from the data rather than taking a predetermined form). One benefit is that additive regression models are more flexible than regular linear regression models. 

Through this model, Prophet has 4 main components:
- list of holidays (from user)
- weekly seasonal component with dummy variables
- yearly seasonal component with Fourier series
- piecewise linear/logistic curve trend (selects points in the data to detect changes in trends) 

To learn the basics of Prophet, click [here](https://facebook.github.io/prophet/docs/quick_start.html#python-api)

In [None]:
from fbprophet import Prophet

In [None]:
conf_data = full[['Date', 'Confirmed']].groupby('Date', as_index = False).sum()
conf_data.columns = ['ds', 'y']
conf_data.ds = pd.to_datetime(conf_data.ds)

In [None]:
conf_data.head()

In [None]:
proph = Prophet()
proph.fit(conf_data)

In [None]:
confirmed_pred = proph.make_future_dataframe(periods=60)
confirmed_pred.tail()

In [None]:
confirmed_forecast = proph.predict(confirmed_pred)
confirmed_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

yhat is the forecast, and yhat_lower & yhat_upper are the uncertainty interval.

In [None]:
fig1 = proph.plot(confirmed_forecast)

The black dots represent the true y values while the blue line represents the forecast or predicted trend. From this graph, we can see that the Prophet model is doing a good job at predicting the number of confirmed cases as the data matches up with the predictions. 

In [None]:
fig2 = proph.plot_components(confirmed_forecast)

This shows the yearly and weekly seasonilty of the cases trend. There is a steady increase in cases from February 2020 to October 2020. The weekly graph shows that Saturday tends to have the most cases with Tuesday having the least amount of cases.

# Model for Deaths

In [None]:
deaths_data = full[['Date', 'Deaths']].groupby('Date', as_index = False).sum()
deaths_data.columns = ['ds', 'y']
deaths_data.ds = pd.to_datetime(deaths_data.ds)

In [None]:
deaths_data.head()

In [None]:
proph2 = Prophet()
proph2.fit(deaths_data)

In [None]:
deaths_pred = proph2.make_future_dataframe(periods=60)
deaths_pred.tail()

In [None]:
deaths_forecast = proph2.predict(deaths_pred)
deaths_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
fig3 = proph2.plot(deaths_forecast)

In [None]:
fig4 = proph2.plot_components(deaths_forecast)

The deaths were flattening at the beginning of 2020, but as time went on, the death cases began increasing. The most amount of deaths or on Saturday and the least amount of deaths is on Monday.

# Model for Recovered

In [None]:
rec_data = full[['Date', 'Recovered']].groupby('Date', as_index = False).sum()
rec_data.columns = ['ds', 'y']
rec_data.ds = pd.to_datetime(rec_data.ds)

In [None]:
rec_data.head()

In [None]:
proph3 = Prophet()
proph3.fit(rec_data)

In [None]:
rec_pred = proph3.make_future_dataframe(periods=60)

In [None]:
rec_forecast = proph3.predict(deaths_pred)
rec_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
fig5 = proph3.plot(rec_forecast)

In [None]:
fig6 = proph3.plot_components(rec_forecast)

There is an increase in recovered cases. Saturday tends to have the most recovered cases while Tuesday has the least amount of recovered cases. 