# Prediction: COVID-19 in Brazil
The following notebook is an attempt to predict the COVID-19 progression curves, using Polinomial Regression and the datasets published by the Brazilian Health Ministry.

![COVID-19](https://www.iol.pt/multimedia/oratvi/multimedia/imagem/id/5e8ba4820cf2cd6069eb2d1d/1024)

---

# Table of contents

1. [Acknowledgment](#acknowledgment)
1. [Why the Polynomial Regression?](#why)
1. [COVID-19 in Brazil](#brazil)
1. [Methodology and results](#methodology)
1. [Conclusion](#conclusion)
1. [Disclaimer](#disclaimer)

---

<div id="acknowledgment"></div>
# Acknowledgment
* The [COVID-19 cases dataset](https://www.kaggle.com/unanimad/corona-virus-brazil) used on this notebook was kindly built and provided by Raphael Fontes, using oficial data published daily by the Brazilian Health Ministry.

<div id="why"></div>
# Why Polynomial Regression?
The COVID-19 spread rate looks like a polynomial equation curve. The polynomial regression grows exponentially and shows the relationship between the total number of cases over a limited time. Due to the high sub-notifications frequency and the unnoticed cases, is quite difficult to build an accurate model, reflecting precisely the reality.

<div id="brazil"></div>
# COVID-19 in Brazil
Recently, Brazil passed 10K cases on April 4th and 500 deaths on April 6th. Currently, the Brazilian pandemic has multiple foci, mainly in SÃ£o Paulo, Rio de Janeiro, and Distrito Federal, doubling the number of cases every 4 days. Brazilian Public Healthcare Services is testing just a small group of people, especially the most severe cases, due to the lack of viral tests. This leads to a high number of undetected cases and, probably, a high increase in the number of confirmed cases in the next days.

<div id="methodology"></div>
# Methodology and results
First, I downloaded the COVID-19 dataset and pre-processed the data. For a better understanding of the data, new columns were generated, helping to visualize the growth rate of the number of cases and the growth rate of the number of deads.

In [None]:
import math, datetime, pandas as pd, warnings
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import plotly.express as px
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, f1_score
from sklearn.preprocessing import PolynomialFeatures

In [None]:
warnings.filterwarnings('ignore')

In [None]:
print('Last update on', pd.to_datetime('now'))

In [None]:
df = pd.read_csv('../input/corona-virus-brazil/brazil_covid19.csv').groupby('date').sum()[27:].reset_index()
brazil = pd.DataFrame({
    'date': pd.to_datetime(df['date'], format='%Y/%m/%d'),
    'cases': df['cases'], 
    'new_cases': df['cases'].diff().fillna(0).astype(int),
    'growth_cases': df['cases'].diff().fillna(0).astype(int)/df['cases'],
    'deaths': df['deaths'],
    'new_deaths': df['deaths'].diff().fillna(0).astype(int),
    'growth_deaths': df['deaths'].diff().fillna(0).astype(int)/df['deaths'],
    'mortality_rate': df['deaths']/df['cases']
})
brazil.fillna(0).tail()

Then, a Polynomial Regression method `poly_reg()` was created, returning the respective prediction of the given input.

In [None]:
def poly_reg(x, y, x_test, d):
    poly = PolynomialFeatures(degree = d) 
    poly.fit(poly.fit_transform(x), y)
    model = LinearRegression()
    model.fit(poly.fit_transform(x), y)
    return model.predict(poly.fit_transform(x_test))

def score(y, yhat):
    r2 = r2_score(y,yhat)
    rmse = np.sqrt(mean_squared_error(y,yhat))
    return (r2,rmse)

Next, an evaluation plot was created to verify the accuracy of the prediction, given a polynomial degree of 4. Given those values, the model predicts with an R2 score of 0.998 and RMSE of  Higher polynomial degrees were tested, although the 4th degree was kept to avoid an overfitted model. 

In [None]:
# Defines the range
start = 17
end = len(brazil)

# Sets the samples
x = np.asarray(range(start,end)).reshape(-1,1)
y = brazil.iloc[start:,1]

# Creates polynomial model and predict
yhat = poly_reg(x, y, x, 4)

# Plot the line chart
fig, ax = plt.subplots(figsize=(14, 10))
plt.scatter(x, y, s=40)
plt.plot(x, yhat, color='magenta', linestyle='solid', linewidth=4, alpha=0.5)
plt.title('Evaluating the model', fontsize=18, fontweight='bold', color='#333333')
plt.legend(labels=['prediction','cases'], fontsize=12)
plt.text(0.01,1.0,s='R2: %.3f RMSE: %.3f' % score(y, yhat), transform=ax.transAxes, fontsize=9)
plt.grid(which='major', axis='y')
ax.set_axisbelow(True)
ax.set_ylim(0)
[ax.spines[side].set_visible(False) for side in ['left','right','top']]
plt.show();

Afterward, the model was fitted with data from day one to the current date and a mixed plot was created, showing the actual number of cases, the number of deaths and the number of new cases by day.

In [None]:
dates = pd.date_range(start=brazil.iloc[0,0], end='2020-05-31') #.strftime('%d/%m').to_list()

In [None]:
fig, ax = plt.subplots(figsize=(14, 10))
plt.title('COVID-19: mortality rate in Brazil', fontsize=18, fontweight='bold', color='#333333')
plt.plot(brazil['date'][17:], brazil['mortality_rate'][17:], color='red', linewidth=4, marker='o')
ax.xaxis.set_minor_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_major_locator(mdates.DayLocator(interval=2))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d/%m"))
[ax.spines[side].set_visible(False) for side in ['right','top']]
plt.grid(which='major', color='#EEEEEE')
plt.grid(which='minor', color='#EEEEEE', linestyle=':')
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d/%m"))
plt.xticks(rotation=90)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(14, 10))
plt.plot(dates[start:end], y, color='limegreen', linewidth=8, alpha=0.5, marker='o')
plt.plot(brazil['date'][17:], brazil['deaths'][17:], color='red', linewidth=4, marker='o')
plt.bar(brazil['date'][17:], brazil['new_cases'][17:])
plt.title('COVID-19: number of cases in Brazil', fontsize=18, fontweight='bold', color='#333333')
plt.legend(labels=['cases','deaths','new cases'], fontsize=12)
plt.xticks(fontsize=10, rotation=90)
plt.grid(which='major', axis='y')
ax.set_axisbelow(True)
#[ax.annotate('%s' % y, xy=(x,y+100), fontsize=10) for x,y in zip(brazil['date'][17:], brazil['cases'][17:])]
ax.xaxis.set_minor_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_major_locator(mdates.DayLocator(interval=2))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d/%m"))
ax.yaxis.set_major_locator(plt.MultipleLocator(50000))
ax.yaxis.set_minor_locator(plt.MultipleLocator(10000))
[ax.spines[side].set_visible(False) for side in ['right','top']]
plt.grid(which='major', color='#EEEEEE')
plt.grid(which='minor', color='#EEEEEE', linestyle=':')
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d/%m"))
plt.show();

Finally, a new plot was created trying to predict the next 7 days' output, based on the training set.

In [None]:
# Creates polynomial model and predict
x_test = np.asarray(range(start, len(dates))).reshape(-1,1)
yhat = poly_reg(x, y, x_test, 4)
yhat_deaths = [i * 0.06 for i in yhat]

# Plot the line chart
fig, ax = plt.subplots(figsize=(14, 10))
plt.plot(dates[start:end], y, color='limegreen', linewidth=8, alpha=0.5)
plt.plot(brazil['date'][17:], brazil['deaths'][17:], color='magenta', linewidth=8, alpha=0.5)

plt.plot(dates[start:len(dates)], yhat, color='green', linestyle='None', marker='o')
plt.plot(dates[start:len(dates)], yhat_deaths, color='darkorchid', linestyle='None', marker='o')

#plt.bar(brazil['date'][17:], brazil['new_cases'][17:])
plt.title('COVID-19: cases prediction in Brazil', fontsize=18, fontweight='bold', color='#333333')
plt.legend(labels=['cases','deaths', 'cases prediction', 'deaths prediction', 'new cases'], fontsize=14)

plt.text(0.01,1.01,s='R2: %.3f RMSE: %.2f' % score(y, yhat[:len(y)]), transform=ax.transAxes, fontsize=10)
plt.xticks(rotation=90)
plt.tick_params(axis='y', length = 0)
ax.set_axisbelow(True)
#[ax.annotate('%s' % y, xy=(x,y+300), fontsize=10) for x,y in zip(dates[start:len(dates)], yhat.astype(int))]
#[ax.annotate('%s' % y, xy=(x,y+500), fontsize=10) for x,y in zip(dates[len(brazil['date']):len(dates)], yhat[len(brazil['date'][17:]):].astype(int))]
ax.xaxis.set_minor_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d/%m"))
ax.yaxis.set_major_locator(plt.MultipleLocator(50000))
ax.yaxis.set_minor_locator(plt.MultipleLocator(10000))
ax.set_ylim(0)
[ax.spines[side].set_visible(False) for side in ['right','top']]
plt.grid(which='major', color='#EEEEEE')
plt.grid(which='minor', color='#EEEEEE', linestyle=':')
plt.show();

<div id='conclusion'></div>
# Conclusion
Despite the sub-notifications and the unnoticed cases, the current model achieved an accuracy (R2 score) level of `99.9%` and root mean square error (RMSE) of `~700`. The prediction curve grows very close to the current growth curve. To avoid data bias and low prediction accuracy, the present model is updated and trained daily, keeping a prediction range of 7 days max.

In [None]:
c = y.to_list()
[c.append(0) for i in range(0, len(yhat)-len(y))]
pred = pd.DataFrame({'date':dates[17:], 'cases':c,'predicted':yhat.astype(int)})
pred.tail(30)

In [None]:
print('R2 score: %.3f \nRMSE: %.2f' % score(pred[(pred['cases'] > 0)]['cases'], pred[(pred['cases'] > 0)]['predicted']))

<div id='disclaimer'></div>
## Disclaimer
Due to the complex nature of a pandemic, this work does not intend to be an accurate projection or a model that tries to reproduce the complexity of reality. The main goal of this project is to propose a reflection of the importance of social distancing, quarantine and other infection prevention actions, to minimize the pandemic effects and try to flat the infection curve.

## Thank you and stay home! :)