In [None]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns

# Loading the database:

In [None]:
covid = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')
covid

I will create a new dataframe with only datas baout Brazil (it's my country) to make some EDA

In [None]:
df_BR = covid[covid['Country/Region'] == 'Brazil']
df_BR

Let's see the ``ObservationDate`` and the ``Confirmed`` cases in a graph:

In [None]:
sns.lineplot(df_BR['ObservationDate'], df_BR['Confirmed']);

Look how interesting, as the days go by the cases increase but when it exceeds the 250000 mark, there is a significant drop.
* Is this drop the result of a quarantine or more stringent measures?
* Was it during the period when the [president of the country decided that he would stop reporting the total number of cases](https://oglobo.globo.com/sociedade/governo-esconde-totais-de-mortes-casos-da-covid-19-tira-site-do-ar-1-24466314) and did this have an impact on the results?

But right after the sharp drop, there is a resumption of the growth of cases, even in a more controlled way than before the declination of the line.

I will create a new dataframe but now the data `` ObservationDate``, `` Confirmed``, `Deaths` and `Recovered` grouped together

In [None]:
df_sum = covid.groupby('ObservationDate').agg({'Confirmed': 'sum', 'Deaths': 'sum', 'Recovered': 'sum'}).reset_index()
df_sum

Let's create a comparative chart between `` Confirmed``, `` Deaths`` and `` Recovered``

In [None]:
plt.stackplot(df_sum['ObservationDate'], [df_sum['Confirmed'], df_sum['Deaths'], df_sum['Recovered']],
              labels = ['Confirmed', 'Deaths', 'Recovered'])
plt.legend(loc = 'upper left')

We can see in these graphs that the number of deaths is insignificant when compared to the number of confirmed cases and those recovered. Despite the delicate periods that the world has been going through, the number of recovered people exceeds the number of deaths, which is very good even the number of confirmed cases continuing to grow and being very high.

In [None]:
sns.pairplot(covid)

# Loading another database
This database brings information from patients

In [None]:
covid_line_list = pd.read_csv('../input/novel-corona-virus-2019-dataset/COVID19_line_list_data.csv')
covid_line_list

Analyzing patients by `` age``:

In [None]:
sns.distplot(covid_line_list['age'])

We can see from the graph above that the highest levels of virus infection are between 25 and 70 years old. It is very clear that we have a very high peak in those people aged approximately 60 to 70 years, people including part of the risk group for being elderly.

# Predicting the deaths by COVID-19:

In [None]:
deaths = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv')
deaths

Let's see about the Brazil deaths

In [None]:
deaths[deaths['Country/Region'] == 'Brazil']

In Brazil, cases grow in a short period of time

As the goal is to predict the number of deaths, we will create a variable called the ``columns`` where we will keep only the date.

In [None]:
columns = deaths.keys()
columns

The columns that start the dates is number 4, so let's recreate the dataframe `` deaths`` from the fourth column. It will take from the fourth to the last column and the number of deaths on the dates will be shown

In [None]:
deaths = deaths.loc[:, columns[4]:columns[-1]]
deaths

Now we can see the number of deaths on their date of occurrence. Let's see how many months we have in this database:

In [None]:
len(deaths.keys())

We have data of just over 6.5 months. Now let's add the number of deaths that occurred on each day:

In [None]:
deaths['1/22/20'].sum()

We had a total of 17 deaths in this day in the world

In [None]:
deaths['8/12/20'].sum()

In [None]:
#creating the variable:
dates = deaths.keys()
y = []
for i in dates:
    y.append(deaths[i].sum())

In [None]:
print(y)

variable ``y`` is in `` list`` format and to use Machine Learning algorithms it is necessary to transform them into `` array``

In [None]:
y = np.array(y).reshape(-1,1) #will be transformed into matrix format
y

Our x variable to get the predict of the deaths

In [None]:
x = np.arange(len(dates)).reshape(-1,1) #will generate the number of dates according to the size of the "dates"
x

Let's create a new `` forecast`` variable so that we can make the future forecast for the next few days:

In [None]:
forecast = np.arange(len(dates) + 15).reshape(-1,1)
forecast

In [None]:
x.shape, y.shape, forecast.shape 

**Training the model**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.10, shuffle = False)

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

As we are using the `` degree = 2`` we have to modify our training data to place these new 2 characteristics

In [None]:
X_train.shape, X_test.shape, X_train_poly.shape, X_test_poly.shape

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train_poly, Y_train)

**The prediction**

In [None]:
poly_pred = lr.predict(X_test_poly)
plt.plot(poly_pred, linestyle = 'dashed')
plt.plot(Y_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MAE:', mean_absolute_error(poly_pred, Y_test))
print('MSE:', mean_squared_error(poly_pred, Y_test))
print('RMSE:', np.sqrt(mean_absolute_error(poly_pred, Y_test)))

Now let's use the ``forecast`` variable to make the forecast

In [None]:
X_train_all = poly.transform(forecast)
pred_all = lr.predict(X_train_all)

plt.plot(forecast[:-15], y, color='red')
plt.plot(forecast, pred_all, linestyle='dashed')
plt.title('DEATHS of COVID-19')
plt.xlabel('Days since 1/22/2020')
plt.ylabel('Number of deaths')
plt.legend(['Death cases', 'Predictions']);

# The prediction of the confirmed cases:

In [None]:
confirmed = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
confirmed

In [None]:
confirmed[confirmed['Country/Region'] == 'Brazil']

In [None]:
columns1 = confirmed.keys()
columns1

In [None]:
confirmed = confirmed.loc[:, columns1[4]:columns1[-1]]
confirmed

In [None]:
confirmed['1/22/20'].sum(), confirmed['8/12/20'].sum()

In [None]:
dates1 = confirmed.keys()
y_c = []
for i in dates1:
    y_c.append(confirmed[i].sum())

In [None]:
print(y_c)

In [None]:
y_c = np.array(y_c).reshape(-1,1)
y_c

In [None]:
x_c = np.arange(len(dates1)).reshape(-1,1)
x_c

In [None]:
forecast_c = np.arange(len(dates1) + 15).reshape(-1,1)
forecast_c

In [None]:
x_c.shape, y_c.shape, forecast_c.shape 

**Training the model**

In [None]:
X_train_c, X_test_c, Y_train_c, Y_test_c = train_test_split(x_c, y_c, test_size = 0.10, shuffle = False)

In [None]:
poly_c = PolynomialFeatures(degree = 4)
X_train_poly_c = poly_c.fit_transform(X_train_c)
X_test_poly_c = poly_c.transform(X_test_c)

In [None]:
X_train_c.shape, X_test_c.shape, X_train_poly_c.shape, X_test_poly_c.shape

In [None]:
lr_c = LinearRegression()
lr_c.fit(X_train_poly_c, Y_train_c)

**The prediction**

In [None]:
poly_pred_c = lr_c.predict(X_test_poly_c)
plt.plot(poly_pred_c, linestyle = 'dashed')
plt.plot(Y_test_c)

In [None]:
print('MAE:', mean_absolute_error(poly_pred_c, Y_test_c))
print('MSE:', mean_squared_error(poly_pred_c, Y_test_c))
print('RMSE:', np.sqrt(mean_absolute_error(poly_pred_c, Y_test_c)))

In [None]:
X_train_all_c = poly_c.transform(forecast_c)
pred_all_c = lr_c.predict(X_train_all_c)

plt.plot(forecast_c[:-15], y_c, color='red')
plt.plot(forecast_c, pred_all_c, linestyle='dashed')
plt.title('CONFIRMED of COVID-19')
plt.xlabel('Days since 1/22/2020')
plt.ylabel('Number of confirmed')
plt.legend(['Confirmed cases', 'Predictions']);

# The prediction of the recovered

In [None]:
recovered = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv')
recovered

In [None]:
recovered[recovered['Country/Region'] == 'Brazil']

In [None]:
columns2 = recovered.keys()
columns2

In [None]:
recovered = recovered.loc[:, columns2[4]:columns2[-1]]
recovered

In [None]:
recovered['1/22/20'].sum(), recovered['8/12/20'].sum()

In [None]:
dates2 = recovered.keys()
y_r = []
for i in dates2:
    y_r.append(recovered[i].sum())

In [None]:
print(y_r)

In [None]:
y_r = np.array(y_r).reshape(-1,1)
y_r

In [None]:
x_r = np.arange(len(dates2)).reshape(-1,1)
x_r

In [None]:
forecast_r = np.arange(len(dates2) + 15).reshape(-1,1)
forecast_r

In [None]:
x_r.shape, y_r.shape, forecast_r.shape

**Traning the model**

In [None]:
X_train_r, X_test_r, Y_train_r, Y_test_r = train_test_split(x_r, y_r, test_size = 0.30, shuffle = False)

In [None]:
poly_r = PolynomialFeatures(degree = 3)
X_train_poly_r = poly_r.fit_transform(X_train_r)
X_test_poly_r = poly_r.transform(X_test_r)

In [None]:
X_train_r.shape, X_test_r.shape, X_train_poly_r.shape, X_test_poly_r.shape

In [None]:
lr_r = LinearRegression()
lr_r.fit(X_train_poly_r, Y_train_r)

**The prediction**

In [None]:
poly_pred_r = lr_r.predict(X_test_poly_r)
plt.plot(poly_pred_r, linestyle = 'dashed')
plt.plot(Y_test_r)

In [None]:
print('MAE:', mean_absolute_error(poly_pred_r, Y_test_r))
print('MSE:', mean_squared_error(poly_pred_r, Y_test_r))
print('RMSE:', np.sqrt(mean_absolute_error(poly_pred_r, Y_test_r)))

In [None]:
X_train_all_r = poly_r.transform(forecast_r)
pred_all_r = lr_r.predict(X_train_all_r)

plt.plot(forecast_r[:-15], y_r, color='red')
plt.plot(forecast_r, pred_all_r, linestyle='dashed')
plt.title('RECOVERED of COVID-19')
plt.xlabel('Days since 1/22/2020')
plt.ylabel('Number of recovered')
plt.legend(['Recovered', 'Predictions']);

# CONCLUSIONS: 
COVID-19 is a new virus and there are still many things to research and deepen, with these results we realized that luckily there are more cured than dead but there are no predictions for a drop in the number of dead and confirmed. Although there are people in the risk group who are contaminated, the results showed that a large part of those currently being contaminated are young. Even though the peak is among the elderly aged 60 to 70 years. It is worth mentioning that many of these young people may be infected but asymptomatic, which aggravates the situation of those who have agreed with these asymptomatic young people and may have been infected. Many countries do not have the best healthcare network and treatment conditions:

* lack of professionals
* lack of suitable materials
* problems with testing

The poor conditions of the health system lead to the prediction of the increase in deaths, as well as many countries that did not adhere to the quarantine, some are in the process of resuming and have already made the quarantine more flexible, which may corroborate to keep the increasing number of contaminated.