This is week 4 of Kaggle's COVID-19 forecasting series, following the Week 3 competition. This is the 4th competition round launched in this series. All of the prior discussion forums have been migrated to this competition for continuity.

**Background**
The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicine’s (NASEM) and the World Health Organization (WHO).

**The Challenge**
Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between April 15 and May 14 by region, the primary goal isn't only to produce accurate forecasts. It’s also to identify factors that appear to impact the transmission rate of COVID-19.


**Companies and Organizations**
There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggle’s dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community.

**Acknowledgements**
JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control.

![](https://unsplash.com/photos/ci2rHJqgC1M)

![COVID-19](https://unsplash.com/photos/ci2rHJqgC1M)

**Download of the Libraries and Data Importation**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="ticks", color_codes=True)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

First, I downloaded the dataset, in csv format, which contains confirmed cases of the COVID-19 and fatalities globally. Dataset source is from Kaggle.

I used the COVID-19 train and test datasets to draw as much meaningful insights as possible regarding the COVID-19 Pandemic spread and forecast.

**Reading of the Dataset**

In [None]:
dt_train = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-4/train.csv') 
dt_test = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-4/test.csv')

**Data Exploration**

In [None]:
display(dt_train.head())
display(dt_train.describe())
display(dt_train.info())

In [None]:
viz = dt_train[["ConfirmedCases", "Fatalities"]]
viz.hist()
plt.show

Now. let's plot each of the above graphs features versus the Dates, regions, and provinces/states to see how linear is their relation.

In [None]:
plt.scatter(dt_train.ConfirmedCases, dt_train.Fatalities)
plt.xlabel("ConfirmedCases")
plt.ylabel("Fatalities")
plt.show

In [None]:
plt.scatter(dt_train.Date, dt_train.Fatalities)
plt.xlabel("Date")
plt.ylabel("Fatalities")
plt.show

In [None]:
plt.figure(figsize=(18,50))
plt.scatter(dt_train.ConfirmedCases, dt_train.Country_Region)
plt.xlabel("ConfirmedCases")
plt.ylabel("Country_Region")
plt.show

**Extracting Data from United States of America (USA) and then Plotting it**

In [None]:
usa = dt_train[dt_train["Country_Region"]=="US"]

In [None]:
dt_train

In [None]:
usa.head()

The above shows that there were no confirmed cases or fatalities in states like Alabama as from the 2020-01-22.

In [None]:
usa.tail()

However, when we check on the mnth of April and current dtae, we see a large number of confirmed cases in Wyoming at 261.

**Plotting of the Graph**

This graph will be used to show the number of confirmed cases in USA.

In [None]:
plt.figure(figsize=(10,8))
plt.plot(usa["ConfirmedCases"])
plt.xlabel("Time")
plt.ylabel("The Number of Confirmed Cases in USA")

This graph will be used to show the number of fatalities in USA.

In [None]:
plt.figure(figsize=(10,8))
plt.plot(usa["Fatalities"])
plt.xlabel("Time")
plt.ylabel("The Number of Fatalities in USA")

The two graphs above show the growth of both the confirmed cases and the fatalities in USA alone against time.

In [None]:
tab_info = pd.DataFrame(dt_train.dtypes).T.rename(index={0:'column Type'}) 
tab_info = tab_info.append(pd.DataFrame(dt_train.isnull().sum()).T.rename(index={0:'null values (nb)'}))
tab_info = tab_info.append(pd.DataFrame(dt_train.isnull().sum()/dt_train.shape[0]*100).T.rename(index={0: 'null values (%)'}))
tab_info

**Checking for the Number of States Represented in the Dataset**

In [None]:
usa_states = dt_train[dt_train["Country_Region"]=="US"]["Province_State"].unique()

In [None]:
def province(state, country):
    if state == "nan":
        return country
    return state

In [None]:
dt_train = dt_train.fillna ("nan")

In [None]:
dt_train["Province_State"] = dt_train.apply(lambda x: province(x["Province_State"], x["Country_Region"]), axis = 1)

In [None]:
dt_train

In [None]:
usa_states

Now we have all the states in the USA that are represented in the dataset.

**Visual Forecasting of the Spread**

In [None]:
import seaborn as sns
corr = dt_train.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr, 
            annot=True, fmt=".3f",
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()

In [None]:
sns.pairplot(dt_train, vars=["ConfirmedCases", "Fatalities", "Date", "Province_State", "Country_Region"])

In [None]:
sns.pairplot(dt_train.fillna(0), vars=["ConfirmedCases", "Fatalities", "Date", "Province_State", "Country_Region"])

In [None]:
for name, group in dt_train.groupby(["Province_State", "Country_Region"]):
    plt.title(name)
    plt.scatter(range(len(group)), group["ConfirmedCases"])
    plt.show()
    break

**Creating a Model using the Training Dataset**

Using simple regression model with a coefficient of B = (B1,.....Bn) to help minimize the residual sum of squares between the x values in the dataset and the y values by using linear approximation.

In [None]:
# Using sklearn package to model the data
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(dt_train[['ConfirmedCases']])
train_y = np.asanyarray(dt_train[['Fatalities']])
regr.fit (train_x, train_y)

# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)

The Coefficient and Intercept are used to create a fit line using the two parameters for estimate.

**Plotting of the Model Output**

In [None]:
plt.scatter(dt_train.ConfirmedCases, dt_train.Fatalities,  color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("ConfirmedCases")
plt.ylabel("Fatalities")

**Evaluating the Model using the Test Dataset**

Comparing the actual COVID-19 dataset values collected and then predicting the values to calculate the accuracy of the model. This is to provide more insights to the areas that require more atention.

I will use the MSE model evaluation metrics to calculate the accuract of my model based on the test set provided. Focusing more on the  large errors and how close the data are to the fitting regression line.

In [None]:
display(dt_test.head())
display(dt_test.describe())
display(dt_test.info())

**Using the Model to Predict the Unknown Values of the Potential Spread of the COVID-19**

In [None]:
from sklearn.metrics import r2_score

test_x = np.asanyarray(dt_train[['ConfirmedCases']])
test_y = np.asanyarray(dt_train[['Fatalities']])
test_y_ = regr.predict(test_x)

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y_ , test_y) )

In [None]:
dt_train.to_csv('submission.csv', index = False)

**Thank you and stay safe!**