,<font size="4">Section 1: Data Preperation</font>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# Loading datasets required for analysis

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)
import warnings # current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings("ignore")

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input/"))

Step 1: Reading and understanding the data

In [None]:
full_table = pd.read_csv('../input/corona-virus-report/covid_19_clean_complete.csv', 
                         parse_dates=['Date'])
full_table.head()

The above shows the headers and a top slice of the table. I think this is self-explanatory.

Step 2: Preparing the data

In [None]:
# Defining COVID-19 cases as per classifications 
cases = ['Confirmed', 'Deaths', 'Recovered', 'Active']

# Defining Active Case: Active Case = confirmed - deaths - recovered
full_table['Active'] = full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']

# Renaming Mainland china as China in the data table
full_table['Country/Region'] = full_table['Country/Region'].replace('Mainland China', 'China')

# filling missing values 
full_table[['Province/State']] = full_table[['Province/State']].fillna('')
full_table[cases] = full_table[cases].fillna(0)

# cases in the ships
ship = full_table[full_table['Province/State'].str.contains('Grand Princess')|full_table['Country/Region'].str.contains('Cruise Ship')]

# china and the row
china = full_table[full_table['Country/Region']=='China']
row = full_table[full_table['Country/Region']!='China']

# latest
full_latest = full_table[full_table['Date'] == max(full_table['Date'])].reset_index()
china_latest = full_latest[full_latest['Country/Region']=='China']
row_latest = full_latest[full_latest['Country/Region']!='China']

# latest condensed
full_latest_grouped = full_latest.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
china_latest_grouped = china_latest.groupby('Province/State')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
row_latest_grouped = row_latest.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()

In [None]:
df = pd.DataFrame(full_table)

Step 3: Creating a consolidated table , which gives the country wise total defined cases

In [None]:
temp = full_table.groupby(['Country/Region', 'Province/State'])['Confirmed', 'Deaths', 'Recovered', 'Active'].max()

In [None]:
temp = full_table.groupby('Date')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
temp = temp[temp['Date']==max(temp['Date'])].reset_index(drop=True)
temp.style.background_gradient(cmap='Pastel1')

In [None]:
temp_f = full_latest_grouped.sort_values(by='Confirmed', ascending=False)
temp_f = temp_f.reset_index(drop=True)


**Elle's Analysis**

Elle: Rename column to 'Country' and 'US' to 'United States' so that it matches that of the new data table.

In [None]:
temp_f_pd = pd.DataFrame(temp_f)
temp_f_pd.rename(columns = {'Country/Region':'Country'}, inplace = True)
temp_f_pd['Country'].replace ({'US': 'United States'}, inplace = True)

Covid = temp_f_pd
df1 = pd.DataFrame(Covid)

Covid.head(10)

Elle: Add in the 2020 Pollution Index per country data set

In [None]:
pollution = pd.read_csv('../input/pollution-data/pollution_data.csv')
df2 = pd.DataFrame(pollution)

pollution.head(10)

Add in population data to created deaths per capita

In [None]:
population = pd.read_csv('../input/population-by-country-2020/population_by_country_2020.csv')
population.rename(columns = {'Country (or dependency)':'Country'}, inplace = True)
population.rename(columns = {'Population (2020)':'Population'}, inplace = True)
df3 = pd.DataFrame(population)

population.head(10)

Elle: Merge the two tables on inner join. Right join proved to be unsuccessful as there were several countries with different names and I was too lazy to rename all of them.

In [None]:
covid_pol = pd.merge(df1, df2, on="Country", how="inner")
covid_pol.head(10)

Merge with the population data set

In [None]:
dataset = pd.merge(covid_pol, df3, on="Country", how="inner")

dataset.head(10)

Create "Cases per capita" and "Deaths per capita" variable.
*Note: We have multiplied the variables by 1000 and so it should be interpretted as such.*

In [None]:
cases_pc = ((dataset.Confirmed)/(dataset.Population))*1000
deaths_pc = ((dataset.Deaths)/(dataset.Population))*1000
dataset = dataset.assign(CasesperCap=cases_pc)
dataset = dataset.assign(DeathsperCap=deaths_pc)

In [None]:
dataset = dataset.sort_values(by='CasesperCap', ascending=False)
dataset = dataset.reset_index(drop=True)
dataset.head(26).style.background_gradient(cmap='Reds')

Elle: Create a scatter plot to eyeball correlation. Need to annotate the points. 

In [None]:
dataset.plot(x = 'Pollution Index',y = 'CasesperCap',style ='o', alpha = 0.75 )
plt.title('Covid Cases vs Pollution level')  
plt.xlabel('Pollution Index')  
plt.ylabel('No. of Cases per 1000 people')  
plt.gcf().set_size_inches((20, 15))   


There does not appear to be correlation at first glance but there are a few 'outliers'.

**Linear Regression Analysis**: Lets look if 'Pollution Index' (independent variable/x) has any affect on the Covid Cases per 1000 people (dependent variable/y).

In [None]:
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline


Show the distribution of Cases per Capita ('00)

In [None]:
plt.figure(figsize=(15,10))
plt.tight_layout()
seabornInstance.distplot(dataset['CasesperCap'])

The average looks to be 0.1 

Define the X and Y variables and train the model. And split 80% of the data to the training set while 20% of the data to test set using below code.

The test_size variable is where we actually specify the proportion of the test set.

In [None]:
X = dataset['Pollution Index'].values.reshape(-1,1)
y = dataset['CasesperCap'].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()  
regressor.fit(X_train, y_train) #training the algorithm

In [None]:
#To retrieve the intercept:
print(regressor.intercept_)

#For retrieving the slope:
print(regressor.coef_)

The coefficient is - which means that as the pollution index increases, the Cases per Capita decreases. This seems to be against the hypothesis. 

Now compare the actual output values for X_test with the predicted values, execute the following script:

In [None]:
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df

In [None]:
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The mean squared error (MSE) tells you how close a regression line is to a set of points. The Mean Absolute Error(MAE) is the average of all absolute errors. Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). 

***Linear Regression *** with Deaths per 1000 people as the dependent Variable

In [None]:
plt.figure(figsize=(15,10))
plt.tight_layout()
seabornInstance.distplot(dataset['DeathsperCap'])

In [None]:
X = dataset['Pollution Index'].values.reshape(-1,1)
y = dataset['DeathsperCap'].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()  
regressor.fit(X_train, y_train) #training the algorithm

In [None]:
#To retrieve the intercept:
print(regressor.intercept_)

#For retrieving the slope:
print(regressor.coef_)

In [None]:
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df

In [None]:
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

This model appears to fit the data better

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))