# Checking Correlation between variables - testing theories

## Estimating the correlation between two variables with a contingency table and a chi-squared test

Estimating correlation between total number of deaths and being a smoker (both female and male smokers).
Note that the exact same analysis could be done for other variables, for example: cardiovasc death rate and diabetes prevalence.

In [9]:
import os
import pandas as pd
from IPython.display import display

pd.set_option('display.max_columns', None)

df_indexed = pd.read_csv(r"../../data/owid-covid-data.csv", index_col='continent')

display(df_indexed.head(1))

Unnamed: 0_level_0,iso_code,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1
Asia,AFG,Afghanistan,2020-02-24,1.0,1.0,,,,,0.026,0.026,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511


In [10]:
# Select latest available data
df_indexed = df_indexed.loc[df_indexed['date'] == '2021-04-24']

countries = df_indexed['location']
total_deaths = df_indexed['total_deaths']

n_smokers = df_indexed['female_smokers'] + df_indexed['male_smokers']

In [11]:
df_bis_smoking = pd.DataFrame({'country':countries,
                       'total_deaths': total_deaths,
                       'n_smokers': n_smokers}).dropna()

df_bis_smoking = df_bis_smoking.reset_index(drop=True)

df_bis_smoking.set_index(["country"], inplace = True,
                    append = True, drop = True)

df_bis_smoking.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_deaths,n_smokers
Unnamed: 0_level_1,country,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Albania,2372.0,58.3
1,Algeria,3198.0,31.1
2,Andorra,124.0,66.8
3,Argentina,61474.0,43.9
4,Armenia,4001.0,53.6


## Pearson's correlation coefficient


In [12]:
# Measures linear correlation between two sets of data - the closest to 1 the higher the correlation.
df_bis_smoking.corr()

Unnamed: 0,total_deaths,n_smokers
total_deaths,1.0,-0.005351
n_smokers,-0.005351,1.0


This is a very low negative correlation between smoking and death from Covid-19.

# Contingency table yielding frequency

In [13]:
df_bis_smoking['total_deaths_binarized'] = (df_bis_smoking['total_deaths'] > df_bis_smoking['total_deaths'].median())
df_bis_smoking['n_smokers_binarized'] = (df_bis_smoking['n_smokers'] > df_bis_smoking['n_smokers'].median())

pd.crosstab(df_bis_smoking['total_deaths_binarized'], df_bis_smoking['n_smokers_binarized'])

n_smokers_binarized,False,True
total_deaths_binarized,Unnamed: 1_level_1,Unnamed: 2_level_1
False,48,22
True,23,46


## To determine if there is a statistically significant correlation between the variables -> chi-squared test

In [14]:
import scipy.stats as st

this = pd.crosstab(df_bis_smoking['total_deaths_binarized'], df_bis_smoking['n_smokers_binarized'])
st.chi2_contingency(this)

(15.885715775900074,
 6.728463101347897e-05,
 1,
 array([[35.75539568, 34.24460432],
        [35.24460432, 33.75539568]]))

# Conclusion
The second value of the Chi-squared test is the p-value which indicates the likelihood of the null-hypothesis (in this case, that smoking is not correlated with the number of deaths). Since it's much lower than 0.05 we can reject the null - there is then a correlation between the proportion of deaths and smoking, or rather, the data shows this correlation as indicated by this test.

# Multivariate Linear Regression
Modelling the relationship between a dependent variable and one or more independent variables.

In [15]:
df_indexed = pd.read_csv(r"/home/goncalo/Documents/cover/Illuminatti/hackathon/aiHackCovid/datasets/owid-covid-data.csv", index_col='continent')

df_indexed = df_indexed.loc[df_indexed['date'] == '2021-04-24']

countries = df_indexed['location']
total_deaths = df_indexed['total_deaths']
diabetes = df_indexed['diabetes_prevalence']
cvd_rate = df_indexed['cardiovasc_death_rate']
n_smokers = df_indexed['female_smokers'] + df_indexed['male_smokers']

df_bis = pd.DataFrame({'country':countries,
                       'total_deaths': total_deaths,
                       'diabetes_prevalence': diabetes,
                       'cvd_rate': cvd_rate,
                       'n_smokers': n_smokers}).dropna()

df_bis = df_bis.reset_index(drop=True)

df_bis.set_index(["country"], inplace = True,
                    append = True, drop = True)

display(df_bis.head())

FileNotFoundError: [Errno 2] No such file or directory: '/home/goncalo/Documents/cover/Illuminatti/hackathon/aiHackCovid/datasets/owid-covid-data.csv'

In [None]:
display(df_bis.describe())

# We can see below a short summary of the data, like the mean or the max value for each metric.

## (Multivariate) Linear regression assumption of linearity

In [None]:
'''
Verify normal distribution with, for example, a q-q plot, which
compares our data with a Gaussian distribution (or normal distribution)
'''
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

# q-q plot
qqplot(df_bis['diabetes_prevalence'], line='s')
pyplot.show()

qqplot(df_bis['cvd_rate'], line='s')
pyplot.show()

qqplot(df_bis['n_smokers'], line='s')
pyplot.show()

print('The data can be considered normally distributed.')

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

y = np.log(df_bis['total_deaths'])
x = df_bis[['diabetes_prevalence', 'cvd_rate', 'n_smokers']]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# define model
linear_regression = LinearRegression()

# fitting the model
linear_regression.fit(x_train,y_train)

print(linear_regression.intercept_)
print(linear_regression.coef_)

# predict with the data
y_pred = linear_regression.predict(x_test)

print()
print('Actual values:')
display(y_test)
print()
print('Predicted values:')
display(y_pred[0:8])

In [None]:
import seaborn as sb

sb.regplot(x=y_test, y=y_pred, ci=None, color="b")

We get an anedoctal result - the line shows underfitting. Still, the claim can be made, that based on this model and the data, there is a positive influence of the three variables in the total number of Covid-19 caused deaths worldwide, that is, cardiovasculary disease, diabetes and smoking.