Problem Statement  - 
#A common perception about COVID-19 is that warm climate is more resistant to the corona outbreak.

Approach - 
We will verify this by using Hypothesis Testing.
Null Hypothesis  :  Temperature doen't affect COVID-19 Outbreak
Alternate Hypothesis : Temperature does affect COVID-19 Outbreak


Note  - 
We will be adding the feature of "Temperature and Humidity" for Latitude and Longitude using Python Weather API - Pyweatherbit

In [1]:
import pandas as pd
import numpy as np

corona = pd.read_csv('Corona_Updated.csv')

In [2]:
corona.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude,Temprature,Humidity,Temp_Cat,Humid_Cat
0,Hubei,Mainland China,2020-03-10T15:13:05,67760,3024,47743,30.9756,112.2707,12.5,86,1,1
1,,Italy,2020-03-10T17:53:02,10149,631,724,43.0,12.0,12.9,64,1,1
2,,Iran (Islamic Republic of),2020-03-10T19:13:20,8042,291,2731,32.0,53.0,11.9,9,0,0
3,,Republic of Korea,2020-03-10T19:13:20,7513,54,247,36.0,128.0,4.9,41,0,0
4,,France,2020-03-10T18:53:02,1784,33,12,47.0,2.0,11.9,93,0,0


In [3]:
corona.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Confirmed,206.0,575.640777,4822.697784,0.0,3.0,12.0,75.75,67760.0
Deaths,206.0,20.68932,215.794845,0.0,0.0,0.0,1.0,3024.0
Recovered,206.0,312.640777,3332.764713,0.0,0.0,0.0,4.0,47743.0
Latitude,206.0,31.184989,21.305149,-41.4545,25.0692,36.03055,43.87025,64.9631
Longitude,206.0,11.75203,84.576291,-157.4983,-74.841325,15.23425,101.363375,174.886
Temprature,206.0,12.161165,10.229763,-21.9,6.1,11.75,20.375,33.1
Humidity,206.0,67.728155,21.780588,6.0,55.0,73.0,84.0,98.0
Temp_Cat,206.0,0.470874,0.500367,0.0,0.0,0.0,1.0,1.0
Humid_Cat,206.0,0.470874,0.500367,0.0,0.0,0.0,1.0,1.0


In [4]:
# correcting the wrong spelled word "Temprature" to "Temperature" 
corona.rename(columns ={'Temprature' : 'Temperature'},inplace = True)

In [5]:
# We are considering Temperaature below 24 as Cold Climate and above 24 as Hot Climate 
corona['Temp_Cat'] = corona['Temperature'].apply(lambda x: 0 if x <24 else 1)
corona.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude,Temperature,Humidity,Temp_Cat,Humid_Cat
0,Hubei,Mainland China,2020-03-10T15:13:05,67760,3024,47743,30.9756,112.2707,12.5,86,0,1
1,,Italy,2020-03-10T17:53:02,10149,631,724,43.0,12.0,12.9,64,0,1
2,,Iran (Islamic Republic of),2020-03-10T19:13:20,8042,291,2731,32.0,53.0,11.9,9,0,0
3,,Republic of Korea,2020-03-10T19:13:20,7513,54,247,36.0,128.0,4.9,41,0,0
4,,France,2020-03-10T18:53:02,1784,33,12,47.0,2.0,11.9,93,0,0


In [6]:
# checking if any NAN values are present in the dataframe
corona.isna().sum()
# we can see below that there are 107 province/state missing but since it is a catagorical value and not relvant to our problem statement because we are focussing on the Temperature - Category.
# if we were to refer Province/State wise then this data would be inapproriate/incomplete for determing a result.
# Since there are no  Null values present in the other columns we are good to go.

Province/State    107
Country/Region      0
Last Update         0
Confirmed           0
Deaths              0
Recovered           0
Latitude            0
Longitude           0
Temperature         0
Humidity            0
Temp_Cat            0
Humid_Cat           0
dtype: int64

In [7]:
# Only picking out the relevant Columns 
corona_t = corona[['Confirmed','Temp_Cat']]
corona_t

Unnamed: 0,Confirmed,Temp_Cat
0,67760,0
1,10149,0
2,8042,0
3,7513,0
4,1784,0
...,...,...
201,0,0
202,0,0
203,0,0
204,0,0


In [8]:
# to apply Hypothesis testing 
# we will be using 2 sample Z test for this data set 
# hence creating a function to calculate the p-value 

def TwoSampleZtest(X1,X2,sigma1,sigma2,N1,N2):
    from numpy import sqrt,abs,round
    from scipy.stats import norm
    
    ovr_sigma = sqrt(sigma1**2/N1 + sigma2**2/N2)
    z= (X1-X2)/ovr_sigma
    pval = 2*(1-norm.cdf(abs(z))) # Two-tailed p-value
    
    return z,pval

In [9]:
# Data preparation
d1 = corona_t[(corona_t['Temp_Cat']==1)]['Confirmed']
d2 = corona_t[(corona_t['Temp_Cat']==0)]['Confirmed']

In [10]:
# Means and standard deviations of the two samples
m1,m2 = d1.mean(),d2.mean()
sd1,sd2 = d1.std(),d2.std()
n1,n2 = d1.shape[0] , d2.shape[0]

In [12]:
# Running the Z-test
z,p = TwoSampleZtest(m1,m2,sd1,sd2,n1,n2)
z_score = np.round(z,8)
p_val = np.round(p,6)

# Hypothesis testing output
if (p_val <0.05) :
    Hypothesis_status = "Reject Null Hypothesis : Significant"
else :
    Hypothesis_status = "Do not rejct Null Hypothesis : Not Significant"

# print results
print (f"Z-statistic : {z_score}") 
print (f"p-value : {p_val}") 
print (f"Hypothesis Status : {Hypothesis_status}")

Z-statistic : -1.63497531
p-value : 0.102054
Hypothesis Status : Do not rejct Null Hypothesis : Not Significant


In [13]:
# Thus we do not have enough evidence to reject out Null Hypothesis that is temperature doesn't affect the COVID-19 outbreak.