In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px # plotly express
import plotly.graph_objects as go
%matplotlib inline
import os
from IPython.display import HTML

# Input data files are available in the '/kaggle/input' or '../../../datasets/extracts/' directory.
file_input=['/kaggle/input','../../../datasets/']
files={}
for dirname, _, filenames in os.walk(file_input[0]):
    for filename in filenames:
        if 'csv' in filename:
            files[filename.replace('.csv','')]=os.path.join(dirname, filename)

# Effect of Weather on Coronavirus rate of spread

**Brief:** Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
The virus that causes COVID-19 is mainly transmitted through droplets generated when an infected person coughs, sneezes, or exhales. These droplets are too heavy to hang in the air, and quickly fall on floors or surfaces.

In this notebook, we will try to find the correlation of weather attributes to the spread of Covid-19 (coronavirus). The notebook is spread into few sections:
1. [Data Collection](#Data-Collection)
2. [Data Cleanup](#Data-Cleanup)
3. [Visualizing Cases vs Date for different Province](#Total-Cases-vs-Date-in-different-provinces)
4. [Visualizing Rolling 7-day average of Cases vs Date for different Province](#Rolling-7-Day-average-for-cases-in-different-provinces)
5. [Correlation Matrix](#Correlation-Matrix)
6. Deep Dive in Total cases correlation matrix
 - [Confirmed Cases](#Confirmed-Cases)
 - [Recovered Cases](#Recovered-Cases)
 - [Death Cases](#Death-Cases)
7. [Conclusion](#Conclusion)


**NOTE:** This notebook deals with the correlation of weather attributes to covid-19, and not its causality, which may or may not be the same. Please keep that in mind

**For the sake of ease of use, we will be use `province` as a pseudonym for Region/States/Provinces**

Assumptions that I have taken for the data collection and processing.
1. For weathers I took the following weather locations as a setpoint for each province.
 - **Reunion(France)** - Sainte-Marie, Réunion (ROLAND GARROS AIRPORT STATION)
 - **Henan** - Zhengzhou, Henan (ZHENGZHOU XINZHENG INTERNATIONAL AIRPORT STATION)
 - **New York** - New York City, NY (LAGUARDIA AIRPORT STATION)
 - **Virginia** - Lynchburg, VA (LYNCHBURG REGIONAL AIRPORT STATION)
 - **Bermuda** - Castle Harbour, St. George's Parish (L.F. WADE INTERNATIONAL AIRPORT STATION)
 - **Maharashtra** - Mumbai, Maharashtra (CHHATRAPATI SHIVAJI INTERNATIONAL AIRPORT STATION)
 - **Lombardia** - Peschiera Borromeo, Province of Milan (LINATE AIRPORT STATION)
 - **Tasmania** - Hobart, Tasmania (HOBART INTERNATIONAL AIRPORT STATION)
2. If the weather has any effect on coronavirus, it will take approximately 14 days for a patient to get diagnosed. So added a weather offset of 14 days

# Data Collection

In [None]:
ProvinceDF=pd.DataFrame()

# provinces to consider
provinces= ['Reunion','Quebec','Henan','New York','Virginia','Bermuda','Maharashtra','Lombardia','Tasmania']

df= pd.read_csv(files['covid_19_data'])
df= df[df['Province/State'].isin(provinces)][['ObservationDate','Province/State','Confirmed','Deaths','Recovered']]\
        .rename({'ObservationDate':'Date','Province/State':'Province'},axis=1)
df['Date']= pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.strftime('%Y-%m-%d')
ProvinceDF= pd.concat([ProvinceDF,df])

# India States that needs to be considered
df= pd.read_csv(files['covid_19_india'])
df= df[df['State/UnionTerritory']=='Maharashtra'][['Date','State/UnionTerritory','Confirmed','Deaths','Cured']]\
        .rename({'State/UnionTerritory':'Province','Cured':'Recovered'},axis=1)
df['Date']= pd.to_datetime(df['Date'],format='%d/%m/%y').dt.strftime('%Y-%m-%d')
ProvinceDF= pd.concat([ProvinceDF,df])

# Italy Region that needs to be considered
df= pd.read_csv(files['covid19_italy_region'])
df= df[df['RegionName']=='Lombardia'][['Date','RegionName','TotalPositiveCases','Deaths','Recovered']]\
    .rename({'RegionName':'Province','TotalPositiveCases':'Confirmed'},axis=1)
df['Date']= pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
ProvinceDF= pd.concat([ProvinceDF,df])

ProvinceDF.head()

**Weather data is collected from [WUNDERGROUND](https://www.wunderground.com/)**

In [None]:
# Weather data for the above Provinces/States, pd.DateOffset for shifting dates
total_weather_df=pd.DataFrame()
for key in provinces:
    weather_df= pd.read_csv(files['Weather '+key])
    weather_df['Date']= (pd.to_datetime(weather_df['valid_time_gmt'],unit='s') - pd.DateOffset(14)).dt.strftime('%Y-%m-%d')
    weather_df= weather_df[['Date','temp','dewPt','wspd','pressure','heat_index','rh','vis','wc','wdir','feels_like','uv_index']].groupby(['Date']).agg(['min','mean','max']).reset_index()
    weather_df.columns= weather_df.columns.map('| '.join).str.strip('| ')
    weather_df['Province']= key
    weather_df.drop('uv_index| min',axis=1,inplace=True)        
    total_weather_df=pd.concat([total_weather_df,weather_df])
    
# merging weather data with province data
ProvinceWeatherDF= pd.merge(ProvinceDF,total_weather_df,left_on=['Date','Province'],right_on=['Date','Province'],how='left')

#### Weather abbreviations and definitions from https://www.worldcommunitygrid.org/lt/images/climate/The_Weather_Company_APIs.pdf:
- **temp**: The forecasted temperature for midpoint day (1pm) or midpoint night (1am) for a 12 hour daypart.
- **dewPt**: The temperature which air must be cooled at constant pressure to reach saturation
- **wspd**: The maximum forecasted hourly wind speed
- **pressure**: Mean Sea Level Pressure, the equivalent pressure reading at sea level recorded at this station
- **heat_index**: An apparent temperature. It represents what the air temperature “feels like” on exposed human skin due to the combined effect of warm temperatures and high humidity
- **rh**: (%)The relative humidity of the air, which is defined as the ratio of the amount of water vapor in the air to the amount of vapor required to bring the air to saturation at a constant temperature. 
- **vis**: Prevailing hourly visibility
- **wc**: Wind Chill - Minimum wind chill.
- **wdir**: Daytime average wind direction in magnetic notation. 
- **feels_like**: Hourly feels like temperature. 
- **uv_index**: Maximum UV index for the 12 hour forecast period. 

# Data Cleanup

In [None]:
# Removing elements for which we dont have weather data (temp| min is one of them to consider)
ProvinceWeatherDF= ProvinceWeatherDF[~ProvinceWeatherDF['temp| min'].isna()]

# adding Delta Changes(per day shifts) as different columns data and merging with the province data
ProvinceWeatherDF= pd.concat([ProvinceWeatherDF.sort_values('Date'),
                            ProvinceWeatherDF.sort_values('Date')[['Province','Confirmed','Deaths','Recovered']]\
                              .groupby('Province').diff().rename({'Confirmed':'Delta Confirmed','Deaths':'Delta Deaths','Recovered':'Delta Recovered'},axis=1)],axis=1)

# Cleaning data, for negative per day changes, fill the previous value
ProvinceWeatherDF['Date']= pd.to_datetime(ProvinceWeatherDF['Date'])
ProvinceWeatherDF.sort_values('Date',inplace=True)

for feature in ['Confirmed','Deaths','Recovered']:
    ProvinceWeatherDF.loc[ProvinceWeatherDF['Delta '+feature]<0,feature]=\
        ProvinceWeatherDF[ProvinceWeatherDF['Delta '+feature]<0][[feature,'Delta '+feature]]\
            .apply(lambda row:row[feature]-row['Delta '+feature],axis=1)

    # After the confirmed cases are shifted to previous value, fill the negative value of 'Delta Confirmed cases' to 0
    ProvinceWeatherDF['Delta '+feature].clip(lower=0,inplace=True)

    # fill zero values apart from first value of Province with the previous value
    for province in ProvinceWeatherDF['Province'].unique():
        ProvinceWeatherDF.loc[ProvinceWeatherDF['Province']==province,feature]=\
            ProvinceWeatherDF.loc[ProvinceWeatherDF['Province']==province,feature].mask((ProvinceWeatherDF['Province']==province)&(ProvinceWeatherDF[feature] == 0)).ffill()
    
    # fill NaN with 0. NaN orignates from unfilled mask above
    ProvinceWeatherDF[feature].fillna(0,inplace=True)
    
    ProvinceWeatherDF.loc[ProvinceWeatherDF[feature]>0,feature+' Days']= ProvinceWeatherDF[ProvinceWeatherDF[feature]>0].groupby('Province')['Date'].rank(ascending=True)
    ProvinceWeatherDF= pd.merge(ProvinceWeatherDF,
                            ProvinceWeatherDF.groupby('Province').rolling('7D',on='Date')[feature].mean().reset_index().rename({feature:'Rolling '+feature},axis=1),
                            left_on=['Date','Province'],right_on=['Date','Province'],how='right')
ProvinceWeatherDF.head()

# Total Cases vs Date in different provinces

In [None]:
for feature in ['Confirmed','Deaths','Recovered']:
    fig= px.line(ProvinceWeatherDF,
            x='Date',
            y=feature,
            color='Province',
            title=feature+' cases in different provinces',
            template='plotly_dark')

    fig.update_layout(yaxis=dict(type='log'))
    fig.show()    

# Rolling 7-Day average for cases in different provinces

In [None]:
for feature in ['Confirmed','Deaths','Recovered']:
    fig= px.line(ProvinceWeatherDF,
            x='Confirmed Days',
            y='Rolling '+feature,
            hover_name=feature,
            color='Province',
            title='Rolling 7-Day average for '+feature+' cases in different provinces',
            template='plotly_dark')
    fig.update_layout(yaxis=dict(type='log'),
        annotations = [dict(xref='paper',
                                        yref='paper',
                                        x=-0.1, y=-0.2,
                                        showarrow=False,
                                        text ='Number of days since 1st non-zero case was recorded')]
    )
    fig.show()   

# Correlation Matrix

In [None]:
corr=ProvinceWeatherDF[ProvinceWeatherDF.columns.sort_values()].corr()
mask= np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig= go.Figure(data=go.Heatmap(z=corr.mask(mask),
                                x=corr.columns.values,
                                y=corr.columns.values,
                                xgap=1, ygap=1,
                                colorscale="Rainbow",
                                colorbar_thickness=20,
                                colorbar_ticklen=3,
                                zmid=0),
                layout= go.Layout(title_text='Correlation Matrix', template='plotly_dark',
                height=900,
                xaxis_showgrid=False,
                yaxis_showgrid=False,
                yaxis_autorange='reversed'))
fig.show()

In [None]:
ReqColumns= ProvinceWeatherDF.columns[(ProvinceWeatherDF.columns.str.contains('Confirmed|Recovered|Deaths'))&(~ProvinceWeatherDF.columns.str.contains('Days'))]

# Correlation Matrix
CorrMatrix= pd.DataFrame(ProvinceWeatherDF.corr())
CorrMatrix[ReqColumns].style.background_gradient(cmap='Blues')

# Deep Dive in Total Cases feature's correlations

>now we will go through one by one and see the major correlations

### Confirmed Cases

In [None]:
opColumns= ProvinceWeatherDF.columns[ProvinceWeatherDF.columns.str.contains('Confirmed|Recovered|Deaths|Days')]
ConfMatrix= CorrMatrix.loc[~CorrMatrix.index.isin(opColumns),'Confirmed']
ConfMatrix= pd.concat([ConfMatrix.abs().rename('Abs Confirmed'),ConfMatrix],axis=1).sort_values('Abs Confirmed',ascending=False)[:10]
ConfMatrix

#### Logic behind creating the below graph:
we cannot put all the features to build the chart, so we will choose, which ones to show.
To choose that, we will do the following:
- Get top 10 highest `absolute` correlation features for Confirmed Cases
- Take the first 3 values, which doesn't include `days` and not a categorical data(value_counts()<4), which are not similar (pressure|mean and pressure|max are correlated), and build the chart below

In [None]:
fig= px.scatter(ProvinceWeatherDF,
               y='dewPt| min',
               x='Confirmed',
               color='rh| min',
               size='vis| min',
               hover_name='Province',
               template='plotly_dark',
               color_continuous_scale="Rainbow",
               opacity=1,
              )
fig.update_layout(xaxis={'type':'log'},yaxis={'type':'linear'})
fig.show()

**The above chart shows the `lower dewPt| max`, `lower rh| min` and `higher vis| max` is propotional to `higher Confirmed cases`.**

In [None]:
PositiveConfirmedCorr= set(CorrMatrix.index[~CorrMatrix.index.isin(opColumns)])
NegativeConfirmedCorr= set(CorrMatrix.index[~CorrMatrix.index.isin(opColumns)])
for feature in ['Confirmed','Delta Confirmed','Rolling Confirmed']:
    Correlations= CorrMatrix.loc[~CorrMatrix.index.isin(opColumns),feature]
    PositiveConfirmedCorr= PositiveConfirmedCorr & set(Correlations[Correlations>0].index)
    NegativeConfirmedCorr= NegativeConfirmedCorr & set(Correlations[Correlations<0].index)
    
    print('\033[1mPositive Correlation with '+feature+'(Descending Order): \033[0m\n\t'+ ', '.join(Correlations[Correlations>0].sort_values(ascending=False).index))
    print('\n\033[1mNegative Correlation with '+feature+'(Descending Order): \033[0m\n\t'+ ', '.join(Correlations[Correlations<0].sort_values().index))
    print(''.join(['_' for _ in range(80)]),'\n')

> If you see the above correlations, all of them have same positive/negative correlation, with slight difference in ordering. Thus we are not building seperate graphs for each variation of confirmed cases

### Recovered Cases

In [None]:
ConfMatrix= CorrMatrix.loc[~CorrMatrix.index.isin(opColumns),'Recovered']
ConfMatrix= pd.concat([ConfMatrix.abs().rename('Abs Recovered'),ConfMatrix],axis=1).sort_values('Abs Recovered',ascending=False)[:10]
ConfMatrix

In [None]:
fig= px.scatter(ProvinceWeatherDF,
               y='wspd| mean',
               x='Recovered',
               size='rh| min',
               color='wdir| max',
               hover_name='Province',
               template='plotly_dark',
               color_continuous_scale="Rainbow",
               opacity=1,
              )
fig.update_layout(xaxis={'type':'log'},yaxis={'type':'log'})
fig.show()

**The above chart shows the `lower wspd| mean`, `higher wdir| max` and `lower rh| mean` is propotional to higher Recovered cases.**

In [None]:
PositiveRecoveredCorr= set(CorrMatrix.index[~CorrMatrix.index.isin(opColumns)])
NegativeRecoveredCorr= set(CorrMatrix.index[~CorrMatrix.index.isin(opColumns)])

for feature in ['Recovered','Delta Recovered','Rolling Recovered']:
    Correlations= CorrMatrix.loc[~CorrMatrix.index.isin(opColumns),feature]
    PositiveRecoveredCorr= PositiveRecoveredCorr & set(Correlations[Correlations>0].index)
    NegativeRecoveredCorr= NegativeRecoveredCorr & set(Correlations[Correlations<0].index)
    
    print('\033[1mPositive Correlation with '+feature+'(Descending Order): \033[0m\n\t'+ ', '.join(Correlations[Correlations>0].sort_values(ascending=False).index))
    print('\n\033[1mNegative Correlation with '+feature+'(Descending Order): \033[0m\n\t'+ ', '.join(Correlations[Correlations<0].sort_values().index))
    print(''.join(['_' for _ in range(80)]),'\n')

> If you see the above correlations, almost of them have same positive/negative coorelation, with slight difference in ordering. Thus we are not building seperate graphs for each variation of Recovered cases
---
### Death Cases

In [None]:
ConfMatrix= CorrMatrix.loc[~CorrMatrix.index.isin(opColumns),'Deaths']
ConfMatrix= pd.concat([ConfMatrix.abs().rename('Abs Deaths'),ConfMatrix],axis=1).sort_values('Abs Deaths',ascending=False)[:10]
ConfMatrix

In [None]:
fig= px.scatter(ProvinceWeatherDF,
               y='rh| min',
               x='Deaths',
               size='vis| min',
               color='dewPt| min',
               hover_name='Province',
               template='plotly_dark',
               color_continuous_scale="Rainbow",
               opacity=1,
              )
fig.update_layout(xaxis={'type':'log'},yaxis={'type':'log'})
fig.show()

**The above chart shows the `lower rh| min`, `lower dewPt| min` and `higher vis| min` is propotional to higher Death cases.**

In [None]:
PositiveDeathsCorr= set(CorrMatrix.index[~CorrMatrix.index.isin(opColumns)])
NegativeDeathsCorr= set(CorrMatrix.index[~CorrMatrix.index.isin(opColumns)])

for feature in ['Deaths','Delta Deaths','Rolling Deaths']:
    Correlations= CorrMatrix.loc[~CorrMatrix.index.isin(opColumns),feature]
    PositiveDeathsCorr= PositiveDeathsCorr & set(Correlations[Correlations>0].index)
    NegativeDeathsCorr= NegativeDeathsCorr & set(Correlations[Correlations<0].index)
    
    print('\033[1mPositive Correlation with '+feature+'(Descending Order): \033[0m\n\t'+ ', '.join(Correlations[Correlations>0].sort_values(ascending=False).index))
    print('\n\033[1mNegative Correlation with '+feature+'(Descending Order): \033[0m\n\t'+ ', '.join(Correlations[Correlations<0].sort_values().index))
    print(''.join(['_' for _ in range(80)]),'\n')

> If you see the above correlations, Features for whom the graph is plotted have same positive/negative coorelation, with slight difference in ordering. Thus we are not building seperate graphs for each variation of Death cases. But to add to the point above, *Delta Deaths* have more negative correlations than the others, hence worth looking deeply into it seperately. Maybe a different Notebook for it.


# Conclusion
   **we saw how the weather conditions effect the spread of Lockdown. The most interesting of them being the `Temperature`, which is negatively correlated to Confirmed & Death's rate of spread, but positively related to Recovered's.**
   
Taking intesection of all types of variation correlation(Total/Rolling/Delta) for each o/p

In [None]:
HTML('<h3>Absolute Positive Correlation with Confirmed Cases(intersections of all variations):</h3>'+
      ', '.join(PositiveConfirmedCorr)+
     '<br><br><h3>Absolute Negative Correlation with Confirmed Cases(intersections of all variations):</h3>'+
     ', '.join(NegativeConfirmedCorr))

In [None]:
HTML('<h3>Absolute Positive Correlation with Recovered Cases(intersections of all variations):</h3>'+
      ', '.join(PositiveRecoveredCorr)+
     '<br><br><h3>Absolute Negative Correlation with Recovered Cases(intersections of all variations):</h3>'+
     ', '.join(NegativeRecoveredCorr))

In [None]:
HTML('<h3>Absolute Positive Correlation with Death Cases(intersections of all variations):</h3>'+
      ', '.join(PositiveDeathsCorr)+
     '<br><br><h3>Absolute Negative Correlation with Death Cases(intersections of all variations):</h3>'+
     ', '.join(NegativeDeathsCorr))

#### Additional Notes:

There are few things I have tried with almost similar results, hence haven't added to this notebook(Let me know, if I should add them in seperate notebook)
- F_regression to find top 10 important variables
- Linear Regression for the same above reason (RMSE score was bad, hence useless)
- DL model for the same reason (Better than LR, but still high RMSE score)
- The 14 day offset to counter to diagnosis delay, if removed has almost the same results, hence not put in as a seperate case. You can fork and change `pd.offset(14)` to `pd.offset(0)` for same.