<h1> India Covid Jan-Mar 2020: EDA </h1>
<br>

In this notebook I'll perform some exploratory data analysis.

<h4 style="background-color:#e6f7ff;" align = 'center'><i>Table of Contents</i></h4>

- [files available and some info](#files)
- [Covid19 India (Jan 20 - Mar 20).csv](#df)
- [india-polygon.shp (**WORK IN PROGRESS**)](#geo)


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import geopandas as gpd
import gc
import fiona
import folium
import contextily as ctx
import matplotlib.pyplot as plt
import plotly.express as px
plt.style.use('ggplot')
import seaborn as sns
import os
root_path = '/kaggle/input/a-small-covid19-dataset'
[print(i) for i in list(map(lambda x: root_path+"/"+x, os.listdir(root_path)))]

<a id = "df"></a>

In [None]:
df = pd.read_csv(os.path.join(root_path, 'Covid19 India (Jan 20 - Mar 20).csv'))

display(df.sample(3), df.info())

It seems that `Sno` column is equal to `df.index` $+1$

In [None]:
assert (df.Sno - 1 - df.index).sum() == 0

So it is, let's drop it and continue the analysis

In [None]:
df.drop('Sno', axis = 1, inplace = True, errors = 'ignore') #I add 'ignore' if you re-execute the cell

Let's set `Date` as a datetime

In [None]:
df['Date'] = pd.to_datetime(df['Date'], errors = 'raise', format = "%d-%m-%Y")

In [None]:
distinct_values_table = df.apply(lambda x: x.nunique(), axis = 0).rename('distinct_values').to_frame().T
nan_values_table = df.isna().sum().rename('nan_values').to_frame().T

plot_df = (distinct_values_table.T.reset_index().rename(columns = {'index': 'variable'})
          .merge(nan_values_table.T.reset_index().rename(columns = {'index': 'variable'}),
                 on = 'variable'))

fig, ax = plt.subplots(1, 2, figsize = (16, 8), gridspec_kw={'width_ratios': [2.3, 1]})

sns.set_context(rc = {'patch.linewidth': 2.0})
sns.barplot(x = 'variable', 
            y = 'distinct_values',
            data = plot_df,
            ax = ax[0])

for index, row in plot_df.iterrows():
    value = row.distinct_values
    ax[0].text(index, value+1, value, color='black', ha="center", fontsize = 15)

ax[0].legend(fontsize=18)
ax[0].set_title('Distinct values for each variable', fontsize = 20)
ax[0].tick_params(axis='both', which='major', labelsize=14)
ax[0].tick_params(axis='both', which='minor', labelsize=14)
ax[0].set_xlabel('')
ax[0].set_ylim(0, 55)
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation = 35, fontsize = 13, color = 'black')
ax[0].set_ylabel('distinct_values', fontsize = 18, color ='black')
plt.subplots_adjust(hspace = 0.3)

bbox=[-0.2, 0, 1.2, 0.9]
ax[1].axis('off')
ax[1].title.set_text('')
ax[1].title.set_size(12)
ccolors = plt.cm.BuPu(np.full(len(plot_df.columns), 0.1))
mpl_table = ax[1].table(cellText = plot_df.values, bbox=bbox, colLabels=plot_df.columns, colColours=ccolors)
mpl_table.auto_set_font_size(True)
mpl_table.auto_set_column_width(col=list(range(len(plot_df.columns))))
#mpl_table.set_fontsize(14)

No Nan Values, which is always good

In [None]:
(df['State/UnionTerritory'].value_counts().rename('distinct_values').to_frame()
 .reset_index().rename({'index': 'State/UnionTerritory'}, axis=1)
 .style.set_properties(subset=['State/UnionTerritory'], **{'font-weight': 'bold'})
 .background_gradient(subset='distinct_values'))

It seems that the number of distinct values depends on the number of distinct `Date` per that
`State/UnionTerritory`

In [None]:
df_no_dup = df.drop_duplicates(['Date', 'State/UnionTerritory'])

assert (df['State/UnionTerritory'].value_counts().to_frame()
       .equals(df_no_dup['State/UnionTerritory'].value_counts().to_frame()))

del df_no_dup
gc.collect() #you can delete objects, but it's always better to garbage collect after

The assertion confirms it

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (16, 6))

plot_df = (df.groupby(['Date'])['ConfirmedForeignNational', 'ConfirmedIndianNational', 'Cured'].sum())

ax.legend(fontsize=25)
fig.suptitle('Number of Cases and Cured after aggregating by Date', fontsize = 20)
plt.title('Basically we are looking at Covid cases in the whole India', fontsize = 12)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.tick_params(axis='both', which='minor', labelsize=14)
ax.set_xlabel('Date', fontsize = 18, color ='black')
ax.set_ylabel('distinct_values', fontsize = 18, color ='black')
plt.subplots_adjust(hspace = 0.3)

plot_df.plot(ax = ax, linewidth = 3, alpha = 0.5)

import matplotlib.dates as mdates
myFmt = mdates.DateFormatter('%d')
ax.xaxis.set_major_formatter(myFmt)



In [None]:
fig = px.line(data_frame = plot_df.reset_index(drop = False), x="Date",
              y = ['ConfirmedForeignNational', 'ConfirmedIndianNational', 'Cured'],
              title = 'Number of Cases and Cured after aggregating by Date')

fig.update_layout(
    title_font_size = 20,
    xaxis_title="Date",
    yaxis_title="Count"
)
    
fig.show()

They seem monotonically increasing, as if they were cumulative

In [None]:
print(plot_df.Cured.is_monotonic_increasing, 
      plot_df.ConfirmedForeignNational.is_monotonic_increasing, 
      plot_df.ConfirmedIndianNational.is_monotonic_increasing)

From [wikipedia](https://en.wikipedia.org/wiki/COVID-19_pandemic_in_India#2020) we actually get that the first case in India was on the 30th of January 2020. 

In [None]:
fig = px.treemap(df.groupby('State/UnionTerritory').ConfirmedIndianNational.sum().reset_index(), 
                 path=['State/UnionTerritory'], values='ConfirmedIndianNational',
                  color='ConfirmedIndianNational', hover_data=['State/UnionTerritory'],
                  title="Confirmed Indian National per State/UnionTerritory",
                  color_continuous_scale='RdBu')
fig.show()

In [None]:
df.groupby('State/UnionTerritory').ConfirmedForeignNational.sum().fillna(0).reset_index()

In [None]:
fig = px.treemap(df.groupby('State/UnionTerritory').ConfirmedForeignNational.sum()
                 .reset_index().query('ConfirmedForeignNational>0'), 
                 path=['State/UnionTerritory'], values='ConfirmedForeignNational',
                  color='ConfirmedForeignNational', hover_data=['State/UnionTerritory'],
                  title="Confirmed Foreign National per State/UnionTerritory (just the ones with cases)",
                  color_continuous_scale='RdBu')
fig.show()

In [None]:
fig = px.treemap(df.assign(allConfirmed=lambda x: x.ConfirmedIndianNational+ x.ConfirmedForeignNational)
                 .groupby('State/UnionTerritory').allConfirmed.sum()
                 .reset_index().query('allConfirmed>0'), 
                 path=['State/UnionTerritory'], values='allConfirmed',
                  color='allConfirmed', hover_data=['State/UnionTerritory'],
                  title="allConfirmed per State/UnionTerritory",
                  color_continuous_scale='RdBu')
fig.show()

<a id = "geo"> </a>

In [None]:
geo_df = gpd.read_file(root_path + "/"+ 'india-polygon.shx').drop("id", axis = 1)

print(geo_df.shape)

display(geo_df.sample(2))

set(geo_df.st_nm) - set(df['State/UnionTerritory']), set(df['State/UnionTerritory']) - set(geo_df.st_nm)

In [None]:
geo_df_plus_cases = geo_df.merge(df.assign(allConfirmed=lambda x: x.ConfirmedIndianNational+ x.ConfirmedForeignNational)
                 .groupby('State/UnionTerritory').allConfirmed.sum()
                 .reset_index().query('allConfirmed>0'), left_on = 'st_nm',
            right_on = 'State/UnionTerritory')

https://jingwen-z.github.io/how-to-draw-a-variety-of-maps-with-folium-in-python/#paint-areas-with-different-colors

In [None]:
# props to https://jingwen-z.github.io/how-to-draw-a-variety-of-maps-with-folium-in-python/#paint-areas-with-different-colors
from branca.colormap import linear
nbh_count_colormap = linear.YlGnBu_09.scale(min(geo_df_plus_cases['allConfirmed']),
                                            max(geo_df_plus_cases['allConfirmed']))

nbh_locs_map = folium.Map(location=[29, 75],
                          zoom_start = 4, tiles='cartodbpositron')

style_function = lambda x: {
    'fillColor': nbh_count_colormap(x['properties']['allConfirmed']),
    'color': 'black',
    'weight': 1.5,
    'fillOpacity': 0.7
}

folium.GeoJson(
    geo_df_plus_cases,
    style_function=style_function,
    tooltip=folium.GeoJsonTooltip(
        fields=['State/UnionTerritory', 'allConfirmed'],
        localize=True
    )
).add_to(nbh_locs_map)

nbh_count_colormap.add_to(nbh_locs_map)
nbh_count_colormap.caption = 'Cases per State/UnionTerritory'
nbh_count_colormap.add_to(nbh_locs_map)

nbh_locs_map