# India's COVID-19 Exploratory analysis
---

![](https://e3.365dm.com/21/03/1600x900/skynews-india-vaccine-graphic_5325213.jpg?bypass-service-worker&20210331165132)

### **About**

This document contains basic exploaratory data analysis of COVID-19 Disease in India. This notebook serves to analyze and visualize the progress of the pandemic from various perspectives.

Data used in this notebook is complied from https://api.covid19india.org/

**Feel free to point out mistakes and give feedback since I am novice.
Any suggestions are welcome.
If you like work Please upvote and share.**

### **Introduction**

The first signs of **COVID-19 in India** was reported in some towns of Kerala, among three Indian medical students who had returned from Wuhan. After that, the Government of India had announced lockdown on **25 March 2020**. India faced its **first wave** from May 2020 to January 2020 with an Amplitude of around **90,000** new infections a day. As of now India is going under second wave which has proved to be more deadlier than previous one.

## 1. Cases, Deaths and Recovery

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates

In [None]:
ind_covid_df = pd.read_csv('../input/indias-covid19-cases/case_time_series.csv')

In [None]:
ind_covid_df

In [None]:
ind_covid_df.info()

In [None]:
ind_covid_df.isnull().sum()

We can see that it doesn't contain any null or missing values.Hence it reduces our work

In [None]:
ind_covid_df['Date_YMD'] = pd.to_datetime(ind_covid_df['Date_YMD'])

In [None]:
ind_covid_df.tail(1)

In [None]:
total_cases = ind_covid_df['Total Confirmed']
dates = ind_covid_df['Date_YMD']

In [None]:
curr_date = dates.max()
curr_total_cases = int(total_cases.tail(1))

In [None]:
dates.max()

In [None]:
filt = ind_covid_df.Date_YMD==dates.max()
today_cases = int(ind_covid_df.loc[filt, 'Daily Confirmed'])

today_deaths = int(ind_covid_df.loc[filt, 'Daily Deceased'])

today_recovered = int(ind_covid_df.loc[filt, 'Daily Recovered'])

curr_total_deaths = int(ind_covid_df.loc[filt, 'Total Deceased'])

curr_total_recovered = int(ind_covid_df.loc[filt, 'Total Recovered'])

In [None]:
plt.style.use('fivethirtyeight')
# total_cases.plot(figsize=(10,6))
plt.figure(figsize=(10,6))
plt.plot(dates, total_cases.values/10**6, color='#0000a0')
# plt.plot(total_deaths.index, total_deaths.values, linewidth=1)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.xlabel('')
plt.ylabel('Count (in Millions)',fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=13)
plt.suptitle('Total Cases by Time', fontsize=20)

plt.annotate(text=str(curr_total_cases), xy=(curr_date,curr_total_cases/10**6),
             xycoords='data', xytext=(-80,1), textcoords='offset points', fontsize=14)

The logarithmic rise of total cases was observed from July end 2020 till December 2020 which seemed to saturate in january 2020. But April 2020 onwards, cases started to increase at much higher rate than before

In [None]:
daily_cases = ind_covid_df['Daily Confirmed']

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, daily_cases,'-', linewidth=2)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Count')
plt.suptitle('Daily New Cases by time',fontsize=22)

plt.annotate(text=str(today_cases), xy=(curr_date, today_cases),
             xycoords='data', xytext=(-56,1), textcoords='offset points', fontsize=14)



From July 2020 Onwards infection rate started to increase and reached its first peak at September 2020 with over 90,000 cases reported per-day.\
Cases began to decline from October 2020 and were reported below 15,000 in January 2021 which was a good sign.

A second wave beginning in March 2021 was much larger than first, with shortages of vaccines, hospital beds, oxygen cylinders and other medicines such as remdesivir in parts of the country. By April end daily infection count reached over 400,000 which was new record

In [None]:
total_deaths = ind_covid_df['Total Deceased']

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, total_deaths, color='red')

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Count')
plt.suptitle('Total Deaths by Time', fontsize=22)

plt.annotate(text=str(curr_total_deaths), xy=(curr_date, curr_total_deaths),
             xycoords='data', xytext=(-56,1), textcoords='offset points', fontsize=14)

In [None]:
daily_deaths = ind_covid_df['Daily Deceased']

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, daily_deaths,'-r', linewidth=1)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Count')
plt.suptitle('Daily New Deaths by time',fontsize=22)

plt.annotate(text=str(today_deaths), xy=(curr_date, today_deaths),
             xycoords='data', xytext=(-40,1), textcoords='offset points', fontsize=14)

Above plot depicts that their were large no. of deaths in August, September and October months of year 2020.
Sudden spike of deaths was seen in mid-June month. 
In Second wave the deaths are 4 to 5 times more than the previous wave

### Let us see if their is any correlation between new cases and new deaths on daily basis

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(daily_cases, daily_deaths, edgecolor='black', alpha=.3)
plt.xlabel('New Cases')
plt.ylabel('New Deaths')

The Scatterplot shows that Daily New deaths are linearly correlated with new cases on daily basis.
Their is positive, strong relation between the two, as more points overlapp to form a line 

i.e. Deaths occuring each day depends on the fresh Covid cases on that day. More the no. of cases are found more deaths will occur.

From above plots we can conclude, that if we could stop or supress the fresh Covid cases, then their would be less deaths.

`if we could prevent new cases from happening, deaths would reduce`

In [None]:
total_recovered = ind_covid_df['Total Recovered']

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, total_recovered/10**6, color='green')

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Count (in Million)')
plt.suptitle('Total Recovery by Time', fontsize=22)

plt.annotate(text=str(curr_total_recovered), xy=(curr_date, curr_total_recovered/10**6),
             xycoords='data', xytext=(-70,1), textcoords='offset points', fontsize=14)



In [None]:
daily_recovered = ind_covid_df['Daily Recovered']

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, daily_recovered,'-g', linewidth=1)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Count')
plt.suptitle('Daily Recovered by time',fontsize=22)

plt.annotate(text=str(today_recovered), xy=(curr_date, today_recovered),
             xycoords='data', xytext=(-60,1), textcoords='offset points', fontsize=14)

In [None]:
active_cases = total_cases-total_deaths-total_recovered

In [None]:
curr_active_cases = curr_total_cases - curr_total_deaths - curr_total_recovered

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, active_cases/10**6, color='#483096', linewidth=2)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Count (in Millions)')
plt.suptitle('Active Cases over Time',fontsize=22)

plt.annotate(text=str(curr_active_cases), xy=(curr_date, curr_active_cases/10**6),
             xycoords='data', xytext=(-66,1), textcoords='offset points', fontsize=14)

## Summary

### 1. Case Fatality Ratio (CFR)

Case fatality ratio(CFR) is ratio to measure risk of death when person is infected with a disease. The actual probability of death of person diagonsed with a disease is generally less since everybody is not tested to have a disease or not. Hence their would be a scenario where their are people who have the disease but are not diagonsed.
CFR can increase or decrease, or could vary by location and characteristics of the infected person.

CFR gives rough chances of death if person is infected with COVID-19

$$CFR=\frac{Number\ of\ deaths\ from\ disease}{Number\ of\ diagonsed\ case\ of\ disease}X\ 100$$

In [None]:
inf_fatality_ratio = (total_deaths/total_cases)*100

In [None]:
curr_fat_ratio = (curr_total_deaths/curr_total_cases)*100

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, inf_fatality_ratio,'-m', linewidth=1)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Percent')
plt.suptitle('Infection Fatality Ratio over Time',fontsize=22)
# plt.title('(Chances of Death)')

plt.annotate(text=str(round(curr_fat_ratio, 3)), xy=(curr_date, curr_fat_ratio),
             xycoords='data', xytext=(-48,-10), textcoords='offset points', fontsize=14)

### 2. Rate of Recovery

$$Recovery\ Rate = \frac{Number\ of\ recovries\ from\ disease}{Number\ of\ diagonsed\ case\ of\ disease}X\ 100$$

During the rise of second wave, recovery rate started falling from March 2021 and settled at 80% after which has started to grow again

In [None]:
recovery_rate = (total_recovered/total_cases)*100

In [None]:
curr_rec_ratio = (curr_total_recovered/curr_total_cases)*100

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, recovery_rate, color='#3b7d24', linewidth=1)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Percent')
plt.suptitle('Recovery rate over Time',fontsize=22)

plt.annotate(text=str(round(curr_rec_ratio, 2)), xy=(curr_date, curr_rec_ratio),
             xycoords='data', xytext=(-50,-5), textcoords='offset points', fontsize=14)

In [None]:
per_act_cases = (active_cases/total_cases)*100

In [None]:
curr_per_act_cases = (curr_active_cases/curr_total_cases)*100

In [None]:
plt.figure(figsize=(10,6))
plt.plot(dates, per_act_cases, color='#8c2730', linewidth=1)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

plt.ylabel('Percent')
plt.suptitle('Percent of Active Cases over Time',fontsize=22)

plt.annotate(text=str(round(curr_per_act_cases, 2)), xy=(curr_date, curr_per_act_cases),
             xycoords='data', xytext=(-40,1), textcoords='offset points', fontsize=14)

In [None]:
fig, ax = plt.subplots()
fig.set_figheight(8)
fig.set_figwidth(10)

labels = ['Deaths','Recovered','Active']
ax.stackplot(dates, total_deaths/10**6,total_recovered/10**6,active_cases/10**6, alpha=.8, labels=labels)
ax.set_ylabel('Count (in Million)')
ax.legend(loc='upper left')

fig.autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')

ax.xaxis.set_major_locator(plt.MaxNLocator(18))
ax.xaxis.set_major_formatter(date_format)

plt.annotate(text=str(curr_total_deaths), xy=(curr_date, curr_total_deaths/10**6),
             xycoords='data', xytext=(-55,5), textcoords='offset points', fontsize=14)

plt.annotate(text=str(curr_total_recovered), xy=(curr_date, curr_total_recovered/10**6),
             xycoords='data', xytext=(-70,1), textcoords='offset points', fontsize=14)

plt.annotate(text=str(curr_active_cases), xy=(curr_date, curr_active_cases/10**6 + curr_total_recovered/10**6),
             xycoords='data', xytext=(-60,-10), textcoords='offset points', fontsize=14)

In [None]:
total_deaths.max()

In [None]:
fig, ax = plt.subplots(figsize=(10,6))

labels = ['Deceased', 'Active', 'Recovered', 'Confirmed']
values = [total_deaths.max(), active_cases.max(), total_recovered.max(), total_cases.max()]

ax.bar(labels, values, color=['#a83232','#3267a8','#67a832','#5d32a8'])
ax.set_ylabel('Count')

# create a list to collect the plt.patches data
totals = []

# find the values and append to list
for i in ax.patches:
    totals.append(i.get_height())

# set individual bar lables using above list
for i in ax.patches:
    # get_x pulls left or right; get_height pushes up or down
    ax.text(i.get_x()+.18, i.get_height()+500000, \
            str(round(i.get_height())), fontsize=15,
                color='dimgrey')




## **2. Vaccination**

India began its **vaccination program** on **16 January 2021**. India has approved two vaccines for emergency use, including Oxford-AstraZeneca vaccine also known as **Covisheld** manufactured by the Serum Institue of India, and **Covaxin** developed by Biotech. In April 2021 ,Sputnik V was approved as a third vaccine.

India first started with vaccinating Health care workers being first to receive the vaccine. On **April 1 2021** vaccination of people above **age 45** was started. Followed by vaccination of age **group 18-44** from **1 May** onwards.

In [None]:
vac_df = pd.read_csv('../input/indias-vaccine-progress/cowin_vaccine_data_statewise.csv')
pd.set_option('display.max_rows', 10)

In [None]:
filt = vac_df.State=='India'
ind_vac_df = vac_df.loc[filt].copy()

In [None]:
ind_vac_df

In [None]:
total_population = 1380004385

In [None]:
ind_vac_df.columns

In [None]:
ind_vac_df.drop(['State','Total Sessions Conducted','Total Sites ','Male(Individuals Vaccinated)','Female(Individuals Vaccinated)','Transgender(Individuals Vaccinated)','AEFI'], axis=1, inplace=True)

In [None]:
ind_vac_df

In [None]:
ind_vac_df['Total Doses Administered']=ind_vac_df['First Dose Administered']+ind_vac_df['Second Dose Administered']

In [None]:
ind_vac_df.info()

In [None]:
ind_vac_df.rename(columns={'Updated On':'Date'}, inplace=True)

In [None]:
ind_vac_df.dropna(thresh=5,inplace=True)

In [None]:
ind_vac_df.loc[:,['First Dose Administered','Second Dose Administered',
                  'Total Covaxin Administered','Total CoviShield Administered',
                  'Total Individuals Vaccinated',
                  'Total Doses Administered']]=ind_vac_df.loc[:,['First Dose Administered',
                                                                 'Second Dose Administered',
                                                                 'Total Covaxin Administered','Total CoviShield Administered','Total Individuals Vaccinated','Total Doses Administered']].astype('int64')

In [None]:
ind_vac_df['Date'] = pd.to_datetime(ind_vac_df['Date'],format='%d/%m/%Y')

In [None]:
ind_vac_df

###  Percentage share of population vaccinated

After the first quarter of the vaccination drive, India has vaccinated **10.13%** of its total population which is about **139 Million** Individuals vaccinated out of which **2.9%** are fully vaccinated. At the end of the quarter one, average Individuals immunized are **0.93 Million**.

In [None]:
ind_vac_df.set_index('Date',inplace=True)

In [None]:
ind_vac_df['Percentage Population Vaccinated'] = (ind_vac_df['Total Individuals Vaccinated']/total_population)*100

In [None]:
ind_vac_df['Percentage Population Completely Vaccinated'] = (ind_vac_df['Second Dose Administered']/total_population)*100

In [None]:
ind_vac_df

In [None]:
curr_date = ind_vac_df.index.max()
curr_fst_dose = ind_vac_df.loc[curr_date, 'First Dose Administered']
curr_snd_dose = ind_vac_df.loc[curr_date, 'Second Dose Administered']

In [None]:
per_pop_vac = ind_vac_df['Percentage Population Vaccinated'].max()
per_pop_com_vac = ind_vac_df['Percentage Population Completely Vaccinated'].max()

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_figheight(4)
fig.set_figwidth(10)

slices = [per_pop_vac, 100-per_pop_vac]
colors = ['#00aaff', '#f0b73c']
labels = ['Vaccinated','Unvaccinated']
ax1.pie(slices, labels=labels,colors=colors, wedgeprops={'edgecolor':'black'}, shadow=True, explode=(0.2,0),  autopct='%.2f%%')

slices = [per_pop_com_vac, per_pop_vac-per_pop_com_vac, 100-per_pop_vac]
labels = ['Both dose','Only One dose','Unvaccinated']
ax2.pie(slices, labels=labels,startangle=30,
        wedgeprops={'edgecolor':'black'}, shadow=True, explode=(0.2,0.2,0),  autopct='%.2f%%')


plt.show()

### Cumulative doses administered across the country

In [None]:
fig, (ax,ax2) = plt.subplots(2,1)
fig.set_figheight(14)
fig.set_figwidth(10)

ax1 = ax.twiny()
ax.stackplot(ind_vac_df.index, ind_vac_df['Percentage Population Completely Vaccinated'],
             ind_vac_df['Percentage Population Vaccinated']-ind_vac_df['Percentage Population Completely Vaccinated'],
             labels=['Two dose','One dose'],colors=['#31a354','#addd8e'], alpha=.6)
ax.set_ylabel('Percent (%) of Population')
ax.legend(loc='upper left')

ax.set_xticks(ind_vac_df.index)
ax.xaxis.set_major_locator(plt.MaxNLocator(18))
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
    

date_format = mpl_dates.DateFormatter('%d %b, %Y')
ax.xaxis.set_major_formatter(date_format)

ax1.plot(ind_vac_df.index, ind_vac_df['Percentage Population Vaccinated'],
         label='Total Percent of Population Vaccinated', color='#007580')
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1.legend(loc='center left')

ax1.set_title('Vaccination over Time', fontsize=23)


ax2.plot(ind_vac_df.index, ind_vac_df['First Dose Administered']/10**6, color='#28abb9', label='First dose')
ax2.plot(ind_vac_df.index, ind_vac_df['Second Dose Administered']/10**6, color='#2d6187', label='Second dose')
ax2.plot(ind_vac_df.index, ind_vac_df['First Dose Administered']/10**6+ind_vac_df['Second Dose Administered']/10**6, color='#ab4b9c', label='Total doses')
ax2.legend()
ax2.set_ylabel('Count (in Millions)')
ax2.set_xticks(ind_vac_df.index)
ax2.xaxis.set_major_locator(plt.MaxNLocator(18))
for tick in ax2.get_xticklabels():
    tick.set_rotation(90)
ax2.xaxis.set_major_formatter(date_format)
ax2.annotate(text=str(curr_fst_dose), xy=(curr_date,curr_fst_dose/10**6), xycoords='data',
            xytext=(-98, -20),
            textcoords='offset points')
ax2.annotate(text=str(curr_snd_dose), xy=(curr_date,curr_snd_dose/10**6), xycoords='data',
            xytext=(-80, 0),
            textcoords='offset points')

plt.annotate(text=str(round(per_pop_com_vac,2)), xy=(curr_date, per_pop_com_vac),
             xycoords='data', xytext=(-38,-35), textcoords='offset points', fontsize=14)

plt.annotate(text=str(round(per_pop_vac-per_pop_com_vac, 2)), xy=(curr_date, per_pop_vac-per_pop_com_vac),
             xycoords='data', xytext=(-35,-85), textcoords='offset points', fontsize=14)

plt.annotate(text=str(round(per_pop_vac, 2)), xy=(curr_date, per_pop_vac),
             xycoords='data', xytext=(-35,0), textcoords='offset points', fontsize=14)

### Individuals immunized on daily basis

In [None]:
help(ax.annotate)

In [None]:
ind_vac_df

In [None]:
daily_vac = []

prev_vac=0;
curr_vac=0;

for i in range(ind_vac_df.shape[0]):
    curr_vac=ind_vac_df['Total Individuals Vaccinated'][i]
    daily_vac.append(curr_vac - prev_vac)
    prev_vac=curr_vac
    
ind_vac_df['Daily Individuals Vaccinated'] = daily_vac

In [None]:
ind_vac_df

In [None]:
daily_vac = ind_vac_df['Daily Individuals Vaccinated']

In [None]:
daily_vac

In [None]:
todays_vac = daily_vac.get(curr_date)

In [None]:
avg_daily_vac = daily_vac.median()

In [None]:
avg_daily_vac

In [None]:
plt.figure(figsize=(10,6))
plt.bar(daily_vac.index, daily_vac.values/10**6,color='#de9d23', alpha=.6)
plt.plot(daily_vac/10**6, color='#6a2c70', linewidth=2)
plt.axhline(y=avg_daily_vac/10**6, linewidth=1, alpha=.9)

plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%d %b, %Y')
plt.gca().xaxis.set_major_formatter(date_format)

# plt.gca().xaxis.set_major_locator(plt.MaxNLocator(len(daily_vac)/15))

plt.ylabel('Count (in Millions)')
plt.suptitle('Daily Individuals Vaccinated over time', fontsize=22)

plt.annotate(text=str(todays_vac), xy=(curr_date, todays_vac/10**6),
             xycoords='data', xytext=(-5,10), textcoords='offset points', fontsize=14)

