# Overview

We have two csv files as input for our analysis
1. India Covid Vaccination
2. State wise Vaccination
<br><br>
Firstly we will start with the country level analysis, perform some Data Cleaning, EDA and summarize our finding. Later we will look into state wise vaccination report and plot geo map of India which will give us a clear visual representation of the state's vaccination progress and identify Top 5 and Bottom 5 state's vaccination status
<br><br>
**Note**: This notebook is up to date as of `25-03-2021`

<img src="https://akm-img-a-in.tosshub.com/indiatoday/images/story/202101/imagecovax_1200x768.jpeg?v.f6Jh3p4OuZaT1rt0Gb3uzs7paJi1F2&size=770:433" >

# 1. Importing Libraries and Reading Data

In [None]:
# Importing Libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as msno
import geopandas as gpd

import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Reading the data from CSV file

country_df = pd.read_csv('/kaggle/input/india-covid-vaccination-data/India_Covid_Vaccination.csv', parse_dates=['date'])
state_df = pd.read_csv('/kaggle/input/india-covid-vaccination-data/State_Wise_Vaccination.csv')

# 2. Quick Look at the data

In [None]:
country_df.head()

In [None]:
state_df.head()

In [None]:
country_df.info()
print('-'*50)
state_df.info()

In [None]:
country_df.describe()

In [None]:
state_df.describe()

In [None]:
print('Shape of Country Dataset {0}'.format(country_df.shape))
print('Shape of State Dataset {0}'.format(state_df.shape))

## Analysis on Country Vaccination

# 3. Data Cleaning (Country)

In [None]:
country_df.head()

In [None]:
# Value count for categorical column

for col in country_df.select_dtypes('object').columns:
    print(country_df[col].value_counts())
    print('-'*50)

In [None]:
# Dist plot for numerical columns

plt.figure(figsize = (15,15))

for idx, col in enumerate(country_df.select_dtypes(['float64', 'int64']).columns):
    
    plt.subplot(4,3,idx+1)
    plt.title('Distribution of {0}'.format(col))
    sns.distplot(country_df[col], color = 'green')
    
plt.tight_layout(h_pad = 2,rect=[0, 0.03, 1, 0.97])                                

Based on value count and distribution plot we can say that attributes such as `'iso_code', 'continent', 'location', 'population','population_density', 'median_age' and  'aged_65_older'` just have single value in them and don't explain much variance about the data. So its better to drop them for future analysis

In [None]:
# Dropping column's 

to_drop = ['iso_code', 'continent', 'location',
            'population','population_density', 'median_age', 'aged_65_older']

country_df.drop(to_drop, axis = 1, inplace = True)

In [None]:
# Calculating missing value percentage

for col in country_df.columns:
    null_rate = (country_df[col].isnull().sum()/len(country_df))*100
    if null_rate>0:
        print("{}'s null rate :{}%".format(col,round(null_rate,2)))

In [None]:
# Visualizing missing values

msno.bar(country_df)
plt.show()

We can clearly see that there are lot of missing values in the data

In [None]:
# Dropping columns with more 80% of missing values

for col in country_df.columns:
    if (country_df[col].isnull().sum()/len(country_df))*100 > 80:
        country_df.drop(col, axis = 1, inplace = True)

In [None]:
# Creating a new column for month for visualizing purpose

# Extracting month and year
country_df['Month_Year'] = country_df['date'].apply(lambda x: x.strftime('%B-%Y'))

# Converting it into datatime
country_df['Month_Year'] = pd.to_datetime(country_df['Month_Year'])

In [None]:
# Looking into our final data frame

country_df.tail(3)

- Let's leave the remaining values as it is, cause imputing with mean/median wouldn't make sense for new_test's and new_death's

# 4. Analysis & Visualizations (Country)

In [None]:
# Creating a separate dataframe to see the trends month wise

grouped = country_df[['new_cases', 'new_deaths', 'new_tests']].groupby(country_df['Month_Year']).sum()
grouped.head()

In [None]:
# Plotting trends for the attributes

plt.figure(figsize = (20,16))

kappa = ['new_cases', 'new_deaths', 'new_tests']

for idx, col in enumerate(kappa):
    
    plt.subplot(3,1, idx+1)
    plt.title('Trend of {0} per day'.format(col), fontsize = 20, fontweight='bold')
    sns.lineplot(x = grouped.index, y = col, data = grouped, color = 'red')
    
    plt.grid()
    
    plt.ylabel(col, fontsize = 15)
    plt.xlabel('Months', fontsize = 15)
    
    plt.xticks(fontsize = 15)
    plt.yticks(fontsize = 15)


plt.tight_layout(h_pad = 5)
plt.show()

### Inference:
- Number of new cases and death report's reached to peak **September 2020**
- We can see pattern as number of testing started to increase death report's and new cases of COVID started to emerge
- Maximum number of new cases solely reported in month of Sep 2020 were **1899543** and death of  **24973** people
- After the peak we can see the decrease of new cases of COVID as well as death of people due to COVID, maybe it related to the fact that COVID testing started to decrease or the emerge of vaccine started to take place

In [None]:
# Plotting heat map to see correaltion between varaibles

sns.heatmap(country_df.corr(), annot = True, cmap = 'YlGnBu')
plt.show()

In [None]:
# Plotting regplot to see the trend of correaltion

plt.figure(figsize = (10,6))

scatter_lst = ['new_deaths','new_tests']
for idx, column in enumerate(scatter_lst):
    
    plt.subplot(1,2, idx+1)
    plt.title('New Cases Vs {0}'.format(column), fontsize = 20, fontweight = 'bold')
    sns.regplot(x = 'new_cases', y = column, data = country_df, fit_reg = True, ci = None, 
                scatter_kws = {'color':'purple'}, line_kws = {'color':'red'})
    
    plt.ylabel(col, fontsize = 15)
    plt.xlabel('New Cases', fontsize = 15)
    
    plt.xticks(fontsize = 15)
    plt.yticks(fontsize = 15)
    
plt.tight_layout(w_pad = 5, rect=[0,0.03,2,0.97])

### Inference:
- Number of new cases has strong postive correlation of **0.91** with new deaths and **0.67** with new test, it can also be seen the same in the regplot with best fit line having postive trend

## Analysis on State wise Vaccination

# 5. Data Preprocessing (State)

In [None]:
state_df.head()

**Understanding meaning of some columns** 

- ``Atleast_1dose_Percentage`` means Percentage of people given atleast one  dose
- ``Percentage_Fully_Vaccinated`` means Percentage of people fully vaccinated

In [None]:
# Renaming some columns

state_df.rename({'State_or_UT': 'st_nm',
                 'Atleast_1dose_Percentage': 'Atleast_1_Dose',
                'Percentage_Fully_Vaccinated': 'Fully_Vaccinated',
                }, axis = 1, inplace = True)

In [None]:
# Seeing if there are any empty rows

state_df[state_df.isnull().any(1)]

In [None]:
# Droping null row, there is no state in India as Miscellaneous

state_df.dropna(axis = 0, how = 'any', inplace = True)

In [None]:
# Soritng state's namewise so that they can be in correct order

state_df = state_df.sort_values(by = 'st_nm').reset_index(drop = True)

In [None]:
# Top 5 State's which have been vacinated

top_5_vac = state_df.sort_values(by = 'Fully_Vaccinated', ascending = False)[['st_nm', 'Fully_Vaccinated']]
top_5_vac.reset_index(drop = True, inplace = True)
top_5_vac.head()

In [None]:
# Bottm 5 State's which have been vacinated

bot_5_vac = state_df.sort_values(by = 'Fully_Vaccinated')[['st_nm', 'Fully_Vaccinated']]
bot_5_vac.reset_index(drop = True, inplace = True)
bot_5_vac.head()

***Note*** : In Geo pandas Ladakh hasnt been update as UT and it will be difficult for us use it later so we will drop it

In [None]:
# Dropping Ladakh UT

index_drop = state_df[state_df['st_nm'] == 'Ladakh'].index
state_df.drop(index_drop, axis = 0, inplace = True)
state_df.reset_index(drop = True, inplace = True)

In [None]:
# Reading the shape file

fp = r'/kaggle/input/india-shape-files/Indian_States.shp'
map_df = gpd.read_file(fp)
map_df.head()

In [None]:
# Soritng state's namewise so that they can be in correct order

map_df = map_df.sort_values(by = 'st_nm').reset_index(drop = True)

In [None]:
# Creating merged dataset for plotting

sub_df = state_df['Fully_Vaccinated']
merged = pd.concat([map_df, sub_df], axis = 1)
merged.head()

In [None]:
# Plotting India Map to see the regions which have been vacinated

merged.plot(column='Fully_Vaccinated',cmap='YlOrRd', linewidth=0.8,edgecolor='0',legend=True, figsize = (15,8))
plt.tight_layout(w_pad = 3)
plt.show()

Top 5 States which are fully vaccinated are
1. Lakshadweep
2. Ladakh
3. Tripura
4. Sikkim
5. Andaman and Nicobar Islands

Bottom 5 States which need to improve there vaccination process are
1. Tamil Nadu
2. Punjab 
3. Assam 
4. Bihar 
5. Uttar Pradesh

# 6. Summing up


- Number of new cases and death report's reached to peak **September 2020**
- We can see pattern as number of testing started to increase death report's and new cases of COVID started to emerge
- Maximum number of new cases solely reported in month of Sep 2020 were **1899543** and death of  **24973** people
- Number of new cases has strong postive correlation of **0.91** with new deaths and **0.67** with new test, it can also be seen the same in the regplot with best fit line having postive trend
- After the peak we can see the decrease of new cases of COVID as well as death of people due to COVID, maybe it related to the fact that COVID testing started to decrease or the emerge of vaccine started to take place


<h2>Upvote if you like my work❤️<br>
If you have any queries, doubt or any suggestion feel free to drop it in comment section<h2>

**My Other Works**<br>
Internet Usage: EDA and Cluster Analysis:https://www.kaggle.com/vishalraibagi/internet-usage-eda-and-cluster-analysis

Price Class Classification: SweetViz & 5 Models:https://www.kaggle.com/vishalraibagi/price-class-classification-sweetviz-5-models

Domestic Violence against Women: EDA: https://www.kaggle.com/vishalraibagi/domestic-violence-against-women-eda