# COVID19 Cases in India

## Task

**To find insights from the data**

## Attribute

- Sno - Serial number
- Date -  Date of diagnoses
- State/UnionTerritory
- ConfirmedIndianNational 
- ConfirmedForeignNational
- Cured
- Deaths
- Confirmed - confirmed positive


## Loading necessary packages

In [None]:
!pip freeze | grep pandas

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import pandas_profiling
from IPython.display import display, HTML, IFrame

In [None]:
# This is our main dataset
df = pd.read_csv('../input/covid19-in-india/covid_19_india.csv')
df.head()

In [None]:
df.shape

In [None]:
df1 = df.copy()

In [None]:
df = df1.copy()

# Data Cleaning

In [None]:
df.columns

In [None]:
#@title Changing columns names

In [None]:
df.columns = ['sr', 'date', 'time', 'state', 'is_indian', 'is_foreigner', 'cured', 'deaths', 'positive']
df.columns

In [None]:
df.state.value_counts()

In [None]:
# Above we see that there are few values which are not states and some are reapeted
# Unassigned &  Cases being reassigned to states are not states, hence lets drop them
# Lets merge 'Dadar Nagar Haveli & 'Daman & Diu' in 'Dadra and Nagar Haveli and Daman and Diu'

df['state'] = df.state.replace({'Dadar Nagar Haveli': 'Dadra and Nagar Haveli and Daman and Diu'})
df['state'] = df.state.replace({'Daman & Diu': 'Dadra and Nagar Haveli and Daman and Diu'})

df['state'] = df.state.replace({'Cases being reassigned to states': np.nan})
df['state'] = df.state.replace({'Unassigned': np.nan})

df['state'].value_counts()

In [None]:
#@title Missing values
df.isna().sum()

In [None]:
df.is_indian.unique()

In [None]:
# There is a '-' value in feature is_indian which means missing value

In [None]:
df.is_indian.value_counts()

In [None]:
# We see '-' missing value is more 90%

In [None]:
df.is_foreigner.unique()

In [None]:
# There is a '-' value in feature is_foreigner which needs to be dropped 

In [None]:
df.is_foreigner.value_counts()

In [None]:
# We see '-' missing value is more 90%

In [None]:
# We see that is_indian and is_foreigner which provided data on how many Indian or Foreigners are effected has 
# missing values more then 90%, hence lets drop the features

df = df.drop(['is_indian', 'is_foreigner'], axis=1)
df.shape

In [None]:
# Lest drop nan values
df = df.dropna()
df.shape

In [None]:
df2 = df.copy()

In [None]:
df = df2.copy()

## Feature Engineering

In [None]:
# Lets work in the date feature
# The date feature has few values with 2 digit day and month (ex- 01 & 1) and some has 1 digit values
# The year has few values of 4 digits YYYY and few with 2 digits YY
# Hence dorectly applying the pd.datetime function will create different dates which are actually not there in the database (bug)
# so we will split the string date into day, month & year and then convert them to int

# Splitting the string data
new_date = df['date'].str.split('/', 2, expand=True)
df['day'] = new_date[0]
df['month']= new_date[1]
df['year'] = new_date[2]

# Comnverting data type
df['day'] = df['day'].astype(int)
df['month'] = df['month'].astype(int)
df['year'] = df['year'].astype(int)

df.head()

In [None]:
# Lets equilize the year column

df['year'] = df['year'].apply(lambda x: 2020 if x < 30 else 2020 )
df['year'].values

In [None]:
# Now all the values in the year column is 2020
# Now to get the month name , lets merge the day, month and year to make ot datetime

df['date'] = pd.to_datetime(df[['day', 'month', 'year']])
df['Month'] = df['date'].dt.month_name()

# Lets drop day, month and year, since we have corrected date colum now
df = df.drop(['day', 'month', 'year'], axis=1)

df.head()

### Converting Data Type

In [None]:
# Lets start by converting the data type to its correct 

# state & time to be converted to category

df['state'] = df['state'].astype('category')
df['time'] = df['time'].astype('category')

df.head()

In [None]:
# Now lets convert Month to category  for data visulization pourpose

df['Month'] = df['Month'].astype('category')

df.info()

# Observation: 

In the above data we see that the feartue 'positive' shows total number of positive cases up until a particular date, hence which means, at the end of every month, it will show the total number of positive cases, total cured and total deaths.

So to further analysis, we would take only the data of the last day of each month.

In [None]:
jan = df[df['date']=='2020-01-31']
feb = df[df['date']=='2020-02-29']
mar = df[df['date']=='2020-03-31']
apr = df[df['date']=='2020-04-30']
may = df[df['date']=='2020-05-31']
jun = df[df['date']=='2020-06-22']

frame = [jan, feb, mar, apr, may, jun]

df = pd.concat(frame)

df.date.unique()


# Data Visualization

## Number of Cases each month

In [None]:
# Lets check the number of cases each month

f, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x=df.Month.unique(), y=df.Month.value_counts()[:6].sort_values(ascending=True), data=df, color= 'blue')
ax.set_title('Number of Cases each month', fontsize= 20)
ax.set_xlabel('Months', fontsize=16)
ax.set_ylabel('Number of Positive cases', fontsize=16)
plt.show()

**Observation** As per the data we see that there were hardly 1 or 2 cases tested or diagnosed till February 2020. However,the real testing started in the month of March 2020 and we can also see a jump in the number of positive cases in the month of April 2020.

## Number of Cases Statewise in the month April 2020

In [None]:
# Lets check the number of cases statewise in the month of April 2020

#Lets create a new dataframe only for the month of April2020
df_April = df[df['Month']=='April']

# Creting a new dataframe with groupby
df_April_State = df_April.groupby(['state'], as_index=False)['positive'].agg('sum').sort_values('positive', ascending=False)

# Lets plot the new dataframe
f, ax = plt.subplots(figsize=(10, 10))
sns.barplot(x='positive', y='state',
            order=df_April_State['state'], data=df_April_State, color= 'blue', ax=ax)
ax.set_title('Number of cases statewise in the month of April 2020', fontsize= 20)
ax.set_ylabel('States/UT', fontsize=16)
ax.set_xlabel('Number of Positive cases', fontsize=16)
plt.close(2)
plt.show()

## Number of cases cured in each state



In [None]:
df_cured = df.groupby(df['state'], as_index=False)['cured'].agg('sum').sort_values('cured', ascending=False)

plt.figure(figsize=(10, 10))

sns.barplot(x='cured', y='state', order=df_cured['state'], data=df_cured, color='blue')
plt.title('Number of cases statewise in the month of April 2020', fontsize= 20)
plt.ylabel('States/UT', fontsize=16)
plt.xlabel('Number of Positive cases', fontsize=16)
plt.show()

## Current status of Covid 19 in India

In [None]:
# Current Status
positive = df['positive'].sum()
cured = df['cured'].sum()
deaths =df['deaths'].sum()
data_pie = [positive, cured, deaths]
plt.figure(figsize=(6, 6))
fig = px.pie(data_frame= df, values=data_pie, names=['Positive Cases', 'Cured People', 'Deaths'],
             title= 'Currrent status of Covid19 India')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()


## Most number of cases/cure/deaths on Monthly basis

In [None]:
# Monthwise graph of Number of positive cases found in last 6 months
df_month = df.groupby(df['Month'], as_index=False)['positive', 'cured', 'deaths'].agg('sum').sort_values(by='positive', ascending=True)

px.bar(df_month, x='Month', y='positive', color='Month', 
          labels={'positive': 'Positive cases', 'cured': 'People cured', 'deaths': 'Deaths occured'},
          hover_data= ['positive', 'cured', 'deaths'], title='Most number of cases/cure/deaths on Monthly basis', log_y=True)

# Conclusion

With the data above we can see the increase in the number of cases in last 6 months, based on states & months. We are also able to how the number of test increased in the later months. However, the data does not have much information to further analysis or do prediction. 
