<a href="https://colab.research.google.com/github/rrevuru/content-aws-mls-c01/blob/master/dataviz/Covid_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covid-19 Data Analysis
This notebook aims to analyze the publicly available Covid-19 data sourced by CDC.This data is publicly available from [here](https://healthdata.gov/dataset/provisional-covid-19-death-counts-sex-age-and-state)


## In this analysis, will try to answer the following question. Will be updating frequently with more questions


1.   How many mortalities in each state, and what are the mortality types? 
2.   How is age is affecting the mortality rate?






### Approach

We are going to leverage Python libraries and Python Visualization tools to achieve this goal

#### Importing the necessary Python Libraries

In [44]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import cufflinks as cf
cf.go_offline()

#### Loading the data into Pandas dataframe

In [45]:
df = pd.read_csv('./Provisional_COVID-19_Death.csv')

#### Information on the dataframe

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1416 entries, 0 to 1415
Data columns (total 13 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Data as of                                1416 non-null   object 
 1   Start week                                1416 non-null   object 
 2   End Week                                  1416 non-null   object 
 3   State                                     1416 non-null   object 
 4   Sex                                       1416 non-null   object 
 5   Age group                                 1416 non-null   object 
 6   COVID-19 Deaths                           1141 non-null   float64
 7   Total Deaths                              1285 non-null   float64
 8   Pneumonia Deaths                          1089 non-null   float64
 9   Pneumonia and COVID-19 Deaths             1119 non-null   float64
 10  Influenza Deaths                    

####Data Cleansing

1.   Format the following columns as datetime format<br>
      Data as of<br>
      Start week<br>
      End Week

2.   Rename the Columns with no spaces<br>
      Age group<br>
      COVID-19 Deaths<br>
      Total Deaths<br>
      Pneumonia Deaths<br>
      Pneumonia and COVID-19 Deaths<br>
      Influenza Deaths<br>
      Pneumonia, Influenza, or COVID-19 Deaths<br>

3. Delete Footnote




In [47]:
df = df.rename(columns={"Age group":"Age_group","COVID-19 Deaths":"COVID19","Pneumonia Deaths":"Pneumonia","Pneumonia and COVID-19 Deaths":"Pneumonia_COVID","Influenza Deaths":"Influenza","Pneumonia, Influenza, or COVID-19 Deaths":"Pneumonia_Influenza_COVID19"})

In [48]:
df = df.drop(['Footnote'], axis=1)

In [49]:
df['Data as of'] = pd.to_datetime(df['Data as of'])

In [50]:
df['Start week'] = pd.to_datetime(df['Start week'])

In [51]:
df['End Week'] = pd.to_datetime(df['End Week'])

#### Lets revisit after the data cleansing

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1416 entries, 0 to 1415
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Data as of                   1416 non-null   datetime64[ns]
 1   Start week                   1416 non-null   datetime64[ns]
 2   End Week                     1416 non-null   datetime64[ns]
 3   State                        1416 non-null   object        
 4   Sex                          1416 non-null   object        
 5   Age_group                    1416 non-null   object        
 6   COVID19                      1141 non-null   float64       
 7   Total Deaths                 1285 non-null   float64       
 8   Pneumonia                    1089 non-null   float64       
 9   Pneumonia_COVID              1119 non-null   float64       
 10  Influenza                    879 non-null    float64       
 11  Pneumonia_Influenza_COVID19  1063 non-null 

#### Data Review

At a first glance, its been observed, that, dataframe has aggregations for two columns, Ages and States. This will skew the numbers in our analysis. So we are going to remote the rows with those aggregations.

In [53]:
df_Ages =  df[df['Age_group'] != 'All Ages']

In [54]:
df = df_Ages[df_Ages.State != 'United States']

### Question-1
**How many mortalities in each state, and what are the mortality types?**

The following visualization shows deaths by state, when you hover around each state, it will display mortality type.


In [55]:
fig = px.bar(df, x='State', y='Total Deaths', hover_data=['COVID19','Pneumonia', 'Influenza','Pneumonia_COVID','Pneumonia_Influenza_COVID19'])
fig.update_layout(barmode='stack')
fig.show()

## Question-2
**How is age is affecting the mortality rate?**

From the below chart, it shows mortality rate in each state with a age group marker.

In [56]:
fig = px.bar(df, x='State', y='Total Deaths', color= 'Age_group',title='Total Deaths by each state and grouped by Age group')
fig.show()