This notebook explores the COVID-19 dataset from:
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

Plotly will be used to produce interactive and dynamic data visualisations.

In [65]:
import numpy as np 
import pandas as pd 
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [66]:
df = pd.read_csv(r'\Users\vinhe\Code\Projects\covid-19_analysis\covid_19_data.csv')

In [67]:
df = df.rename(columns={'Country/Region':'Country'})
df = df.rename(columns={'ObservationDate':'Date'})
df = df.rename(columns={'Province/State':'State'})

df.head()

Unnamed: 0,SNo,Date,State,Country,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


Below is an explanation of the features:
- Sno - Serial number
- Date - Date of the observation in MM/DD/YYYY
- State - Province or state of the observation (Could be empty when missing)
- Country- Country of observation
- Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it)
- Confirmed - Cumulative number of confirmed cases till that date
- Deaths - Cumulative number of of deaths till that date
- Recovered - Cumulative number of recovered cases till that date

#### Create DataFrames that combines cases by country instead of by state.

In [68]:
df_country_date = df[df['Confirmed']>0]
df_country_date = df_country_date.groupby(['Date','Country']).sum().reset_index()

df_country_date.tail()

Unnamed: 0,Date,Country,SNo,Confirmed,Deaths,Recovered
36878,09/13/2020,West Bank and Gaza,108804,30574.0,221.0,20082.0
36879,09/13/2020,Western Sahara,108805,10.0,1.0,8.0
36880,09/13/2020,Yemen,108806,2011.0,583.0,1212.0
36881,09/13/2020,Zambia,108807,13539.0,312.0,12260.0
36882,09/13/2020,Zimbabwe,108808,7526.0,224.0,5678.0


In [69]:
df_countries = df.groupby(['Country','Date']).sum().reset_index().sort_values('Date', ascending=False)
df_countries = df_countries.drop_duplicates(subset=['Country'])
df_countries = df_countries[df_countries['Confirmed']>0]

df_countries

Unnamed: 0,Country,Date,SNo,Confirmed,Deaths,Recovered
2362,Bahamas,09/13/2020,108651,2928.0,67.0,1319.0
9181,Djibouti,09/13/2020,108683,5395.0,61.0,5330.0
13856,Guyana,09/13/2020,108706,1853.0,56.0,1215.0
3910,Bhutan,09/13/2020,108659,245.0,0.0,161.0
14839,Hungary,09/13/2020,108710,12309.0,637.0,4069.0
...,...,...,...,...,...,...
35655,Vatican City,03/09/2020,4507,1.0,0.0,0.0
25615,Palestine,03/09/2020,4322,22.0,0.0,0.0
27187,Republic of Ireland,03/08/2020,4067,21.0,0.0,0.0
0,Azerbaijan,02/28/2020,2664,1.0,0.0,0.0


## Choropleth Maps
Animated visualisation of spread of COVID-19 globally over time, up to September 2020.

We can see that USA, Brazil and India have been suffering from an explosion of high numbers of COVID-19 cases in recent months.

In [45]:
fig = px.choropleth(df_country_date,
                   locations='Country',
                   locationmode='country names',
                   color='Confirmed',
                   hover_name='Country',
                   animation_frame='Date'
                   )

fig.update_layout(
    title_text = 'Global Spread of COVID-19',
    title_x = 0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False
    ))

fig.show()

## Heatmap of Global Confirmed Cases

In [22]:
fig = go.Figure(data=go.Choropleth(
                locations = df_countries['Country'],
                locationmode = 'country names',
                z = df_countries['Confirmed'],
                colorscale = 'Reds',
                marker_line_color = 'black',
                marker_line_width = 0.5,
                ))

fig.update_layout(
    title_text = 'Confirmed Cases as of 13 September 2020',
    title_x = 0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    )
)

## Pie Chart
### Proportion of Confirmed COVID-19 Cases by Country
The below pie chart shows the proportion of confirmed COVID-19 cases by country.

In [48]:
fig = px.pie(df_countries, values='Confirmed', names='Country', height=500)
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False
    ))

fig.show()

### Proportion of COVID-19 Deaths by Country

In [49]:
fig = px.pie(df_countries, values='Deaths', names='Country', height=500)
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=False
    ))

fig.show()

## Treemaps
### Proportion of Confirmed COVID-19 Cases by Country
Treemaps are an alternative way to represent proportions over pie charts.

In [47]:
fig = px.treemap(df_countries,
                path=['Country'],
                values='Deaths',
                names='Country',
                height=700,
                title='Proportion of Deaths'
                )

fig.show()

## Bar Charts
### Confirmed COVID-19 Cases by Country Over Time

In [71]:
bar_data = df.groupby(['Country', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].sum().reset_index().sort_values('Date', ascending=True)

In [60]:
fig = px.bar(bar_data, x='Date', y='Confirmed', color='Country', height=700, title='C')
fig.show()

In [61]:
fig = px.bar(bar_data, x='Date', y='Deaths', color='Country', height=700, title='Deaths by Country Over Time')
fig.show()

## Line Graph
### Total Number of Confirmed COVID-19 Cases, Deaths, and Recoveries Over Time
We can see that COVID-19 cases has been increasingly non-linearly. This is likely a combination of the R value increasing in certain countries and also due to the number of tests increasing around the world.

In [46]:
line_data = df.groupby('Date').sum().reset_index()

line_data = line_data.melt(id_vars='Date', 
                 value_vars=['Confirmed', 
                             'Recovered', 
                             'Deaths'], 
                 var_name='Ratio', 
                 value_name='Value')

fig = px.line(line_data, x="Date", y="Value", color='Ratio', 
              title='Confirmed cases, Recovered cases, and Deaths Over Time')
fig.show()