![Covid-19-Webpage-banner-1170x240-opt2.jpg](attachment:Covid-19-Webpage-banner-1170x240-opt2.jpg)

# COVID-19 Data Analysis

Many of the visualisations uses Plotly's Python graphing library which makes the graph and data interactive

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

import math

import warnings
warnings.filterwarnings('ignore')

# Importing of Data Sets

In [1]:
df = pd.read_csv('../input/covid19-analysis-data-set/country_daywise.csv')
df3 = pd.read_csv('../input/covid19-analysis-data-set/countrywise.csv')

In [1]:
df.head()

In [1]:
df3.head()

# Data Checking and Cleaning

###  Checking for Missing or Duplicate Data 

In [1]:
print(df.isna().sum())
print(df3.isna().sum())

In [1]:
print(df.duplicated().sum())
print(df3.duplicated().sum())

### Changing Date Consistency and Formatting 

In [1]:
df.dtypes
# df3 not needed for date checking

In [1]:
df.Date
# There is inconsistency in year formatting

In [1]:
df[0] = df.Date.str.split('/',expand=True)[0]
df[1] = df.Date.str.split('/',expand=True)[1]
df[2] = df.Date.str.split('/',expand=True)[2]

df[0] = df[0].astype(str).str.zfill(2)
df[1] = df[1].astype(str).str.zfill(2)
df.loc[df[2]=='2020',2] = '20'

df.drop('Date',axis=1)
df["Date"] = df[0].astype(str) + df[1].astype(str) + df[2].astype(str)
df.drop([0,1,2],axis=1,inplace=True)

df['Date'] = pd.to_datetime(df['Date'], format='%m%d%y')

In [1]:
df.dtypes

### Changing Column Names
For ease of parsing data

In [1]:
df.columns = df.columns.str.replace(' ','_').str.lower()
df3.columns = df3.columns.str.replace('/','per').str.replace(' ','_').str.replace('%','percentage').str.lower()

### Sort Values by Date

In [1]:
df.sort_values('date',inplace=True)

###  Checking for Negative Numbers

In [1]:
df[(df['confirmed']<0) | (df['deaths']<0) | (df['recovered']<0) | (df['active']<0) | (df['new_cases']<0)]

In [1]:
df.loc[38747,'active'] = 111           #According to worldometer
df.loc[38747,'recovered'] = 958

Active and recovered cases in row 38747 is wrong and manual changes are made based on cross references to [worldometers](https://www.worldometers.info/)
- As of 1st Nov 2020

In [1]:
df3[(df3['confirmed']<0) | (df3['deaths']<0) | (df3['recovered']<0) | (df3['active']<0) | (df3['deaths_per_100_cases']<0) | (df3['population']<0) | (df3['cases_per_million_people']<0)]
# India and China have wrong population numbers 

In [1]:
df3.loc[36,'population'] = 1441256866            # China's population according to worldometer
df3.loc[36,'cases_per_million_people'] = 3.28    # 3.28 = 4733 / (1441256866 / 1000000)

df3.loc[79,'population'] = 1384764473            # India's population according to worldometer
df3.loc[79,'cases_per_million_people'] = 54.21   # 54.21 = 75062 / (1384764473 / 1000000)

China and India contains error in population numbers and cases per million, the right population figures are manually added based on cross references to figures from [worldometers](https://www.worldometers.info/) as of 1st Nov 2020.

The cases per million people are manually calculated based on the population figures.

###  Dropping of Cruise Ships Included in the Data Sets

In [1]:
print(df[df['country']=='MS Zaandam'].index)
print(df[df['country']=='Diamond Princess'].index)
df = df[df['country']!='MS Zaandam']
df = df[df['country']!='Diamond Princess']

In [1]:
print(df3[df3['country']=='MS Zaandam'].index)
print(df3[df3['country']=='Diamond Princess'].index)
df3 = df3.drop([104,48])

# Data Analysis on COVID-19 Infection and Death Rates
As of 10th Oct 2020.

In [1]:
ld = df[df['date']==max(df['date'])]
print(max(df['date']))

## Overview

### COVID-19 vs Other Epidemics


Retrieved from the following websites (1st Nov 2020): 

[SARS](https://www.who.int/csr/sars/country/2003_07_11/en/) | [EBOLA](https://en.wikipedia.org/wiki/List_of_epidemics) | [MERS](https://www.who.int/csr/don/24-february-2020-mers-saudi-arabia/en/#:~:text=From%202012%20until%2031%20January,(IHR%202005)%20to%20date.) | [H1N1](https://en.wikipedia.org/wiki/List_of_epidemics)


In [1]:
pandemics = pd.DataFrame({
    'pandemics' : ['COVID-19', 'SARS', 'EBOLA', 'MERS', "H1N1"],
    'start_year' : [2019, 2002, 2013, 2012, 2009],
    'end_year' : [2020, 2004, 2016, 2020, 2010],
    'confirmed' : [ld['confirmed'].sum(), 8437, 28646, 2519, 6724149],
    'deaths' : [ld['deaths'].sum(), 813, 11323, 866, 19654]
})

pandemics['mortality'] = round((pandemics['deaths']/pandemics['confirmed'])*100,2)

In [1]:
temp = pandemics.melt(id_vars='pandemics', value_vars=['confirmed', 'deaths', 'mortality'], var_name='Case', value_name='Value')

fig = px.bar(temp, x='pandemics', y='Value', color='pandemics', text='Value', facet_col='Case', color_discrete_sequence=px.colors.qualitative.Bold)

fig.update_traces(textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',title='COVID-19 vs Other Pandemics')
fig.update_yaxes(showticklabels = False)
fig.layout.yaxis2.update(matches = None)
fig.layout.yaxis3.update(matches = None)
fig.show()

In [1]:
temp = ld.sort_values('confirmed', ascending=False)

headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'

fig = go.Figure(data=[go.Table(
  header=dict(
    values=['Country','Confirmed Cases'],
    line_color='darkslategray',
    fill_color=headerColor,
    align=['left','center'],
      
    font=dict(color='white', size=12)
  ),
  cells=dict(
    values=[
      temp['country'],
      temp['confirmed'],
      ],
    line_color='darkslategray',
    # 2-D list of colors for alternating rows
    fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*len(temp)],
    align = ['left', 'center'],
    font = dict(color = 'darkslategray', size = 11)
    ))
])
fig.update_layout(
    title='Confirmed Cases In Each Country',
)
fig.show()

In [1]:
plt.style.use('ggplot')

deaths = ld['deaths'].sum()
recovered = ld['recovered'].sum()
active = ld['active'].sum()
sizes = [deaths,recovered,active]
colors = ['#a5a5a5', '#66b3ff', '#ff9999']

fig1, ax1 = plt.subplots()
ax1.pie(sizes, colors = colors, autopct='%1.1f%%', startangle=90)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.set_size_inches(5.5,5.5)
fig.gca().add_artist(centre_circle)
ax1.axis('equal')  
plt.tight_layout()
plt.title(f"Total COVID-19 Cases: {int(ld['confirmed'].sum()):,}")
plt.legend(['Total Deaths','Total Recovered', 'Total Active'])

### World Map Analysis

In [1]:
ld_country = ld.groupby('country').sum()
temp = ld_country.reset_index()
temp
fig = px.choropleth(temp, locations=temp['country'],
                    color=temp['deaths'],locationmode='country names', 
                    hover_name=temp['country'], 
                    color_continuous_scale=px.colors.sequential.Tealgrn,template='plotly_dark', )
fig.update_layout(
    title='Total Deaths by Country',
)
fig.show()

## Correlation between Total Cases and Total Deaths

In [1]:
ld_country = ld.groupby('country').sum()
#Grouping of countries with latest data available

In [1]:
temp = ld_country.reset_index()
fig = px.scatter(temp,
                 x='confirmed', y='deaths',color='country', height = 700,
                 log_x=True, log_y=True, title="World: Death vs Confirmed Cases (log10 scale)")

fig.update_traces(textposition = 'top center')
fig.update_layout(showlegend = False)
fig.update_layout(xaxis_rangeslider_visible = True)
fig.show()

In [1]:
temp = ld_country.reset_index()

temp2 = temp[(temp['country']=='Singapore') | (temp['country']=='Burundi') | (temp['country']=='Yemen')]
temp3 = temp[(temp['country']!='Singapore') & (temp['country']!='Burundi') & (temp['country']!='Yemen')]

fig = px.scatter(temp2,
                 x='confirmed', y='deaths',color='country', height = 700,
                 log_x=True, log_y=True, title="Outliers: Death vs Confirmed Cases (log10 scale)",text='country')

fig.add_trace(go.Scatter(x=temp3['confirmed'], y=temp3['deaths'],marker_line_color="midnightblue", marker_color="lightskyblue",mode='markers',text=temp2['country']))
fig.update_traces(textposition = 'top center')

fig.update_layout(showlegend = False,xaxis_title="confirmed",
    yaxis_title="deaths")
fig.update_layout(xaxis_rangeslider_visible = True)

fig.show()

In [1]:
top = 15
fig = px.scatter(temp.sort_values('confirmed',ascending=False).head(top),
                 x='confirmed', y='deaths',color='country',size='active', height = 700,
                 text='country',log_x=True, log_y=True, title="Top 15: Death vs Confirmed Cases (log10 scale)")

fig.update_traces(textposition = 'top center')
fig.update_layout(showlegend = False)
fig.update_layout(xaxis_rangeslider_visible = True)
fig.show()

In [1]:
temp = df3[df3['deaths_per_100_cases']!=0].sort_values('deaths_per_100_cases').head(10)
x = temp['country']
y = temp['deaths_per_100_cases']

trace = go.Bar(x = x, y=y,name='countries')
layout=go.Layout(title= 'Lowest Deaths per 100 Cases')
figure = go.Figure(data=trace,layout=layout)

figure.show()

## 4 Quadrants Analysis

There can be 4 different types of correlation between COVID-19 cases & deaths. Hence, a 4 quadrant analysis is carried out in order to get a comprehensive understanding of this relationship. 

The 4 quadrants are as follows:
- High cases, Low deaths
- High cases, High deaths
- Low cases, High deaths
- Low cases, Low deaths

In [1]:
a = {'x':[0,70],'y':[1250,1250]}
line1 = pd.DataFrame(a)  
b = {'x':[2.1,2.1],'y':[0,200000]}
line2 = pd.DataFrame(b)

fig = px.scatter(df3, x='deaths_per_100_cases', y='cases_per_million_people',color='country', height = 700, log_x=True, log_y=True, title="4 Quadrant Analysis")

fig.add_trace(go.Scatter(x=line1['x'], y=line1['y'],marker_line_color="midnightblue", marker_color="black",text='Confirmed Cases per 1 Million People Axis'))
fig.add_trace(go.Scatter(x=line2['x'], y=line2['y'],marker_line_color="midnightblue", marker_color="black",text='Deaths per 100 Cases Axis'))
fig.update_traces(textposition = 'top center')

fig.update_layout(showlegend = False,xaxis_title="Deaths per 100 Cases",
    yaxis_title="Confirmed Cases per 1 Million People")
fig.update_layout(xaxis_rangeslider_visible = True)
fig.show()

In [1]:
a = {'x':[0,70],'y':[1250,1250]}
line1 = pd.DataFrame(a)  
b = {'x':[2.1,2.1],'y':[0,200000]}
line2 = pd.DataFrame(b)

temp = df3[(df3['country']=='Qatar') | (df3['country']=='Singapore') | (df3['country']=='Burundi') | (df3['country']=='San Marino') | (df3['country']=='China') | (df3['country']=='Yemen')]
temp2 = df3[(df3['country']!='Qatar') & (df3['country']!='Singapore') & (df3['country']!='Burundi') & (df3['country']!='San Marino') & (df3['country']!='China') & (df3['country']!='Yemen')]

fig = px.scatter(temp, x='deaths_per_100_cases', y='cases_per_million_people',color='country', height = 700, log_x=True, log_y=True, title="4 Quadrant Analysis (Outliers)", text='country')

fig.add_trace(go.Scatter(x=line1['x'], y=line1['y'],marker_line_color="midnightblue", marker_color="black",text='Confirmed Cases per 1 Million People Axis'))
fig.add_trace(go.Scatter(x=line2['x'], y=line2['y'],marker_line_color="midnightblue", marker_color="black",text='Deaths per 100 Cases Axis'))
fig.add_trace(go.Scatter(x=temp2['deaths_per_100_cases'], y=temp2['cases_per_million_people'],marker_line_color="midnightblue", marker_color="lightskyblue",mode='markers',text=temp2['country']))
fig.update_traces(textposition = 'top center')

fig.update_layout(showlegend = False,xaxis_title="Deaths per 100 Cases",
    yaxis_title="Confirmed Cases per 1 Million People")
fig.update_layout(xaxis_rangeslider_visible = True)

fig.show()

## Quadrant centred around the average of the cluster

The graph above illustrates the confirmed cases per 1 million people over deaths per 100 cases. It shows the relationship between confirmed cases per 1 million people and deaths per 100 cases for every country in the dataset. 

- The top left quadrant are the countries with a higher than average number of confirmed cases per million people but have lower than average deaths per 100 cases.


- The bottom left quadrant are the countries with lower than average confirmed cases per million as well as lower than average deaths per 100 cases.


- The top right quadrant are the countries with both higher than average number of confirmed cases per million as well as higher than average deaths per 100 cases.


- The bottom right quadrant are the countries with lower than average number of confirmed cases per million but higher than average deaths per 100 cases. 

Essentially, going clockwise from the top left quadrant, the four quadrants represent:
- High cases, Low deaths
- High cases, High deaths
- Low cases, High deaths
- Low cases, Low deaths
 

As we can see, Singapore is situated in the quadrant with a high number of cases per million people however, it has very few deaths per 100 cases. In fact, it is the left most country on the entire graph, indicating that out of all the countries in our dataset, Singapore is the number 1 country for lowest deaths per 100 cases. 

## Time Series Analysis

Next we will move on to the time series analysis of the world and the outliers in the 4 different quadrants

The outliers are:
- Singapore and Qatar (high cases, low deaths)
- China and Yemen (low cases, high deaths)
- Burundi (low cases, low deaths)
- San Marino (high cases, high deaths)

In [1]:
df_date = df.groupby('date').sum()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_date.index, y=df_date['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=df_date.index, y=df_date['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=df_date.index, y=df_date['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=df_date.index, y=df_date['active'],name='Active'))
fig.update_layout(title = 'World', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

### High Cases, Low Deaths

In [1]:
singapore = df.loc[df['country']=='Singapore',['date','confirmed','deaths','recovered','active']]

fig = go.Figure()
fig.add_trace(go.Scatter(x=singapore['date'], y=singapore['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=singapore['date'], y=singapore['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=singapore['date'], y=singapore['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=singapore['date'], y=singapore['active'],name='Active'))
fig.update_layout(title = 'Singapore', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

In [1]:
Qatar = df.loc[df['country']=='Qatar',['date','confirmed','deaths','recovered','active']]

fig = go.Figure()
fig.add_trace(go.Scatter(x=Qatar['date'], y=Qatar['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=Qatar['date'], y=Qatar['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=Qatar['date'], y=Qatar['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=Qatar['date'], y=Qatar['active'],name='Active'))
fig.update_layout(title = 'Qatar', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

### Low Cases, High Deaths

In [1]:
china = df.loc[df['country']=='China',['date','confirmed','deaths','recovered','active']]

fig = go.Figure()
fig.add_trace(go.Scatter(x=china['date'], y=china['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=china['date'], y=china['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=china['date'], y=china['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=china['date'], y=china['active'],name='Active'))
fig.update_layout(title = 'China', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

In [1]:
yemen = df.loc[df['country']=='Yemen',['date','confirmed','deaths','recovered','active']]

fig = go.Figure()
fig.add_trace(go.Scatter(x=yemen['date'], y=yemen['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=yemen['date'], y=yemen['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=yemen['date'], y=yemen['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=yemen['date'], y=yemen['active'],name='Active'))
fig.update_layout(title = 'Yemen', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

### Low Cases, Low Deaths

In [1]:
Burundi = df.loc[df['country']=='Burundi',['date','confirmed','deaths','recovered','active']]

fig = go.Figure()
fig.add_trace(go.Scatter(x=Burundi['date'], y=Burundi['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=Burundi['date'], y=Burundi['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=Burundi['date'], y=Burundi['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=Burundi['date'], y=Burundi['active'],name='Active'))
fig.update_layout(title = 'Burundi', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

### High Cases, High Deaths

In [1]:
San_Marino = df.loc[df['country']=='San Marino',['date','confirmed','deaths','recovered','active']]

fig = go.Figure()
fig.add_trace(go.Scatter(x=San_Marino['date'], y=San_Marino['confirmed'],name='Confirmed'))
fig.add_trace(go.Scatter(x=San_Marino['date'], y=San_Marino['deaths'],name='Deaths'))
fig.add_trace(go.Scatter(x=San_Marino['date'], y=San_Marino['recovered'],name='Recovered'))
fig.add_trace(go.Scatter(x=San_Marino['date'], y=San_Marino['active'],name='Active'))
fig.update_layout(title = 'San Marino', xaxis_tickfont_size = 14, yaxis = dict(title = 'Number of Cases'))

fig.show()

## Top 10 Cases Analysis

In [1]:
f = plt.figure(figsize=(10,5))
f.add_subplot(111)

plt.axes(axisbelow=True)
plt.barh(ld_country.sort_values('confirmed')["confirmed"].index[-10:],ld_country.sort_values('confirmed')["confirmed"].values[-10:],color="darkcyan")
plt.tick_params(size=5,labelsize = 13)
plt.xlabel("Confirmed Cases",fontsize=18)
plt.title("Top 10 Countries (Confirmed Cases)",fontsize=20)
plt.grid(alpha=0.3)

In [1]:
sg_confirmed = ld_country.loc['Singapore','confirmed']
print(f"In comparison, Singapore's total number of confirmed cases is {sg_confirmed}")

In [1]:
f = plt.figure(figsize=(10,5))
f.add_subplot(111)

plt.axes(axisbelow=True)
plt.barh(ld_country.sort_values('deaths')["deaths"].index[-10:],ld_country.sort_values('deaths')["deaths"].values[-10:],color="#a5a5a5")
plt.tick_params(size=5,labelsize = 13)
plt.xlabel("Death Cases",fontsize=18)
plt.title("Top 10 Countries (Death Cases)",fontsize=20)
plt.grid(alpha=0.3)

In [1]:
sg_death = ld_country.loc['Singapore','deaths']
print(f"In comparison, Singapore's total number of death cases is {sg_death}")

In [1]:
f = plt.figure(figsize=(10,5))
f.add_subplot(111)

plt.axes(axisbelow=True)
plt.barh(ld_country.sort_values('recovered')["recovered"].index[-10:],ld_country.sort_values('recovered')["recovered"].values[-10:],color="#66b3ff")
plt.tick_params(size=5,labelsize = 13)
plt.xlabel("Recovered Cases",fontsize=18)
plt.title("Top 10 Countries (Recovered Cases)",fontsize=20)
plt.grid(alpha=0.3)

In [1]:
sg_recovered = ld_country.loc['Singapore','recovered']
print(f"In comparison, Singapore's total number of recovered cases is {sg_recovered}")

In [1]:
f = plt.figure(figsize=(10,5))
f.add_subplot(111)

plt.axes(axisbelow=True)
plt.barh(ld_country.sort_values('active')["active"].index[-10:],ld_country.sort_values('active')["active"].values[-10:],color="#ff9999")
plt.tick_params(size=5,labelsize = 13)
plt.xlabel("Active Cases",fontsize=18)
plt.title("Top 10 Countries (Active Cases)",fontsize=20)
plt.grid(alpha=0.3)

In [1]:
sg_active = ld_country.loc['Singapore','active']
print(f"In comparison, Singapore's total number of active cases is {sg_active}")