## Project: Assessing the spread and impact of covid-19 across the world.
### Proponent: The World Health Organization (WHO)
### Task: Exploratory Data Analysis
#### Date: 2021-10-22

## Approach
Before we begin the data analysis process, we need must ensure that our analysis is anchored on the project objectives. To do this, we will formulate questions based on the objectives to guide the exploratory analysis process.

### Project Objectives:
1. Track the trend of covid-19 infections/cases and deaths since the outbreak.
2. Identify countries and continents hit hardest by the pandemic.
3. Evaluate effectiveness of vaccines in controlling the spread of covid-19.

### Questions
**Objective 1**
1. What is the trend of new covid-19 cases and deaths in the world since the beginning of the outbreak? 
2. Are the number of new cases and new deaths decreasing or increasing?
3. What is the trend continent-wise?

**Objective 2**
1. Which country has the highest infection rate?
2. Which country has the highest death rate?
3. Which countries have above average infection and death rates?
4. Are there any countries with unusually high infection/death rates (outliers)?
6. Which continents have highest infection/ death rates?
8. What is the influence of continent/countries socio-economic position on infection and death rates?
9. Does the age of a country/continent's population have an effect on the number of infections and deaths?

**Objective 3**
#### Assumption:
Assumming that vaccine rollout had gained sufficient momentum in march 2021.

Keeping this assumption in mind. 

1. What was the trend of infections before and after vaccine rollout?
2. Which countries saw the highest reduction in average covid infections/deaths after vaccine rollout?

These questions are enough to get our analysis started. The next point of action is understand how to evaluate infection rate and mortality rate of a disease. 

In [70]:
# importing requisite libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mysql.connector as mysql
from mysql.connector import errorcode
from datetime import datetime


sns.set_theme(style='darkgrid')
%matplotlib notebook

In [2]:
# creating database connection
config = {
    'user': 'korir',
    'password': 'Wayne1966!',
    'host': 'localhost',
    'database': 'who_covid_19',
    'raise_on_warnings': True
}

conn = mysql.connect(**config)

In [3]:
# loading data from database

query = '''
        SELECT *
        FROM covid_19
        
        '''
covid_data = pd.read_sql(query, conn)

In [4]:
# checking that our data was loaded correctly
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118567 entries, 0 to 118566
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   entry_id                 118567 non-null  int64  
 1   iso_code                 118567 non-null  object 
 2   continent                118567 non-null  object 
 3   location                 118567 non-null  object 
 4   date_                    118567 non-null  object 
 5   total_cases              118567 non-null  int64  
 6   new_cases                118567 non-null  int64  
 7   total_deaths             118567 non-null  int64  
 8   new_deaths               118567 non-null  int64  
 9   population               118567 non-null  int64  
 10  median_age               118567 non-null  float64
 11  aged_65_older            118567 non-null  float64
 12  aged_70_older            118567 non-null  float64
 13  gdp_per_capita           118567 non-null  float64
 14  life

#### Note
The date column in our dataset was loaded as an objecty dtype, we need to change this to datetime.

In [5]:
# releasing database resources 
conn.close()

In [12]:
# changing dtype date column
covid_data['date_'] = covid_data['date_'].astype('Datetime64')

In [13]:
# inspect first few columns dataset
covid_data.head()

Unnamed: 0,entry_id,iso_code,continent,location,date_,total_cases,new_cases,total_deaths,new_deaths,population,median_age,aged_65_older,aged_70_older,gdp_per_capita,life_expectancy,human_development_index
0,0,AFG,Asia,Afghanistan,2020-02-24,5,5,0,0,39835428,18.6,2.581,1.337,1803.99,64.83,0.511
1,1,AFG,Asia,Afghanistan,2020-02-25,5,0,0,0,39835428,18.6,2.581,1.337,1803.99,64.83,0.511
2,2,AFG,Asia,Afghanistan,2020-02-26,5,0,0,0,39835428,18.6,2.581,1.337,1803.99,64.83,0.511
3,3,AFG,Asia,Afghanistan,2020-02-27,5,0,0,0,39835428,18.6,2.581,1.337,1803.99,64.83,0.511
4,4,AFG,Asia,Afghanistan,2020-02-28,5,0,0,0,39835428,18.6,2.581,1.337,1803.99,64.83,0.511


In [14]:
# dropping redundant columns (they useful within the database context but not here)
covid_data.drop(columns=['entry_id', 'iso_code'], inplace=True)

## Analysis
### Objective 1
#### Answering Question 1 & 2: 
The objective is to understand the general trend of new covid-19 cases and deaths in the world since the start of the pandemic. Let's explore the how covid-19 cases and deaths change on a monthly basis across the world by looking at the avergage number of new cases and new deaths each month.

In [59]:
# grouping by month and getting average

month_avg_covid = covid_data.groupby(covid_data['date_'].dt.to_period('M'))[['new_cases', 'new_deaths']].mean()

# display first few rows

month_avg_covid.head()

Unnamed: 0_level_0,new_cases,new_deaths
date_,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01,32.422145,0.678201
2020-02,61.593137,2.22549
2020-03,184.804724,9.760758
2020-04,439.510506,35.874292
2020-05,505.26625,26.065867


In [60]:
# making the index more readable 
month_avg_covid.set_index(month_avg_covid.index.strftime('%b-%y'), inplace=True)

In [84]:
# visualizing the trend of new covid_19 cases and deaths
fig, axes = plt.subplots(2,1, figsize=(8, 10), tight_layout=True, sharey=False)

# x axis limits
xlims = (month_avg_covid.index[0], month_avg_covid.index[-1])
# chart titles font properties 
title_font = {
    'color': 'navy',
    'fontfamily': 'Monospace',
    'fontstyle': 'normal',
    'fontsize': 12,
    'fontweight': 'bold'
}

# chart axis label font properties 

axis_label_font = {
    'fontfamily': 'Monospace',
    'fontstyle': 'italic',
    'fontsize': 9,
    'fontweight': 'bold'
}

# plot avg new_deaths vs time
sns.lineplot(x='date_', y='new_deaths', ax=axes[0], data=month_avg_covid)
axes[0].set_xlabel('Time in Months', **axis_label_font) # setting name of  ylabel
axes[0].set_ylabel('Average new deaths', **axis_label_font) # setting name of  ylabel
axes[0].set_title('Average new deaths in the world', **title_font) # setting Title
plt.setp(axes[0].get_xticklabels(), fontsize=10, rotation='vertical') # adjusting fontsize and inclination of xtick labels 
axes[0].set_xlim(xlims)  # setting x-lim
axes[0].set_ylim(bottom=0) # setting y-lim

# plot avg new_cases vs time
sns.lineplot(x='date_', y='new_cases', ax=axes[1], data=month_avg_covid) # plotting a line graph 
axes[1].set_xlabel('Time in Months', **axis_label_font) # setting name of  xlabel 
axes[1].set_ylabel('Average new cases', **axis_label_font) # setting name of  ylabel
axes[1].set_title('Average new cases in the world', **title_font)
plt.setp(axes[1].get_xticklabels(), fontsize=10, rotation='vertical') # adjusting fontsize and inclination of xtick labels 
axes[1].set_xlim(xlims) # setting x-lim
axes[1].set_ylim(bottom=0) # setting y-lim


<IPython.core.display.Javascript object>

(0.0, 3688.507320278848)

#### Average new deaths are on the decline
As expected, we see that the number of new deaths increases over time. There was a sharp spike in the number of deaths between October 2020 and Jan 2021. After January 2021,  we observe a decline in the number of new deaths with the exception of April and July, which saw small spikes in number of deaths. 

#### Average new cases have periodic fluctuations
There was a steep steady increase in the number of new cases reported between the beginning of the pandemic and December 2020 where the number of cases began to drop off. Between January and October 2021, there have been periodic fluctuations with the number of new cases. The spike in the number of new cases in April is likely due to the Delta variant which many researches have concluded is more infectious than the Beta variant.

### Objective 1
#### Answering Question 3.
Previously we examined the worldwide trend of new cases and new deaths, here we want to look at the trend of cases and deaths on each continent. This might inform the WHO's covid-relief disbursement by prioritizing continents where number of deaths or cases are not declining.

In [155]:
# filtering data with continent and month

cont_month_avg_covid = covid_data.groupby(['continent', covid_data['date_'].dt.to_period('M')])[['new_cases', 'new_deaths']].mean() 

In [156]:
# inspection our resulting dataframe
cont_month_avg_covid.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,new_cases,new_deaths
continent,date_,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,2020-02,0.042254,0.0
Africa,2020-03,6.669746,0.230947
Africa,2020-04,21.246134,0.923969
Africa,2020-05,64.860409,1.55716
Africa,2020-06,159.469136,3.65679


In [157]:
# transforming data into long-form for easy visualization
cont_month_avg_covid.reset_index(inplace=True)


In [158]:
# inspecting resulting dataframe 
cont_month_avg_covid.head()

Unnamed: 0,continent,date_,new_cases,new_deaths
0,Africa,2020-02,0.042254,0.0
1,Africa,2020-03,6.669746,0.230947
2,Africa,2020-04,21.246134,0.923969
3,Africa,2020-05,64.860409,1.55716
4,Africa,2020-06,159.469136,3.65679


In [174]:
cont_month_avg_covid['date_'] = cont_month_avg_covid['date_'].dt.strftime('%Y-%m-%d')


In [178]:
cont_month_avg_covid['date_'] = cont_month_avg_covid['date_'].astype('Datetime64')

In [179]:
cont_month_avg_covid.head()

Unnamed: 0,continent,date_,new_cases,new_deaths
0,Africa,2020-02-29,0.042254,0.0
1,Africa,2020-03-31,6.669746,0.230947
2,Africa,2020-04-30,21.246134,0.923969
3,Africa,2020-05-31,64.860409,1.55716
4,Africa,2020-06-30,159.469136,3.65679


In [188]:
# visualizing trend of new_cases continent-wise

# visualizing the trend of new covid_19 cases and deaths
fig, axs = plt.subplots(figsize=(8, 5), tight_layout=True, sharey=False)

# x axis limits
xlims = pd.to_datetime('2020-01-01')
# chart titles font properties 
title_font = {
    'color': 'navy',
    'fontfamily': 'Monospace',
    'fontstyle': 'normal',
    'fontsize': 12,
    'fontweight': 'bold'
}

# chart axis label font properties 

axis_label_font = {
    'fontfamily': 'Monospace',
    'fontstyle': 'italic',
    'fontsize': 9,
    'fontweight': 'bold'
}

sns.lineplot(x='date_', y='new_cases', hue='continent', ax=axs, data=cont_month_avg_covid)
axs.set_xlabel('Time in Months', **axis_label_font) # setting name of  ylabel
axs.set_ylabel('Average new deaths', **axis_label_font) # setting name of  ylabel
axs.set_title('Average new deaths continent-wise', **title_font) # setting Title
plt.setp(axs.get_xticklabels(), fontsize=10, rotation='vertical') # adjusting fontsize and inclination of xtick labels 
axs.set_ylim(bottom=0) # setting y-lim
axs.set_xlim(left=xlims) # setting y-lim


<IPython.core.display.Javascript object>

(18262.0, 18962.95)