In this project, I intend to investigate the relationship between the number of Covid-19 cases and its related deaths under different vaccination stage. The notebook will be divided into two parts. In this part, I will collect the data from different sources and merge the useful information into one single dataframe.

In [None]:
# Note: this analysis uses data that were updated 23-08-2021
# import modules
import numpy as np
import pandas as pd
import datetime


There are three separate dataframes used:

The first one comes from the World Health Organisation, it gives the total number of cases and deaths in each country.

The second one gives the vaccination doses given per capita.

The third one gives the population count in 2020. For simplicity it will be used to calculate cases and deaths per 100.

Source of datasets:

WHO COVID-19 Dashboard. Geneva: World Health Organization, 2020. Available online: https://covid19.who.int/ (last cited: 23 Aug 2021)

Our World in Data. COVID-19 vaccine doses administered per 100 people. Available online: https://ourworldindata.org/grapher/covid-vaccination-doses-per-capita. (last cited: 23 Aug 2021)

Tanu N Prabhu. Population by Country - 2020. Available online: https://www.kaggle.com/tanuprabhu/population-by-country-2020. (last cited: 23 Aug 2021)

In [None]:
# read the csv files: cases and death, vaccine rate, populations
cases_and_deaths = pd.read_csv("../input/covid-cases-and-deaths-updated-23082021/country_vaccinations.csv")
vacc_data = pd.read_csv("../input/covid-vaccination-updated-2382021/covid-vaccination-doses-per-capita.csv")
population_data = pd.read_csv("../input/population-by-country-2020/population_by_country_2020.csv")



Some countries in the first dataset is named differently from others, the following code will solve the problem.

In [None]:
# clean some country naming issues
cases_and_deaths.replace("Viet Nam", "Vietnam", inplace=True)
cases_and_deaths.replace("The United Kingdom", "United Kingdom", inplace=True)
cases_and_deaths.replace("United States of America", "United States", inplace=True)

Different datasets include different countries, for simplicity I will only include countries that are present in all datasets.

In [None]:
# attempt to filter out countries not in all dataset (for simplicity)
Country_list = []
for c in cases_and_deaths.Country.unique():
    if (c in vacc_data.Entity.unique()) and (c in population_data["Country (or dependency)"].unique()):
        Country_list.append(c)

In [None]:
# remove rows that have countries not included in analysis
cases_and_deaths = cases_and_deaths.loc[cases_and_deaths["Country"].isin(Country_list)]
vacc_data = vacc_data.loc[vacc_data["Entity"].isin(Country_list)]
population_data = population_data.loc[population_data["Country (or dependency)"].isin(Country_list)]

In [None]:
# Set index to the country column
cases_and_deaths.set_index(["Country"], inplace=True)
vacc_data.set_index(["Entity"], inplace=True)
population_data.set_index(["Country (or dependency)"], inplace=True)

Time-series analysis requires the date to be parsed.

In [None]:
# parse the date
cases_and_deaths["date_parsed"] = pd.to_datetime(cases_and_deaths["Date_reported"], format="%d/%m/%Y")
vacc_data["date_parsed"] = pd.to_datetime(vacc_data["Day"], format="%d/%m/%Y")

Now it is time to extract the useful columns in each dataframe. Once it is done they will be merged into one dataset.

In [None]:
# extract and modify columns used for analysis
cases_and_deaths.drop(["Country_code","WHO_region", "Date_reported"], inplace=True, axis=1)
vacc_data.drop(["Code", "Day"], inplace=True, axis=1)
population_data = population_data["Population (2020)"]

vacc_data.rename_axis("Country", inplace=True, axis=0)
population_data.rename_axis("Country", inplace=True, axis=0)
population_data.sort_index(inplace=True)

In [None]:
# combine pop_data and case death data, and calculate per 100 values
df = cases_and_deaths.join(population_data)
df["New_cases_per_100"] = 100 * df["New_cases"] / df["Population (2020)"]
df["Cumulative_cases_per_100"] = 100 * df["Cumulative_cases"] / df["Population (2020)"]
df["New_deaths_per_100"] = 100 * df["New_deaths"] / df["Population (2020)"]
df["Cumulative_deaths_per_100"] = 100 * df["Cumulative_deaths"] / df["Population (2020)"]
df.drop(["New_cases","Cumulative_cases","New_deaths","Cumulative_deaths", "Population (2020)"], inplace=True, axis=1)

In [None]:
# Create the final dataframe by joining cases-deaths and vacc data
left = df.set_index([df.index,"date_parsed"])
right = vacc_data.set_index([vacc_data.index, "date_parsed"])
df2 = left.join(right)

The final step is to fill in missing data, in this case only in the vaccination column. For each country, I simply dropped the value from above. If there is no value from above (due to the vaccine rollout had had not started), the missing values will be filled with 0s.

In [None]:
# fill the dates with missing values in total vacc column, using preceding values
df2.reset_index(inplace=True)
for c in Country_list:
    df2[df2.Country == c] = df2[df2.Country == c].fillna(method="ffill")


In [None]:
# the remaining NAs are filled with 0s
df2.fillna(0, inplace=True)

# the final dataframe is now completed

The data is now cleaned. I will export the dataframe and use it for analysis later.

In [None]:
# export the dataframe for EDAs (in later parts)
df2.to_csv("./covid_data_cleaned.csv")