# A Beginner's Guide to EDA : COVID-19 World Vaccination Report

## Understanding the Features of the Dataset That Will be Used 

**Country**- this is the country for which the vaccination information is provided;

**Country ISO Code** - ISO code for the country;

**Date** - date for the data entry; for some of the dates we have only the daily vaccinations, for others, only the (cumulative) total;

**Total number of vaccinations** - this is the absolute number of total immunizations in the country;

**Total number of people vaccinated** - a person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment, the number of vaccination might be larger than the number of people;

**Total number of people fully vaccinated** - this is the number of people that received the entire set of immunization according to the immunization scheme (typically 2); at a certain moment in time, there might be a certain number of 
people that received one vaccine and another number (smaller) of people that received all vaccines in the scheme;

**Daily vaccinations (raw)** - for a certain data entry, the number of vaccination for that date/country;

**Daily vaccinations** - for a certain data entry, the number of vaccination for that date/country;

**Total vaccinations per hundred** - ratio (in percent) between vaccination number and total population up to the date in the country;

**Total number of people vaccinated per hundred** - ratio (in percent) between population immunized and total population up to the date in the country;

**Total number of people fully vaccinated per hundred** - ratio (in percent) between population fully immunized and total population up to the date in the country;

**Number of vaccinations per day** - number of daily vaccination for that day and country;

**Daily vaccinations per million** - ratio (in ppm) between vaccination number and total population for the current date in the country;

**Vaccines used in the country** - total number of vaccines used in the country (up to date);

**Source name** - source of the information (national authority, international organization, local organization etc.);

**Source website** - website of the source of information;

***

## Import the Necessary Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

import matplotlib.ticker as mticker

## Import the dataset

In [None]:
data = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv')
data

## Explore the Dataset

In [None]:
# Obtain a description of the dataset

data.describe()

In [None]:
# Obtain a description of the string columns.

data.describe(include='object')

In [None]:
# Find the number of null values in each column

data.isnull().sum()

It is seen that 'People Fully Vaccinated' and 'People Fully Vaccinated Per Hundred' have the highest number of missing values. 

Second is 'Daily Vaccinations Raw'.

Third is 'People Vaccinated' and 'People Vaccinated Per Hundred'.

In [None]:
# Finding the number of unique values in each column.

data.nunique()

From this, we find that there are 23 unique vaccine combinations that are used in 124 countries. 

In [None]:
# Examine the names and the count of each country. 

data['country'].value_counts()

We find that there are a few changes that can be made to the data in order to prevent any potential errors in the analyses. For instance, data is provided for the United Kingdom, but also for England, Scotland, Wales, and Ireland, all of which make up the UK. We can thus remove these four rows from our data and retain only 'United Kingdom'. 


## Clean the Data

In [None]:
# Drop the rows that contain data about England, Scotland, Wales, or Northern Ireland.

data = data[data.country.apply(lambda x: x not in ['England', 'Scotland', 'Wales', 'Northern Ireland'])]

We can also replace the data format given in the dataset with the date format fo the 'Pandas' library. This will make our analyses easier later on.

In [None]:
# Replace the date given with the 'pandas' date format for ease of analyses.

pd.to_datetime(data.date)

***

## Find the Distribution of Vaccine Combinations in Different Countries

In [None]:
# Find the quantity of each vaccine combination used.

data['vaccines'].value_counts()

We find that the 'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech' combination has the highest number of doses, while 'Moderna' has the least number of doses.

This should imply that the 'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech' combination is used in the most number of countries, right? 

Let us do further analyses to check if this is true.

In [None]:
# An extra step to display the full data frame, when required.

pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
# Slice the original dataframe to display the 'vaccines' and 'country' columns only.

df1 = data[["vaccines", "country"]]
df1.head(10)

In [None]:
# Create a dictionary of each vaccine combination and its country of usage.

d = {}
for i in df1["vaccines"].unique():
    d[i] = [df1["country"][j] for j in df1[df1["vaccines"]==i].index]
    
# If we display this directly, we will get repeated values within the key, as the country names appear multiple times. 
# We therefore need to remove repeated values in each key.
    
res = {}
for key,value in d.items():
    res[key] = set(value)
    
res

In [None]:
# Find the number of values for each key in the dictionary.
# This allows us to count the number of countries using each vaccine combination.

for key, value in res.items():
    print(key, len([item for item in value if item]))

We can already see that the 'Pfizer/BioNTech' combination seems to be the most widely used combination, with 25 countries providing it. 

Let's visualise this data to better understand the distribution of each vaccine combination throughout each country.

In [None]:
# Convert this into a dataframe

vacc_coun = pd.DataFrame.from_dict(res,orient='index')


# There will be a lot of null values in this dataframe.
# We can add a highlight to these null values.
# This makes it easier to identify the most and least used vaccines. 

vacc_coun.style.highlight_null(null_color='purple')

In [None]:
# Plotting each country's vaccine combination usage on a world map.
# We use the ISO Code instead of the country name to avoid any errors in plotting. 

import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)

vaccine_map = px.choropleth(data, locations = 'iso_code', color = 'vaccines')
vaccine_map.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
vaccine_map.show()

Thus, we can see how vaccines have been favored and administered throughout the world.

***

## Find the Countries with the Highest Number of Vaccinated People

We will now find the countries that have the most number of vaccinated people. We will start by observing the absolute number of total vaccinations in each country.

In [None]:
# Find the maximum number of total vaccinations for each country and display them in descending order.

g1 = data.groupby(['country'])['total_vaccinations'].max().reset_index()
df2 = g1.sort_values(by='total_vaccinations', ascending = False, ignore_index = True)
df2

We will now visualise the top 10 countries with the highest number of vaccinations. 

In [None]:
# Visualise the top ten countries with the most number of administered vaccines.

fig, ax = plt.subplots(figsize=(14, 7))
tot_vacc = sns.barplot(ax=ax, data=df2.head(10), y="total_vaccinations", x = "country")

# Extra code to annotate the graph.

for p in tot_vacc.patches:
    tot_vacc.annotate('{:.2f}'.format(p.get_height()), (p.get_x(), p.get_height()+1))

We see the that the top 10 countries with the most number of vaccinations carried out are the US, China, the UK, India, Brazil, Turkey, Israel, Germany, Russia, and the UAE.

Let us now have a more in-depth understanding of the number of vaccines administered in each country. We will do this by studying the number of people vaccinated, the number of people fully vaccinated, the percentage of people vaccinated, and the percentage of people fully vaccinated.

In [None]:
# View the number of people who have received the vaccine, either completely or partially.
# The data is sorted in descending order based on the number of people who are fully vaccinated.

g1 = data.groupby(['country'])['people_vaccinated'].max().reset_index()
df2 = g1.sort_values(by='people_vaccinated', ascending = False, ignore_index = True).style.background_gradient(cmap = 'Blues')
df2

Here we find that the US, the UK, and India have the largest number of people who have been vaccinated, either partially or completely.

In [None]:
# View the number of people who have received the vaccine completely.
# The data is sorted in descending order based on the number of people who are fully vaccinated.

g2 = data.groupby(['country'])['people_fully_vaccinated'].max().reset_index()
df3 = g2.sort_values(by='people_fully_vaccinated', ascending = False, ignore_index = True).style.background_gradient(cmap = 'Blues')
df3

Here we find that the US, Israel, and India have the largest number of people who have been vaccinated completely.

In [None]:
# View the percentage of people who have received the vaccine, either completely or partially.
# The data is sorted in descending order based on the number of people who are fully vaccinated.

g3 = data.groupby(['country'])['people_vaccinated_per_hundred'].max().reset_index()
df4 = g3.sort_values(by='people_vaccinated_per_hundred', ascending = False, ignore_index = True).style.background_gradient(cmap = 'Greens')
df4

Here we find that Gibralter, Seychelles, and Israel have the highest percentage of people who have been vaccinated, either partially or completely.

In [None]:
# View the percentage of people who have received the vaccine completely.
# The data is sorted in descending order based on the number of people who are fully vaccinated.

g4 = data.groupby(['country'])['people_fully_vaccinated_per_hundred'].max().reset_index()
df5 = g4.sort_values(by='people_fully_vaccinated_per_hundred', ascending = False, ignore_index = True).style.background_gradient(cmap = 'Greens')
df5

Here we find that Gibralter, Israel, and Seychelles have the highest percentage of people who have been vaccinated completely.

Although the US has the overall highest **number** of vaccinations done, Gibralter has the overall highest **percentage** of vaccinations administered.

This can be explained by understanding the total population of each country:

Gibralter has a population of 33,684 (as of 2021). 

The United States has a population of 331.42 Million (as of 2021). 

Since Gibralter has a comparitively smaller population, the ratio of the vaccinated population to the total population would be higher compared to that of other countries like the US which have larger populations.

**What can we infer from this?**

Assuming that the vaccine is successful, we can say that since Gibralter has vaccinated most of its population, there would most likely be a lower number of Covid cases in the country. There is also a lower risk of catching and/or spreading the disease. However, we must keep in mind that the percentage of vaccinated people is below half the population, i.e., below 50%. Therefore, the danger of Covid still remains. 

The US, on the other hand, has vaccinated a large number of its population, but this overall number is still less than 10%. It is easy to conclude that the risk of Covid still prevails.

The US would also require more funding and efficient arrangements to provide the vaccine to more people at an increased pace.

It must be noted that many people either cannot or will not have the vaccine due to various reasons. This data would also need to be taken into consideration.

***

## Daily Vaccinations 

We will now perform an analyses on daily vaccinations. 

In [None]:
# Find the maximum number of daily vaccinations.

g5 = data.groupby(['country'])['daily_vaccinations'].max().reset_index()
df6 = g5.sort_values(by='daily_vaccinations', ascending = False, ignore_index = True)
df6.style.background_gradient(cmap = 'Oranges')

In [None]:
# Visualise this data

fig, ax = plt.subplots(figsize=(10, 7))
sns.barplot(ax=ax, data=df6.head(10), y="country", x = "daily_vaccinations")

The US, China, India, the UK, and Turkey have the most number of daily vaccinations. 

In [None]:
# Find the minimum number of daily vaccinations.

g6 = data.groupby(['country'])['daily_vaccinations'].min().reset_index()
df7 = g6.sort_values(by='daily_vaccinations', ascending = True, ignore_index = True)
df7.style.background_gradient(cmap = 'Oranges')

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
sns.barplot(ax=ax, data=df7.head(10), y="country", x = "daily_vaccinations")

Belgium, Panama, Australia, Albania, and Bolivia have the least number of daily vaccinations. 

In [None]:
# Find the average number of daily vaccinations.

g7 = data.groupby(['country'])['daily_vaccinations'].mean().reset_index()
df8 = g7.sort_values(by='daily_vaccinations', ascending = False, ignore_index = True)
df8.style.background_gradient(cmap = 'Oranges')

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
sns.barplot(ax=ax, data=df8.head(10), y="country", x = "daily_vaccinations")

The US, China, India, the UK, and Turkey have the highest average daily vaccinations. 

**Let us now visualise the daily rate of vaccinations in the top 3 countries - US, China, and India.**

In [None]:
# Slice the original dataset to get the 'date', country', and 'daily_vaccinations' in a single dataframe.

df9 = data[["date", "country", "daily_vaccinations"]]
df9.head(10)

In [None]:
# Obtain a dataframe of these data values for the 'United States'.

df_us = df9.loc[df9['country'] == "United States"]
df_us

In [None]:
# Visualise this data.

fig, ax = plt.subplots(figsize=(10, 7))
sns.lineplot(ax=ax, data=df_us, x='date', y='daily_vaccinations', marker="o")


# Set the ticks to show every 7 days to avoid overlapping of dates.

myLocator = mticker.MultipleLocator(7)
ax.xaxis.set_major_locator(myLocator)

# Autoformat the layout of the x-axis ticks.

fig.autofmt_xdate()


In [None]:
# Find the dates of the first and last vaccine dosage for the US as recorded in the dataset.
# Here, we start with 1 instead of 0 because at 0, the daily vaccinations data is null.

df_us.iloc[[1, -1]]

In [None]:
# Obtain a dataframe of these data values for 'China'.

df_china = df9.loc[df9['country'] == "China"]
df_china

In [None]:
# Visualise this data.

fig, ax = plt.subplots(figsize=(10, 7))
sns.lineplot(ax=ax, data=df_china, x='date', y='daily_vaccinations', marker='o')

# Set the ticks to show every 7 days to avoid overlapping of dates.

myLocator = mticker.MultipleLocator(7)
ax.xaxis.set_major_locator(myLocator)

# Autoformat the layout of the x-axis ticks.

fig.autofmt_xdate()

In [None]:
# Find the dates of the first and last vaccine dosage for China as recorded in the dataset.
# Here, we start with 1 instead of 0 because at 0, the daily vaccinations data is null.

df_china.iloc[[1, -1]]

In [None]:
# Obtain a dataframe of these data values for 'India'.

df_india = df9.loc[df9['country'] == "India"]
df_india

In [None]:
# Visualise this data.

fig, ax = plt.subplots(figsize=(10, 7))
sns.lineplot(ax=ax, data=df_india, x='date', y='daily_vaccinations', marker='o')

# Set the ticks to show every 7 days to avoid overlapping of dates.

myLocator = mticker.MultipleLocator(7)
ax.xaxis.set_major_locator(myLocator)

# Autoformat the layout of the x-axis ticks.

fig.autofmt_xdate()

In [None]:
# Find the dates of the first and last vaccine dosage for India as recorded in the dataset.
# Here, we start with 1 instead of 0 because at 0, the daily vaccinations data is null.

df_india.iloc[[1, -1]]

We can make the following observations:

For the United States, the daily vaccinations grew almost at a steady pace. Towards the end of February 2021, the numbers dropped, but they rose again by the beginning of March 2021.

For China, the daily vaccinations remained constant throughout December 2020. By January 2021, the numbers began to increase, following which there were several days of increasing and decreasing numbers. Towards the end of February 2021, the numbers once again remained constant. 

For India, the daily vaccinations drop for a few days, and then proceed to increase at an unsteady rate. By the beginning of March, 2021, there is a sharp increase in the numbers. 

We also see that the 3 countries begin their vaccination drive on different dates - 

The United States began on the 21st of December, 2020.

China began on the 16th of December, 2020.

India began on the 16th of January, 2020.

***

## Additional Step : Correlations Among the Features

On studying the data, we can expect the features to have a positive correlation with one another. A heatmap can help us visualise this. 

In [None]:
# Find the correlation.

corr=data.corr()
corr

In [None]:
# Visualize this correlation using a heatmap.

plt.figure(figsize=(10,10))
sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':15}, cmap='Reds')

From the heatmap, we can confirm the following - 

* There is no negative correlation between any of the values, which means that all the features vary in proportion with one another. 
* The number of people vaccinated and the number of daily vaccinations have a perfecly positive correlation, as can be expected.
* The number of people fully vaccinated and the number of daily vaccinations have a very high positive correlation, as can be expected.

***