**Si Nguyen Pham**

**Senior at VNU HCM International University**

**August 7th, 2021**

**Last updated: August 29th, 2021**

# **AN OVERVIEW OF GLOBAL COVID-19 VACCINATIONS PROGRESS**

**Introduction:**

At the moment, I guess almost everyone in this world has heard at least once about COVID-19. It has been nearly two years since the world recorded the first case of COVID-19. Fortunately, we have produced several types of vaccines against the pandemic. If you regurlarly update daily news, you must see and hear about COVID-19 statistics. In case you are wondering how people work with it, then this notebook is suitable for you. In this notebook, I mainly focus on analyzing information that related to the vaccinations progress. We will go through step by step to reach the final result. This notebook is also suitable for those who are at beginner level or intermediate. Follow this notebook thoroughly, you will get all of the information I provided since I explained everything in detail.

This project uses Python languague.

Many thanks to Gabriel Preda and Rishav Sharma for providing the two dataframes.

**I. Setting environment:**

**1.1. Import librabries:**

In [None]:
# Import librabries:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from scipy.stats import pearsonr 

**1.2. Reading files:**

In [None]:
# Read the country_vaccinations dataset
vcn = pd.read_csv("../input/covid-world-vaccination-progress/country_vaccinations.csv")
# Read the country_vaccinations_by_manufacturer dataset
vbm = pd.read_csv("../input/covid-world-vaccination-progress/country_vaccinations_by_manufacturer.csv")
# Read the 2021_population dataset
wp = pd.read_csv("../input/world-population/2021_population.csv")

**II. Data Checking and Cleaning:**

**2.1. Country_vaccinations dataset:**

In [None]:
# Call out the first 5 rows of the dataset
vcn.head()

In [None]:
# Dataset information
vcn.info()

Since it looks like that the null values do not affect our calculation seen the values at the total_vaccionations columns are cummulative. Therefore, we did not need to clean our data.

In [None]:
# Count the number of countries in the dataset
vcn['country'].nunique()

There are 219 countries listed in the country_vaccinations dataset

**2.2. Counttry_vaccinations_by_manufacturers:**

In [None]:
# Call out the first 5 rows of the dataset
vbm.head()

In [None]:
# Dataset information
vbm.info()

This dataset looks perfect!

In [None]:
vbm['location'].nunique()

Meanwhile, in the country_vaccinations_by_manufacturers, there are only 35 countries listed

**2.3. World_population dataset:**

In [None]:
# Call out the first 5 rows of the dataset
wp.head()

In [None]:
# Dataset infomartion
wp.info()

This dataset is perfect, too!

In [None]:
wp['country'].nunique()

There are 228 countries listed in this dataset.

**2.4. Concatenate two dataframes:**

To estimate the rate of fully vaccinated people of one country per its population, I needed to concat two dataframes. As we can see, in the world population dataset, there are 228 countries listed while in country vaccinations dataset there are only 219 countries. That means we need to clean and remove countries that are mentioned once.

As a beginner, I do not have many advanced techniques. Howerver, I am proud to get the below dataset. The idea here is to choose the neccesary columns, concat the chosen dataframes, filter out values that appear more than once in the new dataframes (remove values that appear only in 1 of the 2 dataframes which means that they are redundant data), drop duplicated values (keep last and first in order to keep the neccesary data) and reset their index. Actually, you can also clean the data using Excel but I wanted to challenge myself by using Python, so if you want to do the same thing, you can reference the idea.

In [None]:
wp_sort = wp[['country', '2021_last_updated']] 
wp_sort.sort_values('country') 
vcn_drop = vcn.drop_duplicates('country', keep = "last") 
vcn_sort = vcn_drop[['country', 'people_fully_vaccinated']]
df_same = pd.concat([wp_sort,vcn_sort]) 
df_same = df_same[df_same.groupby('country').country.transform(len) > 1] 
df_same = df_same.drop_duplicates('country', keep = "last") 
df_same_sort = df_same[['country', 'people_fully_vaccinated']] 
df_same_sort = df_same_sort.rename(columns={'country' : 'country_vaccinations'}) 
df_same_sort.reset_index(drop=True, inplace=True) 
wp_clean = pd.concat([wp, df_same])
wp_clean = wp_clean[wp_clean.groupby('country').country.transform(len) > 1]
wp_clean = wp_clean.drop_duplicates('country', keep = "first")
wp_clean_sort = wp_clean[['country', '2021_last_updated']]
wp_clean_sort = wp_clean_sort.sort_values('country')
wp_clean_sort.reset_index(drop = True, inplace=True)
cbn = pd.concat([wp_clean_sort, df_same_sort], axis=1)
cbn


Great, so bascially, we already had our wanted dataframe which is the combination of the country_vaccinations dataframe and the world_population dataframe with no unwanted country. Let's check once again to see whether the new dataframe is really clean or not.

In [None]:
cbn.isna().sum()

As we can see, there are 14 null values in the people_fully_vaccinated column. Therefore, I needed to drop all of it.

In [None]:
cbn = cbn.dropna()
cbn = cbn.reset_index()
cbn

One other important step when cleaning data is to check whether that targeted data is in the right type or not. In further calculation, I worked with values from 2021_last_updated and people_fully_vaccinated. Therefore, I had to make sure that they were in the same type.

In [None]:
# Check data types
cbn.info()

Since the data in 2021_last_updated column was not in the type I want, so I reformatted it.

In [None]:
# Reformat data type
cbn.replace(',','', regex=True, inplace=True)
cbn['2021_last_updated'] = cbn['2021_last_updated'].map(lambda x: float(x))
cbn['people_fully_vaccinated'] = cbn['people_fully_vaccinated'].map(lambda x: float(x))
cbn

So far so good. Then, I had to clean data once again to make sure that there was no weird value or logic. In this case, I would check whether there was any country with the number of fully vaccinated people was higher than the last updated population number.

In [None]:
# Check weird data
cbn[cbn['2021_last_updated'] < cbn['people_fully_vaccinated']]

Because I was using two independent datasets. Therefore, the updated data might not match for some country, Gibraltar for example. All I had to do was to remove this country.

In [None]:
# Remove weird data 
cbn = cbn.drop(index = [61])


Well, so I have pretty much done with our dataframes!

Let's get started out EDA!

**III. Exploratory Data Analysis:**

Firstly, let's analyze the global country vaccinations data progress. Here, we have the following questions to answer:
1. What are the top 5 biggest and smallest countries with vaccinations progress ?
2. Which countries are outstanding vaccinations progress ?
3. What is the global average vaccinations by month ?
4. What vaccine is most common and least used ?
5. What are the top 5 countries that are having the highest and lowest of fully vaccinated people per population ?


**3.1.1. What are the top 5 biggest countries with vaccinations progress ?**

Here, I tried to find 5 countries with highest number of vaccinations. Since I saw that in total_vaccinations columns, the values are cummulative. Therefore, instead of using sum function, I used max function to take the lastest value. You can alsop use sum function but in this case, the column you work with is the daily_vaccinations.

In [None]:
#Five Highest Countries
total = vcn.groupby('country')['total_vaccinations'].max().reset_index()
fhc = total.sort_values('total_vaccinations', ascending=False).head(5)
pd.set_option('display.float_format', lambda x: '%.0f'% x)
fhc

Here, you can see that China, known as the first country that the COVID-19 pandemic occurred has the highest number of total vaccinations. Nearly 1.76 billions vaccine doses were used. It is reasonable because China has faced the pandemic for 2 years and they population is the highest in the world. The next 3 positions are for India, United States and Brazil. It is expected because these 3 countries have the highest number of cases.

In [None]:
fig = px.bar(fhc, 
             x='country', 
             y='total_vaccinations',
             labels = {'country' : 'Country', 'total_vaccinations' : 'Total Vaccinations'},
             title = "Top 5 Countries With Biggest Vaccinations Progress"
            )
fig.show()

Thanks to the graph, it is obviously that China is dominating in the number of vaccine used.

**3.1.2. What are the top 5 smallest countries with vaccinations progress ?**

Similarly, I try to find the top 5 countries with the least vaccinations progress. In this case, I do not put the asceding command.


In [None]:
# Top 5 smallest countries with vaccinations progress
flc = total.sort_values('total_vaccinations', ascending=False).tail(5)
flc

It is weird to see that there is a country with only 83 in total vaccinations. By searching for this country, I figured out that Pitcairn is an islands whose sorvereign state is United Kingdom and in 2021, it is estimated that the population here is only approximately 50 residents.


In [None]:
fig = px.bar(flc, 
             x='country', 
             y='total_vaccinations',
             labels = {'country' : 'Country', 'total_vaccinations' : 'Total Vaccinations'},
             title = "Top 5 Countries With Lowest Vaccinations Progress"
            )
fig.show()

**3.2. Which countries are outstanding vaccinations progress ?**

Seeing the number of vaccinations of top countries compared to bottom countries leads to a question that how many countries are there which are the outliers of the overall. Let's answer this questionn by finding the two types of the outliers.

In [None]:
# Find the outliers which have far higher total vaccinations than the general status
pd.set_option('display.float_format', lambda x: '%.0f'% x)
tvc = total.sort_values('total_vaccinations', ascending=False)
Q1 = tvc['total_vaccinations'].quantile(0.25)
Q3 = tvc['total_vaccinations'].quantile(0.75)
IQR = Q3 - Q1
Upper = Q3 + 1.5*IQR
Lower = Q1 - 1.5*IQR

This code help us find the quantile range. Now let's look for countries with higher vaccinations progress compared to the general status.

In [None]:
tvc[(Lower > tvc['total_vaccinations']) | (tvc['total_vaccinations'] > Upper)].reset_index()

This table shows us that there are 29 outstanding countries. It can be explained that almost all of the countries in this table have had to face severe pandemic and they are developed countries. Therefore, they are leading in the vaccinnationns progress. However, there is also a hug difference between the top country (China) from the bottom country (Thailand). While China has used more than 1.6 billions doses, the number of Thailand is only 15.9 million doses.

Using violinplot provided a clearer view of the distribution.

In [None]:
plt.figure(figsize=(20,10))
plt.suptitle("Distribution of Global Vaccinations Progress")
sns.violinplot(data = total,
            x = 'total_vaccinations',
              showmedians=True,
              showmeans=True)
plt.show()

This violin releases that there is an uneven distribution and the difference of vaccinations progress among countries are huge. 

**3.3. What is the global average vaccinations by month ?**

Now let's check the progress of vaccinations through months to see wheter the progress is increasing or not. Here, I calculated the mean of the global daily vaccinations under the reference of month to have the overview.

In [None]:
# Find the global average total vaccinations by month
vcn['date'] = pd.to_datetime(vcn['date'])
avg = vcn.groupby(vcn['date'].dt.strftime('%B'))['daily_vaccinations'].mean().sort_values().reset_index()
avg


If you look closely, you will see that July came before June. Therefore, I would reindex these two months.

In [None]:
avg = avg.reindex([0, 1, 2, 3, 4, 5, 7, 6, 8])
avg

Good!

In [None]:
# Lineplot to see the full progress
fig = px.line(avg, 
             x='date', 
             y='daily_vaccinations',
             labels = {'daily_vaccinations' : 'Global Monthly Vaccinations', 'date' : 'Month'},
             title = "Average Total Vaccinations"
            )
fig.show()

It is obviously that the world is doing good when the line is gradually increasing except that there is a slight drop between June and July, it may be because the appearance Delta mutation at that time so it decelerated the progress. Note that from July to August, there is a sudden climb. It can be explained that there is not enough data for August in the dataset. Therefore,the value of August is not objective.

So far, it is quite enough to have a basic view of the current status. Now, let's take a quick look at the popularity of the present vaccines.

**3.4. What vaccine is most common used and least common used ?**

Let's explore the popularity of vaccines by summing their total up. In this case, like the total_vaccinations column in the country_vaccinations dataset, the value in the total_vaccinations in country_vaccinations_by_manufacturers are cummulative. Therefore, I used max function to take out the lastest value of each vaccine.

In [None]:
# Call out the number of total vaccines used by country and vaccine types
vpc = vbm.groupby(['vaccine', 'location'])['total_vaccinations'].max().reset_index()
vpc


In [None]:
# Rank the popularity of vaccines
vr = vpc.groupby('vaccine')['total_vaccinations'].sum().reset_index()
vr = vr.sort_values('total_vaccinations', ascending=False)
vr

In [None]:
# Pieplot
fig = px.pie(vr, values='total_vaccinations', names='vaccine', title='Vaccines Occupancy' )
fig.show()

This pie chart illustrates the real domination of Pfizer/BioNTech. What a vaccine!

In [None]:
# Barplot
fig = px.bar(vr, 
             y='total_vaccinations', 
             x='vaccine',
             labels = {'vaccine' : 'Vaccines', 'total_vaccinations' : 'Total Vaccinations'},
             title = "Overview of Vaccines"
            )
fig.update_yaxes(type="log")
fig.show()

There are only 33 countries mentioned in the dataset. However, it is enough to figure out that Pfizer/BioNTech and Moderna are the leaders in this race. While, the two vaccines made by Chinese (CanSino, Sinopharm/Beijing) look unpopular. 

Since one country may uses several types of vaccines. I decided to make a table to look at the total number of vaccines they have used so far.

In [None]:
# Creat pivot table
pvt = pd.pivot_table(data = vbm, index = ['location'], columns = ['vaccine'], values = 'total_vaccinations', aggfunc = 'max')
pvt.fillna(0) # Fill null values by 0 to have a better view of the table

It can be seen that Pfizer/BioNTech is used by all of the listed countries and its biggest consumer is undoubtedly, their origin country, the Uninted States with more than 193.5 millions doses used. While CanSino and Sinopharm/Beijing are only used by one country, Chile and Hungary respectively.

**3.5. World vaccinations current state:**


To do this, we need to calculate the rate between fully vaccinated people per population first. As I have cleaned the data at the beginning of the notebook. We now just need to divide two columns.

In [None]:
# Find percentage
pd.reset_option('display.float_format') # Re-format to get float values
pd.set_option('display.float_format', lambda x: '%.2f' % x) # Set number of figure after doc
cbn['percentage'] = ((cbn['people_fully_vaccinated'])/(cbn['2021_last_updated']))*100 # Set percentage value under % type
cbn

The reason why I used reset_option is that when we divide two column, the values tend to be float. However, the default format may be int so we cannot get our values. So remember when you do something similar, make sure that the format is float.

It is totally good now. Let's analyze!

**3.5.1. Rate of world population that is fully  vaccinated:**

In [None]:
(cbn['people_fully_vaccinated'].sum()/cbn['2021_last_updated'].sum())*100

As we can see, although the vaccination progress has been ocurring continously, the percentage of fully vaccinated people in the world is only 20%. It means the world has a lot of work to do and the target of community immunity needs more time to be achieved.

**3.5.2. Top 5 countries with highest vaccinations rate per population:**

In [None]:
tfh = cbn.sort_values('percentage', ascending=False).head(5)
tfh

**3.5.3. Top 5 countries with lowest vaccinations rate per population:**

In [None]:
cbn.sort_values('percentage').head(5)

From these two tables, we can see that countries with low population tends to have high rate of vaccinations while countries with high population is oppostite. To test the hypothesis, I used scatterplot to plot the relationship between the two index to see wheter they are correlated or not.

In [None]:
# Scatterplot
fig = px.scatter(cbn, 
                y="2021_last_updated", 
                x="percentage",
                labels = {'2021_last_updated' : 'Population', 'percentage' : 'Percentage'},
                title = "Scatterplot"
               )
fig.show()

Here is the scatterplot we are looking for. But, it looks like there is something unclear. Since we are having the outliers, the range is too wide. Therefore, I dropped the outliers to have a clearer view.

In [None]:
# Find quantile range and select normal values
pd.set_option('display.float_format', lambda x: '%.0f'% x)
cbn = cbn.sort_values('2021_last_updated', ascending=False)
qt1 = cbn['2021_last_updated'].quantile(0.25) # Quantile 1
qt3 = cbn['2021_last_updated'].quantile(0.75) # Quantile 3
IQR_cbn = qt3 - qt1 # Quantile range
Upper_cbn = qt3 + 1.5*IQR_cbn # Upper whisker
Lower_cbn = qt1 - 1.5*IQR_cbn # Lower whisker 
cbn_sort = cbn[(Lower_cbn < cbn['2021_last_updated']) & (cbn['2021_last_updated'] < Upper_cbn)].reset_index()

In [None]:
# Scatter plot for normal values
fig = px.scatter(cbn_sort, 
                y="2021_last_updated", 
                x="percentage",
                labels = {'2021_last_updated' : 'Population', 'percentage' : 'Percentage'},
                title = "Scatterplot excluding outliers"
               )
fig.show()

In [None]:
# Pearson correlation test for correlation coefficient
pop = np.array(cbn_sort['2021_last_updated'])
per = np.array(cbn_sort['percentage'])
pearsonr(pop, per)

Obviously, from the scatterplot, we can see that there relationship between the population and the rate of vaccinations is null. Based on the test, the correlation coefficient is negative and since its absolute value is only 0.208. We can conclude that the relationship between these two categories is weak which means that low population does not affect the high percentage of vaccinations. 

**If you are reading this setence, I guess you are having a clearer view of how vaccinations analysis is made. Thank you for spending your time with me!**