# COVID-19 Global Vaccination Data Analysis
*an Exploratory Data Analysis by Tariq Hussain*

# Introduction

The coronavirus (COVID-19) pandemic shocked the world in early 2020, grinding the world to a halt. Many people's lives were changed, as governments and scientists around the world ordered people to change their way of living and stay at home, in order to prevent the spread of the COVID-19 disease. Caused by the virus *SARS-CoV-2*, COVID-19 has taken the lives of more than 2.7 million (as of 30/03/2021) people worldwide, and has infected ~127.8 million worldwide.

This Exploratory Data Analysis strives to highlight the battle, forefronted by science and the presence of vaccines, between the virus and humanity. With many vaccination programs having started in late 2020, day by day, the world is slowly becoming more and more protected.

This EDA will consist of three existing datasets from two different providers: one dataset that contains data on vaccination progress; one dataset that contains data on COVID-19 related deaths, cases, recoveries and so forth; and one dataset that consists of population data - which will be used to put statistics like mortality rates, recovery rates and vaccination rates into perspective.

(and on the subject of datasets, I would like to take this opportunity to personally thank Gabriel Preda and Juan Carlos Santiago Culebras for their excellent datasets. This EDA would cease to exist without their work.)

# Preparatory work

In [None]:
# Preparing the notebook and loading the relevant libaries

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt # library for plotting data
%matplotlib inline
import seaborn as sns # library for data visualisation

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading the relevant datasets

In [None]:
# Loading and reading the relevant datasets using pandas.

covid_vaccine_prog_file_path = '../input/covid-world-vaccination-progress/country_vaccinations.csv'
covid_vaccine_data = pd.read_csv(covid_vaccine_prog_file_path, parse_dates=True)

covid_19_by_country_file_path = '../input/covid19-by-country-with-government-response/covid19_by_country.csv'
covid_19_by_country_data = pd.read_csv(covid_19_by_country_file_path)

country_population_file_path = '../input/covid19-by-country-with-government-response/covid19_country_population.csv'
covid_country_pop_data = pd.read_csv(country_population_file_path)

As part of data preparation and pre processing, we need to verify each column of each dataset to check for any discrepancies or irregularities. Here, pandas is being told to retrieve and print (via Python) the names of each column of each dataset as a list.

In [None]:
print('Columns in covid_vaccine_data:')
print(covid_vaccine_data.columns.to_list(), '\n')
print('Columns in covid_19_by_country_data:')
print(covid_19_by_country_data.columns.to_list(), '\n')
print('Columns in covid_country_pop_data:')
print(covid_country_pop_data.columns.to_list(),'\n')

*covid_country_pop_data* only has two columns, and thus implies a smaller dataset, so we'll just take a closer look by just displaying the entire dataset thusly, before moving on:

In [None]:
covid_country_pop_data

# Reading, cleaning & pre-processing the data

In the lists of column names for each datasets above, we can see that there are many duplicates or alternate spellings for the same type of data. For example, all of the datasets have a column for the iso codes for each country, but *covid_country_pop_data* and *covid_19_by_country_data* refer to them as 'CountryAlpha3Code', whilst *covid_vaccine_data* does not.

Let's rectify this mistake by renaming the columns using pandas, so that it makes it easy to classify columns later in the process:

In [None]:
covid_19_by_country_data_renamed = covid_19_by_country_data.rename(columns={'CountryAlpha3Code': 'iso_code', 'Country': 'country', 'Date': 'date'})

In [None]:
covid_country_pop_data_renamed = covid_country_pop_data.rename(columns={'CountryAlpha3Code': 'iso_code'})

Since these two datasets came from the same source (and most likely, the same author(s)), let's merge these two datasets together by using the *pd.join* function, setting the newly renamed 'iso_code' column as the index and, afterwards, confirming that the join was successful.

In [None]:
covid_country_data = covid_19_by_country_data_renamed.join(covid_country_pop_data_renamed.set_index('iso_code'), on='iso_code', how='left')

# Displaying the new dataset's columns to ensure that the join was successful and that there was no overlapping.
covid_country_data.columns

In [None]:
covid_country_data.head()

Now, let's turn our attention to the 'country' column of each dataset. This is another stage of the data pre processing stage, where we are checking for discrepancies or any other anomalies which could hinder the analysis of this data.

In [None]:
print(covid_country_data.country.unique().tolist())

In [None]:
print(covid_vaccine_data.country.unique().tolist())

A quick CTRL+F (and scanning with the eye) shows that there are some mismatches in each column, which is primarily due to string formatting or geographical issues, causing some country's names to display differently across datasets.

In [None]:
# The code below locates and replaces the mismatches with the correct (or more common) names of countries

covid_country_data.country = covid_country_data.country.replace().replace({
    'Czechia': 'Czech Republic',
    'Korea, South': 'South Korea',
    'US': 'United States'
    
})

Now the mismatches have been renamed, let's check for any countries that appear in *covid_country_data* but not in *covid_vaccine_data*. It would be unnecessary to keep country data for countries that don't have (or haven't recorded yet) any vaccine data.

In [None]:
country_differences = [x for x in covid_country_data.country.unique() if x not in covid_vaccine_data.country.unique()]
print(country_differences)

A large number of countries have been identified. This output shows the countries which we have population, death, infections and recovery data for, but not vaccine data for. To avoid any unncessary data being process, and the process of null values being present during further data processing, let's drop them.

In [None]:
# This code tells pandas to find the matching strings in country_differences and excluded/drop them from covid_country_data, 
# as we won't need them if they don't have any recorded vaccination data.

covid_country_data = covid_country_data[~covid_country_data.country.isin(country_differences)]
print(covid_country_data.country.unique().tolist())

Once we have varified that all of the countries previously identified have been removed, we can now merge the two datasets ('covid_vaccine_data' and 'covid_country_data') using pd.join.

In [None]:
# Combining the two datasets using pd.join, ensuring that three common columns are used as the index.

df_1 = covid_vaccine_data.set_index(['country', 'iso_code', 'date'])
df_2 = covid_country_data.set_index(['country', 'iso_code', 'date'])
covid_vaccine_data = df_1.join(df_2)
covid_vaccine_data.head()

(There appears to be a substantial amount of missing values/data (i.e. NaN) in this dataset, most likely from when either vaccinations weren't started yet, or weren't recorded - most likely the former. We will deal with each missing value accordingly, most likely as part of each analytical process.) 

In [None]:
# Ensuring no columns were dropped during the merge

print(covid_vaccine_data.columns)

In [None]:
# Double checking the bottom end of the dataset to check the data.

covid_vaccine_data.tail()

From the output above, we can see that the dataset has a unique data entry every time the vaccination data  of a country is updated. Whilst it's good to see progress over time, it wouldn't be desirable to process and analyse what are essentially duplicates for each country. So, in order to only focus on the most up to date vaccination data from each country, let's drop (or filter) out the 'older' entries.

In [None]:
# This code resets the index (making it easier to work with), groups the data by country and then retains the most recent entry for that country, 
# where .max() refers to the maximum values in each row (i.e. total vaccinations).
# The output is then displayed to verify that each country has only one entry, and that there are no duplicates.

covid_vaccine_data_f = covid_vaccine_data.reset_index().groupby('country').max()
with pd.option_context('display.max_columns', 5):
    display(covid_vaccine_data_f.head())

# Start of Exploratory Data Analysis
# 

# 1. What vaccines are used in which countries?

There are many different pharmaceutical companies worldwide, and a number of them have been responsible for many of the vaccines distributed throughout the globe, and many are still being developed whilst being rolled out, whilst others are waiting approval. Many countries, such as the United Kingdom, have opted to use multiple vaccines from multiple provided to subside the amount of vaccines needed for the population. 

But which vaccines are used where? 

In [None]:
df = covid_vaccine_data_f.reset_index()
v_df = df[['country', 'vaccines']]
vacc_df = v_df.groupby(['vaccines', 'country']).max().sort_values(by='vaccines', ascending=False)

with pd.option_context(
    'display.max_rows', None,
    'display.expand_frame_repr', True,
):
    display(vacc_df)

This table illustrates the vast array of vaccines being used today, with many being distrubuted worldwide and others only being used in certain territories or on certain continents. The main take aways from this table are:
* The **Oxford/AstraZeneca** and **Pfizer/BioNTech** vaccines are the most widely used across the globe.
* The use of many vaccines appear to be proximity-based, in that certain vaccines seem to cover landlocked or neighbouring countries (i.e the **Oxford/AstraZeneca** vaccine, by itself, covers most of the African continent).
* The **Oxford/AstraZeneca** vaccine is more widely used in third world countries. This makes sense, as this vaccine is manufactured at cost - and is therefore free of charge, meaning the countries with a lower GDP can recieve the vaccine.
* The **Covaxin** vaccine is currently only being rolled out and used in India, which means that particular vaccine hasn't been approved in any other territory yet.
* *China* (**Sinopharm/Beijing, Sinopharm/Wuhan, Sinovac**) and *Russia* (**EpiVacCorona, Sputnik V**) are the only countries that aren't currently utilising vaccines developed overseas.
* According to this table, the *United Arab Emirates* and *Hungary* currently have the most vaccines being rolled out, each originating from a total of five different providers.

# 2. Which countries are the most vaccinated?

Vaccines for COVID-19 have been in clinical trials since mid-late 2020, and were officially rolled out and in use from December 2020, and have been implemented in more and more countries over time. 

The question is, regardless of population, which countries have inoculated the most people?

In [None]:
# Grouping the data by country and then sorting the data by the max value from highest to lowest

top_total_vaccinations = covid_vaccine_data_f.groupby('country').people_vaccinated.max().sort_values(ascending=False)
total_vaccinations_f = top_total_vaccinations.dropna(axis=0).reset_index() # Dropping NaN values

display(total_vaccinations_f[0:5])

# Plotting a barplot for data viz using matplotlib and seaborn

plt.figure(figsize=(15, 9))
plt.title('Top five most vaccinated countries in the world (by amount of people vaccinated)')
sns.barplot(x=total_vaccinations_f.country[0:5], y=total_vaccinations_f['people_vaccinated'])
plt.xlabel(' ')
plt.ylabel('Amount of people vaccinated (in millions)');

Those numbers look promising, but note that the top five countries also happen to be the most populous, and the ratio of those vaccinated to those that are not really matters when considering the rate of transmission, emergence of new variants, and, consequently, a third wave of COVID-19 cases. This leads nicely into the next question...

# 3. Most vaccinated countries per percentage of the population?

So, what are the most vaccinated countries when taking into account the population said country (i.e. the percentage that are vaccinated)?

In [None]:
# Creating a new column called 'percentage vaccinated', which will store the amount of those vaccinated against the entire population as a float

covid_vaccine_data_f['percentage_vaccinated'] = covid_vaccine_data_f.total_vaccinations / covid_vaccine_data_f.Population * 100
vaccine_percentage = covid_vaccine_data_f.percentage_vaccinated.dropna().sort_values(ascending=False).reset_index()

display(vaccine_percentage)

# Plotting a barplot using matplotlib and seaborn
with sns.plotting_context('notebook', font_scale = 1.1):
    plt.figure(figsize=(20, 10))
    plt.title('The top 10 most vaccinated countries by percentage of its population\n (that have received at least one dose)')
    sns.barplot(x=vaccine_percentage.country[0:10], y=vaccine_percentage.percentage_vaccinated)
    plt.xlabel(' ')
    plt.ylabel('Percentage of population that have been vaccinated');

That's strange, why and how has Israel and the Seychelles supposedly vaccinated *over 100%* of its population? This could mean one of two things; either there some discrepancy in the data that we have missed, or that the entire population of Israel and the Seychelles have indeed been vaccinated, and the extra ~10% are those who have had the second dose. Let's just assume that its the latter, as that is probably the more realistic justification.

# 3. Which countries have the most effective vaccination programme?

Many countries are the midst of their vaccination programs at this moment in time. Whilst some countries have not started their vaccination roll out yet, other countries are well underway to inoculate the population. But which countries have the best and most effective vaccination roll out?

In [None]:
top_daily = covid_vaccine_data_f.groupby('country').daily_vaccinations.max().sort_values(ascending=False)
top_daily_f = top_daily.dropna(axis='rows').reset_index()

display(top_daily_f)

# Plotting the data
with sns.plotting_context('notebook', font_scale = 1.25):
    plt.figure(figsize=(20, 10))
    plt.title('The top 10 best performing countries in daily vaccinations')
    sns.barplot(x=top_daily_f.country[0:10], y=top_daily_f.daily_vaccinations[0:10])
    plt.xlabel(' ')
    plt.ylabel('Number of daily vaccinations (in millions)');

We can see a pattern emerging within the countries that seem to be the best performing; the pattern being the countries which have a bigger population, more infrastructure and resources and bigger vaccine supply, appear to be vaccinating at speed, and thus becoming more vaccinated as the days go by. However, GDP is not a factor here, as many of the majority of the countries in the top ten are either developing or third-world countries - which is also interesting.

**Plotting this data to see the progression in daily vaccinations over time**
*(code is hidden below)*

In [None]:
# Establishing which columns are relevant and preparing the dataset
cols = ['country', 'date', 'daily_vaccinations']
covid_df = covid_vaccine_data.reset_index()

# Creating a list of the aforementioned best performing countries in terms of daily vaccinations

top_10_countries = [
    'China',
    'United States',
    'India',
    'United Kingdom',
    'Brazil',
    'England',
    'Indonesia',
    'Turkey',
    'Russia',
    'Mexico',
]

# Filtering and sorting the data

daily_vacc_df = covid_df[['country', 'date', 'daily_vaccinations']].sort_values(by='daily_vaccinations', ascending=True)

top_10_dv_df = daily_vacc_df[daily_vacc_df.country.isin(top_10_countries)].dropna()

display(top_10_dv_df) # verifying that the filtering executed correctly

# Plotting the data in a multi faceted lineplot

#plt.figure(figsize=(18, 32))

sns.relplot(data=top_10_dv_df, x='date', y='daily_vaccinations', hue='country', col='country', kind='line', col_wrap=2)
plt.xticks('');


Wow, China has really increased their vaccination rates exponentially in recent days, particularly in comparison to other countries in the top ten - the United States aren't far behind, either. India also saw big increases in their daily vaccinations, whilst the UK/England are vaccinating their populations slowly but steadily - similarly to Russia. We can also see that the lines for India, Turkey, Indonesia and Brazil end much eariler than any other country. Why is that? Are there some discrepancies in the data (i.e. missing data)? Did they not vaccinate people over a number of days? Have their supplies run out? With the spread of the virus possibly slowing vaccine product, the blocking of the Suez Canal and other obstacles, its most likely that they have run out of supply or had shipments delayed - as countries that manufacture their vaccines domestically (i.e. the US, Russia and China) don't appear to have slowed down, or stopped, vaccinating their populations.

# 4. Which vaccine is the most used around the globe?

Many pharmaceutical companies across the world have contributed to the pool of vaccines available today worldwide, but which vaccines (or combination of vaccines) are used the most by each country?

In [None]:
data_reset_index = covid_vaccine_data_f.reset_index()
most_common_vaccine = data_reset_index.groupby(['vaccines'])['country'].count().sort_values(ascending=False).reset_index()
display(most_common_vaccine)

# Plotting a barplot
with sns.plotting_context('notebook', font_scale = 2):
    plt.figure(figsize=(20, 25))
    plt.title('Most commonly used vaccine (combination) worldwide')
    sns.barplot(y=most_common_vaccine.vaccines, x=most_common_vaccine.country)
    plt.ylabel(' ')
    plt.xlabel('Number of countries');

It appears that the Oxford/AstraZeneca vaccine deems the most popular, being used in a substantial amount of countries in isolation, with many other uses in other countries alongside other vaccines from different providers. This is most like because: a) the Oxford/AstraZeneca is manufactured at cost, meaning it is free for those who order it, and b) it is manufactured in many different sites across the globe (India and the Netherlands both have development sites, for example), making distribution across the world much easier.

However, these findings only tell is what combination is the most common; it doesn't quantify what the most common singular vaccine is across the world, and it could be misleading as an uncommon vaccine may be intepreted as common due to its place on the graph, despite being used in very few countries. This graph probably isn't very helpful if we were to compare the popularity of a single vaccine. So let's find out what the most common vaccine is (in isolation).

The hidden cell and output below shows us all of the vaccines (unique values) found in the dataset. As you can see, many vaccines have their own value (i.e. *'Pfizer/BioNTech'*) whilst also being part of another value (and not wholly representing that value e.g. *'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech'*).

In [None]:
most_common_vaccine.vaccines.unique().tolist()

**difflib** will be used for this particular query of the EDA. **difflib** is a Python module that deals with comparisons, mostly involving sequences and classes. Here, I will be using to find and pull the names of each individual vaccine from the values in the dataset using the *SequenceMatcher* function. You may notice some arguments and parameters, these are basically telling difflib how lenient to be when selecting the values, allowing for me to find not just the name of the vaccine itself (i.e 'Oxford/AstraZeneca'), but the name within other values with multiple names (i.e. Oxford/AstraZeneca, Pfizer/BioNTech, Sinovac, Sputnik V')

In [None]:
import difflib 

# Here, a logical process involving each vaccine is done. difflib (with help from pandas) is 
# told to find values containing the name of vaccine where the ratio is 0.3 (or lower) - i.e. the match is low. This will allow for me to find
# the name of the vaccine when it isn't sole vaccine in a country's vaccination program.

ox_az_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Oxford/AstraZeneca', x.vaccines).ratio() > 0.3, axis=1)]
ox_az_df_sum = ox_az_df.country.sum()

pf_bnt_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Pfizer/BioNTech', x.vaccines).ratio() > 0.31, axis=1)]
pf_bnt_df_sum = pf_bnt_df.country.sum()

moderna_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Moderna', x.vaccines).ratio() > 0.15, axis=1)]
rows_to_drop_m = [0, 2, 3, 5, 6, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27, 28]
moderna_df_dropped = moderna_df.drop(rows_to_drop_m)
moderna_sum = moderna_df_dropped.country.sum()

sputnik_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Sputnik V', x.vaccines).ratio() > 0.15, axis=1)]
rows_to_drop_s = [5, 16, 17]
sputnik_df_dropped = sputnik_df.drop(rows_to_drop_s)
sputnik_sum = sputnik_df_dropped.country.sum()

sinobeijing_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Beijing', x.vaccines).ratio() > 0.125, axis=1)]
rows_to_drop_sinobeijing = [0, 2, 6, 7, 8, 15, 17, 18, 19, 22, 23]
sinobeijing_df_dropped = sinobeijing_df.drop(rows_to_drop_sinobeijing)
sinobeijing_sum = sinobeijing_df_dropped.country.sum()

sinovac_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Sinovac', x.vaccines).ratio() > 0.22, axis=1)]
rows_to_drop_sinovac = [2, 4, 5, 9, 10, 13, 14, 20]
sinovac_df_dropped = sinovac_df.drop(rows_to_drop_sinovac)
sinovac_sum = sinovac_df_dropped.country.sum()

wuhan_df = most_common_vaccine.loc[most_common_vaccine.apply(lambda x: difflib.SequenceMatcher(None, 'Sinopharm/Wuhan', x.vaccines).ratio() > 0.3, axis=1)]
rows_to_drop_wuhan = [5, 9, 10, 12, 14, 17, 21]
wuhan_df_dropped = wuhan_df.drop(rows_to_drop_wuhan)
wuhan_sum = wuhan_df_dropped.country.sum()

# Now all of the values have been retrieved, stored in a variable and counted up using sum(), it is time to create a dataset for these values
# Because the remaining vaccines are used in so few countries, we'll add those to the dataset manually.

new_df = pd.DataFrame({
    'Name of vaccine': ['Oxford/AstraZeneca', 
                        'Moderna', 'Pfizer/BioNTech', 
                        'Sinovac', 'Sinopharm/Beijing', 
                        'Sputnik V', 'Covaxin', 'EpiVacCorona', 
                        'Sinopharm/Wuhan', 'Johnson&Johnson'],
    'Number of countries': [ox_az_df_sum, moderna_sum, 
                            pf_bnt_df_sum, sinovac_sum,
                           sinobeijing_sum, sputnik_sum,
                           1, 1, wuhan_sum, 2]
    
})

most_used_vaccine = new_df.sort_values(by='Number of countries', ascending=False).set_index('Name of vaccine')

most_used_vaccine

Looks good! For the most part, the results look pretty similar to the previous results from the previous analysis. However, we now have the ability to analyse each vaccine on a country by country basis, without any augmentation from other vaccines. We can see that Oxford/AstraZenca is still the most common vaccine with use in 118, followed by Pfizer/BioNTech (92 countries) and Moderna with a huge gap in third place (36 countries). All of these vaccines were the earliest vaccines reported to be in development. 

Let's plot this dataframe on a bar graph.

In [None]:
with sns.plotting_context('notebook', font_scale = 1.5):
    plt.figure(figsize=(20, 10))
    plt.title('Most commonly used vaccine worldwide')
    sns.barplot(y=most_used_vaccine.index, x=most_used_vaccine['Number of countries'])
    plt.ylabel(' ')
    plt.xlabel('Number of countries');

# Bonus questions and analysis

# Apart from the United Kingdom, who else uses the Oxford/AstraZeneca vaccine?

As a UK resident, it is the norm to hear about the Oxford/AstraZeneca vaccine, as it was one of the first vaccines to enter clinical trials, is developed in the UK, and was regularly mentioned by the UK Government's Chief Scientific Officers, as well as cabinet ministers and the Prime Minister himself. Recently, the vaccine has come under fire from the European Union and participating countries after claims that the vaccine causes blood clots (venous thromboembolism). It was later rebutted by many independent scientists, the Chief Executive from the Medical and Healthcare products Regulatory Agency (MHRA) and the European Medicines Agency, who later said there is no evidence to suggest that the vaccine increases chances of blood clots. Germany have since then, however, have suspending the vaccine's use in over-60s. On the 07/04/21, following the reports of blood clots in women under 60 and 13 related deaths, other European countries, such as France, Spain and Italy, have also responded in a similar way by only giving the Oxford/AstraZeneca vaccine to a particular age group. The UK's MHRA have issued guidance on this vaccine, declaring that any 18-34 year old will be given a choice of an alternate vaccine (i.e Moderna or Pfizer).

Despite this, how many countries (that aren't part of the United Kingdom, including its devolved administrations) still use the Oxford/AstraZeneca vaccine?

In [None]:
# Creating a list of strings which contain the countries of the United Kingdom, as well as the UK itself.

united_kingdom = ['United Kingdom', 'England', 'Scotland', 'Wales', 'Northern Ireland']

# Excluding all values that match the aforementioned list and displaying the results

uk_filter = covid_vaccine_data_f[~covid_vaccine_data_f.index.isin(united_kingdom)]
vaccines_minus_united_kingdom = uk_filter.loc[uk_filter.vaccines == 'Oxford/AstraZeneca, Pfizer/BioNTech']
with pd.option_context('display.width', 500, 'display.max_columns', 5, 'display.max_colwidth', 15):
    display(vaccines_minus_united_kingdom)

# Which countries are making the best recoveries (i.e. had the highest death toll, but has the highest amount of daily vaccinations)?

In [None]:
# Filtering out the appropriate columns from the dataset, which are 'deaths' and 'total_vaccinations' respectively, 
# sorting them, and then preparing them for joining.

covid_deaths = covid_vaccine_data_f.deaths.sort_values(ascending=False).dropna().reset_index()
covid_deaths = covid_deaths.set_index('country')

covid_daily_vacc = covid_vaccine_data_f.daily_vaccinations.sort_values(ascending=False).dropna().reset_index()
covid_daily_vacc = covid_daily_vacc.set_index('country')
covid_best_recovery = covid_deaths.join(covid_daily_vacc)

# Narrowing the dataset to the top 10 highest values, and verifying the data after the joining process.

top_10 = covid_best_recovery[0:10]
display(top_10)

# Plotting the data
with sns.plotting_context('notebook', font_scale = 1.25):
    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot()
    ax2 = ax.twinx()

    sns.barplot(data=top_10, x=top_10.index, y='deaths', ax=ax, label='deaths', alpha=0.5)
    sns.barplot(data=top_10, x=top_10.index, y='daily_vaccinations', ax=ax2, label='daily_vaccinations', alpha=0.5)
    ax.set_ylim(0, 750000)
    plt.title('Top 10 fastest recovering countries\n (via daily vaccinations versus death toll)')
    plt.ylabel('Total number of daily vaccinations (in millions)');

The chart above paints a pretty positive picture of how the vaccinations programmes are progressing around the globe, not only is every single country vaccinating more than their death tolls per day, the margins some countries have achieved between those who are vaccinated and their death tolls is substantial. The best performing in the top ten appear to be India (with a huge difference margin of 173.4%\*), the United States (138.8%\*), Brazil (56%\*) and Mexico (45.2%\*).

\* as of 06/04/2021

# Is the recovery rate of a country related to their vaccination numbers?

We've seen the vaccination numbers for each country, and they get increasingly greater as the population number grows. But what about if we analyse from a different perspective, irrespective of population numbers? Are recovery rates related to the success of a vaccination program, and is that dependent of what kind of country it is economically (i.e. first, third-world, developing)?

Let's find out:

In [None]:
recoveries = covid_vaccine_data_f.recoveries.sort_values(ascending=False).dropna().reset_index()
recoveries = recoveries.set_index('country')

total_vaccinations = covid_vaccine_data_f.total_vaccinations.sort_values(ascending=False).dropna().reset_index()
total_vaccinations = total_vaccinations.set_index('country')

recov_totvac_df = recoveries.join(total_vaccinations)

top_10_rec = recov_totvac_df[0:10]
display(top_10_rec)

# Plotting the data
with sns.plotting_context('notebook', font_scale = 1.25):
    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot()
    ax2 = ax.twinx()

    sns.barplot(data=top_10_rec, x=top_10_rec.index, y='recoveries', ax=ax, label='recoveries', alpha=0.5)
    sns.barplot(data=top_10_rec, x=top_10_rec.index, y='total_vaccinations', ax=ax2, label='total_vaccinations', alpha=0.5)
    ax.set_ylim(0, 16000000)
    plt.title('Is the recovery rate of COVID-19 cases directly related to the success of a country\'s vaccination programme?\n (Top 10)')
    plt.ylabel('Total number of vaccinations (in (tens of) millions)');

Nice, we're seeing some new entries in this top ten. We can see some evidence of recovery rates influencing the vaccination programme, as India, Brazil, Italy and Mexico all appear on both charts. The most interesting thing about ths chart is that the more affluent countries (US, UK etc.) are absent from this top ten, which means those countries were hardest hit and had lower recovery rates (which makes sense, their death rates were higher). Another interesting thing is the data for Turkey and Germany, as their margins between recoveries and total vaccinations are small. Did their populations have a higher compliance rate for lockdown restrictions, meaning fewer infections? Some other political interjection? Or just small vaccination totals?

# Final words

The outcome from this EDA is very encouraging; most (if not, all) of the countries around the globe are well underway to vaccinating their populations, some extremely faster than others, which only help to instill the hope in people that there will be better days - the world will be open again very soon. That's all thanks to the incredible effort by scientists across the globe, who have managed to engineer vaccines for public use within twelve months of identifying a novel coronavirus. New vaccines are being engineered and put up for approval all of the time, and vaccines against COVID-19 will only become more available as time goes on.

This is also my first full EDA on Kaggle, so if you liked it please show your appreciation and leave any criticisms and suggestions. Thank your for reading.