# Visualising the Current Status of COVID-19 Vaccination

**This notebook performs exploratory data analysis on multiple datasets and visualises the consolidated data.**
The main objective is to visualise COVID-19 vaccination process globally and monitor its progress.

Plotly is used to visualise the data.

**Used datasets are:**

- COVID-19 World Vaccination Progress
- Population by Country - 2020

**Visualisations answer three simple questions:**

- What vaccines are used and in which countries?
- What country is vaccinated more people?
- What country is vaccinated a larger percent from its population?

## 1. Import libraries

In [1]:
import pandas as pd
import numpy as np
import math
import cufflinks as cf
import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
%matplotlib inline

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
pio.renderers.default = 'iframe'

## 2. EDA on Vaccine Data Frame

In [2]:
df_vac = pd.read_csv('./data/country_vaccinations.csv',
                     usecols=['country', 'iso_code', 'date', 'people_vaccinated', 'vaccines'],
                     parse_dates=['date'])
df_vac.tail()

Unnamed: 0,country,iso_code,date,people_vaccinated,vaccines
6512,Zimbabwe,ZWE,2021-03-11,36019.0,Sinopharm/Beijing
6513,Zimbabwe,ZWE,2021-03-12,36283.0,Sinopharm/Beijing
6514,Zimbabwe,ZWE,2021-03-13,36359.0,Sinopharm/Beijing
6515,Zimbabwe,ZWE,2021-03-14,36359.0,Sinopharm/Beijing
6516,Zimbabwe,ZWE,2021-03-15,37660.0,Sinopharm/Beijing


In [3]:
df_vac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6517 entries, 0 to 6516
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   country            6517 non-null   object        
 1   iso_code           6517 non-null   object        
 2   date               6517 non-null   datetime64[ns]
 3   people_vaccinated  3700 non-null   float64       
 4   vaccines           6517 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 254.7+ KB


In [4]:
df_vac = df_vac.dropna()
df_vac.isnull().sum()

country              0
iso_code             0
date                 0
people_vaccinated    0
vaccines             0
dtype: int64

Create a dataframe that gets the maximum value of total vaccinations and groups by country.

In [5]:
df_vac = pd.DataFrame(df_vac.groupby(["country","iso_code",'vaccines'])["people_vaccinated"].max())

df_vac.reset_index(level=0, inplace=True)
df_vac.reset_index(level=0, inplace=True)
df_vac.reset_index(level=0, inplace=True)

df_vac.tail()

Unnamed: 0,vaccines,iso_code,country,people_vaccinated
117,"Pfizer/BioNTech, Sinovac",URY,Uruguay,212220.0
118,Sputnik V,VEN,Venezuela,12194.0
119,Oxford/AstraZeneca,VNM,Vietnam,15865.0
120,"Oxford/AstraZeneca, Pfizer/BioNTech",OWID_WLS,Wales,1122931.0
121,Sinopharm/Beijing,ZWE,Zimbabwe,37660.0


## 3. EDA on Population Data

In [6]:
df_pop = pd.read_csv('./data/population_by_country_2020.csv')
df_pop.rename(columns={'Country (or dependency)': 'country'}, inplace=True)
df_pop.tail()

Unnamed: 0,country,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
230,Montserrat,4993,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
231,Falkland Islands,3497,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
232,Niue,1628,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
233,Tokelau,1360,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %
234,Holy See,801,0.25 %,2,2003,0,,N.A.,N.A.,N.A.,0.00 %


Merge vaccine and population dataframe

In [7]:
df = pd.merge(df_vac, df_pop, how='left', on='country')
df.tail()

Unnamed: 0,vaccines,iso_code,country,people_vaccinated,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
117,"Pfizer/BioNTech, Sinovac",URY,Uruguay,212220.0,3475842.0,0.35 %,11996.0,20.0,175020.0,-3000.0,2.0,36.0,96 %,0.04 %
118,Sputnik V,VEN,Venezuela,12194.0,28421581.0,-0.28 %,-79889.0,32.0,882050.0,-653249.0,2.3,30.0,N.A.,0.36 %
119,Oxford/AstraZeneca,VNM,Vietnam,15865.0,97490013.0,0.91 %,876473.0,314.0,310070.0,-80000.0,2.1,32.0,38 %,1.25 %
120,"Oxford/AstraZeneca, Pfizer/BioNTech",OWID_WLS,Wales,1122931.0,,,,,,,,,,
121,Sinopharm/Beijing,ZWE,Zimbabwe,37660.0,14899771.0,1.48 %,217456.0,38.0,386850.0,-116858.0,3.6,19.0,38 %,0.19 %


In [8]:
df[df['Population (2020)'].isnull()]['country'].unique()

array(["Cote d'Ivoire", 'Czechia', 'England', 'Guernsey', 'Jersey',
       'Northern Ireland', 'Scotland', 'Turks and Caicos Islands',
       'Wales'], dtype=object)

In [9]:
df_pop['country'].unique()

array(['China', 'India', 'United States', 'Indonesia', 'Pakistan',
       'Brazil', 'Nigeria', 'Bangladesh', 'Russia', 'Mexico', 'Japan',
       'Ethiopia', 'Philippines', 'Egypt', 'Vietnam', 'DR Congo',
       'Turkey', 'Iran', 'Germany', 'Thailand', 'United Kingdom',
       'France', 'Italy', 'Tanzania', 'South Africa', 'Myanmar', 'Kenya',
       'South Korea', 'Colombia', 'Spain', 'Uganda', 'Argentina',
       'Algeria', 'Sudan', 'Ukraine', 'Iraq', 'Afghanistan', 'Poland',
       'Canada', 'Morocco', 'Saudi Arabia', 'Uzbekistan', 'Peru',
       'Angola', 'Malaysia', 'Mozambique', 'Ghana', 'Yemen', 'Nepal',
       'Venezuela', 'Madagascar', 'Cameroon', "Côte d'Ivoire",
       'North Korea', 'Australia', 'Niger', 'Taiwan', 'Sri Lanka',
       'Burkina Faso', 'Mali', 'Romania', 'Malawi', 'Chile', 'Kazakhstan',
       'Zambia', 'Guatemala', 'Ecuador', 'Syria', 'Netherlands',
       'Senegal', 'Cambodia', 'Chad', 'Somalia', 'Zimbabwe', 'Guinea',
       'Rwanda', 'Benin', 'Burundi', 'Tuni

In [10]:
df_pop.country = df_pop.country.replace({
    "Côte d'Ivoire": "Cote d'Ivoire",
    "Czech Republic (Czechia)": "Czechia",
    "Turks and Caicos": "Turks and Caicos Islands"
})

# Merge dataframes
df = pd.merge(df_vac, df_pop, how='left', on='country')

# Identify rows with missing population data
df_na = df[df['Population (2020)'].isnull()]

# Print unique countries with missing population
print(df_na['country'].unique())

['England' 'Guernsey' 'Jersey' 'Northern Ireland' 'Scotland' 'Wales']


In [11]:
# Before dropping rows, drop the Migrants (net) column since it includes
# more missing data and dropping NaN values will cause loss of data

del df['Migrants (net)']

# Delete the Yearly Change, Fert. Rate, Med. Age, Urban Pop %, World Share columnssince they are not relevant

del df["Yearly Change"]
del df["Fert. Rate"]
del df["Med. Age"]
del df["Urban Pop %"]
del df["World Share"]

# Drop NaN columns one more time to remove the rows for Guernsey and Jersey

df_f = df.dropna()
df_f.tail()

Unnamed: 0,vaccines,iso_code,country,people_vaccinated,Population (2020),Net Change,Density (P/Km²),Land Area (Km²)
116,"Johnson&Johnson, Moderna, Pfizer/BioNTech",USA,United States,71054445.0,331341050.0,1937734.0,36.0,9147420.0
117,"Pfizer/BioNTech, Sinovac",URY,Uruguay,212220.0,3475842.0,11996.0,20.0,175020.0
118,Sputnik V,VEN,Venezuela,12194.0,28421581.0,-79889.0,32.0,882050.0
119,Oxford/AstraZeneca,VNM,Vietnam,15865.0,97490013.0,876473.0,314.0,310070.0
121,Sinopharm/Beijing,ZWE,Zimbabwe,37660.0,14899771.0,217456.0,38.0,386850.0


In [12]:
df_f = df_f.reset_index(drop=True)
df_f.tail()

Unnamed: 0,vaccines,iso_code,country,people_vaccinated,Population (2020),Net Change,Density (P/Km²),Land Area (Km²)
111,"Johnson&Johnson, Moderna, Pfizer/BioNTech",USA,United States,71054445.0,331341050.0,1937734.0,36.0,9147420.0
112,"Pfizer/BioNTech, Sinovac",URY,Uruguay,212220.0,3475842.0,11996.0,20.0,175020.0
113,Sputnik V,VEN,Venezuela,12194.0,28421581.0,-79889.0,32.0,882050.0
114,Oxford/AstraZeneca,VNM,Vietnam,15865.0,97490013.0,876473.0,314.0,310070.0
115,Sinopharm/Beijing,ZWE,Zimbabwe,37660.0,14899771.0,217456.0,38.0,386850.0


## 4. Creating a Detailed Vaccine Data Frame

In [13]:
df_f['vaccines'].unique()

array(['Pfizer/BioNTech', 'Oxford/AstraZeneca',
       'Oxford/AstraZeneca, Sinopharm/Beijing, Sputnik V',
       'Oxford/AstraZeneca, Pfizer/BioNTech',
       'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech', 'Sinovac',
       'Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V',
       'Sputnik V', 'Oxford/AstraZeneca, Sinovac', 'Sinopharm/Beijing',
       'Moderna, Pfizer/BioNTech', 'Pfizer/BioNTech, Sinovac', 'Moderna',
       'Moderna, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sputnik V',
       'Covaxin, Oxford/AstraZeneca',
       'Pfizer/BioNTech, Sinopharm/Beijing',
       'Pfizer/ BioNTech, Sinopharm/Beijing',
       'Oxford/AstraZeneca, Pfizer/BioNTech, Sputnik V',
       'Sinopharm/Beijing, Sputnik V',
       'Oxford/AstraZeneca, Sinopharm/Beijing', 'EpiVacCorona, Sputnik V',
       'Johnson&Johnson',
       'Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing, Sinopharm/Wuhan, Sputnik V',
       'Johnson&Johnson, Moderna, Pfizer/BioNTech'], dt

Create boolean columns for all vaccines and concatenate them in one data frame

In [14]:
boo1 = df_f['vaccines'].str.contains('Pfizer/BioNTech')
df_boo1 = pd.DataFrame(boo1).reset_index(drop=True)
df_boo1.rename(columns={'vaccines': 'Pfizer/BioNTech'}, inplace=True)

boo2 = df_f['vaccines'].str.contains('Sputnik V')
df_boo2 = pd.DataFrame(boo2).reset_index(drop=True)
df_boo2.rename(columns={'vaccines': 'Sputnik V'}, inplace=True)

boo3 = df_f['vaccines'].str.contains('Oxford/AstraZeneca')
df_boo3 = pd.DataFrame(boo3).reset_index(drop=True)
df_boo3.rename(columns={'vaccines': 'Oxford/AstraZeneca'}, inplace=True)

boo4 = df_f['vaccines'].str.contains('Sinopharm/Beijing')
df_boo4 = pd.DataFrame(boo4).reset_index(drop=True)
df_boo4.rename(columns={'vaccines': 'Sinopharm/Beijing'}, inplace=True)

boo5 = df_f['vaccines'].str.contains('Moderna')
df_boo5 = pd.DataFrame(boo5).reset_index(drop=True)
df_boo5.rename(columns={'vaccines': 'Moderna'}, inplace=True)

boo6 = df_f['vaccines'].str.contains('Sinovac')
df_boo6 = pd.DataFrame(boo6).reset_index(drop=True)
df_boo6.rename(columns={'vaccines': 'Sinovac'}, inplace=True)

boo7 = df_f['vaccines'].str.contains('Covaxin')
df_boo7 = pd.DataFrame(boo7).reset_index(drop=True)
df_boo7.rename(columns={'vaccines': 'Covaxin'}, inplace=True)

boo8 = df_f['vaccines'].str.contains('EpiVacCorona')
df_boo8 = pd.DataFrame(boo8).reset_index(drop=True)
df_boo8.rename(columns={'vaccines': 'EpiVacCorona'}, inplace=True)

boo9 = df_f['vaccines'].str.contains('Sinopharm/Wuhan')
df_boo9 = pd.DataFrame(boo9).reset_index(drop=True)
df_boo9.rename(columns={'vaccines': 'Sinopharm/Wuhan'}, inplace=True)

result = pd.concat([df_boo1, df_boo2, df_boo3, df_boo4, df_boo5, df_boo6, df_boo7, df_boo8, df_boo9], axis=1)

result = result.astype(int)

result.head()

Unnamed: 0,Pfizer/BioNTech,Sputnik V,Oxford/AstraZeneca,Sinopharm/Beijing,Moderna,Sinovac,Covaxin,EpiVacCorona,Sinopharm/Wuhan
0,1,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0
4,0,1,1,1,0,0,0,0,0


Merge detailed vaccine data frame to the main data frame and delete the vaccines column, then add a column for the total number of different vaccines

In [15]:
df_f = pd.concat([df_f, result], axis=1)

df_f['Different Vaccines'] = df_f['Pfizer/BioNTech'] + \
                             df_f['Sputnik V'] + \
                             df_f['Oxford/AstraZeneca'] + \
                             df_f['Sinopharm/Beijing'] + \
                             df_f['Moderna'] + \
                             df_f['Sinovac'] + \
                             df_f['Covaxin'] + \
                             df_f['EpiVacCorona'] + \
                             df_f['Sinopharm/Wuhan']

df_f.head()

Unnamed: 0,vaccines,iso_code,country,people_vaccinated,Population (2020),Net Change,Density (P/Km²),Land Area (Km²),Pfizer/BioNTech,Sputnik V,Oxford/AstraZeneca,Sinopharm/Beijing,Moderna,Sinovac,Covaxin,EpiVacCorona,Sinopharm/Wuhan,Different Vaccines
0,Pfizer/BioNTech,ALB,Albania,6073.0,2877239.0,-3120.0,105.0,27400.0,1,0,0,0,0,0,0,0,0,1
1,Pfizer/BioNTech,AND,Andorra,3650.0,77287.0,123.0,164.0,470.0,1,0,0,0,0,0,0,0,0,1
2,Oxford/AstraZeneca,AGO,Angola,6169.0,33032075.0,1040977.0,26.0,1246700.0,0,0,1,0,0,0,0,0,0,1
3,Oxford/AstraZeneca,AIA,Anguilla,3929.0,15026.0,134.0,167.0,90.0,0,0,1,0,0,0,0,0,0,1
4,"Oxford/AstraZeneca, Sinopharm/Beijing, Sputnik V",ARG,Argentina,1952883.0,45267449.0,415097.0,17.0,2736690.0,0,1,1,1,0,0,0,0,0,3


## 5. Data Visualisation

In [16]:
summ = df_f.iloc[:,8:17].sum()
df_vacc = pd.DataFrame(summ, columns=['Usage'])
df_vacc = df_vacc.sort_values('Usage', ascending=False)
df_vacc.reset_index(level=0, inplace=True)
df_vacc.tail()

Unnamed: 0,index,Usage
4,Sinopharm/Beijing,15
5,Sinovac,9
6,Covaxin,1
7,EpiVacCorona,1
8,Sinopharm/Wuhan,1


In [126]:
fig = px.bar(
    data_frame=df_vacc,
    x='index',
    y='Usage',
    color='Usage',
    hover_name='index',
    hover_data=['Usage'],
    color_continuous_scale='Magenta',
    labels={'index': 'Vaccine', 'Usage': '# of Countries Using'},
    height=750,  # Increase height for more space
    width=768,
    text='Usage'
)

fig.update_layout(
    uniformtext_minsize=12,
    xaxis_tickangle=-45,
    title='Usage of Vaccines by Countries',
    title_x=0.5,
    showlegend=False,
    plot_bgcolor='rgba(0, 0, 0, 0)',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    coloraxis_showscale=False
)

fig.update_traces(texttemplate='%{text:.2s}', textposition='auto')

footnote = """
<b>Vaccination Race</b><br>
It appears that the vaccine developed by Pfizer/Biontech benefits being first in the market. <br>
Although the vaccine developed by Oxford/AstraZeneca was released some time after the one developed by Pfizer/Biontech, <br>
it is a strong competent in the vaccination race since it is easier to roll out in terms of handling and transportation. <br>
Currently, these two are the most common COVID-19 vaccines around the world, being used in more than 60 countries.<br>
"""

fig.update_layout(margin={"b": 300})  # Increase the bottom margin

fig.add_annotation(
    text=footnote,
    showarrow=False,
    x=0.45,
    y=-0.45,  # Adjust this value
    xref='paper',
    yref='paper',
    xanchor='center',
    yanchor='top',
    font=dict(size=11, color='grey'),
    align="left",
)

fig.update_xaxes(type="category")

# Turn off the y-axis
# fig.update_yaxes(visible=False)

# Show a line for the x-axis
fig.update_xaxes(showline=True, linewidth=2, linecolor="grey")

fig.update_layout(title=dict(x=0, xanchor='left', yanchor='top'))

fig.show()


What vaccines are used and in which countries?

In [101]:
df_f['Log Scale'] = df_f['people_vaccinated'].apply(lambda x : math.log2(x+1))
df_f_sorted = df_f.sort_values('Different Vaccines', ascending = False)

In [133]:
fig = px.bar(df_f_sorted,
              x = 'country',
              y = 'Different Vaccines',
              color='Different Vaccines',
              hover_name = 'country',
              hover_data = ['vaccines'],
              color_continuous_scale = 'Blues',
              labels = {'country':'Country', 'vaccines':'Used Vaccines'},  # changing the labels in to tooltip
              height=500, width=768,)

fig.update_layout(uniformtext_minsize = 15,
                   xaxis_tickangle = -45,
                   title = 'Total Number of Different Vaccines Used by Countries',
                   title_x = 0.5)

fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)',})

fig.update(layout_coloraxis_showscale=False)
fig.update_layout(title=dict(x=0, xanchor='left', yanchor='top'))

footnote = """
<b>Does more money mean more different types of vaccine?</b><br>
It looks like UAE and Hungary are the countries who uses 5 of 9 different kind of vaccines. <br>
They are followed by Bahrain and Serbia with 4 different kind of vaccines used. Let's have 
a look at the distribution in the map to understand it better.
"""

fig.update_layout(margin={'b': 200})

fig.add_annotation(
    text=footnote,
    xref='paper',
    yref='paper',
    xanchor='center',
    yanchor='top',
    x=0.62,
    y=-0.7,
    align='left'
)

fig.show()


In [156]:
fig = px.choropleth(df_f,
                    locations="country", 
                    locationmode='country names',
                    color="Different Vaccines", 
                    hover_name="country", 
                    hover_data=['Different Vaccines','vaccines'],
                    color_continuous_scale="Blues",
                    labels={'country':'Country','vaccines':'Used Vaccines'})

fig.update_layout(title="Total Number of Different Vaccines Used by Countries")
fig.update_layout(title=dict(x=0, xanchor='left', yanchor='top'))
fig.update(layout_coloraxis_showscale=False)

fig.update_layout(
    {
        "plot_bgcolor": "rgba(255, 255, 255, 255)",
        "paper_bgcolor": "rgba(255, 255, 255, 255)",
        "xaxis_title": None,
    }
)

fig.update_geos(
    visible=False,  # Hide default geos
)

fig.update_geos(
    showframe=False,  # Remove the border around the entire figure
    showcoastlines=False,  # Remove the borders between regions
)

fig.show()

What country is vaccinated more people?

In [157]:
df_f_sorted = df_f.sort_values('people_vaccinated', ascending = False)
df_f_sorted = df_f_sorted.iloc[0:19,:]

In [176]:
fig = px.bar(df_f_sorted,
              x = 'country',
              y = 'people_vaccinated',
              color='Log Scale',
              hover_name = 'country',
              hover_data = ['people_vaccinated'],
              color_continuous_scale = 'mint',
              labels = {'country':'Country','people_vaccinated':'People Vaccinated'},
              height=500,
              width=768,
              text='people_vaccinated')

fig.update_layout(uniformtext_minsize = 15,
                   xaxis_tickangle = -45,
                   title = 'Total Vaccinated People by Top 20 Countries',
                   title_x = 0)

fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)',})

fig.update(layout_coloraxis_showscale=False)

footnote="""
<b>More population, more people to vaccinate</b><br>
United States are far above most of the countries with almost 70M people vaccinated. <br>
It is followed by India, UK, Brazil and Turkey. Given most of these countries have the highest population <br>
in the world, it is not suprising they have higher number of people vaccinated compared to the rest.
"""

fig.update_layout(margin={"b": 200})  # Increase the bottom margin

fig.add_annotation(
    text=footnote,
    showarrow=False,
    xref='paper',
    yref='paper',
    xanchor='center',
    yanchor='top',
    font=dict(size=11, color='grey'),
    align='left',
    x=0.35 ,
    y=-0.55
)

fig.show()

In [177]:
fig = px.choropleth(df_f,
                    locations="country", 
                    locationmode='country names',
                    color="Log Scale", 
                    hover_name="country", 
                    hover_data=['people_vaccinated'],
                    color_continuous_scale="mint",
                    labels={'country':'Country','people_vaccinated':'People Vaccinated'})

fig.update_layout(title="Total Vaccinated People by Country")
fig.update_layout(title=dict(x=0, xanchor='left', yanchor='top'))
fig.update(layout_coloraxis_showscale=False)

fig.update_layout(
    {
        "plot_bgcolor": "rgba(255, 255, 255, 255)",
        "paper_bgcolor": "rgba(255, 255, 255, 255)",
        "xaxis_title": None,
    }
)

fig.update_geos(
    visible=False,  # Hide default geos
)

fig.update_geos(
    showframe=False,  # Remove the border around the entire figure
    showcoastlines=False,  # Remove the borders between regions
)

fig.show()

**Numbers are definitely not wrong, but might be misleading**

It is clear that the countries with high number of population has also high number of people vaccinated. To understand the countries' success in vaccination process better, we should look at the % of the population vaccinated by country.