# COVID19 Vaccination Progress Exploratory Data Analysis
- Some preliminary EDA on the vaccination progress in the world.
- Factors such as geography (continent), economics (gdp), population size and politics are explored briefly. 
- A variety of plots are used for visualisation and unsupervised learning is used to glean additional insights into how vaccines may be selected by countries. 

# Data Initialisation

In [None]:
!pip install pycountry_convert
!pip install pyvis

In [None]:
from pycountry_convert import country_alpha2_to_continent_code, country_name_to_country_alpha2

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from sklearn.cluster import AgglomerativeClustering
from sklearn.manifold import TSNE

from pyvis.network import Network
import networkx as nx

In [None]:
df = pd.read_csv("/kaggle/input/world-vaccine-progress/World_Vaccination_Progress.csv")
df.head(5)

# Creating one-hot encode columns for all the vaccines.
- Split the vaccine column into individual vaccines.
- Keep track of each unique vaccine (for indexing) in unique_vac dictionary
- Generate one-hot encode for each row i. 
- First row (world) defaults to all vaccines. 

* As the number of vaccines would not be known at the start, we initialise each encoding with 
[0] * temp_encode_len. Alternatively you can loop through the entire column once to find out how many 
vaccines there are since the dataset is small, but I prefer to skip the extra loop. 

In [None]:
unique_vac = {}
temp_encode_len = 30
encodes = []
continents = []

for i in range(df.shape[0]):
    if i == 0:
        encodes.append([1] * temp_encode_len)
        continents.append("Null")
    else:
        vaccines = df['Vaccine being used in a country'][i]
        encode = [0] * temp_encode_len
        for vaccine in [x.strip() for x in vaccines.split(",")]: 
            if vaccine not in unique_vac.keys():
                unique_vac[vaccine] = len(unique_vac.keys())
                
            vac_index = unique_vac[vaccine]
            encode[vac_index] = 1
        encodes.append(encode)
        
        # Retrieving continent
        try:
            continents.append(country_alpha2_to_continent_code(country_name_to_country_alpha2(df['Country'][i])))
        except:
            continents.append("Null")

for vaccine in unique_vac.keys():
    vac_index = unique_vac[vaccine]
    df[vaccine] = [x[vac_index] for x in encodes]
    
df.head(5)

# Cleaning Vaccine columns
Seems like the vaccine names would have to be cleaned. 

Let's combine these sets together:

**Oxford/AstraZeneca**
- Oxford/AstraZeneca
- Oxford / AstraZeneca

**Sinopharm**
- Sinopharm
- Sinopharm/Beijing
- Sinopharm/Wuhan
- Sinopharm/HayatVax

**RBD-Dimer** 
- Dimer
- RBD


In [None]:
try:
    df['Oxford/AstraZeneca'] = df[['Oxford/AstraZeneca', 'Oxford / AstraZeneca']].max(axis=1)
    df.drop(columns = ['Oxford / AstraZeneca'], axis=1, inplace=True)
except:
    "not found"

try:
    df['Sinopharm'] = df[['Sinopharm', 'Sinopharm/Beijing', 'Sinopharm/HayatVax', 'Sinopharm/Wuhan']].max(axis=1)
    df.drop(columns = ['Sinopharm/Beijing', 'Sinopharm/HayatVax', 'Sinopharm/Wuhan'], axis=1, inplace=True)
except:
    "not found"
    
try:
    df['RBD-Dimer'] = df[['Dimer', 'RBD']].max(axis=1)
    df.drop(columns = ['Dimer', 'RBD'], axis=1, inplace=True)
except:
    "not found"
    
df.head(5)

# Creating/Initialising a few more columns for analysis

In [None]:
df['Size'] = (df['Doses Administered'] / df['Doses per 1000'] * 1000).apply(np.floor)
df['Continent'] = continents
df['Cluster'] = [0] * len(df['Country'])
df.head(5)

# Fully vaccinated % vs  Doses per 1000
Now that data cleaning is complete, let's do some basic analysis to gain some insights into the dataset.

First, I'll go through a scatter plot of doses against fully vaccination rate for a quick overview,a sanity check and to verify if there are any peculiarities.

In [None]:
x = df['Doses per 1000'].to_numpy()
y = df['Fully Vaccinated Population (%)'].to_numpy()
fig = px.scatter(df[1:], x="Doses per 1000", 
                 y="Fully Vaccinated Population (%)", 
                 hover_name="Country",
                 size= df[1:]["Size"].apply(lambda x: x ** 0.3), # Population size is set to the power of 0.3 for visualization purposes,
                 color="Continent")

m, c = np.polyfit(x, y, 1)
x1 = np.linspace(0,2400,2400)
y1 = m*x1 + c
fig.add_traces(go.Scatter(x=x1, y=y1,
                          mode = 'lines',
                          marker_color='grey',
                          name='Trendline')
                          )
fig.show()

As expected, there is a clear linear trend between fully vaccinated rates and the total doses administered per capita.
- This is inspite of certain vaccinations such as Johnson-Johnson only requiring a single dose compared to the two required of other vaccines. 

Although most countries follow the trend, there are a few outliers:
- First, we see some countries (UAE, Kuwait, Guernsay, Ireland...) have particularly low fully vaccinated rates relative to their doses per capita. This could be due to these countries postponing second doses in order to ensure a larger portion of their populations receive a single dose. 
- We also notice that there is one major outlier (Gibraltar) with >100% vaccination rate. Gibraltar is a small, but wealthy British island territory and it's possible that they provide vaccinations to visitors and include such numbers in their tally. Although this possibility would explain the outlier result, it points at another potential limitation that countries may tabulate vaccination records differently (e.g. Some countries may include vaccinations of foreigners, while others may not). This would lead to some inevitable noisiness in data.

Overall, although both metrics are reasonable proxies for one another, it seems that doses per 1000 would be a more meaningful reference for analysing vaccination progress due to potential differences in national vaccination strategies.


Other clear insights we can draw from this plot are:
- It is apparent that several African countries have low vaccination rates. 
- Meanwhile European countries seem to be performing the best with most having 20-40% fully vaccination rates. 


# Geography
- A boxplot and geoplot will be generated to visualise how geography (continent/region) could influence vaccination progress.

In [None]:
fig = px.box(df, x="Continent", 
             y="Doses per 1000", 
             color="Continent", 
             hover_name="Country", 
             points="all")
fig.update_xaxes(
        tickangle = 90,
        title_text = "Continent",
        title_standoff = 25,
        categoryorder ="median descending")
fig.show()

In [None]:
fig = px.scatter_geo(df, locations="Country", locationmode="country names",
                     color="Doses per 1000",
                     size="Size",
                     size_max=100
                     )
fig.show()

- The box plot and geoplot reaffirm that Africa seems to be far behind other continents in terms of vaccinations, while Europe seems to be in the lead.
- However, there are clear outliers to this theory as some African nations such as Seychilles and Morocco have decent progress. Some European countries also seem to be performing poorly. 
- Perhaps factors such as politics and economics, which are closely linked to continent geography could better explain the progress of vaccination. Size could also perhaps explain some of the outliers. 



# Population Size
- Population size could potentially influence vaccination progress by complicating logistics. However, a large population size could also provide a country with a larger talent pool to research its own vaccine. 
- A simple scatter plot of doses per 1000 against size will be generated to evaluate whether size plays a role in vaccination progress.

In [None]:
fig = px.scatter(df[1:], x="Size", 
                 y="Doses per 1000", 
                 hover_name="Country",
                 color="Continent")
fig.show()

- Although it looks like there is a decrease in vaccination progress with size from first glance, it is unlikely that size has a massive impact on vaccination progress.
- While some very large countries like India are lagging behind in progress (skewing the results), several small countries also have low vaccination progress. 
- It is also noticable that there are large countries like the United States that still have decent vaccination progress per capita. 
- Overall, it seems like there isn't much meaningful relationship between size (at least on its own) and vaccination progress. 

# Economics
- This is pretty logical since more money = more choice and ability to reserve/invest in vaccines early.
- A GDP column will be joined to the dataframe using the country column,
- A scatter plot will be generated to determine if there is indeed a trend between GDP and vaccination progress.
- We will also look at how the wealth of a country may influence its vaccine choices with a box plot.

In [None]:
"""
Joining GDP data with dataframe

https://drive.google.com/file/d/16Im_qB1EhYZVyWZhhMOTFBAkuCWh6EhW/view?usp=sharing
ref: https://www.worldometers.info/gdp/gdp-per-capita/
"""

df2 = pd.read_csv("/kaggle/input/gdp-per-capita/GDP_country.csv", encoding= "unicode_escape")
df = pd.merge(df, df2, on="Country", how="left")


df.head(5)

In [None]:
x = df["GDP"]
y = df["Doses per 1000"]
idx = np.isfinite(x) & np.isfinite(y)
fig = px.scatter(df, x=x, y=y, hover_name="Country", color="Continent")
m, c = np.polyfit(x[idx], y[idx], 1)
x1 = np.linspace(0,120000,1000)
y1 = m*x1 + c
fig.add_traces(go.Scatter(x=x1, y=y1,
                          mode = 'lines',
                          marker_color='grey',
                          name='Trendline')
                          )
fig.show()

There is a clear positive correlation between GDP and Vaccination doses.
- This makes sense as richer nations would generally be able to place orders and invest in vaccines while they are still in development, reserving stockpiles.
- Poorer nations may also be unable to afford certain vaccines and would thus have fewer options to choose from. In some cases, nations may even be reliant on donations.

Although a correlation seems to exist, a linear fit is not adequate is explaining the data spread and it is likely that other factors also influence vaccination progress.
- Notably, there are some exceptions where poorer countries have a lot more doses (could be due to political alliances and aid or choosing cheaper alternatives).
- Individual fluctuations are also likely to be due to countries balancing between 'cost', 'efficacy' and 'safety'.

It could be interesting to explore how GDP influences a country's vaccine choice.

In [None]:
new_df_country = []
new_df_vaccine = []
new_df_gdp = []
new_df_doses = []
new_df_continent = []
for i in range(len(df['Country'])):
    if i == 0:
        pass
    else:
        for vaccine in list(df.columns[5:-4]):
            if df[vaccine][i] == 1:
                new_df_country.append(df["Country"][i])
                new_df_vaccine.append(vaccine)
                new_df_gdp.append(df["GDP"][i])
                new_df_doses.append(df["Doses per 1000"][i])
                new_df_continent.append(df["Continent"][i])

df_split = pd.DataFrame()
df_split["Country"] = new_df_country
df_split["Vaccine"] = new_df_vaccine
df_split["GDP"] = new_df_gdp
df_split["Doses per 1000"] = new_df_doses 
df_split["Continent"] = new_df_continent

fig = px.box(df_split, x="Vaccine", y="GDP", hover_name="Country", points="all")
fig.update_xaxes(
        tickangle = 90,
        title_text = "Vaccine",
        title_standoff = 25,
        categoryorder ="mean descending")

fig.show()


These results are certainly interesting and it seems like economics definitely play a role in determining which vaccines a country chooses.

- Moderna is a particularly expensive vaccine, costing 32 to 37 dollars a dose and the high median GDP of countries that use it reflects this cost.
- Countries with a GDP > 50000 almost exclusively use *Pfizer*, *Oxford/AstraZeneca*, *Moderna* and *Johnson and Johnson*, which are widely regarded as the 'safest'/most well-documented vaccines.
- We see that lesser-known vaccines are used by many countries with a GDP <20000. We also see that some of these countries such as Cuba also opt to use locally produced vaccines. 


- Only one country with a GDP > 50000, Macao, uses Sinopharm, which makes sense due to its close relationship (autonomous territory) with China (the producer). 
- This could point to politics also having a major influence on vaccination choice

# Politics(?)
- Evaluating the impact of politics on vaccination progress is tricky without any proper measurement of political closeness. 
- However, we can try grouping countries by the vaccines they use using dimension reduction visualisation and unsupervised learning methods.
- Examining the members of the resulting clusters, we can then glean some insight into whether politics could influence which vaccines a country chooses and vaccination progress. 
- However, especially since there isn't an objective way to evaluate our inferences, caution must be exercised to avoid confirmational bias. 

In [None]:
hot_encode = df[df.columns[5:-4]][1:]
tsne = TSNE(random_state=69)
X_embedded = tsne.fit_transform(hot_encode)
fig = px.scatter(df[1:], x=X_embedded[:,0], y=X_embedded[:,1], hover_name="Country", color="Continent", opacity=0.7)
fig.show()

Examining the t-SNE visualisation plot, it seems plausible that politics has an influence on vaccine choice. 
- We see that many West European countries, South Korea and the United States appear relatively close to one another.
- Countries that aren't closely associated with NATO generally appear on the other side of the t-SNE plot.

To gain a better understanding of the association between countries and vaccine choice, we can do some unsupervised learning on a one-hot encoded vector of the vaccines with hierachial agglomerative clustering.
- This will be achieved by calculating the correlation matrix between countries based on their one-hot encoded vectors of vaccines used. 
- The correlation matrix will then be used for hierachial clustering. 
- We can visualise the clusterings with a dendrogram.

Note: Other forms of clustering such as DBSCAN/K-means were sampled and did not perform well on the one-hot encoded vector, but may perform better on the t-SNE dimensions.  

In [None]:
hot_encode = df[1:][df.columns[5:-4]].to_numpy()
# cols = len(df.columns[5:-3])
countries = list(df['Country'])[1:]

encode_dic = {}
for i in range(len(countries)):
    to_encode = [x/np.sum(hot_encode[i]) for x in hot_encode[i]]
    encode_dic[countries[i]] = to_encode

corr_plot = []
for i in encode_dic.keys():
    corr_row = []
    for j in encode_dic.keys():
        corr_row.append(np.corrcoef(encode_dic[i], encode_dic[j])[0][1])
    corr_plot.append(corr_row)
    
fig = ff.create_dendrogram(np.array(corr_plot), orientation='bottom', labels=countries)
fig.update_layout({'width':1800, 'height':600,
                         'showlegend':False, 'hovermode': 'closest',
                         })
fig.show()

In [None]:
from IPython.display import display
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')  
clusters = cluster.fit_predict(np.array(corr_plot))

df['Cluster'] = np.concatenate((np.array([100]), clusters), axis=0)
for c in range(4):
    print("Cluster", c)
    display(df[df['Cluster'] == c].head(5))
    

From the dendrogram and sample tables of the clusters, it seems that political affiliation has some influence on the 'vaccine selection' cluster a country belongs to.
- Cluster 0 seems to be countries that use a mix of Oxford/AstraZeneca (one of the cheapest vaccines that is highly regarded owever, it has a slightly lower efficacy rate than Pfizer and Moderna) and other high efficacy, but expensive vaccines such as Moderna. We'll call this the 'balanced' group.
- Cluster 1 consists of countries that use a mix of Oxford/AstraZeneca (associated with the western Europe) and vaccines produced by the 'East' (China or Russia) such as Sinopharm, Sinovac and Sputnik or India's case, its own vaccine, Covaxin. Using cheaper vaccines produced from both political sides, these countries can be generally seen as 'budget, politically unbiased'.
- Cluster 2 consists of several countries that are traditionally viewed as being part of the 'communist' block during the cold war era. These countries do not use the vaccines produced by the 'West' and rely on a mix of vaccines from the East or domestically produced vaccines in the case of Cuba. For the purposes of this notebook, I'll label them as 'budget, politically biased'.
- At first glance, Cluster 3 seems to consist of countries that predominantly use American vaccines such as Pfizer/BioNTech. However, some countries that also use Sinovac/Sinopharm such as Turkey and Hong Kong and Macao appear in this cluster. It is interesting that Hong Kong and Macao use Pfizer even though they are autonomous territories of China, which is not on the best diplomatic ties with the United States. These territories may use Pfizer because the citizens in these territories likely trust the US vaccines even though their governments aren't diplomatic terms. This reveals that perhaps citizen trust is just as important as diplomatic ties, when it comes to vaccine selection. Overall this is quite a hodgepodge group and splitting it up could lead to more meaningful clusters. However, based on the current cluster split, the common trait amongst members is that these they use Pfizer, but not Oxford/AstraZeneca.


Overall, it seems that politics could have some influence on how vaccines decide which vaccines to use.

Note: These generalisations may not apply to all countries within the group perfectly. Clusterings are almost never perfect and it would be virutally impossible to describe the countries within each group with a short label. 


# Further Plots of Clusters

In [None]:
fig = px.scatter_geo(df[1:], locations="Country", locationmode="country names",
                     color="Cluster",
                     size="Size",
                     size_max=50
                     )
fig.show()

In [None]:
fig = px.box(df[1:], x="Cluster", y="GDP", hover_name="Country", color="Cluster", points="all")
fig.update_xaxes(
        tickangle = 90,
        title_text = "Cluster",
        title_standoff = 25,
        categoryorder ="median descending")

fig.show()

The GDP per capita of Cluster 0 and 3 are much higher than those of the 'budget' clusters 1 and 2. 

Let's see how vaccination progress is reflected by vaccination progress.

In [None]:
fig = px.box(df[1:], x="Cluster", y="Doses per 1000", hover_name="Country", color="Cluster", points="all")
fig.update_xaxes(
        tickangle = 90,
        title_text = "Cluster",
        title_standoff = 25,
        categoryorder ="median descending")

fig.show()

Vaccination progress for each cluster seems to follow a similar trend as GDP.

However, interestingly, although the median GDP per capita of cluster 0 is lower than that of cluster 3, cluster 0 has a similar median vaccination progress.


# Conclusion
Overall it seems that the following factors could influence vaccination progress:
- Economics (GDP) is a strong predictor of vaccination progress.
- Vaccine choice (some colinearity with GDP)
- Potentially politics (potentially influences vaccine choice: can consider both citizen trust and diplomatic relationships). Not enough analysis or data to conclude on the degree of impact. Would be good if an external dataset of political closeness between countries could be used to validate this. 

Factors that do not contribute meaningfully to vaccine progress:
- Size (on it's own)
- Geography (Although it is a good indicator on its own, it has a colinear impact alongside Economics and Politics, which seem to be more logical proxies) 

Other factors to consider:
- Logistics (Country Land Size, Land-locked)
- Health policies
- Covid cases (if a country has several covid cases and/or deaths, perhaps they would be more inclined to vaccinate quickly)
- Economics (Industry). Countries and territories that are dependent on tourism may be more inclined to vaccinate quickly.

# Limitations with data and analysis
- No breakdown of number of vaccine doses for each specific vaccine brand
- Limited usage on statistics of each individual vaccine (e.g. cost, country of origin, trust)
- Time series data for vaccination progress is too limited for meaningful insight
- Would be meaningful to have vaccination and death time series data to develop better insight on how vaccination rates are helping against covid

# Future works
- Feature engineering using economical and political features. 
- The clustering from the vaccine types could be used as a basic proxy feature for political alignment but there are likely to be more robust datasets that can be used for feature engineering this attribute.
- Time series to track deaths and vaccination progress. 
- Could use more traits related to countries such as proxies of their healthcare/political policies. 
- Could represent each vaccine as a vector of traits. e.g. trust (academic trust and political trust), safety (associated with acadamic trust), cost, availability

This is just a preliminary exploratory data analysis and there is a lot more to explore. 

Hopefully this kernel has been of help to you.

# Thank you!