In [None]:
# Wrangling
import pandas as pd
import numpy as np

# Viz
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Cluster
from sklearn.cluster import KMeans

np.warnings.filterwarnings('ignore')

# Table of Contents

1. [Introduction](#1.-Introduction)

    1.1. [Objectives](#1.1-Objectives)
    
2. [Data Loading, Data Cleaning and Descriptive Analysis](#2.-Data-Loading,-Data-Cleaning-and-Descriptive-Analysis)

    2.1. [Data Cleaning](#2.1.-Data-Cleaning)
    
    2.2. [Merge](#2.2.-Merge)
    
    2.3. [Descriptive Analysis](#2.3.-Descriptive-Analysis)
    
3. [Data Analysis](#3.-Data-Analysis)

    3.1. [What regions have most of its population vaccinated?](#3.1.-What-regions-have-most-of-its-population-vaccinated?)
    
    3.2. [Does a higher GDP mean a higher vaccinated population?](#3.2.-Does-a-higher-GDP-mean-a-higher-vaccinated-population?)
    
    3.3. [What makes Urugay and Saudi Arabia special?](#3.3-What-makes-Urugay-and-Saudi-Arabia-special?)
    
    3.4. [Does Social support and Healthy life expectany influence % of vaccinated people?](#3.4.-Does-Social-support-and-Healthy-life-expectany-influence-%-of-vaccinated-people?)
    
    3.5. [Is a vaccinated country a happy one?](#3.5.-Is-a-vaccinated-country-a-happy-one?)

4. [Conclusions](#4.-Conclusions)

5. [Recommendations](#5.-Recommendations)

# 1. Introduction

Following the outbreak of the Covid-19 and the following development of the vaccine, the world now faces the challenge of distributing the vaccine all around the world with the hope of returning back to the now acclaimed 'normality'. The progress seems positive at a first glance. But, with the limited amount of vaccines, the problem now seems to distribute the vaccine in a fair manner, allowing the less wealthy countries to have access to the vaccine.

This notebook analyses the data of latest Covid-19 Vaccine Status of all the Countries in the World as on 30 June, 2021 (https://www.kaggle.com/anandhuh/latest-worldwide-vaccine-data) and compares it with the data of the world happiness report (https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021). 

What story is behind the data? Is the world succeding in distributing the vaccine in a fair manner? Is a happy country also a vaccinated one? We will find out.

## 1.1. Objectives

The notebook will focus mainly on what factors make a country more vaccinated than other.

- Find key aspects of a leading vaccinated country
- Develop strategies and reccomendations based on findings
- Build a model that predicts how vaccinated a country is as on 30 June, 2021

# 2. Data Loading, Data Cleaning and Descriptive Analysis

In [None]:
covid = pd.read_csv('../input/latest-worldwide-vaccine-data/Worldwide Vaccine Data.csv')
happiness = pd.read_csv('../input/world-happiness-report-2021/world-happiness-report-2021.csv')

## 2.1. Data Cleaning
Let's start by cleaning the data of both datasets. We will see if they have missing values, duplicates and see if eliminate them if thats the case.

Very important to take into account that both datasets are going to merge. Therefore, they must have one key column that has the same values. Hence, We will also see if the values are consistent in both datasets.

### 2.1.1. Covid

In [None]:
# Duplicates check

if covid.duplicated().any() == False:
    print("There aren't any duplicates")
else:
    print('There are suplicates')
    
sns.heatmap(covid.isnull())
plt.show()

There are some null values, most of them in the '% of population fully vaccinated'. let's check thos nulls.

In [None]:
covid[covid['% of population fully vaccinated'].isnull()]

The fact that there are contries which don't have fully vaccunated people is controversial, but consistent. At a first glimpse, we can see that most of those countries are in Africa. 

We are going to replace those Null's with 0's.

In [None]:
covid.fillna(0, inplace = True)

sns.heatmap(covid.isnull())
plt.show()

### 2.1.2. Happiness

In [None]:
# Duplicates check
if happiness.duplicated().any() == False:
    print("There aren't any duplicates")
else:
    print('There are suplicates')

# Nulls check
sns.heatmap(happiness.isnull())
plt.show()

all good with the happiness dataset. Let's proceed and merge both datasets

## 2.2. Merge
The countries will be our key column, so let's start checking the names of the countries. There are a lot of countries, but is still possible to check at a glance if there are similarities between both dataset, so that's what we are going to see.

In [None]:
print(covid.sort_values(by=["Country"])["Country"].values, '\n', len(covid.sort_values(by=["Country"])["Country"]))

print(happiness.sort_values(by=['Country name'])['Country name'].values, '\n', len(happiness.sort_values(by=['Country name'])['Country name']))



Most of the values are consistent, but the following points should be mention.
- The happiness dataset has less countries than the covid dataset. An inner join will be used but information will be lost. The information lost will be around 30 rows, but it shouldn't be critical for the analysis
- United kingodm has different names. It will be changed manually since is an important region.

In [None]:
display(covid[covid['Country']=='U.K.'])
covid.loc[6,'Country'] = 'United Kingdom'

Let's proceed with the merge

In [None]:
df = covid.merge(happiness, left_on='Country', right_on='Country name', how='inner')
del df['Country name']

df = df.sort_values(by=['Country'])
df

We ended up with 136 countries and lot's of useful data for our analysis. Let's now proceed with the descriptive analysis

## 2.3. Descriptive Analysis

Since we have 24 features, we will first start selecting the variables that we are going to use. The correlation matrix will help us find the the most relevant features. I will also select some variables that i find interesting for the analysis. Important to remark that the cosen criteria of the variables will be the relationship they have with the feature'% of population vaccinated', since our analysis is based on that.

In [None]:
fig, ax = plt.subplots(figsize=(15,15))

sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True)

plt.show()

In [None]:
# the chosen features are the wollowing
features = ['Country','Regional indicator', '% of population vaccinated', '% of population fully vaccinated', 
            'Ladder score', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy']

The features that start with "Explained by:" wont be selected since those are the rank of the country based on features of the same dataset. In other words, is redundant data.

In [None]:
df = df[features]
df

In [None]:
display(df.info())

With 2 categorical variables and 6 floats, we can start doing the descriptive analysis.

### 2.3.1. Category features

We know that each country is unique, so we only have one categorical variable to examine.

In [None]:
df[list(df.select_dtypes('object').columns)]

In [None]:
fig = px.histogram(df, x='Regional indicator', template="plotly_white", color_discrete_sequence=["rgb(127,232,186)"]).update_xaxes(categoryorder="total descending")
fig.show()

print(df['Regional indicator'].value_counts())

### 2.3.2. Number Features

In [None]:
numbers = list(df.select_dtypes('float64').columns)

df[numbers].hist(figsize=(20,10), color='#aaf0d1', edgecolor='white')

plt.show()

df[numbers].describe()

All the numerical variables have an almost normal shape, except for '% of population vaccinated' and '% of population fully vaccinated', which have most of the values in the lower bins. This is a first insight that tells us that most of the countries are still on the first phase with around 14% (median) of the population vaccinated.

Let's start analysing the data.

# 3. Data Analysis

## 3.1. What regions have most of its population vaccinated?

In [None]:
fig, ax = plt.subplots(figsize=(15,5))

order=list(df.groupby('Regional indicator')['% of population vaccinated'].mean().sort_values(ascending=False).index)
sns.barplot(x='Regional indicator', y='% of population vaccinated', data=df, order=order, palette="Blues_d")

ax.tick_params(labelrotation=90)

plt.show()

print(df.groupby('Regional indicator')['% of population vaccinated'].mean().sort_values(ascending=False))

the lines on top of the bars represent the variance of the mean, which is represented by the height of the bar. We can start seeing that Western Europe, North America and ANZ, and East Asia lead the vaccunation process. South Asia and Sub-Saharan Africa sit at the bottom. 

Important to remark the high variance in each region except for western Europe. This means that there are some countries in the sample that have way more vaccinated people than other. Could this be a sign of regional cooperation between the countries? My intuition tells me yes. The european union and the european comission should have developed a plan around the continent, that would explain the low variance. 

North America and ANZ on the other hand, doesn't seem to be cooperating. United States and Canada have more thant 50% of its opulation vaccinated, whereas Australia and New Zealand have 24% and 14% respectively as shown in the table below.

In [None]:
df[df['Regional indicator']=='North America and ANZ']

## 3.2. Does a higher GDP mean a higher vaccinated population?

Here we are going to do a scatterplot of the % of people vaccinated vs the GDP per Capita. We are also going to use the kmeans algorithm to set 3 clusters representing the level of GDP they belong to, to make the interpretation easier.

In [None]:
# Cluster
X = df.loc[:,'Logged GDP per capita'].values.reshape(-1,1)

kmeans = KMeans(n_clusters=3, n_init = 3, init = "random", random_state = 42)
kmeans.fit(X)

df['GDP_Cluster'] = kmeans.labels_
df['GDP_Cluster'] = df['GDP_Cluster'].astype(str)
df['GDP_Cluster']

Mmap = {'0': 'Low GDP', '1':'Medium GDP', '2':'High GDP'}
df['GDP_Cluster'] = df['GDP_Cluster'].map(Mmap)


# Viz
fig = px.scatter(data_frame=df,
    x='Logged GDP per capita',
    y='% of population vaccinated',
    color='GDP_Cluster',
    template="plotly_white",
    hover_name='Country',
    hover_data=['Regional indicator', '% of population vaccinated', 'Logged GDP per capita', 'Social support'])

fig.show()

gdp = df.groupby('GDP_Cluster')['% of population vaccinated'].mean().sort_values()
gdp = pd.DataFrame(gdp)
gdp = gdp.rename(columns={'% of population vaccinated': 'average % of population vaccinated'})
display(gdp)

We can see from the graph that GDP does influence the percentage of population vaccinated: the higher the GDP, the higher the people vaccinated.

If we explore through the graph, we notice that there is one high GDP country with low vaccinated population (Saudi Arabia), and medium GDPs countries with high vaccinated population (Mongolia, Maldives and Uruguay). This Phenomenon could be because two reasons:

1. An outsanding/poor management of agreements with the labs
2. A low/high level of cooperation between countries of a region (as seen in section 3.1)

## 3.3 What makes Urugay and Saudi Arabia special?

From the graph above, we notice that saudi arabia has a low percentage of people vaccinated and a high GDP. On the other hand, Urugay is a medium GDP country but have a high % of vaccinated people. So what makes these countries special?

To answer this question, we are going to calculate the data of high income countries and compare it with Uruguay and Saudi Arabia. We are also going to compare the data of these two countries with the data of the countries in their region.

In [None]:
europe = df['Regional indicator'] == "Western Europe"
europe2 = df[europe]
europe2 = europe2.groupby('Regional indicator')[list(europe2.keys()[2:-1])].mean().reset_index()

latin = df['Regional indicator'] == "Latin America and Caribbean"
latin2 = df[latin]
latin2 = latin2.groupby('Regional indicator')[list(latin2.keys()[2:-1])].mean().reset_index()

uruguay = df['Country'] == "Uruguay"
uruguay = df['Country'] == "Uruguay"
uruguay2 = df[uruguay]
del uruguay2['Regional indicator']
uruguay2 = uruguay2.rename(columns={'Country':'Regional indicator'})
uruguay2 = uruguay2.iloc[:,:-1]

saudi = df['Country'] == "Saudi Arabia"
saudi2 = df[saudi]
del saudi2['Regional indicator']
saudi2 = saudi2.rename(columns={'Country':'Regional indicator'})
saudi2 = saudi2.iloc[:,:-1]

special = pd.concat([europe2, uruguay2, saudi2, latin2], axis= 0)

fig, ax = plt.subplots(3,2, figsize=(15,15))

sns.barplot(x='Regional indicator', y='% of population vaccinated', data=special, ax=ax[0,0], palette="vlag")
sns.barplot(x='Regional indicator', y='% of population fully vaccinated', data=special, ax=ax[0,1], palette="vlag")
sns.barplot(x='Regional indicator', y='Ladder score', data=special, ax=ax[1,0], palette="vlag")
ax[1,0].set_yscale("log")
sns.barplot(x='Regional indicator', y='Logged GDP per capita', data=special, ax=ax[1,1], palette="vlag")
ax[1,1].set_yscale("log")
sns.barplot(x='Regional indicator', y='Social support', data=special, ax=ax[2,0], palette="vlag")
ax[2,0].set_yscale("log")
sns.barplot(x='Regional indicator', y='Healthy life expectancy', data=special, ax=ax[2,1], palette="vlag")
ax[2,1].set_yscale("log")


plt.show()
display(special)

**Uruguay**
We can tell from the graphs and table that Uruguay is doing well in th evaccination process, even better than the average of Western Europe and way more than the its region, Latin America and the caribbean. Important to remark that Uruguay dont have a higher GDP than the average of Western Union. But, if we keep examining the graphs, we see that Uruguay has an outstanding social suuport. A couple of hypothesis then come to my mind.

- Is Uruguay doing better than europe and latin america because the importance they put on social support?
- Is Western Union doing worst because they level of cooperation en distribution? this could also explain why Uruguay is so far from the latin america average, since the level of cooperation in this region is quite low.

**Saudi Arabia**
Saudi Arabia is more of a special case than Uruguay. The contry has decent numbers in all of its indicadors. But, they still don't have any vaccinated person. 

To answer this question i had to further research in the internet and found that Saudi Arabia has over 53% of its population vaccinated. Hence, the data of Saudi Arabia is not accurate in the dataset and should be discarted.

## 3.4. Does Social support and Healthy life expectany influence % of vaccinated people?

In this section we are going to use a scatter plot again. But, since we have three variables, we are not going to cluster them but use the parameters "size" and "hue" to graph the three of them.

In [None]:
fig, ax = plt.subplots(figsize=(15,7))

sns.scatterplot(x='Healthy life expectancy', y='% of population vaccinated', data=df, 
                hue='Social support',palette="Blues_d", 
                size='Social support', sizes=(0.3, 300))

plt.show()

The graph tells us that there is indeed a positive relationship between the % of vaccinated people, healthy life expectancy and social support.

## 3.5. Is a vaccinated country a happy one?

To answer this question we are goin to use the ladder score. The higher the ladder score, the happier a country is.

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
sns.regplot(x="Ladder score", y="% of population vaccinated", data=df)

plt.show()

There is indeed a trend that shows that the higher a country is in the happines report ladder, the more people will be vaccinated. Important to remark that countries achieve this level of happiness due to a combination of economic factors, some analyze before and other deleted to focus the analysis. This is a clear example of correlation doesn't mean causation. But, that doesn't take out the fact that a happy country is a vaccinated one.

# 4. Conclusions
- Countries in their regions are in different vaccination stage, which means that there's no evidence of cooperation between countries in their regions, this phenomenon is explained with the variance of the mean in seciton 3.1. Thats not the case for Western Europe, where the countries have a similar % of vaccinated population.

- The trend in section 3.2. shows that the % of vaccinated population is directly related to the GDP of a country: the higher the GDP the higher the % of vaccinated population. This is not the case of Chile, since they are a medium level GDP but have a higher % of vaccinated population. This could be explained with the emphasis the country have on social support, factor that is higher than almost all of the high GDP countries.

- A happy country is a vaccinated one. But do not rush with a direct causation. A happy country is the result of a combination of economic and social factors that usually make a developed country. Then, i would rather say that a developed country is a vaccinated one.

# 5. Recommendations

- What is seen as a competition of what country has the highest vaccinated population should be seen as the region who is distributing its resources the best and even helping poor countries. For the analysis we see how Uruguay is way more advanced than the countries in its region and this is a sign of poor regional coordination. The job of Uruguay has been outsanding, but the coopeartion problem in the region is evident and transcends to other areas. It is then reccomended to better the relationships in each region for a better distribution and further cooperation in other areas.

- The role of social support in Uruguay is also evident and the data back up its results with the level of vaccinated population. A country that cares about its people will also provide them. The reccomendation is to invest in entities and programs than promotes social support