In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](https://msfaccess.org/sites/default/files/styles/msf_mobile/public/2020-10/covid-19-vaccine-equity.png?itok=VyK1Qxnb)

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

1.  [Introduction](#0)<br>
2.  [Problem Statement & Data Content](#2)<br>
3.  [Imports](#2)<br>
4.  [Downloading & Prepping Data](#2)<br>
5.  [Geograhic Analysis](#4) <br>
6.  [Demoghraphic Analysis](#6) <br>
7.  [Politic Analysis](#8) <br>
8.  [Other Factors](#10) <br>
9.  [Discussion & Conclusion](#10) <br>
    </div>
    <hr>

## 1.  Introduction 

COVID-19 is an infectious disease caused by a newly discovered coronavirus. The first infected case has been identified at the end of 2019 in Wuhan, a city in the Hubei Province of China. It rapidly outspread, resulting in an epidemic throughout China, followed by a global pandemic.

To intensify the global actions needed to prevent and slow down transmission of COVID-19, scientists from different countries have achieved an astounding scientific accomplishment by development of several safe and effective vaccines within less than a year after this virus was isolated and sequenced.

By the end of 2020, more than 30 million vaccine doses have already been administered (WHO, 2021),however,the global vaccine rollout has exposed glaring inequalities in access to this life-saving tool.

## 2. Problem Statement & Data Content

We will combine vaccination dataset with UN dataset to capture what influences vaccination programmes, and these programmes success. We will also employ statistical data analysis to provide answers about: 
Politics, economy, demography - what are the factors that influence vaccination?

#### Data
1. countryprofilevariables.csv :contains the indicator variables of all the countries present in UNData.
2. kivacountryprofile_variables.csv : contains the indicator variables of the countries present in the Kiva Crowdfunding dataset.
3. country_vaccinations.csv : contains vaccinations data of the countries
4. CPI2020_GlobalTablesTS_210125.xlsx : contains 2020 CPI scores which ranks countries/territories by their perceived levels of public sector corruption (Transparency International, 2020).
5. VaccinsOrigin_Country.xlsx : contains the approved vaccines and their origin countries(CFRA ,2020)

#### Content
- This dataset contains key statistical indicators of the countries. It covers 4 major sections
1. General Information
2. Economic Indicators
3. Social Indicators
4. Environmental & Infrastructure Indicators




## 3. Imports   

In [None]:
import os
# to interact with the operating system 

import numpy as np

import pandas as pd
# data structure tool for data manipulation and analysis

!pip install datetime
from datetime import date

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib.pyplot import figure
!conda install -c conda-forge folium=0.5.0 --yes
import folium
# for data visualization

! pip install openpyxl
# to read excel files

import warnings
warnings.filterwarnings('ignore')

## 4. Downloading & Prepping Data

In [None]:
# read data
df_vaccinations = pd.read_csv("../input/covid-world-vaccination-progress/country_vaccinations.csv")
df_country_variables = pd.read_csv("../input/country-variables/country_profile_variables.csv")
df_kiva = pd.read_csv("../input/kiva-profile/kiva_country_profile_variables.csv")
CPI2020 = pd.read_excel("../input/cpi-2020/CPI2020.xlsx")
Vccines_Origin_Country = pd.read_excel("../input/vaccines-origin-countries/VaccinsOrigin_Country.xlsx")

In [None]:
# data exploration
df_vaccinations.head()

In [None]:
# list the data types for each column
print(df_vaccinations.dtypes)

In [None]:
# explore the date range in which data were logeed
# we need to convert date from object to date format
df_vaccinations['date'] = pd.to_datetime(df_vaccinations['date'])
# get the date range
print ("The dataset contains vaccination information since :" ,  df_vaccinations['date'].min(),"to :" , df_vaccinations['date'].max()  ) 

In [None]:
# filter country vaccinations dataset by dropping unneccesary column
coutry_vacc_fltrd = df_vaccinations.drop(["iso_code", "date","daily_vaccinations_raw",
                                          "people_vaccinated_per_hundred","people_fully_vaccinated_per_hundred",
                                          "source_name" , "source_website"] , axis=1)

In [None]:
# grouping by country
country_vac_grpd = coutry_vacc_fltrd.groupby(['country'],as_index=False).sum()
print ( "There are " ,country_vac_grpd["country"].nunique() , "different countries in the dataset") 

In [None]:
# let's display the whole countries list to remove any duplication 
pd.set_option("display.max_rows", None, "display.max_columns", None)
country_vac_grpd

In [None]:
# check the missing data
country_vac_grpd.isnull().sum()

In [None]:
# explore countries' variables
df_country_variables

In [None]:
# merge country vaccinations dataset into UN dataset
vacc_region= pd.merge(country_vac_grpd[['country','total_vaccinations']], df_country_variables[['country','Region']], left_on='country',right_on='country', how='outer').dropna()
print ("Number of countries in vaccination dataset is: " , country_vac_grpd["country"].nunique(), "while it is:" ,
       vacc_region["country"].nunique() , "in the merged dataset")

In [None]:
# find the differenc between the two datsets
country_vac = pd.Index(country_vac_grpd.country)
country_UN = pd.Index(vacc_region.country)
country_vac.difference(country_UN).values

* Miss-naming in: Faeroe Islands (Faroe Islands), United States (United States of America), and Russia (Russian Federation)
* The United Kingdom (UK) is made up of England, Scotland, Wales and Northern Ireland
* The United Nations recognises (Northern Cyprus) as territory of Cyprus.

In [None]:
# in country vaccinations datset, we need to :
# replcae (Faeroe Islands) by (Faroe Islands)
country_vac_grpd['country'].replace(['Faeroe Islands'], 'Faroe Islands', inplace = True)
# replace(United States)  by (United States of America)
country_vac_grpd['country'].replace(['United States'], 'United States of America', inplace = True)
# replace (Russia) by (Russian Federation)
country_vac_grpd['country'].replace(['Russia'], 'Russian Federation', inplace = True)
# merege (England, Scotland, Wales and Northern Ireland) to United Kingdom (UK)
country_vac_grpd['country'].replace(['England', 'Wales', 'Scotland','Northern Ireland'], 'United Kingdom', inplace = False)
# merge (Northern Cyprus) to Cyprus
country_vac_grpd['country'].replace(['Northern Cyprus'], 'Cyprus', inplace = True)

#re-grouping
country_vac_re_grpd = country_vac_grpd.groupby(['country'],as_index=False).sum()

#re_merging
vacc_region_remerge= pd.merge(country_vac_re_grpd[['country','total_vaccinations']], df_country_variables[['country','Region']],
                              left_on='country',right_on='country', how='outer').dropna()
vacc_region_remerge

## 5. Geograhic Analysis 

In [None]:
# let's have a new group by regions
region_grpd = vacc_region_remerge.groupby(['Region'],as_index=False).sum()
region_grpd_sorted = region_grpd.sort_values( by = 'total_vaccinations', ascending = False)
region_grpd_sorted

In [None]:
plt.rcParams['figure.figsize'] = (12, 10)
fig, ax = plt.subplots()
bars = ax.bar(x=region_grpd_sorted['Region'] , height=region_grpd_sorted['total_vaccinations'],
              color=['blue', 'red', 'green', 'purple', 'cyan', 'yellow','pink', 'grey'],edgecolor='black', width=0.8, tick_label= None)


# Save the chart so we can loop through the bars below.
plt.yticks(fontsize = 15)
plt.xticks(fontsize = 15 , rotation = 90)
plt.xlabel('Regions' ,labelpad= 20, loc = 'center' ,fontsize = 20)
plt.ylabel('Total Vaccinations' ,labelpad= 20, loc = 'center', fontsize = 20)
plt.title('Total Vaccinations by Regions', fontsize = 30 )

ax.spines['bottom'].set_color('#DDDDDD')
ax.tick_params(bottom=False, left=False)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#EEEEEE')
ax.xaxis.grid(False)
# Grab the color of the bars so we can make the
# text the same color.
bar_color = bars[0].get_facecolor()

# Make the chart fill out the figure better.
fig.tight_layout()

###  It can be clearly observed from the above bar chart, that Northen America and Northen Europe are the regions where more vaccinations have been administered, where Africa and Caribbean are the lowest.

In [None]:
# let' generate choropleth for
# download countries geojson file
!wget --quiet https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/world_countries.json
world_geo = r'world_countries.json' # geojson file
# create a plain world map
world_map = folium.Map(location=[0, 0], zoom_start=2)  
# generate choropleth map using the total Vaccinations
world_map.choropleth(
    geo_data=world_geo,
    data=vacc_region_remerge,
    columns=['country', 'total_vaccinations'],
    key_on='feature.properties.name',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Total Vaccinations'
)

# display map
world_map

### As per our Choropleth map legend, the darker the color of a country and the closer the color to red, the higher the number of vaccinations administered in a country.

## 6. Demoghraphic Analysis 

In [None]:
# grouping
demographics = pd.merge(vacc_region_remerge[['country','total_vaccinations']], df_country_variables[
    ['country','Region', 'Population in thousands (2017)', 'Sex ratio (m per 100 f, 2017)',
     'GDP per capita (current US$)', 'Health: Total expenditure (% of GDP)'  ]]
                        , left_on='country',right_on='country', how='outer').dropna()
demographics

In [None]:
# get the missing data in our new merged df
demographics.replace(['-99', -99 ], np.nan, inplace=True)
demographics.isnull().sum()

### Since we have no missing values in the total vaccinations field, and assuming that a person typically recievevs 2 dose of vaccines to get fully vaccinated,we can then express the number of fully-vaccinated people for a given country by the half of total vaccinations number.

In [None]:
# let's prepare the dataset accordingly
demographics ['estimated_fully_vaccinated_people'] = (demographics ['total_vaccinations']/2)

In [None]:
## Generate/Estimate more fields that may help us in the statstical analysis.
# add percentage of vaccination of the total population
demographics['percent of vaccitaion to population (%)'] = (demographics['estimated_fully_vaccinated_people'] /
                                           demographics['Population in thousands (2017)'])/10

# add male percentage of the total population
demographics['male percentage of population (%)'] = (demographics['Sex ratio (m per 100 f, 2017)'] /
                                          (demographics['Sex ratio (m per 100 f, 2017)'] + 100))*100

# add female percentage of the total population
demographics['female percentage of population (%)'] = (100- (demographics['male percentage of population (%)']))

# estimate vaccinated males
demographics['vaccinated_males'] = ((demographics['estimated_fully_vaccinated_people'])
                                    *(demographics['male percentage of population (%)']))

# estimate vaccinated females
demographics['vaccinated_females'] = ((demographics['estimated_fully_vaccinated_people'])
                                    *(demographics['female percentage of population (%)']))

pd.set_option("display.max_rows", None, "display.max_columns", None)
demographics

### 6.1 Demoghraphic Analysis by Country

In [None]:
# let's explore more correelations between Correlation Coefficient Country Variables and Estimated Fully_vaccinated People
fig = plt.figure(figsize=(20,10))
ax = sns.heatmap(demographics.interpolate(limit_area='inside').dropna(axis=0, inplace=False).corr(), cmap="RdBu_r", annot=True, fmt=".2f")
ax.set_title('Correlation Coefficient Heatmap of Country Variables and Estimated Fully_vaccinated People')
plt.show()

### The Impact of GDP Per Capita
Before we examine the correlation between the percentage of fullyvaccinated people and 'GDP per capita, we need to separate counties that have total vaccinations exceeding thier population that we can analyse them separately.

In [None]:
# split our data 
countries_over_100 = demographics[demographics['percent of vaccitaion to population (%)'] >= 100]
countries_under_100 = demographics[demographics['percent of vaccitaion to population (%)'] <= 100]

In [None]:
# countries with estimated fully vaccinated people more that their poulation
countries_over_100['country'].to_frame()

### This observed over_estimation could be attributed to many reasons, such as; the number vaccinations phases, the percentage of non-citizens residents, etc.. Therefore, these countries will be out of our analysis scope.

In [None]:
# detect outlier for countries in our scope
countries_under_100.boxplot(column =['GDP per capita (current US$)'], grid = False)

In [None]:
# remove outliers 
# will consider 80000 US$ as threshold value for GDP per capita
countries_under_100['GDP per capita (current US$)'][countries_under_100['GDP per capita (current US$)']>80000] = np.nan

# drop whole row with NaN 
countries_under_100.dropna( axis=0, inplace=True)

In [None]:
# We can examine the correlation between the percentage of vaccinated people to population and GDP per capita
countries_under_100[["GDP per capita (current US$)", "percent of vaccitaion to population (%)"]].corr()

### There has been a positive relationship between the two variables.

In [None]:
# let's go deeply in the correlation
sns.regplot(x="GDP per capita (current US$)", y="percent of vaccitaion to population (%)", data= countries_under_100)
plt.ylim(0,)

### It could be concluded from this correlation that people with higher GDP are more likely to get vaccinated.

### The Impact of Health Total expenditure (% of GDP)

In [None]:
# We can examine the correlation between the percentage of fully-vaccinated people and health total expenditure (% of GDP)f
countries_under_100[["Health: Total expenditure (% of GDP)", "percent of vaccitaion to population (%)"]].corr()

In [None]:
sns.regplot(x="Health: Total expenditure (% of GDP)", y="percent of vaccitaion to population (%)", data= countries_under_100)
plt.ylim(0,)

### It is clearly observed that countries that have spent more on health sector, have been able to secure vaccinations to their people.

### 6.1 Demoghraphic Analysis by Region

In [None]:
# grouping by region
demographics_regs = demographics.groupby(['Region'],as_index=False).mean()
# drop outliers
regs_under_100 = demographics_regs[demographics_regs['percent of vaccitaion to population (%)'] <= 100]
# sorting
regs_under_100_sorted = regs_under_100.sort_values( by = 'percent of vaccitaion to population (%)', ascending = False)
regs_under_100_sorted

In [None]:
plt.rcParams['figure.figsize'] = (12, 10)
fig, ax = plt.subplots()
bars = ax.bar(x=regs_under_100_sorted['Region'] , height=regs_under_100_sorted['percent of vaccitaion to population (%)'],
              color=['blue', 'red', 'green', 'purple', 'cyan', 'yellow','pink', 'grey'],edgecolor='black', width=0.8, tick_label= None)


# Save the chart so we can loop through the bars below.
plt.yticks(fontsize = 15)
plt.xticks(fontsize = 15 , rotation = 90)
plt.xlabel('Regions' ,labelpad= 20, loc = 'center' ,fontsize = 20)
plt.ylabel('Percentage of fully-vaccinated people' ,labelpad= 20, loc = 'center', fontsize = 20)
plt.title('The Percentage of Fully-Vaccinatated People to Population by Regions', fontsize = 20 )

ax.spines['bottom'].set_color('#DDDDDD')
ax.tick_params(bottom=False, left=False)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#EEEEEE')
ax.xaxis.grid(False)
# Grab the color of the bars so we can make the
# text the same color.
bar_color = bars[0].get_facecolor()

# Make the chart fill out the figure better.
fig.tight_layout()

### This graph reveals a shocking image about vaccines distribution, as seen, more than 50% of South Europe people have been vaccinated compared to less than 1% of Northen Africa people. 

## 7. Politic Analysis

#### In this political analysis we will focus on corruption as a factor has been strongly related to countries response to crisis. As The Transparency International Organisation has mentioned (Corruption continues to contribute to democratic backsliding during the COVID-19 pandemic. Countries with higher levels of corruption rely on less democratic responses to the crisis).

#### The latest version of  2020_Report_CPI was added to dataset, this report contains 2020 CPI scores which ranks countries/territories by their perceived levels of public sector corruption (Transparency International, 2020)

In [None]:
# let's have a look at the filtered version
CPI2020.head()

In [None]:
# grouping by the percentage of vaccitaion to population with CPI score
CPI_vs_vac = pd.merge(countries_under_100[['country','percent of vaccitaion to population (%)']], CPI2020[
    ['Country', 'CPI score 2020']]
    , left_on='country',right_on='Country', how='outer').dropna()
CPI_vs_vac

In [None]:
# correlation detection
CPI_vs_vac[["CPI score 2020", "percent of vaccitaion to population (%)"]].corr()

### There has been a positive relationship between vaccinated people and CPI score.

In [None]:
# Let's have more clear image
sns.regplot(x="CPI score 2020", y="percent of vaccitaion to population (%)", data= CPI_vs_vac)
plt.ylim(0,)

### Unsurprisingly, countries that promote transparency have responded positively to COVID-19 crisis in terms of number of administered vaccinations.

## 8. Other Factors 

#### In this section we will investigate if countries that have developed the current approved vaccines have given priority to particular countries/regions in vaccine distribution.
#### We will add dataset contains the approved vaccines and their origin countries(CFRA ,2020).

In [None]:
Vccines_Origin_Country

### Vaccins origin countires are: 
* United States of America
* Germany
* United Kingdom
* Sweden
* China
* Russian Federation
* India


### Let's explore more deeply the origin countries of approved vaccines in terms of percentage of fully-vaccinated people to their population.

In [None]:
# filter our dataset by vaccines origin countries 
cls =['United States of America' ,'Germany' , 'United Kingdom' , 'Sweden' ,'China' ,'Russian Federation' ,'India']
Vac_Orgn_Ctrs =demographics [demographics['country'].isin(cls)]

In [None]:
# filter our data set by vaccines origin countries 
cls =['United States of America' ,'Germany' , 'United Kingdom' , 'Sweden' ,'China' ,'Russian Federation' ,'India']
Vac_Orgn_Ctrs =demographics [demographics['country'].isin(cls)]

# the mean value
print ("The avarage percent of fully-vaccinated people to the total population of the vaccines' developer countries is:"
       , int(Vac_Orgn_Ctrs["percent of vaccitaion to population (%)"].mean()) ,"%")

### Unsurprisingly, people of countries where the approved vaccines have been developed, have been given the priority in vaccinations queue.

## 9. Discussion & Conclusion

#### With the outbreak of the global COVID-19 pandemic and its devastating effects, COVID-19 vaccine has become a life-saving tool. Despite the clinical success, there has been an increasing concern about inequalities in vaccines distribution. The aim of this project has therefore been to assess the key factors that affect COVID-19 global distribution. In particular,this work seeks to address if politics, economy, or/and demograghy have an impact of vaccines access. By employing the standard approach of data science and machine learning, with in-deep analysis of the data being provided and added, this study provided an important insights into the influencing factors of COVID-19 global distribution.

#### In this analysis study, comparing the total given vaccinated among world regions showed that there has been so far unfair  allocation for administered vaccines. The most interesting finding was the huge observed gap, for example,the whole Africa people have administered **less than 1%** , while, Northen America people accounted nearly **40% of the total world vaccinations.**

#### Geologically,results of our analysis concluded that **the people of Europe and Northen America have been proven to be the most likely to be fully-vaccinated with a percentage ranges from 15 to 51%, where, African people collectively are less likely with a percentage of 1%.** In other words,considering the same daily vaccinations rate and assuming the typical vaccinations of 2 dose, we can estimate that Europe and Northen America people will be fully vaccinated by March 2022, in contrast, the estimation date of African people to get fully vaccinated will be 2037.

#### There are,however, several explanations for thess unanticipated findings. It could be concluded from our demographical,economical, and political analysis that significant inequalities in vaccines access could be attributed to many factors. In conclusion, **people more likely to be vaccinated are the people who come from countries of:**

### * Higher in CPI score.
### * Higher in health total expenditure (% of GDP).
### * Higher in GDP per capita.
### * Approved-vaccine developer.

#### The main findinds from our correlation analysis, was the significant positive correlation betweenthe CPI score and total vaccinations for a given country. this main finding further supports what has been mentioned by Transparency International Organisation that **(The  corruption undermines the global health response to COVID-19)**.



