This is my first ever submission, so I would really appreciate it if you could please leave an upvote if you find my analysis useful, and please give me feedback as to how I can make my workbook and analysis better. Thank you in advance.

I will be exploring the following in my analysis:

**1) Vaccines used by each country and the overall usage of different vaccines worldwide**

**2) Countries that have the most advanced vaccination programmes and are making the most progress in getting their population vaccinated**

**3) Correlation between GDP/Capita of a country and vaccination rates** 

**4) Trends of vaccinations and overall worldwide vaccination rate**

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns

In [None]:
pwd

In [None]:
df = pd.read_csv("/kaggle/input/covid-world-vaccination-progress/country_vaccinations.csv")

df.head(10)

In [None]:
df.columns

The data contains the following information:

**Country**- this is the country for which the vaccination information is provided;

**Country ISO Code** - ISO code for the country;

**Date** - date for the data entry; for some of the dates we have only the daily vaccinations, for others, only the (cumulative) total;

**Total number of vaccinations** - this is the absolute number of total immunizations in the country;

**Total number of people vaccinated** - a person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment, the number of vaccination might be larger than the number of people;

**Total number of people fully vaccinated** - this is the number of people that received the entire set of immunization according to the immunization scheme (typically 2); at a certain moment in time, there might be a certain number of people that received one vaccine and another number (smaller) of people that received all vaccines in the scheme;

**Daily vaccinations (raw)** - for a certain data entry, the number of vaccination for that date/country;

**Daily vaccination**s - for a certain data entry, the number of vaccination for that date/country;

**Total vaccinations per hundred** - ratio (in percent) between vaccination number and total population up to the date in the country;

**Total number of people vaccinated per hundred** - ratio (in percent) between population immunized and total population up to the date in the country;

**Total number of people fully vaccinated per hundred** - ratio (in percent) between population fully immunized and total population up to the date in the country;

**Daily vaccinations per million** - ratio (in ppm) between vaccination number and total population for the current date in the country;

**Vaccines used in the country** - total number of vaccines used in the country (up to date);

**Source name** - source of the information (national authority, international organization, local organization etc.);

**Source website** - website of the source of information;

*Changing the column names of the dataset*

In [None]:
headers = ["Country","ISO Code","Date","Total Vaccinations","People Vaccinated","People Fully Vaccinated","Daily Vaccinations (Raw)","Daily Vaccinations","Total Vaccinations (per hundred)","People Vaccinated (per hundred)","People Fully Vaccinated (per hundred)","Daily Vaccinations (per million)","Vaccine","Source Name","Source Website"]

df.columns = headers

df.columns

*Checking the data types of the variables in our dataset to see if they are correct*

In [None]:
df.dtypes

*Dropping columns that are not needed for the analysis*

In [None]:
df.drop(labels=["Source Name","Source Website","Daily Vaccinations (Raw)"], axis=1, inplace=True)

df.columns

In [None]:
df.head()

*Checking the null values in the dataset*

In [None]:
df.isnull().sum()

Most of the variables in the dataset have huge amounts of Null values. And since most of these variables will be important to the analysis, it would not make sense to remove these columns. Therefore, it would be best to leave them as they are.

**Finding out which country is using what vaccine or combination of vaccines**

In [None]:
vaccine = df.groupby("Country", as_index=False)["Vaccine"].max()

vaccine

**Finding out which vaccine or combination of vaccines are being used the most in the world currently**

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(200,10))

sns.countplot("Vaccine", data=vaccine, order=vaccine["Vaccine"].value_counts().index)

plt.tight_layout()
plt.ylabel("# of Countries", fontsize=12)
plt.xlabel("Vaccine/Combination of Vaccines", fontsize=12)

As shown in the diagram above, Oxford/AstraZeneca is currently being used by the most number of countries in the world. Pfizer/BioNTech & the combination of Moderna, Oxford/AstraZeneca & Pfizer/BioNTech are the second-most used vaccine and vaccine combination in the world currently.

In [None]:
daily_vaccine = df.groupby("Vaccine", as_index=False)["Daily Vaccinations"].sum().sort_values("Daily Vaccinations",ascending=False).reset_index(drop=True)

daily_vaccine

In [None]:
daily_vaccine2 = daily_vaccine.head(10)

sns.set_style('darkgrid')
plt.figure(figsize=(50,30))

sns.barplot(x="Daily Vaccinations", y="Vaccine", data=daily_vaccine2, palette='autumn')

plt.tight_layout()
plt.xlabel("# of Vaccinations", fontsize=25)
plt.ylabel("Vaccine/Combination of Vaccines", fontsize=25)
plt.xticks(fontsize=25)
plt.yticks(fontsize=25)

daily_vaccine2

If we purely go on the basis of the total number of vaccines used daily on the populations of each country, the combination of CanSino and Sinopharm (Chinese vaccines) are the most used vaccine, followed by Johnson&Johnson, Moderna & Pfizer/BioNTech and the combination of Oxford/AstraZeneca & Pfizer/BioNTech.

**Finding out which countries have the most advanced vaccine programme and who are making most considerable progress in getting their population fully vaccinated**

In [None]:
vaccinations_country = df.groupby("Country", as_index=False)["Daily Vaccinations"].sum().sort_values("Daily Vaccinations",ascending=False).reset_index(drop=True)

vaccinations_country

In [None]:
vaccinations_country3 = vaccinations_country2.head(10)

sns.set_style('darkgrid')
plt.figure(figsize=(50,30))

sns.barplot(x="Daily Vaccinations", y="Country", data=vaccinations_country3, palette='spring')

plt.tight_layout()
plt.xlabel("# of Vaccinations", fontsize=35)
plt.ylabel("Country", fontsize=35)
plt.xticks(fontsize=25)
plt.yticks(fontsize=25)

vaccinations_country3

As shown above, China leads the way with the most number of vaccinations followed by USA, India and the United Kingdom. This result makes sense since the three most used vaccines (on the basis of daily vaccinations) are produced by three  of the four countries mentioned above. Therefore, since these countries produced these vaccines, they would be the first ones to start vaccinating their populations, hence having higher number of overall vaccinations.

But this result does not take into account the population level of countries. A country having an advanced vaccination programme and making the most progress with regards to getting their population fully vaccinated would entail having higher values of vaccination rate per a population measure. In our dataset, we have four such measure namely:

1) *Total Vaccinations (per hundred)*

2) *People Vaccinated (per hundred)*         

3) *People Fully Vaccinated (per hundred)*  

4) *Daily Vaccinations (per million)*       

But since the first three measures have a high number of Null values (as shown at the start of the analysis), it would be best to use the fourth measure.

In [None]:
vaccinations_mil = df.groupby("Country", as_index=False)["Daily Vaccinations (per million)"].mean().sort_values("Daily Vaccinations (per million)",ascending=False).reset_index(drop=True)

vaccinations_mil

In [None]:
vaccinations_mil2 = vaccinations_mil.head(10)

sns.set_style('darkgrid')
plt.figure(figsize=(50,30))

sns.barplot(x="Daily Vaccinations (per million)", y="Country", data=vaccinations_mil2, palette='viridis')

plt.tight_layout()
plt.xlabel("# of Vaccinations/million", fontsize=35)
plt.ylabel("Country", fontsize=35)
plt.xticks(fontsize=25)
plt.yticks(fontsize=25)

vaccinations_mil2

These results differ quite alot to the results in the previous bar graph. There is only 1 country that is common in both these results namely Israel. This shows that not only is Israel administering a high number of vaccines to its population on a daily basis, but also that its daily average vaccination rate respective of its populations is higher than that of other countries. 

Even though the following three measures have a high number of Null Values, we could still use one of them as part of the analysis since they are cumulative:

1) *Total Vaccinations (per hundred)*

2) *People Vaccinated (per hundred)*         

3) *People Fully Vaccinated (per hundred)* 

Since they are cumulative, we just need the maximum value of these measures for each country. This may not give us the most accurate information, but it may help compliment our previous analysis. 

I believe the third measure would be the best to use for our analysis since people being fully vaccinated would be a good measure of an advanced vaccine programme. 

In [None]:
fully_vac = df.groupby("Country", as_index=False)["People Fully Vaccinated (per hundred)"].max().sort_values("People Fully Vaccinated (per hundred)",ascending=False).reset_index(drop=True)

fully_vac

In [None]:
fully_vac2 = fully_vac.head(10)

sns.set_style('darkgrid')
plt.figure(figsize=(50,30))

sns.barplot(x="People Fully Vaccinated (per hundred)", y="Country", data=fully_vac2, palette='icefire')

plt.tight_layout()
plt.xlabel("People Fully Vaccinated (per hundred)", fontsize=35)
plt.ylabel("Country", fontsize=35)
plt.xticks(fontsize=25)
plt.yticks(fontsize=25)

fully_vac2

This result coupled with the previous result clearly shows that the following 4 countries have the most advanced vaccination program with regards to high daily vaccination and fully vaccinated rates respective of their populations:

1) Gibraltar

2) Israel

3) Saint Helena

4) San Marino

5) Aruba

**Exploring the relationship between GDP/capita of a country and daily vaccinations (per million)**

In [None]:
GDP = vaccinations_mil

GDP

Selecting a random sample of countries to be used in our correlation study

In [None]:
GDP = GDP.sample(n=20, random_state=10).reset_index(drop=True)

GDP

*Creating a dataframe with the 2020 GDP/Capita values for the countries in our sample set*

In [None]:
values = [25946,11099,5353,6175,2032,15731,8717,33228,64269,17112,6359,10094,1407,34824,32123,1370,10339,14896,3562,5414]

values2 = pd.DataFrame(values)

values2.columns = ["GDP/Capita($)"]

values2

In [None]:
GDP2 = pd.concat([GDP,values2], axis=1, ignore_index=False)

GDP2

In [None]:
from scipy.stats import pearsonr

sns.jointplot(x="GDP/Capita($)", y="Daily Vaccinations (per million)", data=GDP2, kind='reg')

The graph above shows a few outliers, which may be skewing the results. Therefore, these outliers should be removed

Removing the outliers (Taiwan and Saint Helena)

In [None]:
GDP3 = GDP2.drop(index=[14,16])

GDP3

In [None]:
sns.jointplot(x="GDP/Capita($)", y="Daily Vaccinations (per million)", data=GDP3, kind='reg')

*Calculating the corrlation coefficent (r) for the two variables*

In [None]:
GDP3.corr()

The results above shows that there is a relatively strong positive relationship between the GDP/Capita and Daily Vaccinations (per million) of a country. 

This relationship could be justified as countries that have high GDP/Capita values have strong economies. Furthermore, having a strong economy should entail that country investing in and having a good health-care system and facilities. Therefore, this would translate into having the right platform to administer vaccinations efficiently and effectively to a country's population, hence explaining the positive correlation. 

**Exploring trends in vaccination and vaccination rates in the world**

*Exploring trends of overall world-wide vaccinations per day*

In [None]:
trend = df.groupby("Date", as_index=False)["Daily Vaccinations"].sum()

trend

In [None]:
import plotly.express as px

fig = px.line(trend, x="Date", y="Daily Vaccinations")

fig.show()


As demonstrated in the graph above, overall number of vaccinations started growing at a high rate up until the second week of February where world-wide vaccinations levelled off at approximately 6M per day. After Feb 23rd, 2021, the overall number of vaccinations starting growing rapidly with a couple of dips during April and May. After May 7th, there was a huge increase in the rate of vacicnations up until June 4th, after which there was a sharp decline.

*Exploring average world-wide vaccinations per day*

In [None]:
trend2 = df.groupby("Date", as_index=False)["Daily Vaccinations"].mean()

trend2

In [None]:
fig = px.line(trend2, x="Date", y="Daily Vaccinations")

fig.show()


As demonstrated in the graph above, despite there being peaks and valleys, the general trend of average daily vaccinations is in an upward direction. Since the start of March 2021, there has been a huge increase in the worldwide average daily vaccinations. This could be because of the approval of several different vaccines and the increase in availablilty of these vaccines to countries over the first few months of 2021. A sharp increase is observed after June 8th. 

Though these graphs reveal important information regarding the trends of daily vacicnations world-wide, it does not however show the trend of overall vaccination rate in the world respective of the overall world population. Therefore, to explore this we must use "Daily Vaccinations (per million)" as a measure. This will be a good indicator of the overall world-wide vaccination rate and its trend. 

In [None]:
trend3 = df.groupby("Date", as_index=False)["Daily Vaccinations (per million)"].mean()

trend3

In [None]:
fig = px.line(trend3, x="Date", y="Daily Vaccinations (per million)")

fig.show()

As demonstrated in the graph above, the overall trend of average daily vaccinations (per million) is in an upward direction, but currently the average daily vaccination (per million) is approximately 4000. This means that for a million people, only 4000 people are getting vaccinated worldwide. This value may be low due to a number of reasons. One reason could be that the worldwide population is generally hesitant to get a vaccination shot. This could be due to a lack of information and awareness of vaccines, how they work and why they are needed (especially during this pandemic). Another reason could be that in alot of countries, vaccines are only available to front-line health workers and people over the age of 60 or 65. Therefore, once the vaccines become available to the general population, these numbers would automatically increase.  

**THANK YOU FOR TAKING THE TIME TO READ THIS NOTEBOOK. PLEASE LEAVE AN UPVOTE IF YOU FIND MY NOTEBOOK AND ANALYSIS USEFUL. PLEASE LEAVE A COMMENT AS TO HOW I CAN IMPROVE MY ANALYSIS OR ANY OTHER ASPECTS OF MY WORK. THANK YOU!**