# Welcome to my First Notebook!
In this notebook, we will look at some statistics & predictions on Covid-19 Vaccinations in the following categories:

1. Adoption of Covid-19 Vaccines
    * Number of countries who have started vaccinations
    * Early adopters of vaccination
    * Most recent countries to begin vaccinations
    * Trendline of vaccination adoption over time
    * Visualization of this data on a geoplot
2. Scale of Vaccination across Countries
    * Total number of vaccines to date
    * Countries with the most number of total vaccinations
    * Countries with the highest number of vaccinations per 100 population
3. Vaccine Popularity
    * Most popular vaccines across countries (used in most countries)
    * Most popular vaccines (absolute quantity of vaccines used)
4. Predictions
    * When can herd immunity be achieved in the US?
5. Final Words & Considerations


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import geopandas as gpd
import statsmodels.formula.api as smf 
import datetime

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import data as dataframe vacc
vacc = pd.read_csv("/kaggle/input/covid-world-vaccination-progress/country_vaccinations.csv")

# Dropping the columns source name & source website as they are not relevant to the study
vacc.drop(['source_name','source_website'], inplace=True, axis=1)

Since our main focus of this study is with regards to total vaccinations being administered, we will drop the rows where is data is not available

In [None]:
# Dropping columns with no "total vaccinations" value
vacc = vacc.drop(vacc[vacc.total_vaccinations.isna()].index)

# Adoption of Covid-19 Vaccines

We first look at the number of countries that have started the vaccination program

In [None]:
# Find out how many unique countries have started vaccinations
print(str(len(vacc[vacc.total_vaccinations > 0].country.value_counts())) + " countries have started vaccinations")

Which countries started vaccinations first?

In [None]:
# Find out which countries started vaccinations earliest
vacc['date'] = pd.to_datetime(vacc['date'], utc=True)
vacc_start = vacc.loc[vacc[vacc.total_vaccinations > 0].groupby('country')['date'].idxmin()].sort_values('date')
vacc_start.head(5)

Which countries are the most recent to start vaccinations?

In [None]:
# Find out which countries have most recently started vaccinations
vacc_start.tail(5)

How have the cumulative number of countries adopting covid-19 vaccinations evolved over time?
How is this trend?

In [None]:
# Cumulative distribution of vaccination start dates
events = pd.Series(vacc_start.date.value_counts())
events.index = pd.to_datetime(events.index)
events.sort_index(inplace=True)

plt.plot(events.cumsum())
plt.xticks(rotation=90)
plt.title('Cumulative Frequency of Countries that have Started Vaccinations')
plt.xlabel('Date')
plt.ylabel('Number of Countries')
plt.show()

Geoplot of countries that have started vaccinations

In [None]:
countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
countries['score'] = 0

countries['score'] = np.where(countries.iso_a3.isin(vacc_start.iso_code), 1, 0)

countries.plot(column='score',legend=True, legend_kwds={'label': "Covid-19 Vaccine Started",'orientation': "horizontal"})
plt.show()

# Scale of Vaccination across Countries

What is the total number of vaccinations done to date?

In [None]:
vacc_total = vacc.loc[vacc.groupby('country')['total_vaccinations'].idxmax()].sort_values('total_vaccinations',ascending=False)
print("To date, there have been a total of ", vacc_total.total_vaccinations.sum()/1000000, " Million vaccinations adminstered")

Which countries have administered the most number of vaccines?

In [None]:
print(vacc_total.iloc[0].country, " has administered the most vaccines, with ", vacc_total.iloc[0].total_vaccinations/1000000, " Million vaccinations to date")

What are the top 20 countries in terms of total number of vaccines administered?

In [None]:
# Plot out which countries have performed most vaccinations in descending order
plt.figure(figsize=(10, 4)) 
plt.bar(vacc_total.country[0:20], vacc_total.total_vaccinations[0:20])
plt.xticks(rotation=90)
plt.title('Top 20 Countries by Total Vaccinations')
plt.xlabel('Country')
plt.ylabel('Total Vaccinations (x 10Million)')
plt.show()


Geomap of countries that have started covid-19 vaccinations, in terms of total number of vaccinations administered

In [None]:
countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

countries = countries.merge(vacc_total[['iso_code', 'total_vaccinations']], how = 'left',
                left_on = 'iso_a3', right_on = 'iso_code').drop('iso_code', axis=1)

fig, ax = plt.subplots(1, 1,figsize=(10,10))

countries.plot(column='total_vaccinations',legend=True,ax=ax, 
               legend_kwds={'label': "Number of Covid-19 Vaccinations Administered",'orientation': "horizontal"}, 
               missing_kwds={"color": "lightgrey"})
plt.show()

The above study shows that the United States has administered the most vaccines thus far, however their population is also large, and it will take awhile before most of the population is vaccinated. To study vaccinations on a different angle, we also wonder which countries has vaccinated most of their population.

Thus, we will study which countries have administered the most number of vaccines per capita? 

Note that since each person requires 2 doses on average, we expect a fully vaccinated country to have 200 vaccinations per 100 population


In [None]:
vacc_total = vacc.loc[vacc.groupby('country')['total_vaccinations_per_hundred'].idxmax()].sort_values('total_vaccinations_per_hundred',ascending=False)
print(vacc_total.iloc[0].country, " with ", vacc_total.iloc[0].total_vaccinations_per_hundred, " vaccines administered per 100 population, to date")

What are the top 20 countries in terms of number of vaccines administered per 100 population?

In [None]:
# Find out which countries have performed most vaccinations with respect to their populations 

plt.figure(figsize=(10, 4)) 
plt.bar(vacc_total.country[0:20], vacc_total.total_vaccinations_per_hundred[0:20])
plt.xticks(rotation=90)
plt.title('Top 20 Countries by Total Vaccinations per 100 Population')
plt.xlabel('Country')
plt.ylabel('Total Vaccinations per 100 Population')
plt.show()

Geomap of countries that have started covid-19 vaccinations, in terms of number of vaccinations administered per 100 population

In [None]:
countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

countries = countries.merge(vacc_total[['iso_code', 'total_vaccinations_per_hundred']], how = 'left',
                left_on = 'iso_a3', right_on = 'iso_code').drop('iso_code', axis=1)

fig, ax = plt.subplots(1, 1,figsize=(10,10))

countries.plot(column='total_vaccinations_per_hundred',legend=True,ax=ax, legend_kwds={'label': "Number of Covid-19 Vaccinations Administered",'orientation': "horizontal"},missing_kwds={"color": "lightgrey"})
plt.show()

# Vaccine Popularity

Which is the most popular vaccines around the world? This is quantified in terms of the number of countries it is being used in

In [None]:
vacc_total = vacc.loc[vacc.groupby('country')['total_vaccinations'].idxmax()].sort_values('total_vaccinations',ascending=False)
vacc_total = pd.concat((vacc_total,vacc_total["vaccines"].str.split(", ", expand = True)),axis=1)

vacc_types = vacc_total.iloc[:,13:].apply(pd.Series.value_counts).sum(axis=1).sort_values(ascending=False)

print("The most popular vaccine is ", vacc_types.index[0], " with ", int(vacc_types[0]), " countries using it")

How do the other vaccines fare in terms of popularity around the world?

In [None]:
plt.bar(vacc_types.index, vacc_types)
plt.xticks(rotation=90)
plt.title('Most Popular Vaccines')
plt.xlabel('Vaccine')
plt.ylabel('Number of Countries')
plt.show()


From the chart above, it is obvious that the number of countries using each vaccine does not add up to the number of countries that have started vaccinations. This is because some countries are using more than 1 type of vaccine. 

This leads us to wonder, which countries are using more than 1 type of vaccine, and how many are they using?

In [None]:
# How many vaccines is each country using

vacc_total['vacc_brands'] = vacc_total.iloc[:,-5:].apply(lambda x: (5 - x.isnull().sum()), axis='columns')
vacc_total = vacc_total.sort_values('vacc_brands',ascending=False)
plt.figure(figsize=(20, 4)) 
plt.bar(vacc_total.country, vacc_total.vacc_brands)
plt.xticks(rotation=90)
plt.title('Total Number of Vaccine Brands Used')
plt.xlabel('Country')
plt.ylabel('Number of Brands')
plt.show()


In terms of vaccine sales and absolute number of vaccines being administered, which brands are most popular?

Note that for countries that purchase more than 1 type of vaccine, the total purchases are assumed to be split evenly among the types of vaccines used


In [None]:
vacc_total = vacc.loc[vacc.groupby('country')['total_vaccinations'].idxmax()].sort_values('total_vaccinations',ascending=False)
vacc_total = pd.concat((vacc_total,vacc_total["vaccines"].str.split(", ", expand = True)),axis=1)

vacc_types = vacc_total.iloc[:,13:].apply(pd.Series.value_counts).sum(axis=1).sort_values(ascending=False)

frame = { 'number_of_countries': vacc_types } 
result = pd.DataFrame(frame) 
result['number_sold'] = 0


for index, row in vacc_total.iterrows():
    if(row.iloc[-1] is None):
        if(row.iloc[-2] is None):
            if(row.iloc[-3] is None):
                if(row.iloc[-4] is None):
                    result.loc[row.iloc[-5],'number_sold'] = result.loc[row.iloc[-5],'number_sold'] + row.total_vaccinations
                else:
                    result.loc[row.iloc[-4],'number_sold'] = result.loc[row.iloc[-4],'number_sold'] + (1/2*row.total_vaccinations)
                    result.loc[row.iloc[-5],'number_sold'] = result.loc[row.iloc[-5],'number_sold'] + (1/2*row.total_vaccinations)
            else: 
                result.loc[row.iloc[-3],'number_sold'] = result.loc[row.iloc[-3],'number_sold'] + (1/3*row.total_vaccinations)
                result.loc[row.iloc[-4],'number_sold'] = result.loc[row.iloc[-4],'number_sold'] + (1/3*row.total_vaccinations)
                result.loc[row.iloc[-5],'number_sold'] = result.loc[row.iloc[-5],'number_sold'] + (1/3*row.total_vaccinations)
        else: 
            result.loc[row.iloc[-2],'number_sold'] = result.loc[row.iloc[-2],'number_sold'] + (1/4*row.total_vaccinations)
            result.loc[row.iloc[-3],'number_sold'] = result.loc[row.iloc[-3],'number_sold'] + (1/4*row.total_vaccinations)
            result.loc[row.iloc[-4],'number_sold'] = result.loc[row.iloc[-4],'number_sold'] + (1/4*row.total_vaccinations)
            result.loc[row.iloc[-5],'number_sold'] = result.loc[row.iloc[-5],'number_sold'] + (1/4*row.total_vaccinations)
    else:
        result.loc[row.iloc[-1],'number_sold'] = result.loc[row.iloc[-1],'number_sold'] + (1/5*row.total_vaccinations)
        result.loc[row.iloc[-2],'number_sold'] = result.loc[row.iloc[-2],'number_sold'] + (1/5*row.total_vaccinations)
        result.loc[row.iloc[-3],'number_sold'] = result.loc[row.iloc[-3],'number_sold'] + (1/5*row.total_vaccinations)
        result.loc[row.iloc[-4],'number_sold'] = result.loc[row.iloc[-4],'number_sold'] + (1/5*row.total_vaccinations)
        result.loc[row.iloc[-5],'number_sold'] = result.loc[row.iloc[-5],'number_sold'] + (1/5*row.total_vaccinations)
        
result = result.sort_values('number_sold',ascending=False)
plt.figure(figsize=(10, 4)) 
plt.bar(result.index, result.number_sold)
plt.xticks(rotation=90)
plt.title('Approximate Quantity of Each Vaccine Used')
plt.xlabel('Vaccine Brand')
plt.ylabel('Quantity Vaccinated (x 10M)')
plt.show()



# Predictions

This section will be focused on the United States as it has the most complete data in the dataset. However it may be replicated for any other country as desired

In this section, our main goal is to predict when we can expect the United States to administer enough vaccines to achieve herd immunity

Based on studies being done, WHO is unable to determine the vaccination rate required for herd immunity of Covid-19. [WHO Article](https://www.who.int/news-room/q-a-detail/herd-immunity-lockdowns-and-covid-19#:~:text=The%20percentage%20of%20people%20who,among%20those%20who%20are%20vaccinated.)

On the other hand, John Hopkins School for Public Health indicates that the vaccination rate required is approximately 70%. Thus, we will use this as the benchmark. [JHSPH Study](https://www.jhsph.edu/covid-19/articles/achieving-herd-immunity-with-covid19.html)

First, we will look at the vaccination trend in the United States

Daily Vaccinations in the US

In [None]:
us_stats = vacc[vacc.country == 'United States']

plt.figure(figsize=(10, 4)) 
plt.plot(us_stats.date, us_stats.daily_vaccinations)
plt.xticks(rotation=90)
plt.title('Daily number of vaccinations administered in USA')
plt.xlabel('Date')
plt.ylabel('Number of Vaccinations (x 1M)')
plt.show()

Cumulative Vaccinations in the US

In [None]:
us_stats = vacc[vacc.country == 'United States']

plt.figure(figsize=(10, 4)) 
plt.plot(us_stats.date, us_stats.total_vaccinations)
plt.xticks(rotation=90)
plt.title('Total number of vaccinations administered in USA')
plt.xlabel('Date')
plt.ylabel('Total Vaccinations (x 10M)')
plt.show()


From this, it can be observed that there is an exponential growth of vaccinations in the US

For sake of simplicity, we will create a new dataframe containing only the 'date' and 'total_vaccinations' columns

In [None]:
us_stats_sliced = us_stats.loc[:,['total_vaccinations','date']]
us_stats_sliced.head(5)

We will start with data preparation for this problem. We will convert all the dates into as the value 't', representing days since vaccinations started in the US

In [None]:
us_stats_sliced["t"] = (us_stats_sliced['date'] - us_stats_sliced.date.iloc[0]).dt.days + 1
us_stats_sliced["t_squared"] = us_stats_sliced["t"]*us_stats_sliced["t"]
us_stats_sliced["log"] = np.log(us_stats_sliced["t"])
us_stats_sliced["exp"] = np.log(us_stats_sliced["total_vaccinations"])
us_stats_sliced["sqrt"] = np.sqrt(us_stats_sliced["t"])

We partition the data into train & test in 80:20 ratio

In [None]:
Train = us_stats_sliced.iloc[0:int(np.floor(0.8*len(us_stats_sliced))),:]
Test = us_stats_sliced.iloc[int(np.floor(0.8*len(us_stats_sliced))):,:]

Various models are built & trained to fit the data. Each of the RMSE scores are recorded for tabulation later.

The models tested here are:
* Linear
* Exponential
* Quadratic
* Log
* Sqrt

In [None]:
# Testing trend on linear model
linear_model = smf.ols('total_vaccinations ~ t', data=Train).fit()
pred_linear =  pd.Series(linear_model.predict(pd.DataFrame(Test['t'])))
rmse_linear = np.sqrt(np.mean((np.array(Test['total_vaccinations'])-np.array(pred_linear))**2))

# Testing trend on exponential model
Exp = smf.ols('exp ~ t', data=Train).fit()
pred_Exp = pd.Series(Exp.predict(pd.DataFrame(Test['t'])))
rmse_Exp = np.sqrt(np.mean((np.array(Test['total_vaccinations'])-np.array(np.exp(pred_Exp)))**2))

# Testing trend on quadratic model
Quad = smf.ols('total_vaccinations ~ t + t_squared',data=Train).fit()
pred_Quad = pd.Series(Quad.predict(Test[["t","t_squared"]]))
rmse_Quad = np.sqrt(np.mean((np.array(Test['total_vaccinations'])-np.array(pred_Quad))**2))

# Testing trend on log model
Log = smf.ols('total_vaccinations ~ log',data=Train).fit()
pred_Log = pd.Series(Log.predict(pd.DataFrame(Test[["log"]])))
rmse_Log = np.sqrt(np.mean((np.array(Test['total_vaccinations'])-np.array(pred_Log))**2))

# Testing trend on sqrt model
Sqrt = smf.ols('total_vaccinations ~ sqrt',data=Train).fit()
pred_Sqrt = pd.Series(Sqrt.predict(pd.DataFrame(Test[["sqrt"]])))
rmse_Sqrt = np.sqrt(np.mean((np.array(Test['total_vaccinations'])-np.array(pred_Sqrt))**2))

The RMSE scores are tabulated below:

In [None]:
data = {"MODEL":pd.Series(["rmse_linear","rmse_Exp","rmse_Quad","rmse_Log","rmse_Sqrt"]),"RMSE_Values":pd.Series([rmse_linear,rmse_Exp,rmse_Quad,rmse_Log,rmse_Sqrt])}
table_rmse=pd.DataFrame(data)
table_rmse

Based on the RMSE scores, the best model to predict the cumulative vaccinations in the US is the quadratic model. The details of the quadratic model built is summarized below:

In [None]:
Quad.summary()

The adjusted R-squared value for the model is very high, indicating that it is a good fit for the data. Besides that, the P value of t_squared < 0.005 shows that it is a good predictor of the total vaccinations

Using this model, we will now attempt to predict when the US will be able to achieve herd immunity

Current population in the US is 332m. 

To achieve 70% vaccination rate for herd immunity, 70% * 332m = 232m of the population would need to be vaccinated

Since each person requires 2 doses, the total number of vaccinations would need to be **> 464m**

In [None]:
# Preparation of input data for model prediction

#us_stats_sliced.t.iloc[-1]


t_data = pd.Series(list(range(us_stats_sliced.t.iloc[-1] + 1, us_stats_sliced.t.iloc[-1] + 150)))

t_squared_data = t_data*t_data

pred_x = pd.DataFrame({'t':t_data, 't_squared':t_squared_data})

In [None]:
# Prediction of when US will achieve herd immunity, where the first occurrence of total vaccination > 464m is identified

pred_y = Quad.predict(pred_x)
days_after = (pred_y > 464000000).idxmax() + 1 # to offset the first index of 0
herd_date = us_stats.iloc[-1].date + datetime.timedelta(days=int(days_after))

print("The earliest date that the US will be able to achieve herd immunity is ", herd_date)

We then plot the predictions in order to visualize it on top of the initial data

In [None]:
pred_y.index = pd.to_datetime(pred_y.index + 1, unit='D',origin=pd.Timestamp(us_stats.iloc[-1].date.tz_localize(None)))

In [None]:
plt.figure(figsize=(10, 4)) 
plt.plot(us_stats_sliced.date, us_stats_sliced.total_vaccinations, label="Current Vaccinations")
plt.plot(pred_y.index, pred_y,'--', label="Predicted Vaccinations")
plt.plot(herd_date,pred_y[herd_date.tz_localize(None)], 'go', label="Point of Herd Immunity")
plt.legend(loc="upper left")
plt.xticks(rotation=90)
plt.title('Total number of vaccinations administered in USA')
plt.xlabel('Date')
plt.ylabel('Total Vaccinations (x 100M)')
plt.show()

# Final Words & Considerations
Note that this prediction is done based on the current vaccination rate in the US, it does not take into consideration external factors such as the vaccine supply chain and vaccination capacity (manpower, facilities, etc.) in the US that may hit peak capacity in the future