<h1><center> COVID-19 World Vaccination Progress Notebook</center></h1>

<h2><center>Welcome to the Jupyter notebook with COVID-19 vaccinations visualization and forecast </center></h2>

### Tracking the progress of COVID-19 vaccination:

1. What vaccines are used and in which countries?
2. What country is vaccinated more people?
3. What country is vaccinated a larger percent from its population?
4. Is there correlation between vaccinations and new COVID-19 cases?
5. What is the forecast of vaccinations for the next months?

### Description of the data fields
1. country - The vaccinated countries.
2. iso_code - The currency codes per country.
3. date - Update date.
4. total_vaccinations - Total number of vaccines made.
5. people_vaccinated - Total number of people vaccinated.
6. people_fully_vaccinated - Number of people completed vaccination.
7. daily_vaccinations - Number of vaccinations on that day
8. total_vaccinations_per_hundred - Percentage of vaccines per country population: (total vaccinated / population) * 100. 
9. people_vaccinated_per_hundred - Percentage of vaccinated people per country population: (people vaccinated / population) * 100.
10. people_fully_vaccinated_per_hundred - Calculated by (people fully vaccinated / population) * 100.
11. daily_vaccinations_per_million - Calculated by (daily vaccinations / population) * 1 million.
12. vaccines - Type of vaccine.
13. source_name - Source of informaiton about vaccination.
14. source_website - Website source.

In [None]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [None]:
import numpy as np # numpy arrays / linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px # map plot
import warnings
warnings.filterwarnings('ignore') 

import matplotlib as mpl
import matplotlib.pyplot as plt 
import matplotlib.dates as md
%matplotlib inline
from scipy.optimize import curve_fit # predictive modeling
import seaborn as sns
import statsmodels.api as sm

import datetime
from datetime import date

### Data Sourse of COVID-19 World Vaccination Progress <font color='red'>(last update 03.15.2021)</font>: 
##### https://www.kaggle.com/gpreda/covid-world-vaccination-progress/notebooks

In [None]:
df = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv', header=0)
df.head()

In [None]:
#fill out missing values for the latest update
df['total_vaccinations'] = df.groupby('country')['total_vaccinations'].ffill()
df['people_vaccinated'] = df.groupby('country')['people_vaccinated'].ffill()
df['people_fully_vaccinated'] = df.groupby('country')['people_fully_vaccinated'].ffill()
df['people_vaccinated'] = df.groupby('country')['people_vaccinated'].ffill()
df['total_vaccinations_per_hundred'] = df.groupby('country')['total_vaccinations_per_hundred'].ffill()
df['people_vaccinated_per_hundred'] = df.groupby('country')['people_vaccinated_per_hundred'].ffill()
df['people_fully_vaccinated_per_hundred'] = df.groupby('country')['people_fully_vaccinated_per_hundred'].ffill()
df = df.fillna(0) #all remaining NA change to 0
df.head()

## Total vaccinations and people vaccinated number by the latest update date

In [None]:
#country list
country = df['country'].unique()
total_vacc = []
people_vaccinated = []
people_fully_vaccinated = []
total_vacc_per_hundred = []
people_vacc_per_hundred = []
people_fully_vacc_per_hundred = []
vacc_type = []
dates = []

for i in range(0,len(country)):
        #getting the numpy array with the latest total_vaccinations number
        total_vacc.append(df["total_vaccinations"][df['country'] == country[i]].iloc[-1])
        #getting the numpy array with the latest people_vaccinated number
        people_vaccinated.append(df["people_vaccinated"][df['country'] == country[i]].iloc[-1])
        #getting the numpy array with the latest people_fully_vaccinated number
        people_fully_vaccinated.append(df["people_fully_vaccinated"][df['country'] == country[i]].iloc[-1])
        #data per hundred
        total_vacc_per_hundred.append(df["total_vaccinations_per_hundred"][df['country'] == country[i]].iloc[-1])
        people_vacc_per_hundred.append(df["people_vaccinated_per_hundred"][df['country'] == country[i]].iloc[-1])
        people_fully_vacc_per_hundred.append(df["people_fully_vaccinated_per_hundred"][df['country'] == country[i]].iloc[-1])  
        #vaccines type
        vacc_type.append(df["vaccines"][df['country'] == country[i]].iloc[-1])  
        #getting the numpy array with the latest date update
        dates.append(df["date"][df['country'] == country[i]].iloc[-1])  
        
df_actual = pd.DataFrame({'total_vaccinations': total_vacc, 
                            'people_vaccinated': people_vaccinated, 
                            'people_fully_vaccinated': people_fully_vaccinated,
                            'total_vacc_per_hundred': total_vacc_per_hundred,
                            'people_vacc_per_hundred': people_vacc_per_hundred,
                            'people_fully_vacc_per_hundred': people_fully_vacc_per_hundred,
                            'vacc_type': vacc_type,
                            'update_date': dates}, 
                             index = country)
df_actual['total_vaccinations'] = df_actual['total_vaccinations'].astype(int)
df_actual['people_vaccinated'] = df_actual['people_vaccinated'].astype(int)
df_actual['people_fully_vaccinated'] = df_actual['people_fully_vaccinated'].astype(int)
df_actual.head(10)

### The Map of Total Vaccinations

In [None]:
dat1 = df_actual.iloc[:,0].values

plot_map = px.choropleth(df_actual, locations=df_actual.index,
                    color_continuous_scale="Peach",
                    locationmode='country names',
                    color=dat1,
                    width = 950,
                    height= 600,
                    labels = {'color':'Total Vaccinations'})
plot_map.update_layout(title="Total Vaccinations Map", title_x=0.5)
plot_map.show()

#### Conclusion: from the map above it is clear that the US and China provided the highest volume of vaccines. 

### Top 10 Countries With The Highest Vaccinations

In [None]:
df_top1 = df_actual.nlargest(10, 'total_vaccinations')
plt.rcParams["figure.figsize"] = (15,5)
ax = df_top1.plot.bar(y='total_vaccinations', color='red', rot=0,  width = 0.7)
for p in ax.patches:
        ax.annotate(str(round(p.get_height()/1000000, 2)) + " M", (p.get_x() * 1.005, p.get_height() * 1.005))

### The Map of People Vaccinated

In [None]:
dat2 = df_actual.iloc[:,1].values

plot_map = px.choropleth(df_actual, locations=df_actual.index,
                    locationmode='country names',
                    color=dat2,
                    width = 950,
                    height= 600,
                    labels = {'color':'People Vaccinated'})
plot_map.update_layout(title="People Vaccinated Map", title_x=0.5)
plot_map.show()

### Top 10 Countries With The Highest Number of People Vaccinated

In [None]:
df_top2 = df_actual.nlargest(10, 'people_vaccinated')
plt.rcParams["figure.figsize"] = (15,5)
ax = df_top2.plot.bar(y='people_vaccinated', color='darkblue', rot=0,  width = 0.7)
for p in ax.patches:
        ax.annotate(str(round(p.get_height()/1000000, 2)) + " M", (p.get_x() * 1.005, p.get_height() * 1.005)) 

#### Conclusion: the US and UK have the greatest number of people who got at least one dose of a vaccine. (The data for China is not presented on the web-sourse.)

### The Map of People Fully Vaccinated

In [None]:
dat3 = df_actual.iloc[:,2].values

plot_map = px.choropleth(df_actual, locations=df_actual.index,
                    color_continuous_scale="Viridis",
                    locationmode='country names',
                    color=dat3,
                    width = 950,
                    height= 600,
                    labels = {'color':'People Fully Vaccinated'})
plot_map.update_layout(title="People Fullly Vaccinated Map", title_x=0.5)
plot_map.show()

### Top 10 Countries With The Highest Number of People Fully Vaccinated

In [None]:
df_top3 = df_actual.nlargest(10, 'people_fully_vaccinated')
plt.rcParams["figure.figsize"] = (15,5)
ax = df_top3.plot.bar(y='people_fully_vaccinated', color='purple', rot=0,  width = 0.7)
for p in ax.patches:
        ax.annotate(str(round(p.get_height()/1000000, 2)) + " M", (p.get_x() * 1.005, p.get_height() * 1.005)) 

## The distribution of vaccinactions

In [None]:
mpl.rcParams['figure.figsize'] = (20,5)
df_actual.plot(kind = 'box', title = 'Boxplot - All Countries')

In [None]:
df_top1 = df_actual.nlargest(10, 'total_vaccinations')
mpl.rcParams['figure.figsize'] = (20,5)
df_top1.plot(kind = 'box', title = 'Boxplot - Top 10 Countries')

#### Conclusion: the distribution of vaccinations for all countries represent that the median of vaccines provided is around ~160k, that means that most of the countries are only starting the vaccination process. The median of top 10 countries is around 16M vaccines.

### Heatmap Correlations Between Data Fields

In [None]:
correlation_matrix = df_actual.corr()
plt.subplots(figsize=(8,5))
sns.heatmap(correlation_matrix, annot=True, cmap="YlGnBu")
plt.title('Correlation Plot (Heatmap)', size=18)
plt.show()

#### Conclusion: there are no correlation between vaccines volume provided and % of population that have got the vaccines. The reason is big difference of population between countries, i.e. small countries need less volume of vaccines. 

### % of population fully vaccinated per country

In [None]:
df_top4 = df_actual.nlargest(20, 'people_fully_vacc_per_hundred')
df_top4.sort_values(by=['people_fully_vacc_per_hundred'], inplace=True)
plt.rcParams["figure.figsize"] = (15,5)
ax = df_top4.plot.barh(y='people_fully_vacc_per_hundred', color='darkblue', rot=0, width = 0.8)
plt.title("Bar plot - % of population fully vaccinated", size=18)

#### Conclusion: Israel with 9M of population is leading the world in vaccinations, 43% of population are already fully vaccinated. The country has the health infrastructure and logistics to deliver the vaccines. 

## The US Predictive Modelling of Vaccinations - Curve Fit

In [None]:
#Levenbergâ€“Marquardt (LM) Algorithm
def func(x, a, b):
    return a * np.exp(b * x)

#select the US data'
df_us = df[df['country'] == 'United States']
df_us['people_fully_vaccinated'] = df_us['people_fully_vaccinated'].astype(int)
y =df_us['people_fully_vaccinated'].to_numpy() #depended variable
x = np.arange(len(y)) #independent variable

#Data and Predictive Modelling in Pandemics https://www.youtube.com/watch?v=zk2ptM4H2Uk&feature=youtu.be
plt.figure(figsize=(10, 5))
plt.plot(x, y, 'ko', label='Actual number of people fully vaccinated in the US')
popt, pcov = curve_fit(func, x, y, p0=(1, 0.1))
plt.plot(x, func(x, *popt), 'r-', label='Predicted curve, fit: a = %5.3f, b = %5.3f' % tuple(popt))
plt.xlabel('Days Since 12/20/2020')
plt.ylabel('People Fully Vaccinated')
plt.legend()
plt.show()

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

y_t1= y
y_t1[y_t1 == 0] = 1 # change 0 to1 tu exclude deviding by 0
y_t1 = y_t1[25:] # exclude first 40 days

y_p1 = func(x, *popt)
y_p1 = y_p1.astype(int)
y_p1[y_p1 < 0] = 1
y_p1 = y_p1[25:]

MAPE1 = mean_absolute_percentage_error(y_t1, y_p1)
print ("MAPE (Mean Absolute Percentage Error) = " + str(round(MAPE1,2)) + "%")

#### The average difference between the predictive values and the actual values is ~25% (excluding first 25 days of data), that means the preditive model has not the perfect fit.

### The US Predictive Model for the next 60 days

In [None]:
x2 = np.arange(len(y)+50)
y2_predicted = []
       
for i in range(0, len(x2)):
    y2_predicted.append(popt[0] * np.exp((popt[1]) * x2[i]))
plt.figure(figsize=(10, 5))
plt.plot(x, y, 'ko', label='Actual number of people fully vaccinated in the US')    
plt.plot(x2, y2_predicted, 'r-', label='Predictive exponential model for the next 50 days')
plt.xlabel('Days Since 12/20/2020')
plt.ylabel('People Vaccinated Prediction')
plt.legend()
plt.show()

#### Conclusion: if the vaccination continuous with the exponential trend the population of the US (~350 million people) might be fully vaccinated before June 2021.
##### https://www.nytimes.com/2021/03/02/us/politics/merck-johnson-johnson-vaccine.html

## Israel Predictive Modelling of Vaccinations - Linear Regression

In [None]:
'select Israel data'
df_israel = df[df['country'] == 'Israel']
df_israel['people_fully_vaccinated'] = df_israel['people_fully_vaccinated'].astype(int)
y = df_israel['people_fully_vaccinated'].to_numpy() 
x = np.arange(len(y)) 

#Linear Regression https://towardsdatascience.com/introduction-to-linear-regression-in-python-c12a072bedf0
xmean = np.mean(x)
ymean = np.mean(y)
df_israel = df_israel.reset_index(drop=True)
df_israel.reset_index(level=0, inplace=True)
df_israel.head()

# Calculate the terms needed for the numator and denominator of beta
df_israel['xycov'] = (df_israel['index'] - xmean) * (df_israel['people_fully_vaccinated'] - ymean)
df_israel['xvar'] = (df_israel['index'] - xmean)**2

# Calculate beta and alpha
beta = df_israel['xycov'].sum() / df_israel['xvar'].sum()
alpha = ymean - (beta * xmean)

ypred = alpha + beta * x

# Plot regression against actual data
plt.figure(figsize=(10, 5))
plt.plot(x, y, 'ko', label='Actual number of people fully vaccinated in Israel')     # scatter plot showing actual data
plt.plot(x, ypred, 'r-', label='Predicted Linear Regression')   # regression line
plt.title('Actual data vs Predicted')
plt.xlabel('Days Since 12/19/2020')
plt.ylabel('People Fully Vaccinated')
plt.legend()
plt.show()


In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

y_t2= y
y_t2[y_t2 == 0] = 1 # change 0 to1 tu exclude deviding by 0
y_t2 = y_t2[40:] # exclude first 40 days

y_p2 = ypred
y_p2 = y_p2.astype(int)
y_p2[y_p2 < 0] = 1
y_p2 = y_p2[40:]

MAPE = mean_absolute_percentage_error(y_t2, y_p2)
print ("MAPE (Mean Absolute Percentage Error) = " + str(round(MAPE,2)) + "%")

#### The average difference between the predictive values and the actual values is ~3% (excluding first 40 days of data), that means the preditive model is accurate. 

### Israel Predictive Model for the next 30, 60 and 90 days

In [None]:
x1 = np.arange(len(y)+30)
y1_predicted = []
       
for i in range(0, len(x1)):
    y1_predicted.append(popt[0] * np.exp((popt[1]) * x1[i]))
    
ypred1 = alpha + beta * x1
plt.figure(figsize=(10, 5))
plt.plot(x, y, 'ko', label='Actual number of people fully vaccinated in Israel')     # scatter plot showing actual data
plt.plot(x1, ypred1, 'r-', label='Predictive model for the next 30 days')   # regression line
plt.title('Actual data vs Predicted')
plt.xlabel('Days Since 12/19/2020')
plt.ylabel('People Fully Vaccinated')
plt.legend()
plt.show()

In [None]:
x2 = np.arange(len(y)+60)
y2_predicted = []
       
for i in range(0, len(x1)):
    y2_predicted.append(popt[0] * np.exp((popt[1]) * x2[i]))
    
ypred2 = alpha + beta * x2
plt.figure(figsize=(10, 5))
plt.plot(x, y, 'ko', label='Actual number of people fully vaccinated in Israel')     # scatter plot showing actual data
plt.plot(x2, ypred2, 'r-', label='Predictive model for the next 60 days')   # regression line
plt.title('Actual data vs Predicted')
plt.xlabel('Days Since 12/19/2020')
plt.ylabel('People Fully Vaccinated')
plt.legend()
plt.show()

In [None]:
x2 = np.arange(len(y)+90)
y2_predicted = []
       
for i in range(0, len(x1)):
    y2_predicted.append(popt[0] * np.exp((popt[1]) * x2[i]))
    
ypred2 = alpha + beta * x2
plt.figure(figsize=(10, 5))
plt.plot(x, y, 'ko', label='Actual number of people fully vaccinated in Israel')     # scatter plot showing actual data
plt.plot(x2, ypred2, 'r-', label='Predictive model for the next 90 days')   # regression line
plt.title('Actual data vs Predicted')
plt.xlabel('Days Since 12/19/2020')
plt.ylabel('People Fully Vaccinated')
plt.legend()
plt.show()

#### Conclusion: if the vaccination continuous with the same pace the population of Israel (~9 million people) will be fully vaccinated before July 2021.

## Vaccine Types and Combinations

In [None]:
dat4 = df_actual.iloc[:,6].values

plot_map = px.choropleth(df_actual, locations=df_actual.index,
                    locationmode='country names',
                    color=dat4,
                    width = 1200,
                    height= 550,
                    labels = {'color':'Vaccine Types / Combinations'})
plot_map.update_layout(title="People Vaccinated Map per Vaccines Combinations", title_x=0.5)
plot_map.show()

In [None]:
vacc_type = df_actual.copy()
vacc_type.reset_index(level=0, inplace=True)
vacc_type = vacc_type.groupby('vacc_type').agg(['count'])
vacc_type_sorted = vacc_type.apply(lambda x: x.sort_values(ascending=True))

plt.rcParams["figure.figsize"] = (20,15)
ax = vacc_type_sorted.plot.barh(y='index', color='darkred', rot=0, width = 0.8)
plt.xlabel('Countries Count')
plt.ylabel('Vaccine Types')
plt.title('Vaccine Combinations', size=18)

#### Conclusion: the most popular worldwide vaccines for March 2021 are Pfizer/BioNTech, Moderna, Oxford/AstraZeneca, Sputnik V.

## COVID-19 cases (European Center for Disease Prevention and Control)
##### https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19

The data file contains information on the 14-day notification rate of newly reported COVID-19 cases per 100 000 population and the 14-day notification rate of reported deaths per million population by week and country. Each row contains the corresponding data for a certain day and per country. The file is updated weekly.

In [None]:
df_cases = pd.read_csv('https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv', header=0)
df_cases = df_cases.fillna(0)
df_cases.head(10)

### The US COVID-19 Statistics 

In [None]:
#select only cases for the US
df_cases_us = df_cases[df_cases['country'] == 'United States']
df_cases_us = df_cases_us[df_cases['indicator'] == 'cases']
df_cases_us.head(15)

In [None]:
plt.rcParams["figure.figsize"] = (15,5)
df_cases_us.sort_values(by=['year_week'], inplace=True)
df_cases_us.plot(x="year_week", y=["weekly_count", "cumulative_count"])
plt.title("The US COVID-19 weekly cases", size=18)
plt.show()

In [None]:
#select only deaths for the US
df_deaths_us = df_cases[df_cases['country'] == 'United States']
df_deaths_us = df_deaths_us[df_deaths_us['indicator'] == 'deaths']

plt.rcParams["figure.figsize"] = (15,5)
df_deaths_us.sort_values(by=['year_week'], inplace=True)
df_deaths_us.plot(x="year_week", y=["weekly_count", "cumulative_count"])
plt.title("The US COVID-19 weekly deaths", size=18)
plt.show()

## Correlation between the US weekly COVID-19 cases and vaccinations process

In [None]:
#select the US vaccination data 
df_vacc_us = df[df['country'] == 'United States']
df_vacc_us['total_vaccinations'] = df_vacc_us['total_vaccinations'].astype(int)
df_vacc_us['people_vaccinated'] = df_vacc_us['people_vaccinated'].astype(int)
df_vacc_us['people_fully_vaccinated'] = df_vacc_us['people_fully_vaccinated'].astype(int)
df_vacc_us['date'] = pd.to_datetime(df_vacc_us['date'], errors='coerce') #converting date
df_vacc_us['week_num'] = df_vacc_us['date'].dt.week #adding week number
df_vacc_us.head()

In [None]:
# getting the last row per every week
week_num = df_vacc_us['week_num'].unique()
vacc_weekly = []
for i in range(0,len(week_num)):
        #getting the numpy array with the latest weekly people_vaccinated number
        vacc_weekly.append(df_vacc_us["people_vaccinated"][df_vacc_us['week_num'] == week_num[i]].iloc[-1])
#print(vacc_weekly)

In [None]:
#getting last weeks cumulative_count from the US cases data
week_num_cases = df_cases_us.iloc[-len(week_num):,6].values
#print(week_num_cases)
cases_latest = df_cases_us.iloc[-len(week_num):,5].values
#print(cases_latest)
deaths_latest = df_deaths_us.iloc[-len(week_num):,5].values
#print(deaths_latest)

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(week_num_cases, vacc_weekly, 'r-', label='People Vaccinated (Cumulative Count)')
plt.plot(week_num_cases, cases_latest, 'b-', label='Weekly Count - Cases')
plt.plot(week_num_cases, deaths_latest, 'ko', label='Weekly Count - Deaths')
plt.xlabel('Week Number')
plt.ylabel('People Fully Vaccinated vs New Weekly Cases')
plt.title("The US COVID-19 weekly cases/ deaths vs Total number of people vaccinated", size=12)
plt.legend()
plt.show()

#### Correlation matrix of COVID-19 cases and vaccinations progress

In [None]:
# correlation matrix between cases and vaccinations
r = np.corrcoef(cases_latest, vacc_weekly)
print(r)

In [None]:
#combining arrays to the dataframe
df_cases_vacc = pd.DataFrame({'people_vaccinated': vacc_weekly, 
                            'weekly_cases': cases_latest,
                             'weekly_deaths': deaths_latest},
                                index = week_num_cases)
df_cases_vacc.head()

In [None]:
correlation_matrix = df_cases_vacc.corr()
plt.subplots(figsize=(8,5))
sns.heatmap(correlation_matrix, annot=True, cmap= 'coolwarm', linewidths=3, linecolor='black')
plt.title('Correlation Plot (Heatmap)', size=18)
plt.show()

#### Conclusion: the correlation matric on the heatmap above represents a strong negative correlation (r = -0.93) between total people vaccinated and new COVID-19 cases. It means that vaccinations are reducing the level of new cases number, and the worldwide vaccinations might help to stop the pandemic.

<h1><center>Thank you! Stay safe!</center></h1>