I will answer the following questions in this analysis:

+ Does the week day have an impact on how many people are vaccined?
+ Did the vaccination increase since the beginning of the year?
+ Which countries have the most advanced vaccination campaigns?
+ Which vaccines are used how often?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(rc={'figure.figsize':(10,10)})

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/covid-world-vaccination-progress/country_vaccinations.csv')

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
# Creating a var for day of the week
df['date'] = pd.to_datetime(df['date'])
df['day'] = df['date'].dt.weekday

# another one for the month 

df['month'] = df['date'].dt.month

In [None]:
# I pic out some countries i am espacially interested in
df_us = df.loc[df['country'] == 'United States']
df_de = df.loc[df['country'] == 'Germany']
df_fr = df.loc[df['country'] == 'France']
df_gb = df.loc[df['country'] == 'United Kingdom']

# Analysis

Q1: Does the weekday have an impact on how many people are vaccinated?

In [None]:
sns.barplot(x='day', y='daily_vaccinations', data = df)
plt.title('Vaccinations by day Global')

In [None]:
f, axes = plt.subplots(2, 2, figsize=(18, 12))
f.suptitle('Daily vaccinations by country')

sns.barplot(x='day', y='daily_vaccinations', data = df_de, ax=axes[0,0])
axes[0,0].set_title('Germany')
sns.barplot(x='day', y='daily_vaccinations', data = df_fr, ax=axes[0,1])
axes[0,1].set_title('France')
sns.barplot(x='day', y='daily_vaccinations', data = df_gb, ax=axes[1,0])
axes[1,0].set_title('UK')
sns.barplot(x='day', y='daily_vaccinations', data = df_us, ax=axes[1,1])
axes[1,1].set_title('USA')

Interestingly, Germany, France and Great Britan show less vaccinations at the weekend. Considering the CIs it is possible that this is just a random effect. However, it is a possibility that less medical personal is working on the weekend, especially given the shortage of vaccine atm. 

Q2: Did the vaccination increase since the beginning of the year?

In [None]:
sns.lineplot(x = 'date', y ='daily_vaccinations', data=df)
plt.title('Daily vaccinations Global')

In [None]:
sns.barplot(x='month', y='daily_vaccinations', data=df)
plt.title('Total vaccinations by month Global')

Globally the daily vaccinations increased, this also holds true for the monthly average, where we see an increase from december to january and then to february.

In [None]:
sns.lineplot(x = 'date', y ='daily_vaccinations_per_million', hue='country', data=df[df['country'].isin(['Germany', 'France', 'United States', 'United Kingdom'])])
plt.title('Daily vaccinations per million')

In [None]:
sns.barplot(x = 'month', y ='daily_vaccinations_per_million', hue='country', data=df[df['country'].isin(['Germany', 'France', 'United States', 'United Kingdom'])])
plt.title('Monthly vaccinations per million')

This observation holds true for the four countries shown. However, it should be noted that the UK and the USA vaccinated significantlly more than Germany and France. Also the daily vaccinations in the UK and the US dropped since the 15th Feb. 

Q3: Which countries have the most advanced vaccination campaigns?

In [None]:
df_date = pd.DataFrame()

for country in df['country'].unique():
    data = df[df['country'] == country]
    row = data[data['date'] == data['date'].max()]
    df_date = df_date.append(row, ignore_index=True)

In [None]:
df_date.shape

In [None]:
# all of those are part of the UK
df_date = df_date[-df_date.country.isin(['Wales', 'England', 'Scotland', 'Gibraltar', 'Cayman Islands', 'Falkland Islands', 'Northern Ireland', 'Guernsey', 'Isle of Man', 'Jersey', 'Anguilla', 'Turks and Caicos Islands'])]

In [None]:
f, axes =plt.subplots(1,2,  figsize=(15, 10), sharey=True)

sns.barplot(x='country', y='people_vaccinated_per_hundred', data=df_date.sort_values(by=['people_vaccinated_per_hundred'], ascending=False)[:10], ax=axes[0])
axes[0].tick_params('x', labelrotation=45)
axes[0].set_title('People vaccineted per 100')

sns.barplot(x='country', y='people_fully_vaccinated_per_hundred', data=df_date.sort_values(by=['people_fully_vaccinated_per_hundred'], ascending=False)[:10], ax=axes[1])
axes[1].tick_params('x', labelrotation=45)
axes[0].set_title('People fully vaccineted per 100')

As expected Israel has the highest number of vaccination in both graphs. Also the graphs clearly show the effect of the UK's strategy to postpone the second vaccination. Furthermore, the majority of the country in the top ten qre relativly small countries.

Q4: Which vaccines are used how often?

In [None]:
df_date.head()

In [None]:
df_date.vaccines.value_counts()

The Pfizer/Biontech vaccine is the most prominent vaccine globally (unsurprisigly as it was the first one available). However, a total of 20 countries already use 3 different vaccines, and one country even uses 5 vaccines. This numbers will likely increase as more vaccines become available.

# Forcasting for Germany

In [None]:
df_ger = df_de[['date', 'daily_vaccinations']].groupby('date').sum()[4:-1]

In [None]:
sns.lineplot(x='date', y='daily_vaccinations', data=df_ger)

In [None]:
# dicky-fuller test to check if the data is stationary
from statsmodels.tsa.stattools import adfuller

result = adfuller(df_ger)
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
#using differencing to stabalize the mean
df_ger_diff = df_ger.copy()
df_ger_diff['daily_vaccinations'] = df_ger_diff.daily_vaccinations.diff()
df_ger_diff = df_ger_diff.dropna()

In [None]:
result = adfuller(df_ger_diff)
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
from sklearn.metrics import mean_squared_error
import math

from fbprophet import Prophet
from statsmodels.tsa.arima.model import ARIMA

In [None]:
# train test data split
df_ger.reset_index(inplace=True)

train_len = int(0.75 * len(df_ger))
test_len = len(df_ger) - train_len

#generate train data
train = df_ger.iloc[:train_len, :]
train.columns = ['ds', 'y'] # facebooks prophet needs that...

#generate validation data
x_train, y_train = pd.DataFrame(df_ger.iloc[:train_len, 0]), pd.DataFrame(df_ger.iloc[:train_len, 1])
x_valid, y_valid = pd.DataFrame(df_ger.iloc[train_len:, 0]), pd.DataFrame(df_ger.iloc[train_len:, 1])

x_train.columns = ['ds'] # facebooks prophet all the way...
y_train.columns = ['y']
x_valid.columns = ['ds']
y_valid.columns = ['y']

In [None]:
# Train the model
model = Prophet()
model.fit(train)


# Predict on valid set
y_pred = model.predict(x_valid)

# Calcuate metrics
#according to fb prophet quick start yhat is the forecast
rmse = math.sqrt(mean_squared_error(y_valid, y_pred.tail(test_len)['yhat']))

print('RMSE: {}'.format(rmse))

In [None]:
model.plot_components(y_pred)

In [None]:
model.plot(y_pred)
sns.lineplot(x=x_valid['ds'], y=y_valid['y'], color='red', label='Validation Data')

plt.xlabel(xlabel='Date', fontsize=14)
plt.ylabel(ylabel='Daily Vaccinations', fontsize=14)

In [None]:
# Fit model

model = ARIMA(y_train, order=(1,1,1))
model = model.fit()

# Prediction with ARIMA
y_pred = model.forecast(16)

# Calcuate metrics
rmse = math.sqrt(mean_squared_error(y_valid, y_pred))

print('RMSE: {}'.format(rmse))

In [None]:
print(model.summary())

In [None]:

sns.lineplot(x=x_valid['ds'], y=model.forecast(16), color='green', label='Test Data')
sns.lineplot(x=df_ger.date, y=df_ger.daily_vaccinations, color='purple', label='Test Data')

plt.xlabel(xlabel='Date', fontsize=14)
plt.ylabel(ylabel='Daily Vaccinations', fontsize=14)

plt.show()

I calculated two very simple models, of which the ARIMA Model performed better. Given that the situation in Germany at the moment is dependent on various factors that we miss data on (like the introduction of the Johnson & Johnson vaccin, changes in the logistics surrounding the vaccination), and thus drastic changes are likely, I belive that these models will fullfill their purpose. 
I will use the Arima Model to forecast the next 100 days.

In [None]:
# we start our forcast at day 46, as the training data included 45 days. The vaildation data inluded another 16 days, which I need to compensate for.
model.forecast(117)

The model predicts that 100 days after the 2. Mar (10. Jun) 341,981 people a day will be vaccinated. It should be noted that this model is (hopefully) likely to underestimate the true number, as Germany expects a massive increase in both, the amount of vaccine available and the number of people that can be vaccinated a day.