## Updated SG data to 19th July. Vaccination forecast is built using 1st 28 data points

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/covid-vaccination-forecast/Vaccines_Singapore.csv')
df.info()

In [None]:
df.head()

My country Singapore only has 40 data entries so far, the 1st reported date is on 11th Jan.

In [None]:
df.tail()

* The last reported date is on 19 July, so we have only ~6 mth of actual data so far.

Let's see the graph of people vaccinated to date!

In [None]:
import plotly_express as px
fig = px.scatter(df,x = 'date', y = 'people_vaccinated', title="People vaccinated in SG",labels={
                     "people_vaccinated": "People Vaccinated (Million)" },)
fig.show()

## FBProphet Model

The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement we wish to forecast.

In [None]:
df1 = df[['date', 'people_vaccinated']]
df1.columns = ['ds', 'y']
df1.head()

We need to set the DS column explicitly to date format.

In [None]:
df1['ds']= pd.to_datetime(df1['ds'])
df1.info()

I'm going to use the 1st 28 data points to train the model, and the rest of the points to validate the model later


In [None]:
train_size = 28
test_size = df1.shape[0] - train_size
df_train = df1.head(train_size)
df_test = df1.tail(test_size)
print(df_train.shape, df_test.shape)



Create Prophet object

In [None]:
from fbprophet import Prophet
m = Prophet()
m.fit(df_train)

Predictions are then made on a dataframe with a column ds containing the dates for which a prediction is to be made. You can get a suitable dataframe that extends into the future a specified number of days using the helper method Prophet.make_future_dataframe. By default it will also include the dates from the history, so we will see the model fit as well.

The predict method will assign each row in future a predicted value which it names yhat. If you pass in historical dates, it will provide an in-sample fit. The forecast object here is a new dataframe that includes a column yhat with the forecast, as well as columns for components and uncertainty intervals.

Let's try to see weekly predictions for the next 12 weeks (3 mths) from the last date in the training data -> July to Sep

In [None]:
future = m.make_future_dataframe(periods=12, freq='W')
future.tail()

In [None]:
forecast = m.predict(future)
pd.options.display.float_format = '{:20,.0f}'.format
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Last predicted value is ~5 million!. Let's find the indexes of the 1st to 4th million predictions.

In [None]:
mil_list = [0] * 4
for i in range(len(mil_list)):
    threshold = (i+1) * 1000000
    result = np.where(forecast['yhat'] > threshold)
    mil_list[i] = result[0][0]
print(mil_list)

Use the trained model to get predictions on the test data


In [None]:
test_pred = df_test[['ds']].copy()
test_pred = m.predict(test_pred)
test_pred[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Let's check the Mean Absolute Percentage Error (MAPE) of the model on the validated data

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    """Calculates MAPE given y_true and y_pred"""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = round(mean_absolute_percentage_error(y_true=df_test['y'], y_pred=test_pred['yhat']),1)
print("Mape: ", mape)

## Let's draw the predicted forecast and all the historical + actual data

In [None]:
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
def millions(x, pos):
    'The two args are the value and tick position'
    return '%1.1f Million' % (x * 1e-6)
formatter = FuncFormatter(millions)

import matplotlib.dates as mdates
monthyearFmt = mdates.DateFormatter('%B %Y')
locator = mdates.AutoDateLocator(minticks=3, maxticks=7)

fig = m.plot(forecast, xlabel='Date', ylabel='People Vaccinated')
ax = fig.gca()
ax.yaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(monthyearFmt)
ax.xaxis.set_major_locator(locator)
ax.scatter(df_test['ds'], df_test['y'], color='r', label='Actual data', marker='x')
ax.annotate('Model trained to this date',(df_train.iloc[-1]['ds'],df_train.iloc[-1]['y']), xytext=(0.4, 0.3), textcoords='axes fraction', arrowprops = dict(facecolor='green',color='green'))
for i in range(len(mil_list)):
    arrow_label = str(i+1) + ' million vaccinated'
    arrow_xpos = 0.1 + (i * 0.1)
    arrow_ypos = 0.25 + (i * 0.2)
    ax.annotate(arrow_label, (forecast.iloc[mil_list[i]]['ds'],forecast.iloc[mil_list[i]]['yhat']), xytext=(arrow_xpos, arrow_ypos), textcoords='axes fraction', arrowprops = dict(facecolor='blue',color='blue'))
ax.legend(loc='lower right', ncol=4)
mape_string = "Mean Absolute Percentage Error of actual vs predicted data: " + str(mape) + "%"
ax.text(1, 0.15, mape_string, horizontalalignment='right',transform=ax.transAxes)
ax.title.set_text('Vaccination Forecast for SG')

You can see a surge in new vaccinations during 1st half of July, but the new vaccinations are stalling around ~4 million