## Dataset updated to 26th May. Vaccination forecast model is built using 1st 80 data points

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/covid-vaccination-forecast/vaccinations_us.csv')
df.info()

In [None]:
df.head()

The 1st reported date is on 20th Dec.

In [None]:
df.tail()

US has 144 data entries so far, The last reported date is on 26th May, so we have ~5 mth of actual data so far.

Let's see the graph of people vaccinated to date!

In [None]:
import plotly_express as px
fig = px.scatter(df,x = 'date', y = 'people_vaccinated', title="People vaccinated in USA",labels={
                     "people_vaccinated": "People Vaccinated (Million)" },)
fig.show()

## FBProphet Model

The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement we wish to forecast.

In [None]:
df1 = df[['date', 'people_vaccinated']]
df1['date'] = pd.to_datetime(df1['date'])
df1.columns = ['ds', 'y']
df1.head()

In [None]:
df1.info()

I'm going to use the 1st 80 data points to train the model, and the rest of the points to validate the model later

In [None]:
train_size = 80
test_size = df1.shape[0] - train_size
df_train = df1.head(train_size)
df_test = df1.tail(test_size)
print(df_train.shape, df_test.shape)

Create Prophet object

In [None]:
from fbprophet import Prophet
m = Prophet()
m.fit(df_train)

Predictions are then made on a dataframe with a column ds containing the dates for which a prediction is to be made. You can get a suitable dataframe that extends into the future a specified number of days using the helper method Prophet.make_future_dataframe. By default it will also include the dates from the history, so we will see the model fit as well.

1. Let's see the predictions for the next 12 weeks

In [None]:
future = m.make_future_dataframe(periods=12, freq='W')
future.tail()

The predict method will assign each row in future a predicted value which it names yhat. If you pass in historical dates, it will provide an in-sample fit. The forecast object here is a new dataframe that includes a column yhat with the forecast, as well as columns for components and uncertainty intervals.

In [None]:
forecast = m.predict(future)
pd.options.display.float_format = '{:20,.0f}'.format
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Let's plot the forecast!

In [None]:
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
def millions(x, pos):
    'The two args are the value and tick position'
    return '%1.1f Million' % (x * 1e-6)


formatter = FuncFormatter(millions)

import matplotlib.dates as mdates
monthyearFmt = mdates.DateFormatter('%B %Y')
locator = mdates.AutoDateLocator(minticks=3, maxticks=7)

fig = m.plot(forecast, xlabel='Date', ylabel='People Vaccinated')
ax = fig.gca()
ax.yaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(monthyearFmt)
ax.xaxis.set_major_locator(locator)
ax.scatter(df_test['ds'], df_test['y'], color='r', label='Actual data', marker='x')
ax.annotate('Model trained to this date',(df_train.iloc[-1]['ds'],df_train.iloc[-1]['y']), xytext=(0.2, 0.4), textcoords='axes fraction', arrowprops = dict(facecolor='green',color='green'))
ax.legend(loc='lower right', ncol=4)
ax.title.set_text('Vaccination Forecast for USA')
plt.show()

Overall we can see a clear uptrend, but the actual data from May is starting to divert away from the forecast, showing slowing growth.

Use the trained model to get predictions on the test data

In [None]:
test_pred = df_test[['ds']].copy()
test_pred = m.predict(test_pred)
test_pred[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Let's check the Mean Absolute Percentage Error (MAPE) of the model on the validated data 

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    """Calculates MAPE given y_true and y_pred"""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mean_absolute_percentage_error(y_true=df_test['y'], y_pred=test_pred['yhat'])

The MAPE is still low, but is starting to get worse. My earlier MAPE was at 0.8%