# Forecasting Monthly Armed Robberies in Boston

**Contents**

1. Splitting dataset into data and validation
2. Transforming date column
3. Data analysis
4. Dickey-Fuller test for stationarity
5. ACF PACF plots
6. Defferencing to make the series stationary
7. Train-test split
8. Box-Cox transformation
9. Building ARIMA model
10. Hyperparameter tuning
11. Rolling forecasting to capture random variation
12. Exponential Smoothning
13. Validation
14. Forecasting for next 1 year

In [None]:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
from datetime import  datetime, timedelta
from statsmodels.tsa.stattools import  adfuller
from statsmodels.graphics.tsaplots import  plot_pacf, plot_acf
from statsmodels.graphics.tsaplots import  plot_acf
from statsmodels.graphics.gofplots import  qqplot
from statsmodels.tsa.seasonal import  seasonal_decompose
from statsmodels.tsa.arima_model import  ARIMA
from statsmodels.tsa.statespace.sarimax import  SARIMAX
from sklearn.metrics import mean_squared_error
import itertools
from scipy.stats import boxcox
from statsmodels.tsa.api  import ExponentialSmoothing


from  pylab import rcParams
rcParams['figure.figsize'] = 25,8

import warnings
warnings.filterwarnings("ignore")

# Reading data 

## Splitting data onto dataset and validation

In [None]:
series = pd.read_csv('../input/monthly-armed-robberies-in-boston/Robberies.csv')
split_point = len(series) - 12
dataset, validation = series[0:split_point], series[split_point:]
print('Dataset %d, Validation %d' % (len(dataset), len(validation)))

In [None]:
dataset.head()

In [None]:
validation.head()

# Preprocessing

In [None]:
dataset.tail()

## Transforming date column

In [None]:
# creating date range according to the data
# data has dates from 1966-01-31 to 1974-10-31 with monthly frequency
date = pd.date_range(start='1/1/1966', end='11/1/1974', freq='M')
date

In [None]:
# replacing original date column with newly created
dataset['Months'] = date

# setting date column as index of the dataframe
dataset.set_index('Months', inplace=True)
dataset.head()

# Data analysis

In [None]:
# stats
dataset.describe().T

**Inference**
1. Total of 106 records
2. Mean robberies as 173.10
3. Standard deviation is larger than mean: robberies are increasing yearly

In [None]:
dataset.plot()
plt.show()

**Inference:** Linear up-trend but no seasionality

In [None]:
sns.boxplot(x = dataset.index.month, y = dataset['Robberies'])
plt.show()

In [None]:
monthly_rabberies_across_years = pd.pivot_table(dataset,
                                                values = 'Robberies',
                                                columns = dataset.index.year,
                                                index = dataset.index.month_name())
monthly_rabberies_across_years.plot()
plt.grid()
plt.legend(loc='best');

**Inference:** Both graph shows no strong seasonality

# Stationary test
**Stationary test means all the data are around mean and varience of the entire data. To forecast any time series, stationary data is required. In case, data is not stationary we use differencing technique to transform non-stationary series into stationary series** 

In [None]:
# year wise box plot
sns.boxplot(x = dataset.index.year, y = dataset['Robberies'])
plt.show()

**Inference:** Graph shows strong up-trend, which means presence of non-stationary data. Let's check bt **A**uto**c**orrelation **F**unction plot and **P**artial **A**utocorrelation **F**unction plot.

## ACF and PCAF plots

In [None]:
# ACF plot
plot_acf(dataset, lags=100);

In [None]:
# PACF plot
plot_pacf(dataset);

**Inference:** Slow decay in in ACF and random trend in PACF shows trend in data. Let's conform by statistical test.

## Dickey-Fuller test

In [None]:
'''
Null hypothesis:Series is not stationary
Alternate hypothesis: Series is stationary
'''

test_result = adfuller(dataset.values)
print('ADF Statistic: %f' % test_result[0])
print('p-value: %f' % test_result[1])
print('Critical Values:')
for key, value in test_result[4].items():
    print('\t%s: %.5f' % (key, value))

**Inference:** p-value is greater than threshold 0.05(commenly taken threshold in statistics). Hence, we fail to reject hull hypothesis.So the series is  non-stationary.<br>
**Differencing is required**

# Differencing by 1 lab value
(`y` at time `t`) - (`y` at time `t-1`)

In [None]:
df_diff1 = dataset.diff(periods=1).dropna()

test_result = adfuller(df_diff1.values)
print('ADF Statistic: %f' % test_result[0])
print('p-value: %f' % test_result[1])
print('Critical Values:')
for key, value in test_result[4].items():
    print('\t%s: %.5f' % (key, value))

**Inference:** p-value is less than threshold 0.05(commenly taken threshold in statistics). Hence, we  reject hull hypothesis.So the series is stationary at t-1 differencing.<br>

In [None]:
plot_acf(df_diff1, lags=100);

In [None]:
plot_pacf(df_diff1);

# Train-test split

In [None]:
dataset.index

In [None]:
# split data such that last two years are taken into test data remainig for train data
train_end = datetime(1972, 12, 31)
test_end = datetime(1974, 10, 31)

train = dataset[ : train_end]
test = dataset[train_end+timedelta(days=1) : test_end]

In [None]:
train.shape

In [None]:
test.shape

# ARIMA model
**Since, we have trend to capture ARIMA model is best suited**

## Building the model

In [None]:
# order = (1, 1, 1): ACF value 1, Differencing by 1, PACF value 1
arima_model = ARIMA(train, order = (1, 1, 1))
model_fit = arima_model.fit()
print(model_fit.summary())

## Forecasting

In [None]:
arima_forecast = model_fit.forecast(steps = len(test))

In [None]:
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, arima_forecast[0], label='Forecast')

In [None]:
rmse = np.sqrt(mean_squared_error(test.Robberies, arima_forecast[0]))
print(rmse)

## Mean Absolute Percentage Error

In [None]:
def MAPE(y_true, y_pred):
    return np.mean((np.abs(y_true-y_pred))/(y_true))*100

mape = MAPE(test['Robberies'].values, arima_forecast[0])
mape

**Incerence:** There is almost 14% of error

In [None]:
results_df = pd.DataFrame({'Test RMSE': rmse,'Test MAPE':mape}
                           ,index=['ARIMA(1,1,1)'])

results_df

## Hyperparameter tuning
**We will look for paramerers with least AIC value**

In [None]:
# parameters for grid search
p = q = range(0, 4)
d= range(1,2)
pdq = list(itertools.product(p, d, q))
print('parameter combinations for the Model')
for i in range(1,len(pdq)):
    print('Model: {}'.format(pdq[i]))

In [None]:
# Grid search technique
arima_df = pd.DataFrame(columns=['param', 'AIC'])

for param in pdq:
    try:
        model = ARIMA(train, order = param)
        model_fit = model.fit()
        print('ARIMA_params',param, '- AIC{}', model_fit.aic)
        arima_df = arima_df.append({'param': param, 'AIC': model_fit.aic}, ignore_index = True)
    except:
        continue

print('==============================================')
arima_df = arima_df.sort_values('AIC')
print('Best params for ARIMA')
print(arima_df.head(1))

In [None]:
arima_model = ARIMA(train, order = (0, 1, 2))
model_fit = arima_model.fit()

arima_forecast = model_fit.forecast(steps = len(test))

In [None]:
rmse = np.sqrt(mean_squared_error(test.Robberies, arima_forecast[0]))
print(rmse)

In [None]:
mape = MAPE(test['Robberies'].values, arima_forecast[0])
print(mape)

In [None]:
results_df_temp = pd.DataFrame({'Test RMSE': rmse,'Test MAPE': mape}
                           ,index=['Tuned ARIMA(0, 1, 2)'])

results_df = pd.concat([results_df, results_df_temp])
results_df

**Inference:** No much improvement in the model accuracy

In [None]:
residuals = test['Robberies'] - arima_forecast[0]
qqplot(residuals,line="s");

# Transformation
**Let's check if transforming the data with Box-Cox method will improve our model** 

In [None]:
data = [x[0] for x in train.values]
transformed, lam = boxcox(data)

# the forecast will be Box-Cox transformed values.
# Hence, we need to invest the values back to original scale.
def boxcox_inverse(value, lam):
    if lam == 0:
        return np.exp(value)
    return np.exp(np.log(lam * value + 1) / lam)

In [None]:
# Fit the model with transformed data
arima_model = ARIMA(transformed, order=(0, 1, 2))
model_fit = arima_model.fit()

# Forecast for test
arima_forecast = model_fit.forecast(steps = len(test))

# Invert the transformation
arima_forecast = boxcox_inverse(arima_forecast[0], lam)

# Check RMSE
rmse = np.sqrt(mean_squared_error(test.Robberies, arima_forecast))

# Check error
mape = MAPE(test['Robberies'].values, arima_forecast[0])

results_df_temp = pd.DataFrame({'Test RMSE': rmse,'Test MAPE': mape}
                           ,index=['Transformed ARIMA(0, 1, 2)'])

results_df = pd.concat([results_df, results_df_temp])
results_df

**Inference:** Error increased, transformation is a bad idea

**Concluion:** We will stick to paramerets which we got from hyper parameter tuning. That is, ACF=0, Differencing=1 and PACF=2 (0, 1, 2) withoud transformation

## Rolling forecasting to capture random variation

In [None]:
predictions = []
data = [x[0] for x in train.values]


for i in range(0, len(test)):
        
    # predict
    model = ARIMA(data, order=(0, 1, 2))
    model_fit = model.fit()
    yhat = model_fit.forecast()[0]
    predictions.append(yhat)
    
    # observation
    obs = test.iloc[i].values[0]
    data.append(obs)
    print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))

In [None]:
rmse = np.sqrt(mean_squared_error(test.Robberies, predictions))
rmse

In [None]:
mape = MAPE(test['Robberies'].values, predictions)
mape

In [None]:
results_df_temp = pd.DataFrame({'Test RMSE': rmse,'Test MAPE': mape}
                           ,index=['Rolling ARIMA(0, 1, 2)'])

results_df = pd.concat([results_df, results_df_temp])
results_df

In [None]:
plt.plot(train, label='Train')
plt.plot(test.index,test, label='Test')
plt.plot(test.index, predictions, label='Forecast')
plt.legend(loc='best')
plt.grid()

**Inference:** Our goal is to reduce MAPE. Let's see if Exponential Smoothing will do a better job.

# Exponential Smoothing model

In [None]:
model_TES_add = ExponentialSmoothing(train, trend='additive', seasonal='additive', initialization_method='estimated')
model_TES_add = model_TES_add.fit(optimized=True)
model_TES_add.summary()

In [None]:
TES_add_predict =  model_TES_add.forecast(len(test))

rmse = np.sqrt(mean_squared_error(test.Robberies, TES_add_predict))
mape = MAPE(test['Robberies'], TES_add_predict)

In [None]:
results_df_temp = pd.DataFrame({'Test RMSE': rmse,'Test MAPE': mape}
                           ,index=['Exponential Smoothing'])

results_df = pd.concat([results_df, results_df_temp])
results_df

**Inference:** Final model is Exponential Smoothing

In [None]:
plt.plot(train, label='Train')
plt.plot(test.index,test, label='Test')
plt.plot(test.index, TES_add_predict, label='Forecast')
plt.legend(loc='best')
plt.grid()

# Validation

In [None]:
validation.shape

In [None]:
validation.head()

In [None]:
validation.tail()

## Preprocessing validation data

In [None]:
date = pd.date_range('11/1/1974', '11/1/1975', freq='M')
date

In [None]:
validation['Months'] = date
validation.set_index('Months', inplace=True)
validation.head()

## Model fitting and forecasting

In [None]:
model_TES_add = ExponentialSmoothing(dataset, trend='additive', seasonal='additive', initialization_method='estimated')
model_TES_add = model_TES_add.fit(optimized=True)
model_TES_add.summary()

In [None]:
TES_add_predict =  model_TES_add.forecast(len(validation))
plt.plot(validation, label='Validation')
plt.plot(validation.index, TES_add_predict, label='Forecast')
plt.legend()
plt.show()

# Forecasting for next 1 year

In [None]:
series.head()

In [None]:
series.tail()

## Preprocessing original dada

In [None]:
date = pd.date_range('1/1/1966', '11/1/1975', freq='M')
date

In [None]:
series['Months'] = date
series.set_index('Months', inplace=True)
series.head()

## Manually creating next 1 year date fields

In [None]:
forecast_date = pd.date_range('11/1/1975', '11/1/1976', freq='M')
forecast_date

## Model builting and forecasting the trend

In [None]:
model_TES_add = ExponentialSmoothing(series, trend='additive', seasonal='additive', initialization_method='estimated')
model_TES_add = model_TES_add.fit(optimized=True)
TES_add_predict =  model_TES_add.forecast(12)

In [None]:
yhat

In [None]:
plt.plot(series, label='Data')
plt.plot(forecast_date, TES_add_predict, label='Forecast')
plt.legend()
plt.show()

# Conclusion
**It is forecasted that armed robberies are going to be increased in the next one year. Government and police has to take measure accordingly by imposing strict measures and deploying more man-force for patrolling**