# Time Series Forecasting

In this notebook, we will be developing a time series forecasting model i.e. [ARIMA model](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) to predict customer/user growth that enterprises can leverage to help improve their businesses. We will be using a publicly available monthly revenue dataset from [Kaggle](https://www.kaggle.com/datasets/census/business-and-industry-reports?select=notes.txt).

In [3]:
# Install required packages
import os
import pandas as pd
import numpy as np
from math import ceil
from dotenv import load_dotenv
from sklearn.preprocessing import OneHotEncoder
import mlflow
from ipynb.fs.defs.helper_functions import (
     add_future_rows,
     forecast,
     symmetric_mean_absolute_percentage_error,
)
import warnings
# ignore the SettingWithCopyWarning
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', 500)
from pmdarima.arima import auto_arima
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_percentage_error

## Set the notebook parameters

We will set the following parameters, based on which the notebook can be executed accordingly:


* `train_start_date` - The training data start date
* `train_end_date` - The training data end date
* `test_start_date` - The testing/validation data start date
* `test_end_date` - The testing/validation data end date
* `forecast_start_date` - The forecasting start date (which we wish to predict for)
* `forecast_end_date` - The forecasting end date

In [4]:
train_start_date = '2012-01-01'
train_end_date = '2014-12-01'

test_start_date = '2015-01-01'
test_end_date = '2015-03-01'

forecast_start_date = '2015-04-01'
forecast_end_date = '2015-09-01'

# Number of months being predicted
prediction_window = 6

## Data Collection

We will be using the public revenue dataset from [Kaggle](https://www.kaggle.com/datasets/census/business-and-industry-reports?select=notes.txt). The data consists of 50 different accounts, month and corresponding monthly revenue.  

In [5]:
df = pd.read_csv('../data/processed/final_revenue_data.csv')

In [8]:
df.head()

Unnamed: 0,account_name,date,value
0,00XX_E_MPCP_US,2012-01-01,3.2
1,00XX_E_MPCP_US,2012-02-01,3.8
2,00XX_E_MPCP_US,2012-03-01,2.5
3,00XX_E_MPCP_US,2012-04-01,3.1
4,00XX_E_MPCP_US,2012-05-01,4.6


In [9]:
df.account_name.nunique()

50

### Using `tenure` as an input feature
We define **tenure** to be the number of months a customer has non-zero revenue and is calculated cumulatively over time. We calculate the tenure for all accounts to be used as an input feature for our model training.

In [7]:
# Group by customer ID and service and apply the cumulative sum to the revenue column
df['tenure'] = (df['value'] > 0).groupby(df['account_name']).transform(lambda x: x.cumsum())

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1950 entries, 0 to 1949
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   account_name  1950 non-null   object 
 1   date          1950 non-null   object 
 2   value         1950 non-null   float64
 3   tenure        1950 non-null   int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 61.1+ KB


In [9]:
len(df)

1950

In [10]:
df_train = df[['account_name','date','tenure', 'value']]

In [11]:
df_train.head(50)

Unnamed: 0,account_name,date,tenure,value
0,00XX_E_MPCP_US,2012-01-01,1,3.2
1,00XX_E_MPCP_US,2012-02-01,2,3.8
2,00XX_E_MPCP_US,2012-03-01,3,2.5
3,00XX_E_MPCP_US,2012-04-01,4,3.1
4,00XX_E_MPCP_US,2012-05-01,5,4.6
5,00XX_E_MPCP_US,2012-06-01,6,2.5
6,00XX_E_MPCP_US,2012-07-01,7,2.5
7,00XX_E_MPCP_US,2012-08-01,8,3.2
8,00XX_E_MPCP_US,2012-09-01,9,3.0
9,00XX_E_MPCP_US,2012-10-01,10,4.2


## Add future rows

In [12]:
# Since we are predicting for a future date in time (out of sample data), let's populate and append empty rows for each account for the range of the prediction window that is missing in our dataset
new_grouped_df = add_future_rows(df_train, prediction_window)

# Model Forecasting

We will be training an ARIMA time series forecasting model to predict the revenue for each account using **value and tenure** as input features.

In [13]:
df_train = new_grouped_df

In [14]:
# load data
df_train['date'] = pd.to_datetime(df_train['date'])
df_train.set_index(['account_name','date'], inplace=True)
# list of columns to be used as regressors for ARIMA model
regressor_cols_list = ['tenure']

result = df_train.groupby(['account_name']).apply(lambda x: forecast(x,train_start_date, train_end_date, 
                                                                     test_start_date, test_end_date, 
                                                                     forecast_start_date, forecast_end_date,
                                                                     regressor_cols_list))

result = result.droplevel(level=1)
result.reset_index(inplace=True)
result.head()

Unnamed: 0,account_name,date,tenure,value,validated_value,MAPE_monthly,SMAPE_monthly
0,00XX_E_MPCP_US,2012-01-01,1,3.2,,,
1,00XX_E_MPCP_US,2012-02-01,2,3.8,,,
2,00XX_E_MPCP_US,2012-03-01,3,2.5,,,
3,00XX_E_MPCP_US,2012-04-01,4,3.1,,,
4,00XX_E_MPCP_US,2012-05-01,5,4.6,,,


In [15]:
result.rename(columns={'date':'month', 'value':'actual_mnth_value$', 'validated_value':'predicted_mnth_value$', 'MAPE_monthly':'MAPE', 'SMAPE_monthly':'sMAPE'}, inplace=True)

In [16]:
result = result[['account_name', 'month','tenure','actual_mnth_value$','predicted_mnth_value$', 'MAPE', 'sMAPE']]

In [17]:
result

Unnamed: 0,account_name,month,tenure,actual_mnth_value$,predicted_mnth_value$,MAPE,sMAPE
0,00XX_E_MPCP_US,2012-01-01,1,3.2,,,
1,00XX_E_MPCP_US,2012-02-01,2,3.8,,,
2,00XX_E_MPCP_US,2012-03-01,3,2.5,,,
3,00XX_E_MPCP_US,2012-04-01,4,3.1,,,
4,00XX_E_MPCP_US,2012-05-01,5,4.6,,,
...,...,...,...,...,...,...,...
2245,05XX_T_US,2015-05-01,41,,7310.528373,0.0,0.0
2246,05XX_T_US,2015-06-01,42,,7469.996223,0.0,0.0
2247,05XX_T_US,2015-07-01,43,,7402.641361,0.0,0.0
2248,05XX_T_US,2015-08-01,44,,7157.836641,0.0,0.0


In [18]:
# save the result to a CSV file
result.to_csv(f'../data/processed/forecast_result_Q2-Q3_test.csv')

# Conclusion

We utilized the auto ARIMA method to generate forecasts for multiple accounts in a time series. To enhance the accuracy of our predictions, we incorporated the regressor variable "tenure" as an additional factor in our modeling process.

To evaluate the reliability of our forecasts, we calculated the Mean Absolute Percentage Error (MAPE) and symmetric Mean Absolute Percentage Error (sMAPE) on a monthly basis using the validation data. This provided us with valuable insights into the level of deviation between our projected values and the actual values.

The resulting forecasted data, taking into account the impact of tenure, has been stored for future analysis and decision-making.