## Contents
* [1. Improved Model](#1.-Improved-Model)
* [2. Read & Clean Data](#2.-Read-&-Clean-Data)
* [3. Modelling](#3.-Modelling)

---
## 1. Improved Model 3
---
- To develop an improved ARIMA model using a shorter training timeframe for a shorter 12-week prediction timeframe.


| Models        | Improved 5                                         |
|---------------|----------------------------------------------------|
| Model         | SARIMA                                             |
| Training data | - 2021-2022 Sep weekly dengue data<br>- Season = 3 |
| Testing data  | Oct - Dec 2022 (12 weeks)                          |

---
## 2. Read & Clean Data
---

In [13]:
import numpy as np 
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


#Visualisation:
import seaborn               as sns
import matplotlib.pyplot     as plt
sns.set_theme(style="whitegrid")

from tqdm import tqdm
tqdm.pandas()

#Showing missing, duplicates, shape, dtypes
def df_summary(df):
    print(f"Shape(col,rows): {df.shape}")
    print(f"Number of duplicates: {df.duplicated().sum()}")
    print('---'*20)
    print(f'Number of each unqiue datatypes:\n{df.dtypes.value_counts()}')
    print('---'*20)
    print("Columns with missing values:")
    isnull_df = pd.DataFrame(df.isnull().sum()).reset_index()
    isnull_df.columns = ['col','num_nulls']
    isnull_df['perc_null'] = ((isnull_df['num_nulls'])/(len(df))).round(2)
    print(isnull_df[isnull_df['num_nulls']>0])
    
#modelling metrics:
from sklearn.metrics import mean_absolute_percentage_error

In [2]:
df = pd.read_csv("data/input/Dengue_weekly.csv")

In [3]:
df_summary(df)

Shape(col,rows): (574, 6)
Number of duplicates: 0
------------------------------------------------------------
Number of each unqiue datatypes:
object     3
int64      2
float64    1
dtype: int64
------------------------------------------------------------
Columns with missing values:
Empty DataFrame
Columns: [col, num_nulls, perc_null]
Index: []


In [4]:
df.dtypes

Epidemiology Wk    float64
Start               object
End                 object
Dengue               int64
DHF                  int64
month_year          object
dtype: object

In [5]:
df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime, yearfirst=True)

In [6]:
df.dtypes

Epidemiology Wk           float64
Start              datetime64[ns]
End                datetime64[ns]
Dengue                      int64
DHF                         int64
month_year                 object
dtype: object

In [7]:
_df = df.copy()
_df.rename(columns={'Start':'Week'},inplace=True)
_df.set_index('Week',inplace=True)
_df.sort_index(inplace=True)

In [8]:
dengue_df = _df['Dengue'].resample('W').sum()
dengue_df = pd.DataFrame(dengue_df)
dengue_df

Unnamed: 0_level_0,Dengue
Week,Unnamed: 1_level_1
2012-01-01,74
2012-01-08,64
2012-01-15,60
2012-01-22,50
2012-01-29,84
...,...
2022-11-27,242
2022-12-04,326
2022-12-11,289
2022-12-18,272


---
## 3. Modelling
---
- SARIMA model using only 2021 and 2022 data

In [14]:
import pmdarima as pm # to do Auto ARIMA

In [33]:
_df = dengue_df.copy()

_df = _df[-52*2:]# takes 2021 and 2022 data

train = _df[:-12]#takes 2021-2022 data but exclude last 12 weeks
test = _df[-12:]#takes last 12 weeks data
len(test)/len(_df)#show that test is ~10% of overall data (2021-2022)

0.11538461538461539

In [34]:
arima_model = pm.AutoARIMA(start_p=0, max_p=10, 
                           d=None,    # find optimum value of d automatically
                           start_q=0, max_q=10,
                           trace=True, # Print values in console for each fit in the grid search
                           random=True, # to not perform an exhaustive search & set an internal stop (refer doc)
                           stepwise = False, #If want to truly randomise Param search, need to set stepwise=False and random=True
                           random_state=20, # repeatability of steps in the same order
                           n_fits=50 # max no. of ARIMA models fits, the algorithm MUST stop at the end of 50 fits
                          )

arima_model.fit(train) # only fit on the one column that we have, y_train
pred = arima_model.predict(n_periods = len(test))
mean_absolute_percentage_error(test, pred)

  gen = random_state.permutation(list(gen))[:n_fits]


 ARIMA(3,1,1)(0,0,0)[1] intercept   : AIC=1065.455, Time=0.33 sec
 ARIMA(4,1,1)(0,0,0)[1] intercept   : AIC=1067.412, Time=0.36 sec
 ARIMA(4,1,0)(0,0,0)[1] intercept   : AIC=1066.068, Time=0.14 sec
 ARIMA(0,1,4)(0,0,0)[1] intercept   : AIC=1068.322, Time=0.26 sec
 ARIMA(0,1,1)(0,0,0)[1] intercept   : AIC=1074.808, Time=0.06 sec
 ARIMA(2,1,1)(0,0,0)[1] intercept   : AIC=1063.464, Time=0.16 sec
 ARIMA(2,1,2)(0,0,0)[1] intercept   : AIC=1065.453, Time=0.26 sec
 ARIMA(3,1,2)(0,0,0)[1] intercept   : AIC=1067.445, Time=0.27 sec
 ARIMA(2,1,3)(0,0,0)[1] intercept   : AIC=1067.412, Time=0.33 sec
 ARIMA(5,1,0)(0,0,0)[1] intercept   : AIC=1067.623, Time=0.16 sec
 ARIMA(1,1,4)(0,0,0)[1] intercept   : AIC=1066.900, Time=0.31 sec
 ARIMA(0,1,5)(0,0,0)[1] intercept   : AIC=1069.698, Time=0.17 sec
 ARIMA(1,1,2)(0,0,0)[1] intercept   : AIC=1065.397, Time=0.20 sec
 ARIMA(1,1,0)(0,0,0)[1] intercept   : AIC=1074.359, Time=0.07 sec
 ARIMA(0,1,0)(0,0,0)[1] intercept   : AIC=1073.446, Time=0.02 sec
 ARIMA(0,1

0.07903703602785125

### Analysis
The improved model 5 performed the best of all developed models:

|         | Improved 5                                         |
|------------------------------------------------------|----------------------------------------------------|
| Model         | SARIMA                                             |
| Training data | - 2021-2022 Sep weekly dengue data<br>- Season = 3 |
| Testing data  | Oct - Dec 2022 (12 weeks)                          |
| MAPE score    | 0.08                                               |