# Capstone Two - 5 Extended Modeling: AutoML mljar-supervised<a id='5'></a>

## 5.1 Contents<a id='5.1'></a>
* [5 Extended Modeling](#5)
  * [5.1 Contents](#5.1)
  * [5.2 Imports](#5.2)
  * [5.3 Load The Data](#5.3)
  * [5.4 Data Processing and Modeling](#5.4)


## 5.2 Imports<a id='5.2'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

In [2]:
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

In [3]:
import warnings
warnings.filterwarnings('ignore')

## 5.3 Load The Data<a id='5.3'></a>

In [4]:
data_dir = '../data/'

data_ori = pd.read_csv(data_dir+'train_all_groups.csv')
data_ori.head(3)

Unnamed: 0,date,store_nbr,family,sales,onpromotion,city,state,store_type,cluster,dcoilwtico,transactions,holiday_type,transferred,year,month
0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,93.14,0.0,Holiday,False,2013,1
1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,93.14,0.0,Holiday,False,2013,1
2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,93.14,0.0,Holiday,False,2013,1


From the pre-processing part, we know that the daily oil price and holiday/non_holiday have little importance in the prediction, so we drop these columns.

In [5]:
data = data_ori[['date', 'store_nbr', 'family', 'sales', 'onpromotion', 'transactions']]
data.set_index('date', inplace = True)
data.head(3)

Unnamed: 0_level_0,store_nbr,family,sales,onpromotion,transactions
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,1,AUTOMOTIVE,0.0,0,0.0
2013-01-01,1,BABY CARE,0.0,0,0.0
2013-01-01,1,BEAUTY,0.0,0,0.0


## 5.4 Data Processing and Modeling<a id='5.4'></a>

In [6]:
stores = data['store_nbr'].unique()
families = data['family'].unique()

In [7]:
grp = data.groupby(['store_nbr', 'family'])

In [8]:
stores1 = [*range(1, 7, 1)] # No.1-6 stores
families1 = ['BREAD/BAKERY', 'DAIRY', 'GROCERY I']

In [9]:
# df = pd.DataFrame()
mAPE = []
groups = []

for s in stores1:
    for f in families1:

    ### Data Processing ###
        df =  grp.get_group((s, f))
        group = str(s)+","+f
        df = df.drop(columns=['store_nbr', 'family'])

        df.index = pd.DatetimeIndex(df.index).to_period('D')
        df = df.resample('1D').mean().ffill() # fill time gaps
        
        # Sales are zeros on New Year's Day (Store Closed)
        df.loc[['2014-01-01']] = df.loc[['2013-12-31']].values
        df.loc[['2015-01-01']] = df.loc[['2014-12-31']].values
        df.loc[['2016-01-01']] = df.loc[['2015-12-31']].values
        df.loc[['2017-01-01']] = df.loc[['2016-12-31']].values

        df = df.to_timestamp(freq='D') # from period to datetime

        # Weekday or Weekend - impartant feature     
        df = df.reset_index()
        df['day_of_week'] = df['date'].dt.day_name()
        df.loc[df['day_of_week'].isin(['Monday','Tuesday','Wednesday','Thursday','Friday']), 'day_of_week'] = 'weekday'
        df.loc[df['day_of_week'].isin(['Saturday','Sunday']), 'day_of_week'] = 'weekend'
        dummy_dow = pd.get_dummies(df['day_of_week'], dtype=float)
        df = pd.concat([df, dummy_dow], axis=1)
        
        # lags
        # For sales
        df['sales_lag1'] = df['sales'].shift(1)
        df['sales_lag7'] = df['sales'].shift(7)
        df['sales_7days_avg'] = df['sales_lag1'].rolling(7).mean().round(1)
        # For transactions
        df['trans_lag1'] = df['transactions'].shift(1)
        df['trans_lag7'] = df['transactions'].shift(7)
        df['trans_7days_avg'] = df['trans_lag1'].rolling(7).mean().round(1)

        df = df.drop(['day_of_week', 'transactions'], axis=1)
        df = df.set_index('date')
        
        df = df[7:] #start from day 8
        X_train, X_test, y_train, y_test = train_test_split(df[df.columns[1:]], df['sales'], test_size=16, shuffle=False)
        
        
    ### Modeling ### 
        automl = AutoML()
        automl.fit(X_train, y_train)

        y_pred = automl.predict(X_test)
        
        APE_y = abs( (y_test - y_pred) / y_test)
        mAPE_y = np.mean(APE_y)
        print(group+' MAPE: '+str(mAPE_y))
    
        mAPE.append(mAPE_y)
        groups.append(group)

AutoML directory: AutoML_3
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline rmse 108.161968 trained in 0.31 seconds
2_DecisionTree rmse 65.523902 trained in 11.34 seconds
3_Linear rmse 69.84858 trained in 2.03 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_Xgboost rmse 63.696606 trained in 2.52 seconds
5_Default_NeuralNetwork rmse 61.029549 trained in 0.49 seconds
6_Default_RandomForest rmse 63.775601 trained in 3.07 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 60.546087 trained in 0.24 seconds
AutoML fit time: 26.15 seconds
AutoML best model: Ensemble
1,BREAD/BAKERY MAPE: 0.2079879962427437
AutoML directory: AutoML_4
The task is regression wit

In [10]:
df_mape = pd.DataFrame()
df_mape['Group'] = groups
df_mape['MAPE'] = mAPE
df_mape.sort_values(by=['MAPE'])

Unnamed: 0,Group,MAPE
13,"5,DAIRY",0.06989
3,"2,BREAD/BAKERY",0.088599
17,"6,GROCERY I",0.089693
12,"5,BREAD/BAKERY",0.093863
7,"3,DAIRY",0.101299
14,"5,GROCERY I",0.103343
16,"6,DAIRY",0.104938
6,"3,BREAD/BAKERY",0.111173
8,"3,GROCERY I",0.113798
10,"4,DAIRY",0.115199


In [11]:
print("\033[1m"+"Top 3 MAPE:")
df_mape.nsmallest(3, 'MAPE')

[1mTop 3 MAPE:


Unnamed: 0,Group,MAPE
13,"5,DAIRY",0.06989
3,"2,BREAD/BAKERY",0.088599
17,"6,GROCERY I",0.089693


In [12]:
print("\033[1m"+"Bottom 3 MAPE:")
df_mape.nlargest(3, 'MAPE')

[1mBottom 3 MAPE:


Unnamed: 0,Group,MAPE
9,"4,BREAD/BAKERY",0.345387
0,"1,BREAD/BAKERY",0.207988
1,"1,DAIRY",0.195389
