# Practicalities of AutoML forecasting

In this notebook, we will walk through the process of trying to actually push a real world dataset through AutoML. AutoML has expectations of clean data which are rarely encountered with real world data. We will show various techniques of detecting the problems before they crash AutoML, and workarounds that will get you through the day.

Eventually, these tips and tricks will become part of the AutoML product, prettied up and made more robust.

Many of the cells in this notebook will show the first attempt at doing something, which is later abandoned or cleaned up. They are marked with `DEAD END`. Don't run them, they are there so that you can learn from my mistakes, not your.

To begin, let's encounter our first dead end. Let's read in some data, which lives in a database, using pyodbc.

In [1]:
# establish connection to the server where the data lives
import pyodbc

server = '<hidden>'
database= '<hidden>'
user='<hidden>'
table='grocery_sales'
passwd='<hidden>' 

conndrv = 'DRIVER={SQL Server Native Client 11.0};'
connpar = 'SERVER={{{0}}};DATABASE={{{1}}};UID={{{2}}};PWD={{{3}}}'
connstr = conndrv + connpar.format(server, database, user, passwd)

conn = pyodbc.connect(connstr)
crsr = conn.cursor()

In [None]:
# DEAD END
query = "select g.* from grocery_sales"

# crsr.execute(query)
# rows = crsr.fetchall()

Why a dead end? A dozen cells below, this data proves to be problematic because it contains many short series for products that were sold only once or twice, or were seasonal specials. Instead, we have returned from below and had to re-pull only the data that is machine-learnable. This gives us a problem: we will not be able to make forecasts for those items we left behind! We will fix that problem later, but let's make a note.

*Practical problem 1*: we left some grains behind. How are we going to forecast them?

In [None]:
# SKIP this cell. We will load a smaller subset for the tutorial that will run faster.
# But this is what actually happens, so the code is here for you.

# Pull things that have been sold for at least 100 days (not consecutive).
# This is a good way to filter the data because it matches the requirement
# of cross-validation: N periods avaiable (It's OK if the value is 0.)

# Do the filtering in a small DB rather than in a big pandas job
# to avoid out-of-memory experiences.
query = "select g.* \
        from grocery_sales g join \
            ( \
            select Item, Site, Channel, count(1) as count \
            from grocery_sales \
            group by Item, Site, Channel \
            having DATEDIFF(day, min(SalesDate), max(SalesDate)) > 100 \
            ) long_grains \
            on (long_grains.Item = g.Item \
            and long_grains.Site = g.Site \
            and long_grains.Channel = g.Channel )"

crsr.execute(query)
rows = crsr.fetchall()

# parse the data coming from a database
import pandas as pd
SQLdata = pd.DataFrame(columns=["SalesDate", "Item", "Site", "Channel", "Quantity"]) 
SQLdata["SalesDate"] = [i[0] for i in rows]
SQLdata["Item"] = [i[1] for i in rows]
SQLdata["Site"] = [i[2] for i in rows]
SQLdata["Channel"] = [i[3] for i in rows]
SQLdata["Quantity"] = [i[4] for i in rows]

SQLdata.head()

Now we'll establish the time series metadata. We want to forecast the quantity sold by Item, Site, and Channel.

In [None]:
grain_colnames = ['Item', 'Site', 'Channel']
time_colname = 'SalesDate'
target_colname = 'Quantity'

# How many series do we have?
guppy = SQLdata.groupby(grain_colnames)

In [None]:
print("Rows pulled : " + str(len(SQLdata)))
print("Distinct time series : " + str(len(guppy)))

In [None]:
# make a nice function that fills out the data frame with:
# * zeros for the target column
# * NaNs for the rest of the data

import numpy as np

def fill_out_with_zeros(df, time_colname, grain_colnames, target_colname, freq = 'D', default_value=0):
    """
    For each series, fill in all zeroes from its first observation to the maximum of all data.
    
    Expects all data to be present in the columns, with no index.
    * df                : pd.DataFrame - the input data frame
    * time_colname      : String       - name of the time column
    * grain_colnames    : List[string] - names of the grain columns
    * target_colname    : String       - name of the target (ts value) column
    * freq              : String       - frequency of rows to fill (pandas.Offset string like 'D', 'W')
    
    # TODO: needs test with Xs!
    
    """
    
    if not np.issubdtype(df[time_colname].dtype, np.datetime64):
        print("WARNING: Your time column is not a datetime type. I'll convert it.")        
        df[time_colname] = pd.to_datetime(df[time_colname])
    
    # get the list of grains, with occurence count as a bonus
    grps = df.groupby(grain_colnames)
    unique_grains = grps.size().to_frame().rename(columns = {0 : "count"})    
    date_ranges = grps.agg({time_colname : ['min', 'max'] })
    date_ranges.columns = ['min','max']
    
    # make a dataframe of all dates for all grains with zero in target
    
    default_colname = '__automl_DefaultTarget'
    
    # don't this with a big join/filter because it might blow up too much
    indexes = []
    for index, row in unique_grains.iterrows():    
        grain_dates = pd.date_range(date_ranges.loc[index]["min"], date_ranges.loc[index]["max"], freq=freq).to_frame()
        for i, col in enumerate(grain_colnames):
            grain_dates[col] = index[i]        
        grain_dates.drop(columns=[0], inplace=True)
        grain_dates[default_colname] = default_value
        indexes.append(grain_dates)
        
    # put all the grain-level data frames together in one big df
    expected_values = pd.concat(indexes)    
    # the index is time, rename it to time_colname
    expected_values = expected_values.reset_index().rename(columns = {'index': time_colname})    
    # set same index for merge for the expected values and the indexed original df
    expected_values.set_index(grain_colnames + [time_colname], inplace=True)        
    indexed_df = df.set_index(grain_colnames + [time_colname])
    
    # merge on indexes
    merged_df = expected_values.merge(indexed_df, how='left', left_index=True, right_index=True)
    # replace those variables that are still null
    
    # most elegant way to do a coalesce in pandas?
    # alternatively
    # merged_df[target_colname] = np.where(merged_df[target_colname].isnull(), df[default_colname], df[target_colname] )
    # merged_df.loc[merged_df[target_colname].isnull(), target_colname] = merged_df.loc[merged_df[target_colname].isnull(), default_colname]
    coalesced = merged_df[target_colname].combine_first(merged_df[default_colname])
    merged_df[target_colname] = coalesced
    
    return merged_df.drop(columns=default_colname)
        
    

In [None]:
complete = fill_out_with_zeros(SQLdata, time_colname, grain_colnames, target_colname, 'D')
complete.shape

In [None]:
flat_complete = complete.reset_index()
flat_complete.to_csv("grocery_flat.csv", index=False)

In [None]:
# how do I split the data? It's about 5 years of data, 2012-01-01 to 2017-08-31.
# It probably does not make sense to do more than one month ahead of daily predictions
n_test_periods = 31 # days in August 2017

import numpy as np

def is_column_sorted_ascending(df, time_colname):
    a = df[time_colname].values
    return np.all(a[:-1] <= a[1:])

def split_last_n_by_grain(df, n, time_column_name, grain_column_names, min_grain_length=1):
    """
    Group df by grain and split on last n rows for test, remaining first as train for each group.
    """
    
    paranoid = False
    
    gcols = grain_column_names + [time_column_name]
    if all([c in df.columns for c in gcols]):
        method = 'columns'
    elif set(df.index.names) == set(gcols):
        method = 'index'
    else:
        print('Your index levels are ' + ', '.join(df.index.names))
        print('Your dataframe key is ' + ', '.join(gcols))
        
        raise ValueError('time_column_name, grain_column_names must either both be in columns of df, ' + 
                          'or the index levels must be precisely time_column_name + grain_column_names');
    
    if method == 'columns' and not np.issubdtype(df[time_column_name].dtype, np.datetime64):
        print('WARNING: Your time column is not a datetime type, this function might split lexicographically')
    
    if method == 'index' and not np.issubdtype(df.head().index.get_level_values(time_colname), np.datetime64):
        print('WARNING: Your time index is not a datetime type, this function might split lexicographically')
    
    if method == 'columns':        
        # group, then apply filter, which makes it a df again
        long_grains = df.groupby(grain_column_names, group_keys=False).filter(lambda g : len(g) >= min_grain_length)
        # now you have a flat df, group again        
        df_grouped = (long_grains.sort_values(time_column_name).groupby(grain_column_names, group_keys=False))         
        # flipping the order of group and sort would be natural but is hard in pandas. 
        # So sort first, group second, check third
        if paranoid:
            # this relies on stability of grouping, so assert occasionally
            assert(all([is_column_sorted(dfs, time_colname) for name, dfs in df_grouped]))
    elif method == 'index':
        long_grains = df.groupby(level=grain_column_names, group_keys=False).filter(lambda g : len(g) >= min_grain_length)
        df_grouped = (long_grains.sort_index().groupby(level=grain_column_names, group_keys=False))
        if paranoid:
            assert(all([dfs.index.is_monotonic_increasing for name, dfs in df_grouped]))
            
    df_train = df_grouped.apply(lambda dfg: dfg.iloc[:-n])    
    df_test = df_grouped.apply(lambda dfg: dfg.iloc[-n:])
    return df_train, df_test

# Restarting the processing from file if something bonks

In [None]:
# continue from saved data with a reset kernel
grain_colnames = ['Item', 'Site', 'Channel']
time_colname = 'SalesDate'
target_colname = 'Quantity'

import pandas as pd
flat_complete = pd.read_csv('grocery_flat.csv', index_col=0, parse_dates=[time_colname])

In [None]:
df_train, df_test = split_last_n_by_grain(complete, n_test_periods, time_colname, grain_colnames, min_grain_length=100)

In [None]:
print(df_train.shape)
print(df_test.shape)

In [None]:
# smaller dataframe for test and debug
smaller = complete.loc[(['599ADC', '933860'], slice(None), slice(None), slice(None)), :]
small_train, small_test = split_last_n_by_grain(smaller, n_test_periods, time_colname, grain_colnames, min_grain_length=100)
print(small_train.shape)
print(small_test.shape)

In [None]:
# this is for checking the time extent of the grains
ranges_train = small_train.reset_index().groupby(grain_colnames).agg({'SalesDate': ['min', 'max']})
ranges_train.columns = ['begin', 'end']
ranges_train_flat = ranges_train.reset_index()

ranges_test = small_test.reset_index().groupby(grain_colnames).agg({'SalesDate': ['min', 'max']})
ranges_test.columns = ['begin', 'end']
ranges_test_flat = ranges_test.reset_index()

# we should be able to plot that
%matplotlib notebook
import matplotlib
matplotlib.rcParams['figure.figsize'] = [10, 12]
import matplotlib.pyplot as plt

plt.hlines(range(len(ranges_train_flat)),xmin=ranges_train_flat['begin'].values,xmax=ranges_train_flat['end'].values)
plt.hlines(range(len(ranges_test_flat)),xmin=ranges_test_flat['begin'].values,xmax=ranges_test_flat['end'].values, colors='r')
ytick_tuples = list(zip(ranges_train_flat['Item'],ranges_train_flat['Site'], ranges_train_flat['Channel']))
ytick_labels = list(map(lambda t : '-'.join(t), ytick_tuples))
plt.yticks(range(len(ranges_test_flat)), ytick_labels)
plt.show()

# Todo make that into a function
# maybe plot little series instead of ranges, or color-separate zero values, ....


In [None]:
# flatten for writing and save
flat_test = df_test.reset_index()
flat_test.to_csv("grocery_flat_test.csv", index=False)

In [None]:
flat_train = df_train.reset_index()
flat_train.to_csv("grocery_flat_trainvalid.csv", index=False)

### Save and switch kernel point - switch to automl env

In [1]:
# Restarting the processing from file after changing the kernel
grain_colnames = ['Item', 'Site', 'Channel']
time_colname = 'SalesDate'
target_colname = 'Quantity'

import pandas as pd
flat_train = pd.read_csv("grocery_flat_trainvalid.csv", parse_dates=[time_colname])
flat_test = pd.read_csv("grocery_flat_test.csv", parse_dates=[time_colname])

KeyboardInterrupt: 

If the data is too large, we need to split the series into multiple series to be handled separately.

It is best to split by the value of a specified column. It should be one of the grain columns - that way grains are guaranteed not to split across bins. We will first group by column value, then approximately solve the bin packing problem with maximum volume equal to the max number of rows we are willing to have for one model.

In [23]:
def split_into_chunks_by_size(df, column, max_number_of_rows):
    """
    Split a dataframe into multiple data frames, each with max_number of rows
    
    Takes a dataframe, the column on whose value to split, and maximum 
    size of the resulting dataframes.
    
    Returns two aligned lists
    * list of data frames which partition the or
    * list of sets of `column` values
    """
    
    if not column in df.columns:
        raise ValueError('Splitting column must be in dataframe')
    
    sizes = df.groupby(column).size()
    
    # There may be a more efficient bin packing solver,
    # or if there are sufficiently many groups, one could simply
    # sample N/average_grain_size times without replacement
    # and hope to end up with fairly even groups by virtue of central limit
    from binpacking import to_constant_volume
    bins = to_constant_volume(sizes.to_dict(), max_number_of_rows)
    
    allframes = []
    allindices = []
    for idx, modelbin in enumerate(bins):
        minidf = df[ df[column].isin(set(modelbin.keys())) ]
        allframes.append(minidf)
        allindices.append(set(modelbin.keys()))
        
    return allframes, allindices

def split_into_chunks_by_groups(df, column, valuesets):
    """
    Split a dataframe into multiple dataframes.
    
    Take a list of sets of values of the `column`.
    
    For each valueset in the list, a dataframe will be made consisting of those
    rows which have the value of the `column` in the set.
    
    """
    
    if not column in df.columns:
        raise ValueError('Splitting column must be in dataframe')
        
    allframes = []
    for idx, valueset in enumerate(indices):
        minidf = df[ df[column].isin(valueset) ]
        allframes.append(minidf)
        
    return allframes
    
max_number_of_rows = 1 * 1000 * 1000;
# train_frames, indices = split_into_chunks_by_size(flat_train, 'Item', max_number_of_rows)
# test_frames = split_into_chunks_by_groups(flat_test, 'Item', indices)

In [None]:
flat_train.groupby('Item').size()

In [None]:
# pickle the split datasets
import pickle
pickle.dump( (train_frames, test_frames, indices), open('split_datasets.pkl', 'wb'))

In [1]:
# and load them back after the lengthy procedure
import pickle
(train_frames, test_frames, indices) = pickle.load(open('split_datasets.pkl', 'rb'))

grain_colnames = ['Item', 'Site', 'Channel']
time_colname = 'SalesDate'
target_colname = 'Quantity'

In [3]:
idx = 0
model_valid_for = indices[idx]
X_train = train_frames[idx].copy()
X_test = test_frames[idx].copy()

y_train = X_train.pop(target_colname).values
y_test = X_test.pop(target_colname).values

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(999966, 4)
(18538, 4)
(999966,)
(18538,)


# Benchmark it!

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import logging
import warnings
import os
# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None

import azureml.core
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [3]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-grocery'
# project folder
project_folder = './sample_projects/automl-grocery'
experiment = Experiment(ws, experiment_name)

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


In [4]:
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
Location,westus2
Project Directory,./sample_projects/automl-grocery
Resource Group,automl
Run History Name,automl-grocery
SDK version,0.1.0.3119914
Subscription ID,938fa533-eeb9-4121-b97f-05b31c6eb088
Workspace,automl-customers


In [5]:
time_series_settings = {
    'time_column_name': time_colname,
    'grain_column_names': grain_colnames,
    'drop_column_names': [],
    'max_horizon': 31
}

In [7]:
# does it work on one slice?
idx = 0
model_valid_for = indices[idx]
X_train = train_frames[idx].copy()
X_test = test_frames[idx].copy()

y_train = X_train.pop(target_colname).values
y_test = X_test.pop(target_colname).values

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(999966, 4)
(18538, 4)
(999966,)
(18538,)


In [8]:
automl_config = AutoMLConfig(task='forecasting',
                             debug_log='automl-grocery.log',
                             primary_metric='normalized_root_mean_squared_error',
                             iterations=5,
                             X=X_train,
                             y=y_train,                             
                             n_cross_validations=2,
                             enable_ensembling=False,
                             path=project_folder,
                             verbosity=logging.INFO,    
                             **time_series_settings)



local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_c752ec5f-082f-46d1-af1d-e983f667beaa
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:17       0.0076  

In [9]:
local_run.id

'AutoML_c752ec5f-082f-46d1-af1d-e983f667beaa'

In [15]:
best_run, fitted_pipeline = local_run.get_output()
fitted_pipeline.steps
y_pred = fitted_pipeline.predict(X_test)

In [11]:
# check for empty levels

def empty_grains(df, grain_colnames):
    """"
    returns the list of empty grains
    """
    empties = df.groupby(grain_colnames).filter(lambda g: len(g) == 0)
    return empties

frames_with_empties = [empty_grains(df, grain_colnames).shape[0] > 0 for df in train_frames]

In [12]:
any(frames_with_empties)

False

In [16]:
def MAPE(actual, pred):
    """
    Calculate mean absolute percentage error.
    Remove NA and values where actual is close to zero
    """
    not_na = ~(np.isnan(actual) | np.isnan(pred))
    not_zero = ~np.isclose(actual, 0.0)
    actual_safe = actual[not_na & not_zero]
    pred_safe = pred[not_na & not_zero]
    APE = 100*np.abs((actual_safe - pred_safe)/actual_safe)
    return np.mean(APE)

def SMAPE(actual, pred):
    """
    Calculate mean absolute percentage error.
    Remove NA and values where actual is close to zero
    """
    not_na = ~(np.isnan(actual) | np.isnan(pred))
    not_zero = ~np.isclose(actual, 0.0)
    actual_safe = actual[not_na & not_zero]
    pred_safe = pred[not_na & not_zero]
    SAPE = 100*np.abs(2 * (actual_safe - pred_safe)/(actual_safe + pred_safe))
    return np.mean(SAPE)

print("[Test Data] \nRoot Mean squared error: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))
print('mean_absolute_error score: %.2f' % mean_absolute_error(y_test, y_pred))
print('MAPE: %.2f' % MAPE(y_test, y_pred))
print('SMAPE: %.2f' % SMAPE(y_test, y_pred))

[Test Data] 
Root Mean squared error: 24.55
mean_absolute_error score: 13.32
MAPE: 109.63
SMAPE: 496.36


In [None]:
# check for empty dataframes in train_frames
[idx for idx, df in enumerate(train_frames) if len(df)==0]

In [19]:
# the serial version because I suck at parallel
class CompositeModel:    
    
    def __init__(self, split_column, target_column):
        self._split_column = split_column
        self._target_column = target_column
        self._item_run_map = dict()
        self._item_model_map = dict()
        self._model_impls = dict()
        self._indices = dict()
    

    # todo: redo this so that this class splits its own training data
    # on the splitcolumn, ideally in sequence, rather than materializing
    # the whole dataset again in memory as a sequence of chunks. 
    
    def fit(self, train_frames, indices):    
        
        self._indices = indices
            
        # TODO: this is hard-wired and overwrites outer scope        
        time_series_settings = {
            'time_column_name': time_colname,
            'grain_column_names': grain_colnames,
            'drop_column_names': [],
            'max_horizon': 31
        }
        
        for idx, Xy in enumerate(train_frames):
            
            if len(Xy) == 0:
                print('Warning: found a zero-length frame at index ' + str(idx))
                continue
            
            X_train = Xy.copy()            
            y_train = X_train.pop(self._target_column).values                        
          
            automl_config = AutoMLConfig(task='forecasting',
                                debug_log='automl-grocery.log',
                                primary_metric='normalized_root_mean_squared_error',
                                iterations=5,
                                X=X_train,
                                y=y_train,                             
                                n_cross_validations=3,
                                enable_ensembling=False,
                                path=project_folder,
                                verbosity=logging.INFO,    
                                **time_series_settings)
        
            # get the model and metadata
            local_run = experiment.submit(automl_config, show_output=True) # Parent run 
            best_run, fitted_pipeline = local_run.get_output()             # Favorite child
            model_id = best_run.id
            print('Learned model ' + str(model_id))  # this is not working - needs a different ID
        
            # record the model for item
            self._model_impls[model_id] = fitted_pipeline
            for idx, item in enumerate(self._indices[idx]):
                self._item_model_map[item] = model_id
                self._item_run_map[item] = best_run.id
                
                
    def forecast(self, X_test, y_test):
        
        # split X and y together by splitcolumn
        X_copy = X_test.copy()
        X_copy['__automl_target_column'] = y_test
        chunks = split_into_chunks_by_groups(X_copy, self._split_column, self._indices)
        
        ys = []
        X_transes = []
        for chunk in chunks:
            
            # skip potentially empty splits
            if len(chunk) == 0:
                continue
            
            # Look up the right model. It should be the same model 
            # for the whole chunk by construction
            item = chunk.loc[chunk.index[0], self._split_column]
            modelid = self._item_model_map[item]
            print('Using model ' + str(modelid))
            model = self._model_impls[modelid]
            
            paranoid = True
            if paranoid:
                for item2 in pd.unique(chunk[self._split_column]):
                    assert(
                           (item2 in self._item_model_map.keys()) and 
                           (self._item_model_map[item2] == modelid),
                           'Item ' + str(item2) + ' is not mapped to the same model ' + str(modelid) + ' as ' + str(item)
                    )
                           
            y_chunk = chunk.pop('__automl_target_column').values
            y_pred, X_trans = model.forecast(chunk, y_chunk)
            ys.append(y_chunk)
            X_transes.append(X_trans)
            
        return np.concatenate(ys), pd.concat(X_transes)                    

In [20]:
cm = CompositeModel(split_column = 'Item', target_column = 'Quantity')
cm.fit(train_frames, indices)

Running on local machine
Parent Run ID: AutoML_0982cbf9-2ef2-4be5-bb6b-aa12952d6f2c
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:01:07       0.0076  

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:20       0.0039    0.0039
         1   StandardScalerWrapper ElasticNet               0:00:16       0.0039    0.0039
         2   StandardScalerWrapper ElasticNet               0:01:27       0.0036    0.0036
         3   VotingEnsemble                                 0:00:14       0.0036    0.0036
         4   StackEnsemble                                  0:00:12       0.0036    0.0036
Learned model AutoML_bbe61b86-5a2c-4b05-a4a8-3462d897fcb6_3
Running on local machine
Parent Run ID: AutoML_c456037c-d921-4b31-9dd2-11962a9b98be
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:20       0.0055    0.0055
         1   StandardScalerWrapper ElasticNet               0:00:15       0.0055    0.0055
         2   StandardScalerWrapper ElasticNet               0:00:24       0.0051    0.0051
         3   VotingEnsemble                                 0:00:17       0.0051    0.0051
         4   StackEnsemble                                  0:00:13       0.0051    0.0051
Learned model AutoML_542712c0-a327-423b-b384-b2e57c84bacc_4
Running on local machine
Parent Run ID: AutoML_cf596133-567a-4134-83e6-a015ecedb09c
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:22       0.0092    0.0092
         1   StandardScalerWrapper ElasticNet               0:00:18       0.0092    0.0092
         2   StandardScalerWrapper ElasticNet               0:00:23       0.0088    0.0088
         3   VotingEnsemble                                 0:00:12       0.0088    0.0088
         4   StackEnsemble                                  0:00:14       0.0088    0.0088
Learned model AutoML_030b3043-41a6-44fa-ad3f-64d8a1337acf_2
Running on local machine
Parent Run ID: AutoML_efa343c3-9368-4b2e-a779-894df84dcc8e
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:23       0.0052    0.0052
         1   StandardScalerWrapper ElasticNet               0:00:17       0.0052    0.0052
         2   StandardScalerWrapper ElasticNet               0:00:23       0.0050    0.0050
         3   VotingEnsemble                                 0:00:35       0.0050    0.0050
         4   StackEnsemble                                  0:00:16       0.0050    0.0050
Learned model AutoML_2f06c6f4-b978-4377-b24c-0fb8b6b24c3c_2
Running on local machine
Parent Run ID: AutoML_c3fea07c-97e3-4a0c-8d98-8626a514aafb
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:25       0.0043    0.0043
         1   StandardScalerWrapper ElasticNet               0:00:18       0.0043    0.0043
         2   StandardScalerWrapper ElasticNet               0:00:29       0.0041    0.0041
         3   VotingEnsemble                                 0:00:16       0.0041    0.0041
         4   StackEnsemble                                  0:00:15       0.0041    0.0041
Learned model AutoML_5766f42a-fb7e-40b6-8146-4dc45fd9fa1a_2
Running on local machine
Parent Run ID: AutoML_6edcd3ba-490f-47a3-81c4-bfdfacdac5b0
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:30       0.0077    0.0077
         1   StandardScalerWrapper ElasticNet               0:00:21       0.0076    0.0076
         2   StandardScalerWrapper ElasticNet               0:00:29       0.0074    0.0074
         3   VotingEnsemble                                 0:00:21       0.0074    0.0074
         4   StackEnsemble                                  0:00:18       0.0076    0.0074
Learned model AutoML_5f7dbb75-06fa-4c66-8834-23de37b78f48_2
Running on local machine
Parent Run ID: AutoML_8c2741c0-eb71-4f40-a750-5fd50e208627
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:00:43       0.0108    0.0108
         1   StandardScalerWrapper ElasticNet               0:00:39       0.0108    0.0108
         2   StandardScalerWrapper ElasticNet               0:00:42       0.0102    0.0102
         3   VotingEnsemble                                 0:00:28       0.0102    0.0102
         4   StackEnsemble                                  0:00:24       0.0102    0.0102
Learned model AutoML_86d5ebdc-15da-406d-a76e-ab868ae735c0_3
Running on local machine
Parent Run ID: AutoML_06a3fb5b-7a5f-4c0a-ba0a-eeb0535a5bd8
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

***************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:09:57       0.0101    0.0101
         1   StandardScalerWrapper ElasticNet               0:06:55       0.0101    0.0101
         2   StandardScalerWrapper ElasticNet               0:06:57       0.0091    0.0091
         3   VotingEnsemble                                 0:02:20       0.0091    0.0091
         4   StackEnsemble                                  0:02:06       0.0091    0.0091
Learned model AutoML_70e5f24c-f159-4481-a5f0-1393d973b5de_3


In [24]:
# test the forecasts
X_test0 = test_frames[0].copy()
y_test0 = X_test0.pop(target_colname).values
y_test0.fill(np.nan)
r0 = cm.forecast(X_test0, y_test0)

Using model AutoML_0982cbf9-2ef2-4be5-bb6b-aa12952d6f2c_4


In [25]:
X_test1 = test_frames[1].copy()
y_test1 = X_test1.pop(target_colname).values
y_test1.fill(np.nan)
r1 = cm.forecast(X_test1, y_test1)

Using model AutoML_680e0e70-1c4d-4241-bf55-9a7c4793fb0d_4


In [26]:
# test running on all groups
rall = cm.forecast(pd.concat([X_test0, X_test1]), np.concatenate([y_test0, y_test1]))

Using model AutoML_0982cbf9-2ef2-4be5-bb6b-aa12952d6f2c_4
Using model AutoML_680e0e70-1c4d-4241-bf55-9a7c4793fb0d_4


In [None]:
def MAPE(actual, pred):
    """
    Calculate mean absolute percentage error.
    Remove NA and values where actual is close to zero
    """
    not_na = ~(np.isnan(actual) | np.isnan(pred))
    not_zero = ~np.isclose(actual, 0.0)
    actual_safe = actual[not_na & not_zero]
    pred_safe = pred[not_na & not_zero]
    APE = 100*np.abs((actual_safe - pred_safe)/actual_safe)
    return np.mean(APE)

def SMAPE(actual, pred):
    """
    Calculate mean absolute percentage error.
    Remove NA and values where actual is close to zero
    """
    not_na = ~(np.isnan(actual) | np.isnan(pred))
    not_zero = ~np.isclose(actual, 0.0)
    actual_safe = actual[not_na & not_zero]
    pred_safe = pred[not_na & not_zero]
    SAPE = 100*np.abs(2 * (actual_safe - pred_safe)/(actual_safe + pred_safe))
    return np.mean(SAPE)

print("[Test Data] \nRoot Mean squared error: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))
print('mean_absolute_error score: %.2f' % mean_absolute_error(y_test, y_pred))
print('MAPE: %.2f' % MAPE(y_test, y_pred))
print('SMAPE: %.2f' % SMAPE(y_test, y_pred))