# Automated Machine Learning
**BikeShare Demand Forecasting**

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Evaluate](#Evaluate)

<img src="https://cdn.thenewstack.io/media/2018/10/2e4f0988-az-ml-0.png">

> Documentation : https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-auto-train-forecast 

## Introduction
In this example, we show how AutoML can be used for bike share forecasting.

The purpose is to demonstrate how to take advantage of the built-in holiday featurization, access the feature names, and further demonstrate how to work with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

In this notebook you would see
1. Creating an Experiment in an existing Workspace
2. Instantiating AutoMLConfig with new task type "forecasting" for timeseries data training, and other timeseries related settings: for this dataset we use the basic one: "time_column_name" 
3. Training the Model using local compute
4. Exploring the results
5. Viewing the engineered names for featurized data and featurization summary for all raw features
6. Testing the fitted model

## Setup


In [29]:
import azureml.core
import pandas as pd
import numpy as np
import logging
import warnings
# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from matplotlib import pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [30]:
import azureml.core
print("Version Azure ML service :", azureml.core.VERSION)

import sys
sys.version

Version Azure ML service : 1.0.48


'3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 18:50:55) [MSC v.1915 64 bit (AMD64)]'

As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [31]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-bikeshareforecasting'
# project folder
project_folder = './sample_projects/automl-local-bikeshareforecasting'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
SDK version,1.0.48
Workspace,azuremlservice
Resource Group,azuremlserviceresourcegroup
Location,westeurope
Project Directory,./sample_projects/automl-local-bikeshareforecasting
Run History Name,automl-bikeshareforecasting


## Data
Read bike share demand data from file, and preview data.

In [32]:
data = pd.read_csv('bike-no.csv', parse_dates=['date'])

Let's set up what we know abou the dataset. 

**Target column** is what we want to forecast.

**Time column** is the time axis along which to predict.

**Grain** is another word for an individual time series in your dataset. Grains are identified by values of the columns listed `grain_column_names`, for example "store" and "item" if your data has multiple time series of sales, one series for each combination of store and item sold.

This dataset has only one time series. Please see the [orange juice notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales) for an example of a multi-time series dataset.

In [33]:
display(data)

Unnamed: 0,instant,date,season,yr,mnth,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,6,2,0.34,0.36,0.81,0.16,331,654,985
1,2,2011-01-02,1,0,1,0,2,0.36,0.35,0.70,0.25,131,670,801
2,3,2011-01-03,1,0,1,1,1,0.20,0.19,0.44,0.25,120,1229,1349
3,4,2011-01-04,1,0,1,2,1,0.20,0.21,0.59,0.16,108,1454,1562
4,5,2011-01-05,1,0,1,3,1,0.23,0.23,0.44,0.19,82,1518,1600
5,6,2011-01-06,1,0,1,4,1,0.20,0.23,0.52,0.09,88,1518,1606
6,7,2011-01-07,1,0,1,5,2,0.20,0.21,0.50,0.17,148,1362,1510
7,8,2011-01-08,1,0,1,6,2,0.17,0.16,0.54,0.27,68,891,959
8,9,2011-01-09,1,0,1,0,1,0.14,0.12,0.43,0.36,54,768,822
9,10,2011-01-10,1,0,1,1,1,0.15,0.15,0.48,0.22,41,1280,1321


In [34]:
data.describe()

Unnamed: 0,instant,season,yr,mnth,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.5,0.5,6.52,3.0,1.4,0.5,0.47,0.63,0.19,848.18,3656.17,4504.35
std,211.17,1.11,0.5,3.45,2.0,0.54,0.18,0.16,0.14,0.08,686.62,1560.26,1937.21
min,1.0,1.0,0.0,1.0,0.0,1.0,0.06,0.08,0.0,0.02,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,1.0,1.0,0.34,0.34,0.52,0.13,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,3.0,1.0,0.5,0.49,0.63,0.18,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,5.0,2.0,0.66,0.61,0.73,0.23,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,6.0,3.0,0.86,0.84,0.97,0.51,3410.0,6946.0,8714.0


In [35]:
target_column_name = 'cnt'
time_column_name = 'date'
grain_column_names = []

## Split the data

The first split we make is into train and test sets. Note we are splitting on time.

In [36]:
train = data[data[time_column_name] < '2012-09-01']
test = data[data[time_column_name] >= '2012-09-01']

X_train = train.copy()
y_train = X_train.pop(target_column_name).values

X_test = test.copy()
y_test = X_test.pop(target_column_name).values


print("X Train : ", X_train.shape)
print("Y Train : ", y_train.shape)
print("X Test : ", X_test.shape)
print("Y Test : ", y_test.shape)

X Train :  (609, 13)
Y Train :  (609,)
X Test :  (122, 13)
Y Test :  (122,)


### Setting forecaster maximum horizon 

Assuming your test data forms a full and regular time series(regular time intervals and no holes), 
the maximum horizon you will need to forecast is the length of the longest grain in your test set.

In [37]:
if len(grain_column_names) == 0:
    max_horizon = len(X_test)
else:
    max_horizon = X_test.groupby(grain_column_names)[time_column_name].count().max()

## Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|
|**n_cross_validations**|Number of cross validation splits.|
|**country_or_region**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. 

In [38]:
time_column_name = 'date'
automl_settings = {
    "time_column_name": time_column_name,
    # these columns are a breakdown of the total and therefore a leak
    "drop_column_names": ['casual', 'registered'],
    # knowing the country/region allows Automated ML to bring in holidays
    "country_or_region" : 'FR',
    "max_horizon" : max_horizon,
    "target_lags": 1    
}

automl_config = AutoMLConfig(task = 'forecasting',                             
                             primary_metric='normalized_root_mean_squared_error',
                             iterations = 10,
                             iteration_timeout_minutes = 10,
                             X = X_train,
                             y = y_train,
                             n_cross_validations = 3,                             
                             path=project_folder,
                             verbosity = logging.INFO,
                            **automl_settings)

We will now run the experiment, starting with 10 iterations of model search. Experiment can be continued for more iterations if the results are not yet good. You will see the currently running iterations printing to the console.

In [39]:
%%time
local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_6395a64e-d94d-41ba-8f95-4c3f068d8846
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   RobustScaler ElasticNet                        0:01:18       0.1288  

In [41]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!

In [42]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-bikeshareforecasting,AutoML_6395a64e-d94d-41ba-8f95-4c3f068d8846,automl,Completed,Link to Azure Portal,Link to Documentation


### Retrieve the Best Model
Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. There are overloads on get_output that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration.

In [15]:
best_run, fitted_model = local_run.get_output()
fitted_model.steps

[('timeseriestransformer', TimeSeriesTransformer(logger=None)),
 ('stackensembleregressor',
  StackEnsembleRegressor(base_learners=[('3', Pipeline(memory=None,
       steps=[('standardscalerwrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x00000219F4AC3E48>), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
             max_features='log2', max_leaf_nodes=None,
             min_impuri...timators=25, n_jobs=1,
             oob_score=False, random_state=None, verbose=0, warm_start=False))]))],
              meta_learner=ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
         l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=1,
         normalize=False, positive=False, precompute='auto',
         random_state=None, selection='cyclic', tol=0.0001, verbose=0),
              training_cv_folds=5))]

### View the engineered names for featurized data

You can accees the engineered feature names generated in time-series featurization. Note that a number of named holiday periods are represented. We recommend that you have at least one year of data when using this feature to ensure that all yearly holidays are captured in the training featurization.

In [16]:
fitted_model.named_steps['timeseriestransformer'].get_engineered_feature_names()

['atemp',
 'atemp_WASNULL',
 'horizon_origin',
 'hum',
 'hum_WASNULL',
 'instant',
 'instant_WASNULL',
 'mnth',
 'mnth_WASNULL',
 'season',
 'season_WASNULL',
 'temp',
 'temp_WASNULL',
 'weathersit',
 'weathersit_WASNULL',
 'weekday',
 'weekday_WASNULL',
 'windspeed',
 'windspeed_WASNULL',
 'yr',
 'yr_WASNULL',
 '_automl_target_col_lag1D',
 'year',
 'year_iso',
 'half',
 'quarter',
 'month',
 'day',
 'wday',
 'qday',
 'week',
 '_IsPaidTimeOff',
 '_Holiday_1 day after Armistice 1918',
 '_Holiday_1 day after Armistice 1945',
 '_Holiday_1 day after Ascension',
 '_Holiday_1 day after Assomption',
 '_Holiday_1 day after Fête du Travail',
 '_Holiday_1 day after Fête nationale',
 "_Holiday_1 day after Jour de l'an",
 '_Holiday_1 day after Lundi de Pentecôte',
 '_Holiday_1 day after Lundi de Pâques',
 '_Holiday_1 day after Noël',
 '_Holiday_1 day after Toussaint',
 '_Holiday_1 day before Armistice 1918',
 '_Holiday_1 day before Armistice 1945',
 '_Holiday_1 day before Ascension',
 '_Holiday_1 

### View the featurization summary

You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:

- Raw feature name
- Number of engineered features formed out of this raw feature
- Type detected
- If feature was dropped
- List of feature transformations for the raw feature

In [17]:
fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()

[{'RawFeatureName': 'atemp',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'date',
  'TypeDetected': 'DateTime',
  'Dropped': 'No',
  'EngineeredFeatureCount': 93,
  'Tranformations': ['MaxHorizonFeaturizer',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'DateTime-OneHotEncoder',
   'Date

### Test the Best Fitted Model

Predict on training and test set, and calculate residual values.

We always score on the original dataset whose schema matches the scheme of the training dataset.

In [18]:
X_test.head(10)

Unnamed: 0,instant,date,season,yr,mnth,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered
609,610,2012-09-01,3,1,9,6,2,0.75,0.7,0.64,0.11,2352,3788
610,611,2012-09-02,3,1,9,0,2,0.7,0.65,0.81,0.06,2613,3197
611,612,2012-09-03,3,1,9,1,1,0.71,0.66,0.79,0.15,1965,4069
612,613,2012-09-04,3,1,9,2,1,0.73,0.69,0.76,0.24,867,5997
613,614,2012-09-05,3,1,9,3,1,0.74,0.71,0.74,0.19,832,6280
614,615,2012-09-06,3,1,9,4,2,0.7,0.66,0.81,0.14,611,5592
615,616,2012-09-07,3,1,9,5,1,0.7,0.66,0.74,0.17,1045,6459
616,617,2012-09-08,3,1,9,6,2,0.66,0.61,0.8,0.28,1557,4419
617,618,2012-09-09,3,1,9,0,1,0.61,0.58,0.55,0.22,2570,5657
618,619,2012-09-10,3,1,9,1,1,0.58,0.57,0.5,0.26,1118,6407


In [19]:
y_query = y_test.copy().astype(np.float)
y_query.fill(np.NaN)
y_fcst, X_trans = fitted_model.forecast(X_test, y_query)

In [25]:
y_fcst

array([8252.2950373 , 7826.89469791, 7828.84730945, 7735.77856671,
       7732.1117878 , 7800.76910656, 8010.71037202, 7567.56891329,
       7814.18672436, 7798.10833924, 8059.62823558, 8118.30588951,
       8133.43857661, 8133.43857661, 7975.21588583, 7904.31812131,
       7867.62181175, 7464.05862703, 7811.11119901, 8128.4307181 ,
       8099.83461182, 7862.31627741, 7691.85388466, 7763.83327498,
       7630.10070539, 7701.8374249 , 8096.30594088, 8048.13849055,
       7850.80942462, 7794.39127969, 7692.14096691, 7580.0077118 ,
       7798.32801007, 7898.83377511, 8039.8271956 , 7926.44275419,
       7675.05558539, 7291.74285447, 7636.72241037, 7747.08011371,
       7958.50740293, 7859.27145919, 7940.71709526, 7775.94733577,
       7643.91358234, 7978.94824085, 7996.57527029, 7636.47733398,
       7883.02866128, 8026.96591105, 7796.0908472 , 7648.20999123,
       8036.04841805, 8052.2397125 , 7784.05526496, 7814.44590572,
       7710.40327131, 7490.25852421, 6989.38899487, 7192.91894

In [26]:
y_fcst=pd.DataFrame(y_fcst)

### Export prévisions fichier CSV

In [27]:
# Export CSV
y_fcst.to_csv(r'Forecast_Bike.csv')

In [28]:
%ls Forecast_Bike.csv -l

 Volume in drive C is Windows
 Volume Serial Number is F8A0-81F9

 Directory of C:\Users\seretkow\notebooks\autoMLForecast


 Directory of C:\Users\seretkow\notebooks\autoMLForecast

11-Jul-19  11:32             2,696 Forecast_Bike.csv
               1 File(s)          2,696 bytes
               0 Dir(s)  42,128,687,104 bytes free
