# Vertex Forecast - probabilistic inference with Quantiles

## Dealing with Uncertainty

**Retail sales are random by nature**; consider the following scenario:

> On a given day, 1000 customers walk into a large retail store (on average), and 1% of customers decide to buy a pair of jeans (on average).


In [2]:
import os

GCP_PROJECTS = !gcloud config get-value project
PROJECT_ID = GCP_PROJECTS[0]
PROJECT_NUM = !gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)"
PROJECT_NUM = PROJECT_NUM[0]
LOCATION = 'us-central1'
BQ_LOCATION='US'

# TODO: Service Account address
VERTEX_SA = '934903580331-compute@developer.gserviceaccount.com' 

print(f"PROJECT_ID: {PROJECT_ID}")
print(f"PROJECT_NUM: {PROJECT_NUM}")
print(f"LOCATION: {LOCATION}")

PROJECT_ID: hybrid-vertex
PROJECT_NUM: 934903580331
LOCATION: us-central1


In [3]:
import google.cloud.aiplatform as vertex_ai
from google.cloud import bigquery
from google.cloud import storage

import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime, timedelta

### Setup - init SDK clients and define vars

In [4]:
bq_client = bigquery.Client(
    project=PROJECT_ID, 
    location=BQ_LOCATION
)

storage_client = storage.Client(project=PROJECT_ID)

vertex_ai.init(
    project=PROJECT_ID,
    location=LOCATION
)

In [5]:
# previously defined
BQ_DATASET="m5_us"
BQ_TABLE="sdk_train"
BQ_TABLE_PLAN="sdk_plan"

# new vars
EXPERIMENT_tag="nb5"
VERSION='v1'

EXPERIMENT_NAME = f"m5_{EXPERIMENT_tag}_{VERSION}"
print(f'EXPERIMENT_NAME: {EXPERIMENT_NAME}')

EXPERIMENT_NAME: m5_nb5_v1


### get dataset

In [6]:
dataset = vertex_ai.TimeSeriesDataset('projects/934903580331/locations/us-central1/datasets/462153324456574976')

In [12]:
# dataset.column_names

### define column specs

In [13]:
import pickle as pkl
from pprint import pprint

LOCAL_COL_FILE = 'column_specs.pkl'

filehandler = open(LOCAL_COL_FILE, 'rb')
COL_DICT_TEST = pkl.load(filehandler)
filehandler.close()

COL_TRANSFORMS = COL_DICT_TEST['column_specs']
UNAVAILABLE_AT_FORECAST_COLS = COL_DICT_TEST['unavailable_at_forecast_columns']
AVAILABLE_AT_FORECAST_COLS = COL_DICT_TEST['available_at_forecast_columns']
SERIES_COLUMN = COL_DICT_TEST['time_series_identifier_column']
TIME_COLUMN = COL_DICT_TEST['time_column']
TARGET_COLUMN = COL_DICT_TEST['target_column']
PREDEFINED_SPLIT_COL = COL_DICT_TEST['predefined_split_column_name']

## Model config

Vertex Forecast now supports Probabilistic Inference, enabling users to simultaneously optimize an objective and provide quantiles (not supported by `minimize-quantile-loss` objective). The advantages of learning a predictive distribution also may yield improved performance for optimization objectives. It improves over `minimize-quantile-loss` (i.e. Pinball loss) through fidelity of fine grained quantiles with ordering guarantee, and combines predictions from ensemble members using mixtures to improve calibration.

In [24]:
# forecast spec
FORECAST_GRANULARITY = 'DAY'
DATA_GRANULARITY_COUNT=1
FORECAST_HORIZON = 14
CONTEXT_WINDOW = 14
forecast_test_length = 14
forecast_val_length = 14

# model config
# OPTIMIZATION_OBJECTIVE='minimize-quantile-loss'
OPTIMIZATION_OBJECTIVE ='minimize-rmse',
QUANTILES = [0.1, 0.3, 0.5, 0.7, 0.9]
ENABLE_PROBABILISTIC_INFER="enable_probabilistic_inference"

# job spec
MILLI_NODE_HRS=1000
HOLIDAY_REGIONS=['GLOBAL', 'NA', 'US']

# export eval set BQ destination
BQ_EVAL_DESTINATION = f"bq://{PROJECT_ID}:{BQ_DATASET}:{BQ_TABLE}_automl_qs"

print(f"EXPERIMENT_NAME: {EXPERIMENT_NAME}")
print(f"VERSION: {VERSION}")
print(f"OPTIMIZATION_OBJECTIVE: {OPTIMIZATION_OBJECTIVE}")
print(f"TARGET_COLUMN: {TARGET_COLUMN}")
print(f"TIME_COLUMN: {TIME_COLUMN}")
print(f"SERIES_COLUMN: {SERIES_COLUMN}")
print(f"AVAILABLE_AT_FORECAST_COLS: {AVAILABLE_AT_FORECAST_COLS}")
print(f"PREDEFINED_SPLIT_COL: {PREDEFINED_SPLIT_COL}")
print(f"FORECAST_HORIZON: {FORECAST_HORIZON}")
print(f"FORECAST_GRANULARITY: {FORECAST_GRANULARITY.lower()}")
print(f"CONTEXT_WINDOW: {CONTEXT_WINDOW}")
print(f"TARGET_COLUMN: {TARGET_COLUMN}")
print(f"QUANTILES: {QUANTILES}")
print(f"ENABLE_PROBABILISTIC_INFER: {ENABLE_PROBABILISTIC_INFER}")
print(f"BQ_EVAL_DESTINATION: {BQ_EVAL_DESTINATION}")

EXPERIMENT_NAME: m5_nb5_v1
VERSION: v1
OPTIMIZATION_OBJECTIVE: ('minimize-rmse',)
TARGET_COLUMN: gross_quantity
TIME_COLUMN: date
SERIES_COLUMN: timeseries_id
AVAILABLE_AT_FORECAST_COLS: ['event_name_1', 'year', 'event_type_1', 'month', 'dept_id', 'event_type_2', 'wday', 'state_id', 'snap_WI', 'snap_CA', 'snap_TX', 'product_id', 'date', 'event_name_2', 'location_id', 'weekday', 'cat_id']
PREDEFINED_SPLIT_COL: splits
FORECAST_HORIZON: 14
FORECAST_GRANULARITY: day
CONTEXT_WINDOW: 14
TARGET_COLUMN: gross_quantity
QUANTILES: [0.1, 0.3, 0.5, 0.7, 0.9]
ENABLE_PROBABILISTIC_INFER: enable_probabilistic_inference
BQ_EVAL_DESTINATION: bq://hybrid-vertex:m5_us:sdk_train_automl_qs


# Create and submit job

In [25]:
forecast_job = vertex_ai.AutoMLForecastingTrainingJob(
    display_name = f'{EXPERIMENT_NAME}_training',
    optimization_objective=OPTIMIZATION_OBJECTIVE,
    column_specs = COL_TRANSFORMS,
    labels = {'experiment':f'{EXPERIMENT_NAME}'}
)

In [26]:
forecast=forecast_job.run(
    dataset=dataset,
    target_column=TARGET_COLUMN,
    time_column=TIME_COLUMN,
    time_series_identifier_column=SERIES_COLUMN,
    unavailable_at_forecast_columns=UNAVAILABLE_AT_FORECAST_COLS,
    available_at_forecast_columns=AVAILABLE_AT_FORECAST_COLS,
    forecast_horizon=FORECAST_HORIZON,
    data_granularity_unit=FORECAST_GRANULARITY.lower(),
    data_granularity_count=DATA_GRANULARITY_COUNT,
    predefined_split_column_name=PREDEFINED_SPLIT_COL,
    context_window = CONTEXT_WINDOW,
    export_evaluated_data_items=True,
    export_evaluated_data_items_bigquery_destination_uri=BQ_EVAL_DESTINATION,
    validation_options="fail-pipeline",
    budget_milli_node_hours = MILLI_NODE_HRS,
    model_display_name=f"{EXPERIMENT_NAME}_{BQ_TABLE}",
    model_labels={'experiment':f'{EXPERIMENT_NAME}'},
    holiday_regions=HOLIDAY_REGIONS,
    quantiles=QUANTILES,
    additional_experiments=[ENABLE_PROBABILISTIC_INFER],
    sync=True
)

View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/8325361720348377088?project=934903580331
AutoMLForecastingTrainingJob projects/934903580331/locations/us-central1/trainingPipelines/8325361720348377088 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLForecastingTrainingJob projects/934903580331/locations/us-central1/trainingPipelines/8325361720348377088 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLForecastingTrainingJob projects/934903580331/locations/us-central1/trainingPipelines/8325361720348377088 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLForecastingTrainingJob projects/934903580331/locations/us-central1/trainingPipelines/8325361720348377088 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLForecastingTrainingJob projects/934903580331/locations/us-central1/trainingPipelines/8325361720348377088 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLForecastingTrainingJob projects/934903580331/lo

In [None]:
FORECAST_MODEL_RSC_NAME = forecast.resource_name
print(f"FORECAST_MODEL_RSC_NAME: {FORECAST_MODEL_RSC_NAME}")

## default Model Evaluation

In [None]:
forecast_EVALS = forecast.list_model_evaluations()

for model_evaluation in forecast_EVALS:
    print(model_evaluation.to_dict())

In [None]:
# Drop any metrics the ARIMA pipeline doesn't support yet.
model_evaluation = list(forecast.list_model_evaluations())[0]
metrics_dict = {k: [v] for k, v in dict(model_evaluation.metrics).items()}
metrics_dict

# Evaluate quantile predictions

**TODO**

* see [Probabilistic Inference User Guide | Vertex Forecast](https://docs.google.com/document/d/1kegOsor8j7HO2qttMKK6mtfl5GzoxDf8LhsXH8oXsyo/edit#)

### understanding quantile predictions

Vertex Forecast output eval dataset
* `predicted_sales.quantile_values` will give the quantiles, i.e. [0.1, 0.3, 0.5, 0.7, 0.9]
* `predicted_sales.quantile_predictions` will be an array of the same length with matching predictions
* There is also a field `predicted_sales.value` which is just the prediction for the 0.5 quantile (median)

Different statistics can be estimated from the quantiles, including statistics that minimize:
* RMSE (weighted mean of quantile values)
* MAPE (median weighted by 1/value)
* MAE (median)