# Vertex AI SDK for Python: Vertex AI Forecasting Model Training 

This notebook demonstrates how to create an AutoML model using the SDK based on a time series dataset. 

To use this notebook run each step, or cell, in seuqence and see its results. To run a cell, use Shift+Enter. The notebooks will automatically display the return value of the last line in each cell. 

# Install Vertex AI SDK, create a storage bucket and define envrionment variables

After the SDK installation the kernel will be automatically restarted. You may see this error message `Your session crashed for an unknown reason` which is normal.

## Install the Vertex AI SDK and restart the kernel

In [None]:
%%capture
!pip3 uninstall -y google-cloud-aiplatform
!pip3 install google-cloud-aiplatform
 
import IPython
 
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Import Python system modules

In [None]:
import sys
import os
from datetime import datetime

## Define Lab Environment variables

In [None]:
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]
print("Project ID: ", PROJECT_ID)
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET_NAME = "gs://"+PROJECT_ID
REGION = "us-central1"  # Change this if you need to use a different region for Vertex AI

## Create a Regional Storage Bucket

This cell creates a new regional Cloud Storage bucket for the lab. This bucket uses the lab Project_ID to ensure that it is a unique name across all Cloud Storage buckets.

You cannot use a Multi-Regional Cloud Storage bucket for training with Vertex AI, you must use a Regional Cloud Storage bucket.

In [None]:
!gsutil mb -l {REGION} {BUCKET_NAME}

## Define Training Dataset source

The datasets you are using are samples from the [Iowa Liquor Retail Sales](https://pantheon.corp.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales) dataset. The training sample contains the sales from 2020 and the prediction sample (used in the batch prediction step) contains the January - April sales from 2021.

In [None]:
TRAINING_DATASET_BQ_PATH = 'bq://bigquery-public-data:iowa_liquor_sales_forecasting.2020_sales_train'

# Initialize Vertex AI SDK

Initialize the *client* for Vertex AI.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

# Create a Managed Time Series Dataset from BigQuery

This section will create a dataset from a BigQuery table.

In [None]:
ds = aiplatform.datasets.TimeSeriesDataset.create(
    display_name='iowa_liquor_sales_train',
    bq_source=[TRAINING_DATASET_BQ_PATH])

ds.resource_name

# Create a Training Job to train a Vertex AI Forecasting Model

You create the Vertex AI AutoML Forecasting job definition , specifying the optimization objective and any column transformations required. 

In [None]:
time_column = "date"
time_series_identifier_column="store_name"
target_column="sale_dollars"

job = aiplatform.AutoMLForecastingTrainingJob(
    display_name='train-iowa-liquor-sales-automl_1',
    optimization_objective='minimize-rmse',    
    column_transformations=[
        {"timestamp": {"column_name": time_column}},
        {"numeric": {"column_name": target_column}},
        {"categorical": {"column_name": "city"}},
        {"categorical": {"column_name": "zip_code"}},
        {"categorical": {"column_name": "county"}},
    ]
)

## Train the Vertex AI AutoML Forecasting model

You now run the training job to train the Vertex AI AutoML Forecasting model. 

**Note:** This will take approximately an hour and 15 minutes to run.

While you are waiting for this to complete you can return to the lab guide for instructions on how to download and explore sample batch prediction output data. 

In [None]:
model = job.run(
    dataset=ds,
    target_column=target_column,
    time_column=time_column,
    time_series_identifier_column=time_series_identifier_column,
    available_at_forecast_columns=[time_column],
    unavailable_at_forecast_columns=[target_column],
    time_series_attribute_columns=["city", "zip_code", "county"],
    forecast_horizon=30,
    context_window=30,
    data_granularity_unit="day",
    data_granularity_count=1,
    weight_column=None,
    budget_milli_node_hours=1000,
    model_display_name="iowa-liquor-sales-forecast-model", 
    predefined_split_column_name=None,
)

# Fetch Model Evaluation Metrics

Fetch the model evaluation metrics calculated during training on the test set.

In [None]:
import pandas as pd

list_evaluation_pager = model.api_client.list_model_evaluations(parent=model.resource_name)
for model_evaluation in list_evaluation_pager:
  metrics_dict = {m[0]: m[1] for m in model_evaluation.metrics.items()}
  df = pd.DataFrame(metrics_dict.items(), columns=["Metric", "Value"])
  print(df.to_string(index=False))

# Run Batch Prediction

## Create Output BigQuery Dataset
First, create a new BigQuery dataset for the batch prediction output in the same region as the batch prediction input dataset. 

In [None]:
import os
from google.cloud import bigquery

os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

batch_predict_bq_input_uri = "bq://bigquery-public-data.iowa_liquor_sales_forecasting.2021_sales_predict"
batch_predict_bq_output_dataset_name = "iowa_liquor_sales_predictions"
batch_predict_bq_output_dataset_path = "{}.{}".format(PROJECT_ID, batch_predict_bq_output_dataset_name)
batch_predict_bq_output_uri_prefix = "bq://{}.{}".format(PROJECT_ID, batch_predict_bq_output_dataset_name)
# Must be the same region as batch_predict_bq_input_uri
client = bigquery.Client()
dataset = bigquery.Dataset(batch_predict_bq_output_dataset_path)
dataset_region = "US" # @param {type : "string"}
dataset.location = dataset_region
dataset = client.create_dataset(dataset)
print("Created bigquery dataset {} in {}".format(batch_predict_bq_output_dataset_path, dataset_region))

Run a batch prediction job to generate liquor sales forecasts for stores in Iowa from an input dataset containing historical sales.

In [None]:
model.batch_predict(
   bigquery_source=batch_predict_bq_input_uri,
   instances_format="bigquery",
   bigquery_destination_prefix=batch_predict_bq_output_uri_prefix,
   predictions_format="bigquery",
   job_display_name="predict-iowa-liquor-sales-automl_1")

# Visualize the Forecasts
Follow the this link to visualize the generated forecasts in [Data Studio](https://support.google.com/datastudio/answer/6283323?hl=en).

In [None]:
import urllib

tables = client.list_tables(batch_predict_bq_output_dataset_path)

prediction_table_id = ""
for table in tables:
  if table.table_id.startswith(
      "predictions_") and table.table_id > prediction_table_id:
    prediction_table_id = table.table_id
batch_predict_bq_output_uri = "{}.{}".format(
    batch_predict_bq_output_dataset_path, prediction_table_id)


def _sanitize_bq_uri(bq_uri):
  if bq_uri.startswith("bq://"):
    bq_uri = bq_uri[5:]
  return bq_uri.replace(":", ".")


def get_data_studio_link(batch_prediction_bq_input_uri,
                         batch_prediction_bq_output_uri, time_column,
                         time_series_identifier_column, target_column):
  batch_prediction_bq_input_uri = _sanitize_bq_uri(
      batch_prediction_bq_input_uri)
  batch_prediction_bq_output_uri = _sanitize_bq_uri(
      batch_prediction_bq_output_uri)
  base_url = "https://datastudio.google.com/c/u/0/reporting"
  query = "SELECT \\n" \
  " CAST(input.{} as DATETIME) timestamp_col,\\n" \
  " CAST(input.{} as STRING) time_series_identifier_col,\\n" \
  " CAST(input.{} as NUMERIC) historical_values,\\n" \
  " CAST(predicted_{}.value as NUMERIC) predicted_values,\\n" \
  " * \\n" \
  "FROM `{}` input\\n" \
  "LEFT JOIN `{}` output\\n" \
  "ON\\n" \
  "CAST(input.{} as DATETIME) = CAST(output.{} as DATETIME)\\n" \
  "AND CAST(input.{} as STRING) = CAST(output.{} as STRING)"
  query = query.format(time_column, time_series_identifier_column,
                       target_column, target_column,
                       batch_prediction_bq_input_uri,
                       batch_prediction_bq_output_uri, time_column, time_column,
                       time_series_identifier_column,
                       time_series_identifier_column)
  params = {
      "templateId": "067f70d2-8cd6-4a4c-a099-292acd1053e8",
      "ds0.connector": "BIG_QUERY",
      "ds0.projectId": PROJECT_ID,
      "ds0.billingProjectId": PROJECT_ID,
      "ds0.type": "CUSTOM_QUERY",
      "ds0.sql": query
  }
  params_str_parts = []
  for k, v in params.items():
    params_str_parts.append("\"{}\":\"{}\"".format(k, v))
  params_str = "".join(["{", ",".join(params_str_parts), "}"])
  return "{}?{}".format(base_url,
                        urllib.parse.urlencode({"params": params_str}))


print(
    get_data_studio_link(batch_predict_bq_input_uri,
                         batch_predict_bq_output_uri, time_column,
                         time_series_identifier_column, target_column))

# Cleaning up

The lab resources will automatically be cleaned up by Qwiklabs but if you are reusing this notebook in your own environment you can use this cell to remove the Vertex AI model and Cloud Storage Bucket created for this exercise. 

In [None]:
# Delete model resource
model.delete(sync=True)

# Delete Cloud Storage objects that were created
! gsutil -m rm -r $BUCKET_NAME