# In Progress (Coming Soon)
## 04g - Vertex AI > Pipelines - Forecasting Tournament with Kubeflow (KFP) and BQML + AutoML + Prophet


### Prerequisites:
-  04 - Time Series Forecasting - Data Review in BigQuery

### Overview:
-  

### Resources:
-  

---
## Vertex AI - Conceptual Flow

<img src="architectures/slides/04g_arch.png">

---
## Vertex AI - Workflow

<img src="architectures/slides/04g_console.png">

---
# DEV NOTES

KFP
- prepare data: raw > forecast (Transform, add splits)
- Launch Forecasting
    - ARIMA+
        - Output to BigQuery
        - Post-Process Predictions
    - AutoML
        - Parallel CW scenarios [32, 16, 8, 4, 2, 1, 0]
            - Launch Vertex AI AutoML Forecasting Training Job
                - Output to BigQuery
            - Post-Process Predictions
    - Prophet
        - Parallel Scenarios (yearly flag):
            - Launch Vertex AI Training Jobs - use 04f container
                - Output to BigQuery
            - Post Process Predictions
- Custom Metrics - Use combined predictions table from the individual post-processing
- Prepare Forecast Table (Best by Series)


The AutoML component will need to run 7 concurrent jobs
The default limit is 5 per: https://cloud.google.com/vertex-ai/docs/quotas#model_quotas
I updated this to 10 with IAM > Quota using this instructions: https://cloud.google.com/docs/quota#requesting_higher_quota



---
## Setup

inputs:

In [None]:
PROJECT_ID='statmike-mlops'
REGION = 'us-central1'
DATANAME = 'citibikes'
NOTEBOOK = '04g'

# Used for Prophet Custom Forecasting Jobs
BASE_IMAGE = 'gcr.io/deeplearning-platform-release/base-cpu'
TRAIN_COMPUTE = 'n1-standard-8'

packages:

In [None]:
from google.cloud import bigquery
from google.cloud import aiplatform

from typing import NamedTuple
import kfp # used for dsl.pipeline
import kfp.v2.dsl as dsl # used for dsl.component, dsl.Output, dsl.Input, dsl.Artifact, dsl.Model, ...

import matplotlib.pyplot as plt
from datetime import datetime
import json

clients:

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)
bigquery = bigquery.Client()

parameters:

In [None]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{DATANAME}/models/{NOTEBOOK}"
DIR = f"temp/{NOTEBOOK}"

In [None]:
# Give service account roles/storage.objectAdmin permissions
# Console > IMA > Select Account <projectnumber>-compute@developer.gserviceaccount.com > edit - give role
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

environment:

In [None]:
!rm -rf {DIR}
!mkdir -p {DIR}

---
## Custom Components (KFP)

Vertex AI Pipelines are made up of components that run independently with inputs and outputs that connect to form a graph - the pipeline.  For this notebook workflow the following custom components are used to orchestrate different forcasting approaches (BigQuery ML ARIMA+, Prophet, and Vertex AI AutoML Forecasting) and the different scenearios for each of these.

### Data Preparation
This component prepares the data for forcasting and add the split for Train/Validate/Test sets.  This follows the logic used in notebok `04`.

In [None]:
@dsl.component(
    base_image = "python:3.9",
    packages_to_install = ['pandas','pyarrow','google-cloud-bigquery']
)
def forecast_prep(
    project: str,
    notebook: str,
    bqsource: str
) -> NamedTuple('Source', [('Dataset', str), ('Table', str), ('TournamentTable', str), ('SourceQuery', str), ('TournamentQuery', str)]):

    # Setup
    from google.cloud import bigquery
    from collections import namedtuple
    
    bigquery = bigquery.Client(project = project)
    
    # parameters
    bqdataset = f"{notebook}_tournament"
    bqtable = 'source'
    bqmain = 'tournament'
    sources = namedtuple('Source', ['Dataset', 'Table', 'TournamentTable', 'SourceQuery', 'TournamentQuery'])
    
    # Create Schema
    query = f"""
        CREATE SCHEMA IF NOT EXIST `{project}.{bqdataset}`
        OPTIONS(location = 'US', labels = [('notebook','{notebook}')])
    """
    job = bigquery.query(query = query)
    job.results()
    
    # Plan Cutoff dates
    query = f"""
        WITH
            ALLDATES AS(
                SELECT EXTRACT(DATE from starttime) as date
                FROM `{bqsource}`
                WHERE start_station_name LIKE '%Central Park%'
            ),
            KEYS AS(
                SELECT 
                    MIN(date) as start_date,
                    MAX(date) - CAST(0.025 * DATE_DIFF(MAX(date), MIN(date), DAY) AS INT64) as val_start,
                    MAX(date) - CAST(0.0125 * DATE_DIFF(MAX(date), MIN(date), DAY) AS INT64) test_start,
                    MAX(date) as end_date
                FROM ALLDATES  
            )
        SELECT *, DATE_DIFF(end_date, test_start, DAY)+1 as forecast_horizon
        FROM KEYS    
    """
    keyDates = bigquery.query(query = query).to_dataframe()
    
    # Prepare Source
    querySource = f"""
        CREATE OR REPLACE TABLE `{project}.{bqdataset}.{bqtable}` AS
        WITH
            DAYS AS(
                SELECT
                   start_station_name,
                   EXTRACT(DATE from starttime) AS date,
                   COUNT(*) AS num_trips
                FROM `{bqsource}`
                WHERE start_station_name LIKE '%Central Park%'
                GROUP BY start_station_name, date
            )
        SELECT *,
           CASE
               WHEN date < DATE({keyDates['val_start'][0].strftime('%Y, %m, %d')}) THEN "TRAIN"
               WHEN date < DATE({keyDates['test_start'][0].strftime('%Y, %m, %d')}) THEN "VALIDATE"
               ELSE "TEST"
           END AS splits
        FROM DAYS
    """
    job = bigquery.query(query = querySource)
    job.result()
    
    # Prepare Common Output Table
    queryOutput = f"""
        CREATE OR REPLACE TABLE `{project}.{bqdataset}.{bqmain}`
        (platform STRING, method STRING, scenario STRING, start_station_name STRING, date DATE, num_trips INT64, yhat FLOAT64, yhat_lower FLOAT64, yhat_upper FLOAT64)
    """
    job = bq.query(query = queryOutput)
    job.result()
    
    return sources(bqdataset, bqtable, bqmain, querySource, queryOutput)

### BigQuery ML ARIMA+
This component fits a forecasting model using BigQuery ML model type ARIMA+.  This follows the logic used in notebook `04a`.

In [None]:
@dsl.component(
    base_image = "python:3.9",
    packages_to_install = ['pandas','pyarrow','google-cloud-bigquery']
)
def forecast_bqarima(
    project: str,
    notebook: str,
    bqdataset: str,
    bqtable: str,
    bqmain: str
) -> NamedTuple('Source', [('Output', str), ('Model', str), ('ModelQuery', str)])

    # Setup
    from google.cloud import bigquery
    from collections import namedtuple
    
    bigquery = bigquery.Client(project = project)
    
    # parameters
    bqsource = f"{project}.{bqdataset}.{bqtable}"
    bqmodel = f"{project}.{bqdataset}.{notebook}_arimaplus"
    bqoutput = f"{project}.{bqdataset}.{notebook}_forecast_arimaplus"
    sources = namedtuple('Source', ['Output', 'Model', 'ModelQuery'])
    
    # Retrieve Key Dates from Source
    query = f"""
        WITH
            SPLIT AS (
                SELECT splits, min(date) as mindate, max(date) as maxdate
                FROM `{bqsource}`
                GROUP BY splits
            ),
            TRAIN AS (
                SELECT mindate as start_date
                FROM SPLIT
                WHERE splits ='TRAIN'
            ),
            VAL AS (
                SELECT mindate as val_start
                FROM SPLIT
                WHERE splits = 'VALIDATE'
            ),
            TEST AS (
                SELECT mindate as test_start, maxdate as end_date, DATE_DIFF(maxdate, mindate, DAY)+1 as forecast_horizon
                FROM SPLIT
                WHERE splits = 'TEST'
            )
        SELECT * EXCEPT(pos) FROM
        (SELECT *, ROW_NUMBER() OVER() pos FROM TRAIN)
        JOIN (SELECT *, ROW_NUMBER() OVER() pos FROM VAL)
        USING (pos)
        JOIN (SELECT *, ROW_NUMBER() OVER() pos FROM TEST)
        USING (pos)
    """
    keyDates = bigquery.query(query = query).to_dataframe()
    keyDates
    
    # Create Model: ARIMA_PLUS
    queryARIMA = f"""
        CREATE OR REPLACE MODEL `{bqmodel}`
        OPTIONS
          (model_type = 'ARIMA_PLUS',
           time_series_timestamp_col = 'date',
           time_series_data_col = 'num_trips',
           time_series_id_col = 'start_station_name',
           auto_arima_max_order = 5,
           holiday_region = 'US',
           horizon = {keyDates['forecast_horizon'][0]}
          ) AS
        SELECT start_station_name, date, num_trips
        FROM `{bqsource}`
        WHERE splits in ('TRAIN','VALIDATE')
    """
    job = bigquery.query(query = queryARIMA)
    job.result()
    
    # Create Raw Output
    query = f"""
        CREATE OR REPLACE TABLE `{bqoutput}` AS
        WITH
            FORECAST AS (
                SELECT
                    start_station_name, 
                    EXTRACT(DATE from time_series_timestamp) as date,
                    time_series_adjusted_data as forecast_value,
                    time_series_type,
                    prediction_interval_lower_bound,
                    prediction_interval_upper_bound
                FROM ML.EXPLAIN_FORECAST(MODEL `{bqmodel}`, STRUCT({keyDates['forecast_horizon'][0]} AS horizon, 0.95 AS confidence_level))
                WHERE time_series_type = 'forecast'
            ),
            ACTUAL AS (
                SELECT start_station_name, date, sum(num_trips) as actual_value
                FROM `{bqsource}`
                WHERE splits = 'TEST'
                GROUP BY start_station_name, date
            )
        SELECT *
        FROM FORECAST
        INNER JOIN ACTUAL
        USING (start_station_name, date)
        ORDER BY start_station_name, time_series_type  
    """
    job = bigquery.query(query = query)
    job.result()
    
    # Insert Output for Tournament
    query = f"""
        INSERT INTO `{project}.{bqdataset}.{bqmain}`
        SELECT
            'BigQuery' as platform,
            'ARIMA_PLUS' as method,
            'automatic' as scenario,
            start_station_name,
            date,
            num_trips,
            forecast_value as yhat,
            prediction_interval_lower_bound as yhat_lower,
            prediction_interval_upper_bound as yhat_upper
        FROM `{bqoutput}`
        ORDER by start_station_name, date
    """
    job = bigquery.query(query = query)
    job.result()
    
    return sources(bqoutput, bqmodel, queryARIMA)

### Vertex AI AutoML Forecasting
This component fits a forecasting model using Vertex AI AutoML Forecasting.  This follows the logic used in notebooks `04c` and `04d`.

### Vertex AI Training Custom Jobs for Forecasting with Prophet
This component fits a forecasting model using a Prophet script in a custom container (built in `04f`) to fit forecasting.  This follows the logic used in notebook `04f`.

### Custom Metrics and Champion Selection
This component calculates custom metrics for all time series and all methods to select the best method per series and prepare a champion prediction file in BigQuery.

---
## Vertex AI Pipeline

### Pipeline (KFP) Definition

In [None]:
# Parameters
    BQ_SOURCE = 'bigquery-public-data.new_york.citibike_trips'

### Compile Pipeline

### Create Vertex AI Pipeline Job

### Review Pipeline Job

## Results