# Forecast Flow

<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

In [None]:
%%javascript
// Javascript to generate Table of Contents from notebook headers. Re-execute it at the very begining and
// on document structure change
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

## Parameters

In [None]:
# Parameter: URL of the REST API
%env MLFACTORY_REST_API_URL=None

In [None]:
# Parameter
# Problem is created in UI, here we use its ID as parameter
problem_id = 200110

In [None]:
# Parameter
# Experiment is pre-created in UI, here we use its ID as parameter to load and reuse it
experiment_id = 622

## Import Libraries

In [None]:
# Install MLFactory SDK
!pip install mlfactory_sdk --upgrade --extra-index https://pypi.swarm.devfactory.com > /dev/null

In [None]:
# WARNING: This cell may take 15-20 mins to finish. If you don'y need Pandas Profiler data analysis, just skip this cell,
# as well as `a. Automatic Data Exploration` section

# System cell
# Sophisticated way to import pandas profiler into the Notebook. Work with kernel = 'conda_amazonei_tensorflow_p36
try:
    import pandas_profiling
except:
    !sudo /home/ec2-user/anaconda3/bin/conda update -n amazonei_tensorflow_p36 --all -y
    !sudo /home/ec2-user/anaconda3/bin/conda install -c conda-forge -n amazonei_tensorflow_p36 pandas-profiling imagehash -y
    !sudo /home/ec2-user/anaconda3/bin/conda update -n amazonei_tensorflow_p36 ipywidgets -y
finally:
    import pandas_profiling

from pandas_profiling import ProfileReport

In [None]:
# System cell
# Import required and common libs
import json
import time
import boto3
import os
import pandas as pd
import numpy as np
import datetime
import pyarrow
from itables import show


In [None]:
# System cell
# Import all required MLFactory classes, and init MLFactory API
from mlfactory.sdk.restclient.mlfactory_api import MLFactoryApi
from mlfactory.sdk.common import Util
from mlfactory.sdk.problem.base import Problem
from mlfactory.sdk.tf.transformations import Tfs
from mlfactory.sdk.common import ExperimentLoader
from mlfactory.sdk.tf.tf_execution import TfExecution


mlf_api = MLFactoryApi()

In [None]:
from mlfactory.sdk.forecast.forecast_config import ForecastConfiguration


## Define Problem

In [None]:
#todo Output JSON as table
problem = Problem.load(problem_id)
problem

## Data Exploration

Here is the place to explore & visualize your data

In [None]:
# Table names under your problem
problem.table_names()

#### a. Automatic Data Exploration

Visualize and Analize your data automatically based on Pandas Profiler

In [None]:
# Set to the table name you want to explore
# Generally, you'd like to see here your main (targe time series) dataset
explore_table_name = "raw_data_csv"

In [None]:
# Data frame from the exloration table
df_explore = problem.read_dataframe_from_table(explore_table_name)
df_explore.head()

In [None]:
# System Cell
# Automatic data exploration
profile = ProfileReport(df_explore, title=f"Exploration report for {explore_table_name}", explorative=True)
profile.to_widgets()

#### b. Automatic Anomaly Detection

Detect outliers in your data using MLFactory AnomalyDetection transformation

[Read details](https://docs.google.com/document/d/1xyV_paZdy3vW9S954korzOqmZVDX56Yavu2OA0zKT_Y/edit?usp=sharing)

In [None]:
# Uncomment and run this cell if you want to apply automatic anomalies detecion

anomaly_detector = Tfs.AnomalyDetection()
anomaly_detector.problem_id = problem.id()
anomaly_detector.table = explore_table_name
anomaly_detector.run()

In [None]:
# Uncomment and run this cell if you appled automatic anomalies detecion above.

status = anomaly_detector.tf_execution.refresh_status()
if status.is_done():
    df_explore = problem.read_dataframe_from_table(anomaly_detector.name)
    df_explore.head()
else:
    print(f"Anomalies detection is still in progress, please wait. Current status is {status}")

#### c. Custom Exploration

In [None]:
# Feel free to explore more data if you want!

### Data Transformations

### In-Memory and Server-Side Transformations Guide

You have two options to transform your data:
 1. Use `problem.read_dataframe_from_table(<table_name>)` to load all your data in memory into good old pandas DataFrame.
  Apply all transformations your need, and save data using `problem.write_dataframe_into_table(<df>, <table_name>)`.
  That suits perfectly if you have a small to medium size dataset which fits to RAM (up to several Gbs)
 2. Use server-side MLFactory SDK transformations. They run at our backend, take more time (minutes usually),
  but can handle, and are intended for, huge datasets.

Please find examples below

#### Server-side transforation example

#### 1. Choose and configure

In [None]:
# a. To list all available server-side transformations, invoke "all()"

# Tfs.all()

In [None]:
# b. Choose one, e.g. `SelectColumns

# tf = Tfs.SelectColumns()

In [None]:
# c. To get help on a transformation (as well as almost any SDK class or function), invoke `?`,
# or put cursor on variable or function name and Shift+Tab

# tf?

In [None]:
# d. Configure the transformation
# Name of the transformation is also used as the name of output table you will find transformed data in
# By default, the name is assigned to the name of transformation itself, in snake case


# tf.name = "select_best_columns"

# tf.table = "raw_data_csv"

# tf.keys = ["item_id"]

# tf.problem_id = problem.id()

#### 2. Save and execute

In [None]:
# tf_execution = tf.run()

#### 3. Check status

In [None]:
# If notebook when offline while you've been waiting, you can always load the execution object from DB
# tf_execution = TfExecution.instance(problem_id, transformation_id)

# Check the transformation execution status
# tf_execution.refresh_status()

Now you should wait for (usually) several minutes. You can use those flags to understand if execution is complete and if it was successful

In [None]:
# tf_execution.refresh_status().is_done()
# tf_execution.refresh_status().is_successful()

#### Classic (Pandas) transforation example

#### 1. Read Data

In [None]:
# All table names for your problem
problem.table_names()

In [None]:
# Read data into pandas DataFrame
df = problem.read_dataframe_from_table("please specify table name")
df.head()

#### 2. Change Data

In [None]:
# Do changes using pandas
# ...

#### 3. Save data back to the table

In [None]:
# You can check available parameters running the line below
# problem.write_dataframe_into_table?

In [None]:
# Write data
# problem.write_dataframe_into_table(df, table_name=<new table name>)

### Transform Data to Required Format

Below is the proper place to transform your data to fit the AWS Forecast requirements

https://docs.aws.amazon.com/forecast/latest/dg/howitworks-datasets-groups.html


#### a. Target Time Series Dataset

Traget time series is a mandatory dataset. It should be reduced to the strict format requried by AWS
* https://docs.aws.amazon.com/forecast/latest/dg/howitworks-datasets-groups.html
* https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html#target-time-series-type-custom-domain (we are using CUSTOM domain, so please follow this short instruciton)

In [None]:
# Add your transformaions here (If any)

#### b. Related Time Series Dataset

Optional yet useful dataset which can significantly improve the forecast quality
* https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html
* https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html#related-time-series-type-custom-domain (we are using CUSTOM domain, so please follow this short instruciton)

In [None]:
# Add your transformaions here (If any)

#### c. Item Metadata Dataset

Another optional dataset (categorisation)
* https://docs.aws.amazon.com/forecast/latest/dg/item-metadata-datasets.html
* https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html#item-metadata-type-custom-domain

In [None]:
# Add your transformaions here (If any)

### The Last Transformation

#### 1. If you use server-side transformations

In [None]:
# Select transformation class from Tfs.all() list, and replace <SELECT CLASS> with class name
tf_final = Tfs.<SELECT CLASS>()
# Last transformation in the chain should have the predefined name = "target_time_series"
# That's required by our backend: we don't specify time series table name explicitly but rather expect data to be
# present in a table named "target_time_series"
tf_final.name = "target_time_series"
tf_final.partitions = 1
tf_final.problem_id = problem.id

In [None]:
# System cell
# Long running

# Run transformation
tf_execution = tf_final.run()

In [None]:
# If notebook when offline while you've been waiting, you can always load the execution object from DB
# tf_execution = TfExecution.instance(problem_id, transformation_id)

# Wait until `is_done` is True
tf_execution.refresh_status()

#### 2. Or, If you use in-memory pandas transformations

In [None]:
# Read data
df = problem.read_dataframe_from_table('t1_problem_120')
df.head()
# Do changes using pandas
# ...
# Write date
# problem.write_dataframe_into_table(df, "target_time_series")

## Create Experiment

In [None]:
# System cell
experiment = ForecastConfiguration.load(problem_id=problem_id, experiment_id=experiment_id)

#### Required Fields (*)

Those fields are mandatory, and require you to set values

In [None]:
# Please give your experiment a meaninful name
# It should be unique in the scope of the problem
experiment.name = "EEE1"

# number of data points the model should be able to predict
experiment.forecast_horizon = None

# Y|M|W|D|H|30min|15min|10min|5min|1min
experiment.granularity = "None"

#### Optional Fields

You can leave the values as is, or adjust them if you want

In [None]:
experiment.description = ""

# https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-choosing-recipes.html
# "automl" fits well in the most cases
experiment.algorithm = "automl"

# number of data points to test on (forecast_horizon <= test_horizon < 1/2 * TARGET_TIME_SERIES dataset length)
experiment.test_horizon = experiment.forecast_horizon

# Related Time series
experiment.related_timeseries_location = None

# Item Metadata
experiment.metadata_location = None

# https://docs.aws.amazon.com/forecast/latest/dg/metrics.html
experiment.number_of_test_windows = 1

# 'mean' or 0-1; upto 5 values
DEFAULT_FORECAST_METRICS_LIST = ["0.1", "0.5", "0.9", "mean"]   # Will reuse that var during deploy
experiment.metrics = DEFAULT_FORECAST_METRICS_LIST

# https://docs.aws.amazon.com/forecast/latest/dg/API_SupplementaryFeature.html
experiment.holiday_country_code = "US"

# Current assumption is that we don't necessarily need to pick the dataset domain (retail , web traffic etc)
# and we go ahead with 'custom domain'
# (https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html)
experiment.domain = "CUSTOM"

experiment.featurization = {
    "AttributeName": "target_value",
    "FeaturizationPipeline": [{
      "FeaturizationMethodName": "filling",
      "FeaturizationMethodParameters": {
        "frontfill": "none",
        "middlefill": "zero",
        "backfill": "zero"
      }
    }]
}

## Train Model

In [None]:
# System cell
# Long-running job
experiment.save_and_run_training()

That's it! Now all you need is to wait until training is over.

In [None]:
# When the status = 'Training complete', we can proceed further. Otherwise, we need to wait.
# Training could take several hours, or even more, on huge datasets
experiment = ExperimentLoader.load(problem_id, experiment_id)
experiment.status()

### Create Forecast

In [None]:
# Create the actual forecast
# Isntead of DEFAULT_FORECAST_METRICS_LIST, you can use a subset of metrics you set for training as `experiment.metrics`
forecast_arn = experiment.create_forecast(DEFAULT_FORECAST_METRICS_LIST)
forecast_arn

In [None]:
# And now we should wait until status becomes 'ACTIVE'
experiment = ExperimentLoader.load(problem_id, experiment_id)
experiment._forecast_arn = forecast_arn

experiment.get_forecast_status()

In [None]:
# todo Add Filter API and provide access to Forecast API from here
# forecast_result = experiment.get_forecast_result()

## Export All Predictions

### Run Export

This will create export job to save all predicted results to s3 location

FYR: https://docs.aws.amazon.com/forecast/latest/dg/API_CreateForecastExportJob.html

In [None]:
# You can set your custom s3_path to export data to
s3_path = None
export_job_arn, s3_path = experiment.export_forecast_results(s3_path)
export_job_arn


In [None]:
# Wait until status becomes 'ACTIVE' (its usually pretty fast, several minutes)
experiment = ExperimentLoader.load(problem_id, experiment_id)
experiment._forecast_arn = forecast_arn
experiment._export_job_arn = export_job_arn

experiment.get_forecast_export_job_status()


### Load All Predictions into Pandas DataFrame

In [None]:
# Load all predictions into a single dataframe
# Please note that there should be no other files there
df_predictions = Util.load_dataframe_from_s3_partitions(s3_path)

In [None]:
# Show several lines
df_predictions.head()


In [None]:
# Example of how you can plot example graphs
# Please substitute `item_id` and `metrics` of your interest
item_id = 1
metrics = "mean"
df_predictions[df_predictions.item_id == item_id].plot(x="date", y=metrics, figsize=(50, 5))