<div>
<img src="https://user-images.githubusercontent.com/3348134/223112746-345126ff-a0e8-479f-8ac0-670d78f71712.png" width="500"/>
</div>

In [None]:
import warnings
warnings.filterwarnings('ignore')
%load_ext autoreload
%autoreload 2
from absl import logging as absl_logging
absl_logging.set_verbosity(-10000)

# Why ZenML

![Sam](_assets/sam.png)

Let's get into this. 
But first things first. We need to initialize our zenml repository.

In [None]:
!zenml init
# Create a local stack to run these pipelines
!zenml stack register local_stack -a default -o default
!zenml stack set local_stack

# Overview

A couple weeks ago, we were looking for a fun project to work on for the next chapter of our ZenHacks. During our initial discussions, we realized that it would be really great to work with an NBA dataset, as we could quickly get close to a real-life application like a "3-Pointer Predictor" while simultaneously entertaining ourselves with one of the trending topics within our team.

As we were building the dataset around a "3-Pointer Predictor", we realized that there is one factor that we need to take into consideration first: Stephen Curry, The Baby Faced Assassin. In our opinion, there is no denying that he changed the way that the games are played in the NBA and we wanted to actually prove that this was the case first. 

That's why our story in this ZenHack will start with a pipeline dedicated to drift detection. As the breakpoint of this drift, we will be using the famous "Double Bang" game that the Golden State Warriors played against Oklahoma City Thunder back in 2016. Following that, we will build a training pipeline which will generate a model that predicts the number of three-pointers made by a team in a single game, and ultimately, we will use these trained models and create an inference pipeline for the upcoming matches in the NBA.

# Chapter 1 - Exploring NBA Data
## Did Steph Curry Change the Game?

https://www.youtube.com/watch?v=GEMVGHoenXM

![Steph Curry Drains the Game Winner vs Oklahoma City](https://i.makeagif.com/media/3-20-2016/7N5RWB.gif)

In [None]:
# We'll use this date in our pipelines as the division between old and new
CURRYS_THREE_POINTER = '2016-02-27'

![PipelineStructure](_assets/DriftDetectionPipeline.png "PipelineStructure")

## Creating our first step

Naturally our first step should be the data import. For this we query the nba_api for all data for a set of seasons. 

<div class="alert alert-block alert-warning">
Best practice would be to disable cache for steps that fetch external data. ZenML has no way of knowing if this data has changed and would always cache the step if caching is not explicitly disabled.
</div>

For development it is useful to have caching enabled to enable faster iterative development of downstream steps.
Use `@step(enable_cache=False)` to disable cache.

In [None]:
import time
from zenml.steps import step
from steps.importer import ImporterConfig
import pandas as pd
from nba_api.stats.endpoints import leaguegamelog

@step
def game_data_importer(config: ImporterConfig) -> pd.DataFrame:
    """Downloads season data from NBA API and returns a pd.DataFrame"""
    dataframes = []
    for season in config.seasons:
        print(f"Fetching data for season: {season}")
        dataframes.append(leaguegamelog.LeagueGameLog(season=season, timeout=180).get_data_frames()[0])
        # sleep so as not to bomb api server :-)
        time.sleep(2)
    return pd.concat(dataframes)

## Creating an exploratory pipeline

This is where we configure the steps of our pipeline and how data will flow from one step into the other. 

For this we use the `@pipeline decorator`. To define a pipeline we first define all steps of the pipeline in the function signature. Then within the function we configure how the outputs of steps get passed into steps downstream.

In [None]:
from zenml.pipelines import pipeline

@pipeline
def data_analysis_pipeline(
        importer,          # Import NBA game data
        drift_splitter,    # Split data at relevant date
        drift_detector,    # Compare data distributions
):
    """Links all the steps together in a pipeline"""
    raw_data = importer()
    reference_dataset, comparison_dataset = drift_splitter(raw_data)
    drift_report, _ = drift_detector(reference_dataset, comparison_dataset)

## Integrating Evidently

Evidently is an open source tool that allows you to easily compute drift on your data. [Here](https://blog.zenml.io/zenml-loves-evidently/) is a little blog post of ours that explains the evidently integration in a bit more detail. 

At its core, Evidently’s drift detection calculation functions take in a reference data set and compare it with a separate comparison dataset. These are both passed in as Pandas dataframes, though CSV inputs are also possible. ZenML implements this functionality in the form of several standardized steps along with an easy way to use the visualization tools also provided along with Evidently as ‘Dashboards’.


If you’re working on any kind of machine learning problem that has an ongoing training loop that takes in new data, you’ll want to guard against drift. Machine learning pipelines are built on top of data inputs, so it is worth checking for drift if you have a model that was trained on a certain distribution of data. The incoming data is something you have less control over and since things often change out in the real world, you should have a plan for knowing when things have shifted. Evidently offers a [growing set of features](https://github.com/evidentlyai/evidently) that help you monitor not only data drift but other key aspects like target drift and so on.

![Evidently](_assets/zenml+evidently.png "Evidently")

In [None]:
# First we need to install evidently to our python environment
!zenml integration install evidently -f

In [None]:
# Then we need to add evidently data validator to stack
!zenml data-validator register local_evidently --flavor=evidently
!zenml stack set local_stack
!zenml stack update local_stack -dv local_evidently    

Here we choose the [DataDriftMetric](https://docs.evidentlyai.com/reference/all-metrics#data-drift) for our [Report](https://docs.evidentlyai.com/readme/core-concepts#what-is-a-report).

In [None]:
# Zenml provides some standard steps for the evidently integration
from zenml.integrations.evidently.steps import (
    EvidentlyReportParameters,
    EvidentlyReportStep,
)
from zenml.integrations.evidently.metrics import EvidentlyMetricConfig

# We create a config object for our evidently step -
#  here we choose the datadrift profile
evidently_drift_detector_config = EvidentlyReportParameters(
    # column_mapping=None,
    metrics=[EvidentlyMetricConfig.metric("DatasetDriftMetric")],
)

### Add step implementations to the pipeline and run

In [None]:
from steps.splitter import date_based_splitter, SplitConfig

# We also configure our data splitter - In this case we want to compare data before
#  Steph Curry's infamous three-pointer to afterwards
data_split_config = SplitConfig(date_split=CURRYS_THREE_POINTER, columns=['FG3M'])

# Instantiate the pipeline
#  For this we need to pass all our step implementations. 
#  At this stage the step configurations are passed to the correspondign steps
eda_pipeline = data_analysis_pipeline(
    importer=game_data_importer(),
    drift_splitter=date_based_splitter(data_split_config),
    drift_detector=EvidentlyReportStep(evidently_drift_detector_config),
)

eda_pipeline.run()

## Post-execution: Fetching pipelines and reviewing results

Once our pipeline has run we now want to inspect and visualize the results.

In [None]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer
from zenml.post_execution import get_pipeline
import json

p = get_pipeline(pipeline_name='data_analysis_pipeline')

In [None]:
# Our pipeline can have multiple runs associated with it
p.runs

In [None]:
# For this we want to look at the last run, the runs are sorted chronologically
last_run = p.runs[0]
last_run

In [None]:
drift_detection_step = last_run.get_step(
    name="drift_detector"
)
drift_detection_step

In [None]:
EvidentlyVisualizer().visualize(drift_detection_step)

__Conclusion__: Ever since Steph Currys game in 2016, there has been a drift in how many three pointers are scored in the NBA. Is this all thanks to Curry? We won't claim any causation here. But we can say for sure the the amount of three pointers in the NBA has increased in the last few years.

# Chapter 2 - Training Pipeline 

Let's move on to our machine learning task. The diagram below is the result of an internal brainstorming session of what cool usecase we want to demonstrate. Here we have a continuous training pipeline that does two things. 

For one we import data from the NBA api and calculate if there hase been any significant drift in the amount of three pointers within the last week in comparison with all past data from 2016 onwards. 

On the other hand we also have a training pipeline that takes in the raw data from the NBA, does some basic feature engineering, encode the data and feeds it into the trainer/tester steps. The purpose of the trained model is to predict based on very little input data (two teams facing each other and the season id) how many three pointers the home team will score.

<div class="alert alert-block alert-info">
    <b>Note:</b> The purpose of this notebook is <b>not</b> to train the best, most state-of-the-art model for the task. The purpose is to show you how to quickly set up a scalable, deployable and extensible machine learning pipeline that can go from ideation to production in no time.
</div>

![Training Pipeline](_assets/TrainingPipeline.png "Planned Architecture")

For this pipeline we want to take you a step further by showing you some more integrations. We will be using MLFlow Tracking for visualizing and comparing multiple pipeline runs. 

![Mlflow](_assets/zenml+evidently+mlflow.png "Mlflow")

In [None]:
# We start off by installing the required packages
!zenml integration install mlflow -f

# Then we register an experiment tracker with mlflow flavor
!zenml experiment-tracker register local_mlflow_tracker --flavor=mlflow
!zenml stack set local_stack
!zenml stack update local_stack -e local_mlflow_tracker    

After showing you a local pipeline run with mlflow tracking we will then continue on to changing our orchestrator to a kubeflow pipeline. 

As an additional little nugget we have also implemented a Discord step, which post into our company internal Discord channel whenever the drift is analyzed. 

![All](_assets/evidently+mlflow+discord+kubeflow.png "All")

In [None]:
!zenml integration install kubeflow -f

### Build the pipeline definition

Just like above we start off by defining the steps of our pipeline and the flow of inputs and outputs through the pipeline.

In [None]:
from datetime import date, timedelta
from zenml.pipelines import pipeline


@pipeline
def training_pipeline(
        importer,
        feature_engineerer,
        encoder,
        ml_splitter,
        trainer,
        tester,
        drift_splitter,
        drift_detector,
        drift_alert
):
    """Links all the steps together in a pipeline"""
    # Data Preprocessing
    raw_data = importer()
    transformed_data = feature_engineerer(raw_data)
    encoded_data, le_seasons, ohe_teams = encoder(transformed_data)
    train_df_x, train_df_y, test_df_x, test_df_y, eval_df_x, eval_df_y = ml_splitter(encoded_data)
    
    # Model training
    model = trainer(train_df_x, train_df_y, eval_df_x, eval_df_y)
    test_results = tester(model, test_df_x, test_df_y)

    # drift detection branch
    reference_dataset, comparison_dataset = drift_splitter(raw_data)
    drift_report, _ = drift_detector(reference_dataset, comparison_dataset)
    drift_alert(drift_report)

### Configure the steps

Now that we have mlflow enabled we need to choose what we want to log into mlflow. For now, we have chosen to use the [mlflow autolog](https://www.mlflow.org/docs/latest/tracking.html#scikit-learn) functionality to automatically log the model and training parameters within the training step.

In [None]:
import pandas as pd
import numpy as np
from sklearn.base import RegressorMixin
from sklearn.ensemble import RandomForestRegressor
import mlflow
from zenml.client import Client
from zenml.steps import step
from zenml.steps import BaseParameters


# This is how step configurations are defined
class RandomForestTrainerConfig(BaseParameters):
    """Config class for the sklearn trainer.   
    """

    max_depth: int = 10000
    target_col: str = "FG3M"


experiment_tracker = Client().active_stack.experiment_tracker

@step(enable_cache=False, experiment_tracker=experiment_tracker.name)
def random_forest_trainer(train_df_x: pd.DataFrame, train_df_y: pd.DataFrame,
                          eval_df_x: pd.DataFrame, eval_df_y: pd.DataFrame,
                          config: RandomForestTrainerConfig) -> RegressorMixin:

    mlflow.sklearn.autolog()
    clf = RandomForestRegressor(max_depth=config.max_depth)
    clf.fit(train_df_x, np.squeeze(train_df_y.values.T))
    eval_score = clf.score(eval_df_x, np.squeeze(eval_df_y.values.T))
    print(f"Eval score is: {eval_score}")
    return clf

Multiple of our steps have configurations that we want to set ahead of our pipeline run.

In [None]:
# Zenml provides some standard steps for the evidently integration
from zenml.integrations.evidently.steps import (
    EvidentlyReportParameters,
    EvidentlyReportStep,
)
from zenml.integrations.evidently.metrics import EvidentlyMetricConfig
from steps.splitter import SklearnSplitterConfig, TrainingSplitConfig

# Here we simply choose how we will split our data
train_data_split_config = SklearnSplitterConfig(
    ratios={'train': 0.6, 'test': 0.2, 'validation': 0.2})

# We have chosen to run the pipeline on a weekly schedule. As scuh we always want to look one week in the past 
#  and decide if the last week was anomalous​ in comparison to the last few years 
one_week_ago = (date.today() - timedelta(days=7)).strftime("%Y-%m-%d")

drift_data_split_config = TrainingSplitConfig(
    new_data_split_date=one_week_ago,
    start_reference_time_frame=CURRYS_THREE_POINTER,
    end_reference_time_frame=one_week_ago,
    columns=["FG3M"])

# Just like in the previous pipeline we choose the datadrift metric
evidently_report_config = EvidentlyReportParameters(
        #column_mapping=None,
        metrics=[EvidentlyMetricConfig.metric("DatasetDriftMetric")],
    )

### Ready to run

The pipeline is defined and our steps have been written. Let's instantiate our pipeline and start our training.

In [None]:
!zenml stack list

In [None]:
from steps.analyzer import analyze_drift
from steps.encoder import data_encoder
from steps.evaluator import tester
from steps.feature_engineer import feature_engineer
from steps.importer import game_data_importer
from steps.splitter import sklearn_splitter, SklearnSplitterConfig, reference_data_splitter, TrainingSplitConfig
from steps.discord_bot import discord_alert


# Initialize the pipeline
train_pipeline = training_pipeline(
    # Data Wrangling
    importer=game_data_importer(),
    feature_engineerer=feature_engineer(),
    encoder=data_encoder(),
    ml_splitter=sklearn_splitter(train_data_split_config),
    
    # Model training
    trainer=random_forest_trainer(),
    tester=tester(),
    
    # Drift detection
    drift_splitter=reference_data_splitter(drift_data_split_config),
    drift_detector=EvidentlyReportStep(evidently_report_config),
    
    # Alert Discord
    drift_alert=discord_alert(),
)

train_pipeline.run()

### Let's have a look at mlflow

Training is done, let's have a look at our mlflow ui and see if our training including the model have made it in there.

In [None]:
!mlflow ui --backend-store-uri <SPECIFIC_MLRUNS_PATH_GOES_HERE>

Check the terminal output of the pipeline run to see the exact path appropriate in your specific case. This will start mlflow at `localhost:5000`. If this port is already in use on your machine you may have to specify another port:

In [None]:
!mlflow ui --backend-store-uri <SPECIFIC_MLRUNS_PATH_GOES_HERE> -p 5001

### Let's check out our Drift for this pipeline as well

In [2]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer
from zenml.client import Client

from datetime import date, timedelta

last_week = date.today() - timedelta(days=7)
ONE_WEEK_AGO = last_week.strftime("%Y-%m-%d")
CURRY_FROM_DOWNTOWN = '2016-02-27'

client = Client()
p = client.get_pipeline(name_id_or_prefix='training_pipeline')
last_run = p.runs[0]
drift_analysis_step = last_run.get_step(
    step="drift_alert"
)
print(f'Data drift detected: {drift_analysis_step.output.read()}')

drift_detection_step = last_run.get_step(
    step="drift_detector"
)

evidently_outputs = drift_detection_step

EvidentlyVisualizer().visualize(evidently_outputs)

[31mThe ZenML global configuration version (0.33.0) is higher than the version of ZenML currently being used (0.32.0). This may happen if you recently downgraded ZenML to an earlier version, or if you have already used a more recent ZenML version on the same machine. It is highly recommended that you update ZenML to at least match the global configuration version, otherwise you may run into unexpected issues such as model schema validation failures or even loss of information.[0m
Data drift detected: True


## The ZenML stack

The ZenML stack is a concept that describes the union of Metadata Store, Artifact Store and Orchestrator that will be used for all pipeline runs. When you get started with zenml you start off with a default local stack.

In [None]:
!zenml stack list

### The Local Stack

You can imagine the local stack to look like this. Within the diagram we show how a generic pipeline interacts with the local stack.

![LocalStack](_assets/localstack.png "LocalStack")

### The Kubeflow Pipeline stack

Now we want to transition to a kubeflow stack that will look a little bit like this. Note that for kubeflow pipelines we also need a registry where the docker images for each step are registered. 

![KubeflowStack](_assets/localstack-with-kubeflow-orchestrator.png "KubeflowStack")

But we have good news! You barely have to do anything to transition.

In [None]:
# You register a container registry with zenml
!zenml container-registry register local_registry  --flavor=default --uri=localhost:5000
    
# You register an orchestrator with zenml
!zenml orchestrator register kubeflow_orchestrator  --flavor=kubeflow

# Now it all is combined into the local_kubeflow_stack
!zenml stack register local_kubeflow_stack \
    -a default \
    -o kubeflow_orchestrator \
    -c local_registry \
    -e local_mlflow_tracker \
    -dv local_evidently

# And we activate the new stack, now all pipelines will be run within this stack
!zenml stack set local_kubeflow_stack

# Check it out, your new stack is registered
!zenml stack list

### Starting up your new kubeflow pipelines stack

All that is left to do is power up your stack. This is just one more line away. The stack up process might take some time for you. In the background k3d will be creating and starting up a cluster of docker containers to host kubeflow pipelines locally. 

In [None]:
!zenml stack up

If you scroll down all the way on the previous output you should see a link to your running kubeflow pipelines UI. Most probably this will be at [http://localhost:8080/](http://localhost:8080/).

<div class="alert alert-block alert-info">
    <b>Note:</b> Currently running pipelines defined within a jupyter notebook cell is
    not supported. To get around this you can run the train pipeline within this repo. 
</div>

In [None]:
!zenml stack set local_kubeflow_stack
# Let's train within kubeflow pipelines - this will deploy the training pipeline on a Schedule
!python run_pipeline.py train

# Chapter 3 - The Prediction Pipeline

The Model is trained - time to get to the prediction pipeline.

In [None]:
# Let's return to our local stack first so we can continue within the jupyter notebook
!zenml stack set local_stack

This is the initial inference pipeline coupled with the training pipeline as described above.

![Training And Inference Pipeline](_assets/Training%20and%20Inference%20Pipeline.png "Planned Architecture Full")

In [None]:
from zenml.pipelines import pipeline


@pipeline(enable_cache=False)
def inference_pipeline(
        importer,           # Import the schedule for upcoming games
        preprocessor,       # Preprocess data and use same encoder as the training data
        extract_next_week,  # Extract only the next week of dat
        model_picker,       # Pick the best model
        predictor,          # Predict three pointers for home team
        post_processor,     # Decode Encoded data and make human readable
        prediction_poster   # Post Prediction to Discord
):
    """Links all the steps together in a pipeline"""
    season_schedule = importer()
    processed_season_schedule, le_seasons = preprocessor(season_schedule)
    upcoming_week = extract_next_week(processed_season_schedule)
    model, run_id = model_picker()
    predictions = predictor(model, upcoming_week, le_seasons)
    readable_predictions = post_processor(predictions)
    prediction_poster(readable_predictions)
    

In [None]:
from steps.encoder import encode_columns_and_clean
from steps.importer import import_season_schedule, SeasonScheduleConfig
from steps.model_picker import model_picker
from steps.predictor import predictor
from steps.splitter import get_coming_week_data, TimeWindowConfig
from steps.post_processor import data_post_processor
from steps.discord_bot import discord_post_prediction

# Initialize the pipeline
inference_pipe = inference_pipeline(
    importer=import_season_schedule(
        SeasonScheduleConfig(current_season='2021-22')),
    preprocessor=encode_columns_and_clean(),
    extract_next_week=get_coming_week_data(TimeWindowConfig(time_window=7)),
    model_picker=model_picker(),
    predictor=predictor(),
    post_processor=data_post_processor(),
    prediction_poster=discord_post_prediction()
)

inference_pipe.run()

In [None]:
# Let's have a look at some of our predictions
from zenml.repository import Repository

r = Repository()
df = r.get_pipeline(pipeline_name='inference_pipeline').runs[0].steps[-2].output.read()
df.head(20)