# TruEra Monitoring Demo

## Demonstrate Production Monitoring Ingestion via Python SDK
### Modeling Scenario: Orange Juice Forecasting (Regression)

Part 1: Using TruEra for ML Explainability **when model & data are available for use**, including
- Project creation/setup
- Data Preparation
- Using TruEra's SDK to ingest data (model inputs & outputs)
- Using TruEra's SDK to ingest models
- Using TruEra's SDK to generate predictions & feature influences

Part 2: Using TruEra for ML Explainability **when model file is not available** / **virtual model project setup**
- [TO DO]

In [None]:
!pip list | grep truera

In [None]:
import os
import glob

In [None]:
import pandas as pd
import numpy as np
import pickle
from datetime import date, datetime

In [None]:
import sklearn
from sklearn.ensemble import RandomForestClassifier

In [None]:
!pip list | grep truera

In [None]:
from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication
from truera.client.ingestion import ColumnSpec, ModelOutputContext
from truera.client.ingestion.util import merge_dataframes_and_create_column_spec

The following is a custom python script that contains several convenience functions. 

These functions are not generally required, nor fully generalizable. They are use case specific. 

However, in many cases, snippets of these utility functions may prove useful for implementing use cases with your models and data

In [None]:
import ingestion_utils

In [None]:
import imp
imp.reload(ingestion_utils)

In [None]:
# connection details
TRUERA_URL = 
AUTH_TOKEN = 

In [None]:
import os

Recommendation: place URL and auth token in env vars. Not a necessary step, but useful for security and code cleanliness purposes 

In [None]:
os.environ['URL'] = TRUERA_URL
os.environ['AUTH_TOKEN'] = AUTH_TOKEN

In [None]:
# Python SDK - Create TruEra workspace
auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth, ignore_version_mismatch=True)

# Pre-production: Create TruEra Project and load baseline data

## Create Project

In [None]:
projectName = "Forecasting Monitoring Quickstart"
print(projectName)

In [None]:
scoreFormat = "regression"

In [None]:
tru.add_project(project_name, score_type=scoreFormat)  

## Create Data Collection

In [None]:
dcName = "OJ Sales Data"

In [None]:
tru.add_data_collection(dcName)

## Add data to data collection

In [None]:
train_data_df = pd.read_csv('./split_sim_v1_mon/train_df.csv')

In [None]:
train_data_df.head()

## Create column_spec 
- Tell TruEra about the columns in your dataframe. Which columns correspond to:
    - unique ID
    - timestamp
    - pre-transform features (optional, if using feature map; else, post-transform features loaded as "pre_data")
    - post-transform features (i.e., model readable)
    - labels (optional; almost always provided for development data)
    - predictions (optional if model object is available for use)
    - extra data (for use in segmentation or fairness workflows)

In [None]:
random_forest = pickle.load(open("./split_sim_v1_mon/rf.pkl", 'rb'))

In [None]:
#prepare data - truera SDK convenience function to merge and create column specification
## include index in all dataframes being merged. In this case, we're merging from the same original dataframe, for demo purposes. 
data_df, column_spec = merge_dataframes_and_create_column_spec(id_col_name='index',
                                                               timestamp_col_name='datetime', #optional for pre-prod data
                                                               pre_data=train_data_df[['index','datetime','store','feat','price','AGE60','EDUC','ETHNIC','INCOME','HHLARGE','WORKWOM','HVAL150','SSTRDIST','SSTRVOL','CPDIST5','CPWVOL5','brand_dominicks','brand_minute.maid','brand_tropicana','weekday_Friday','weekday_Monday','weekday_Saturday','weekday_Sunday','weekday_Thursday','weekday_Tuesday','weekday_Wednesday']],
                                                               labels=train_data_df[['index','logmove']])

In [None]:
?ColumnSpec

In [None]:
column_spec

In [None]:
#save column spec as pickle file, for future use
with open('./split_sim_v1_mon/column_spec.pkl', 'wb') as f:
    pickle.dump(column_spec, f)

## Add model object to project

The arguments used in add_python_model function are the name of the model (user specified) and the model object itself. 

This step is where the "data & model" and "data only" aka "virtual model" approaches to generating TruEra ML observability metrics begins to differ. 

In the virtual model scenario, a function .add_model is used -- there, we **only** specify the model name, and do not interact with the model object itself, directly, at all. The virtual model scenario implies that one already has all model I/Os required to generate observability metrics persisted in a source location (e.g., in memory, flat file, object storage, etc.). Those model I/Os are, at a minimum, model input data, and typically also include model scores, labels, and feature influences. 

In [None]:
modelName = 'Random Forest Regressor'

In [None]:
tru.add_python_model(modelName, random_forest)

In [None]:
tru.add_data(
        data_split_name='baseline data',
        data=data_df,
        column_spec=column_spec)

## Scoring model, and generating feature influences

When a model object is available for use, TruEra provides simplified means to generate predictions, feature influences, and error influences

Whenever possible, use truera-qii for these purpose. Otherwise, omit the following setting. TruEra will use the OSS SHAP library that corresponds to your model and prediction type. Be aware that this may lead to lengthy increases in computation time to generate Shapley value estimates.

In [None]:
tru.set_influence_type('truera-qii')

By default, the following function will sync the artifacts that have been ingested to your local machine, and compute predictions, feature influences, and error influences for all model-split pairs. 

Params exist to constrain to specific calculations, as well as specific models or data splits

In [None]:
?tru.compute_all

In [None]:
tru.compute_all()

# Production: Prepare and load production data into TruEra Monitoring
1. Simulate/generate production data
2. Generate predictions using model
3. Load data into production monitoring services

In [None]:
prod_data_df = pd.read_csv('./split_sim_v1_mon/prod_df.csv')

In [None]:
prod_data_df.head()

In many production scenarios, predictions will already be generated prior to ingesting production data into TruEra. 

In other words, scoring will happen separately and independently of TruEra, in some other production system. 

Here, we simulate that independent process by generating predictions, on the simulated production dataset, and including them in our production column specification. 

Note that we use the previously created column specification to simplify the selection of the correct columns with which to score the model on

In [None]:
preds = random_forest.predict(prod_data_df.drop(columns=prod_data_df.columns.difference(column_spec.pre_data_col_names)))
preds_df = pd.DataFrame(preds, columns = ['preds'], index=[prod_data_df['index']])
preds_df = preds_df.reset_index()
preds_df.head()

Here, we use the convenience function to merge our predictions with the prod data

In [None]:
prod_df, prod_column_spec = merge_dataframes_and_create_column_spec(id_col_name=column_spec.id_col_name,
                                                               timestamp_col_name=column_spec.timestamp_col_name,
                                                               pre_data=prod_data_df[column_spec.pre_data_col_names+[column_spec.id_col_name]+[column_spec.timestamp_col_name]],
                                                               labels=prod_data_df[column_spec.label_col_names+[column_spec.id_col_name]],
                                                               predictions=preds_df)

In [None]:
prod_column_spec

In [None]:
#save column spec as pickle file, for future use
with open('./split_sim_v1_mon/prod_column_spec.pkl', 'wb') as f:
    pickle.dump(prod_column_spec, f)

In [None]:
projectName, dcName, random_forest, modelName, prod_start, prod_end

### Add production data
- Use merged prod_df and prod_column_spec
- Specify model output context -- tell TruEra the format of the predictions being ingested

In [None]:
?ModelOutputContext

In [None]:
tru.add_production_data(data=prod_df,
                        column_spec=prod_column_spec,
                        model_output_context=ModelOutputContext(
                        model_name=modelName,
                        score_type='regression'))

# Generate Feature Influences for a time range split

check current workspace context; set to desired project/model/split if not already done

In [None]:
tru

A time range split has been cut from the production data, from a time period of interest

In [None]:
tru.get_data_splits()

Let's compute feature influences for that time range split so we can use TruEra Diagnostics to debug performance issues 

In [None]:
?tru.compute_all

Note: the name of your time range split may differ than below. Example left in for demo purposes. 

In [None]:
tru.compute_all(data_splits=['tr_split_prod_1'])

# Various explainer / programmatic examples

[TO DO. References here:](https://docs.truera.com/1.41/public/sdk/explainers/)
- Performance
- Explainability
- Drift analysis
- Fairness
- Testing

### Demo: Create segments programmatically

In [None]:
weekday_names = train_data_df.weekday.unique().tolist()

In [None]:
weekday_names

In [None]:
defs = ["weekday == '{}'".format(s) for s in weekday_names]

In [None]:
defs

In [None]:
segment_defs = dict(zip(weekday_names, defs))

In [None]:
segment_defs

In [None]:
tru

In [None]:
tru.get_data_splits()

In [None]:
tru.set_data_split("training data")

In [None]:
tru.add_segment_group('Day of Week', segment_defs)

Scratch

----