# Guided Exercise: Performance

### Goals 🎯
In this tutorial, you will use TruEra to train and ingest a model, and then make performance improvements to our model in a structured and methodical way!

In this tutorial, you will:
1. Set up and view the results of performance and feature importance tests.
2. Find actionable issues with the model.
3. Mitigate these issues and re-upload your model to TruEra.
4. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.

If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com

In [1]:
# connection details
TRUERA_URL = "https://app.truera.net/"
AUTH_TOKEN = "..."

### Install required packages

In [2]:
! pip install --upgrade xgboost==1.6.2
#! pip install truera



In [3]:
# delete this cell
# uncomment out install of truera
# uncomment out workspace/auth
# clear cell output

import os
import sys

os.chdir("/Users/davidkurokawa/Work/code/truera/truera/python")
if "/Users/davidkurokawa/Work/code/truera/truera/python" not in sys.path:
    sys.path.append("/Users/davidkurokawa/Work/code/truera/truera/python")

from truera.client.truera_authentication import BasicAuthentication, TokenAuthentication
from truera.client.truera_workspace import TrueraWorkspace

TRUERA_URL = "http://localhost:8000"
AUTH_TOKEN = ""
tru = TrueraWorkspace(TRUERA_URL, BasicAuthentication("ailens", "ailens123"))
if "Starter Example Companion - Performance" in tru.get_projects():
    tru.delete_project("Starter Example Companion - Performance")

INFO:truera.client.remote_truera_workspace:Connecting to 'http://localhost:8000'
INFO:truera.client.remote_truera_workspace:remaining items: []
INFO:truera.client.remote_truera_workspace:Delete resource succeeded. Project_id: Starter Example Companion - Performance intra_artifact_path: 


### From here, run the rest of the notebook and follow the analysis.

### Import packages, set up your TruEra Workspace

In [4]:
import logging

import pandas as pd
import xgboost as xgb
from truera.client.ingestion import ColumnSpec, ModelOutputContext
from truera.client.truera_authentication import TokenAuthentication
from truera.client.truera_workspace import TrueraWorkspace

#auth = TokenAuthentication(AUTH_TOKEN)
#tru = TrueraWorkspace(TRUERA_URL, auth)

### Load the data and train an xgboost model
A bit about the model and data... 

In this example, we will use real data on the AirBnb listings 🏠 in San Francisco and Seattle to predict the listing price. The Airbnb data was scraped by Inside Airbnb and hosted by OpenDataSoft. Pricing a rental property is a challenging task for Airbnb owners as they need to understand the market, the features of their property, and how those features contribute to listing price.

You can find more information about the data here:
https://data.opendatasoft.com/explore/dataset/airbnb-listings%40public/

In [5]:
# load data
san_francisco = pd.read_csv(
    'https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/San_Francisco.csv')
seattle = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/Seattle.csv')

# make all float
san_francisco = san_francisco.astype(float).reset_index(names="id")
seattle = seattle.astype(float).reset_index(names="id")

### Now we can create our first project

In [6]:
# create the first project
project_name = "Starter Example Companion - Performance"

tru.add_project(project_name, score_type='regression')

# set up influence algorithm
tru.set_influence_type("shap")

# reduce settings for speed
tru.set_num_internal_qii_samples(100)
tru.set_num_default_influences(100)

# add data collection
tru.add_data_collection("Data Collection v1")

### After we've done so, the next thing to do is adding the data.

To do so, we'll introduce the add_data method along with the `ColumnSpec`. `add_data` is a general purpose method for adding data to truera and can be used to ingest a variety of data types including feature data, predictions, influences, labels and extra data.

`ColumnSpec` is a helper class we imported from truera.client.ingestion that we use to specify which columns in the dataframe are for what purpose. If you prefer, you can skip using `ColumnSpec` and just pass a dict instead, e.g. `{'id_col_name': 'id', ...}`

In [7]:
# add data to the collection we just created
tru.add_data(data=san_francisco,
             data_split_name="San Francisco",
             column_spec=ColumnSpec(id_col_name="id",
                                    pre_data_col_names=list(san_francisco.columns.drop(["id", "price"])),
                                    label_col_names="price"))
tru.add_data(data=seattle,
             data_split_name="Seattle",
             column_spec=ColumnSpec(id_col_name="id",
                                    pre_data_col_names=list(seattle.columns.drop(["id", "price"])),
                                    label_col_names="price"))

Uploading tmp4ssq3o_5.parquet (336.5KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpwqr3bqol.parquet (228.9KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


After adding our data, it's a good time to set our background set for calculating feature influences. Generally, it's a good practice to set the background set equal to the training set. We'll do so here.

In [8]:
tru.set_influences_background_data_split("San Francisco")

### Train and register the model

In [9]:
# train first model
xgb_reg = xgb.XGBRegressor(eta=0.2, max_depth=4)
xgb_reg.fit(san_francisco.drop(['id', 'price'], axis=1), san_francisco.price)

# register the model
tru.add_python_model("model_1",
                     xgb_reg,
                     train_split_name="San Francisco",
                     train_parameters={
                         "model_type": "xgb.XGBRegressor",
                         "eta": 0.2,
                         "max_depth": 4
                     })

INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBRegressor
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!


Uploading tmpholjptj4.json (172.9KiB) -- ### -- file upload complete.
Uploading MLmodel (218.0B) -- ### -- file upload complete.
Uploading conda.yaml (208.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.py (459.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "model_1" is added and associated with remote data collection "Data Collection v1". "model_1" is set as the model for the workspace context.


Model uploaded to: http://localhost:8000/home/p/Starter%20Example%20Companion%20-%20Performance/m/model_1/


INFO:truera.client.remote_truera_workspace:Triggering computations for model predictions on split San Francisco.
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v1". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_1".


### Calculate and ingest predicitions.

For calculation, we'll use the built-in tru.get_ys_pred() to compute predictions.

In [10]:
tru.set_data_split("San Francisco")
sf_preds = tru.get_ys_pred().reset_index(names="id")
tru.set_data_split("Seattle")
se_preds = tru.get_ys_pred().reset_index(names="id")

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmp374awbcd
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". 
INFO:truera.client.truera_workspace:Syncing data split "San Francisco" to local.
INFO:truera.client.local.local_truera_workspace:Data split "San Francisco" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing model model_1 to local.
INFO:truera.client.local.local_truera_workspace:Model "model_1" is added and associated with local data collection "Data Collection v1". "model_1" is set as the model for the workspace context.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:The previous d

Once we've computed our predictions we can add them again using add_data. We'll also use column_spec here, but this time we need to supply our prediction_col_names.

In [11]:
tru.add_data(data=sf_preds,
             data_split_name="San Francisco",
             column_spec=ColumnSpec(id_col_name="id", prediction_col_names="__truera_prediction__"))

tru.add_data(data=se_preds,
             data_split_name="Seattle",
             column_spec=ColumnSpec(id_col_name="id", prediction_col_names="__truera_prediction__"))

INFO:truera.client.remote_truera_workspace:`model_output_context` will be inferred as it was not provided.
INFO:truera.client.remote_truera_workspace:Inferred ModelOutputContext: ModelOutputContext(model_name='model_1', score_type='regression', background_split_name='', influence_type='')


Uploading tmpstdc98d4.parquet (70.7KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:`model_output_context` will be inferred as it was not provided.
INFO:truera.client.remote_truera_workspace:Inferred ModelOutputContext: ModelOutputContext(model_name='model_1', score_type='regression', background_split_name='', influence_type='')


Uploading tmp_jh22plt.parquet (44.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Computing and Ingesting Feature Influences

Feature influences are a core component of the TruEra platform that enables model explainability.

To compute them, we'll use `compute_feature_influences`. We'll use opensource SHAP as our influence calculation method here. 

Note: TruEra-QII is available for paid users to provide faster and more accurate computation.

In [14]:
tru.set_influence_type("shap")

for data_split_name in ["Seattle", "San Francisco"]:
    tru.set_data_split(data_split_name)
    tru.compute_feature_influences()

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmp374awbcd
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


Uploading tmpq1q7ev3b.parquet (121.9KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmp374awbcd
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


Uploading tmpvtlb34jt.parquet (131.8KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Computing and Ingesting Error Influences

Error influences are a core component of the TruEra platform that enables model explainability.

To compute them, we'll again use `compute_feature_influences`, except with a different score type: `"mean_absolute_error_for_regression"`.

In [17]:
for data_split_name in ["Seattle", "San Francisco"]:
    tru.set_data_split(data_split_name)
    tru.compute_feature_influences(score_type='mean_absolute_error_for_regression')

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmp374awbcd
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


|          | 0.000% [00:00<?]

Uploading tmprkz3vio3.parquet (114.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmp374awbcd
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


|          | 0.000% [00:00<?]

Uploading tmpvsghn9dd.parquet (113.7KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Issue: Overfitting

We observe there to be a large discrepency between our train and test accuracy!

In [18]:
# set default performance metric
tru.set_default_performance_metrics(["MAE"])

train_split_name = "San Francisco"
test_split_name = "Seattle"

# generate the explainer and compute performance
explainer = tru.get_explainer(test_split_name, comparison_data_splits=[train_split_name])
explainer.compute_performance(metric_type="MAE")

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmp374awbcd
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


Unnamed: 0,Split,MAE
0,Seattle,110.269441
1,San Francisco,47.540053


### Create Tests

To help us keep track of this issue, let's add a test for it!

Additionally, too many unimportant features is a common cause of overfitting. We should test for that as well.

Note that we could also set up tests for fairness and stability.

In [19]:
# add performance tests
tru.tester.add_performance_test(test_name='Relative MAE Test',
                                all_data_collections=True,
                                data_split_name_regex='Seattle',
                                metric="MAE",
                                reference_split_name=train_split_name,
                                fail_if_greater_than=0.80,
                                fail_threshold_type="RELATIVE")

tru.tester.add_performance_test(test_name='RMSE Test',
                                all_data_collections=True,
                                data_split_name_regex='.*',
                                metric="RMSE",
                                fail_if_greater_than=115,
                                fail_threshold_type="ABSOLUTE")

### View Test Results

In [20]:
# get model results
tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["performance"])

0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
❌,Relative MAE Test,Seattle,ALL POINTS,MAE,110.2694,Explore in UI
❌,RMSE Test,Seattle,ALL POINTS,RMSE,154.0107,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,75.6847,Explore in UI


Both tests are failing.

### From here, navigate to the TruEra Web App for analysis or continue on to Part 2!     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/16DexGCY1i4A5fLJZXC7xHPpqCSrQhVab)