# Monitoring: Computing and Ingesting Feature Influences

In this notebook, we will show how to compute and ingest feature influences in monitoring. We assume you have some familiarity from going through other quickstarts and have the [TruEra Python SDK](https://pypi.org/project/truera/) installed.

### Step 1: Connect to TruEra

In [1]:
TRUERA_URL = "https://app.truera.net/"
AUTH_TOKEN = "<replace with your token>"

In [3]:
from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth, ignore_version_mismatch=True)

INFO:truera.client.remote_truera_workspace:Connecting to 'https://app-stg.truera.net'


### Step 2: Create a project and data collection

In [5]:
# adding project
project_name = "PROJECT_NAME" # Replace this with a project name of choice. 
if project_name not in tru.get_projects():
    tru.add_project(project=project_name, score_type="regression")
else:
    tru.set_project(project_name)

In [6]:
# Add data collection.
data_collection_name = "dc1"
if data_collection_name not in tru.get_data_collections():
    tru.add_data_collection(data_collection_name=data_collection_name)
else:
    tru.set_data_collection(data_collection_name)

### Step 3: Train Model (Optional)

In order to compute feature influences, we need to have the model object ingested into TruEra. If you already have a trained model ready, skip this step. For this notebook, we will train a XGBoost model.


In [7]:
from xgboost import XGBRegressor
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

data_bunch = fetch_california_housing()

XS_TRAIN = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_TRAIN = pd.DataFrame(data_bunch["target"], columns=["label"])

model = XGBRegressor().fit(XS_TRAIN, YS_TRAIN)

### Step 4: Ingest Model

In [8]:
tru.add_python_model("xgb_model", model)

INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBRegressor
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:❔ Skipping test model check, as no data splits exist in this data collection.


Uploading conda.yaml (207.0B) -- ### -- file upload complete.
Uploading MLmodel (218.0B) -- ### -- file upload complete.
Uploading tmp3f04wojk.json (599.0KiB) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.py (459.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "xgb_model" is added and associated with data collection "dc1". "xgb_model" is set as the model for the workspace context.


Model uploaded to: https://app-stg.truera.net/home/p/PROJECT_NAME/m/xgb_model/




### Step 5: Create and set a background data split

In order to compute feature influences, we need a background data split to compute against. In this notebook, we will use the data that the model was trained on.

In [9]:
background_data = XS_TRAIN.merge(YS_TRAIN, left_index=True, right_index=True).reset_index(names="id")

In [10]:
from truera.client.ingestion import ColumnSpec

# Ingest data
background_data = XS_TRAIN.merge(YS_TRAIN, left_index=True, right_index=True).reset_index(names="id")
tru.add_data(
    data=background_data,
    data_split_name="background_data_split",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=XS_TRAIN.columns.to_list(),
        label_col_names=YS_TRAIN.columns.to_list()
    )
)

Uploading tmp28ox9jyn.parquet (919.4KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 6e81b52e-bb3f-4815-b915-53fffe6be77c finished with status: SUCCEEDED.


In [11]:
# Set a background data split for influences
tru.set_influences_background_data_split("background_data_split")

### Step 6: Ingest production data

In [12]:
XS_PROD = pd.DataFrame(data=data_bunch["data"], columns=data_bunch["feature_names"])
YS_PROD = pd.DataFrame(data_bunch["target"], columns=["label"])

In [13]:
from datetime import datetime
timestamp = datetime.utcnow()

prod_data = XS_PROD.merge(YS_PROD, left_index=True, right_index=True)
prod_data["predictions"] = model.predict(XS_PROD)
prod_data["timestamp"] = timestamp
prod_data.reset_index(names="id", inplace=True)

In [14]:
tru.add_production_data(
    prod_data,
    column_spec=ColumnSpec(
        id_col_name="id",
        timestamp_col_name="timestamp",
        pre_data_col_names=XS_PROD.columns.to_list(),
        label_col_names=YS_PROD.columns.to_list(),
        prediction_col_names="predictions"
    )
)

INFO:truera.client.remote_truera_workspace:`model_output_context` will be inferred as it was not provided.
INFO:truera.client.remote_truera_workspace:Inferred ModelOutputContext: ModelOutputContext(model_name='xgb_model', score_type='regression', background_split_name='', influence_type='')


Uploading tmpc4o7r_aq.parquet (1.0MiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 46ec3bb6-f3be-46fa-be0a-456a7604973b finished with status: SUCCEEDED.


In [15]:
# Verify ingestion
tru.get_xs(0, 10, system_data=True)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,__truera_timestamp__
5329,2.6509,25.0,3.74295,1.159436,3136.0,1.700651,34.049999,-118.459999,2023-09-22 19:54:11
1970,4.0197,10.0,6.84582,1.248643,2597.0,2.819761,38.709999,-120.620003,2023-09-22 19:54:11
20074,4.0583,17.0,6.404255,2.0,122.0,2.595745,38.130001,-120.260002,2023-09-22 19:54:11
18829,2.7222,25.0,5.528302,1.254717,1117.0,2.634434,41.860001,-123.260002,2023-09-22 19:54:11
9700,2.5329,31.0,4.055639,1.037594,2220.0,3.338346,36.669998,-121.620003,2023-09-22 19:54:11
4612,1.6406,22.0,1.885057,1.030651,634.0,2.429119,34.07,-118.290001,2023-09-22 19:54:11
6095,4.0417,34.0,6.08,1.1,522.0,3.48,34.119999,-117.879997,2023-09-22 19:54:11
11610,6.6264,25.0,7.226994,1.01227,937.0,2.874233,33.779999,-118.050003,2023-09-22 19:54:11
16774,3.4375,26.0,3.910909,1.022727,3320.0,3.018182,37.689999,-122.459999,2023-09-22 19:54:11
4897,2.1466,31.0,3.450928,1.068966,1952.0,5.177719,34.009998,-118.25,2023-09-22 19:54:11


### Step 7: Compute feature influences on production data

In [16]:
# Retrieve a subset of production data
prod_xs = tru.get_xs(0, 1000)

In [17]:
tru.set_data_split("background_data_split")

INFO:truera.client.truera_workspace:Download temp_dir: /tmp/tmpdttynqwj
INFO:truera.client.truera_workspace:Syncing data collection "dc1" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "dc1". 
INFO:truera.client.truera_workspace:Syncing data split "background_data_split" to local.
INFO:truera.client.local.local_truera_workspace:Data split "background_data_split" is added to local data collection "dc1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Downloading model xgb_model...
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.local.local_truera_workspace:The previous data collection ("dc1") and its associated data splits and/or models have been cleared from the local environment workspace context.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "dc1". 


In [18]:
INFLUENCES_PROD = tru.compute_feature_influences_for_data(prod_xs)

|          | 0.000% [00:00<?]

In [19]:
INFLUENCES_PROD

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
5329,-0.308769,0.023874,-0.198069,0.028406,-0.042587,0.425318,1.054309,0.476623
1970,-0.109728,0.023297,0.248250,-0.118971,-0.017163,-0.029630,-0.879800,0.203722
20074,-0.075011,0.041535,0.047804,-0.111241,-0.061263,-0.006103,-1.178181,0.300355
18829,-0.336528,-0.009431,-0.045018,-0.041373,-0.010003,-0.007644,-1.778821,0.960855
9700,-0.283829,0.018434,-0.165362,-0.012819,0.011688,-0.175731,-0.635661,0.460185
...,...,...,...,...,...,...,...,...
14746,-0.022791,0.012517,0.028641,0.013745,-0.009560,-0.376938,1.149124,-1.537508
18075,1.484543,0.082001,0.397349,-0.014108,0.026615,0.079297,-0.281407,1.081445
13463,-0.069285,0.005235,0.025997,0.001135,-0.005201,-0.235176,0.362228,-1.118061
15101,-0.235256,-0.073188,-0.091429,-0.006136,-0.014288,0.066581,1.151131,-1.369775


### Step 8: Ingest feature influences into production data

In [20]:
influence_data = INFLUENCES_PROD.reset_index(names="id")
influence_data["timestamp"] = timestamp

tru.add_production_data(
    data=influence_data,
    column_spec = ColumnSpec(
        id_col_name="id",
        timestamp_col_name="timestamp",
        feature_influence_col_names=INFLUENCES_PROD.columns.to_list()
    )
)

INFO:truera.client.remote_truera_workspace:`model_output_context` will be inferred as it was not provided.
INFO:truera.client.remote_truera_workspace:Inferred ModelOutputContext: ModelOutputContext(model_name='xgb_model', score_type='regression', background_split_name='background_data_split', influence_type='truera-qii')


Uploading tmpk4x7pul4.parquet (85.7KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Materialize operation id: 4bacf048-0880-4617-a2b8-75788ea00446 finished with status: SUCCEEDED.
