# Onboard Test Data & Evaluate A Classification Model Notebook

In this notebook we will use the ObservationSpec to onboard both Training and Test data and then use the test data to evaluate model performance.

First we will split the dataset into a train segment and a test segment. Then we will induce a model using the train data, and finally we will test the performance of the model using the test data and the Classification Evaluation View.

# Check Environment Variables
Before installing Hybrid Intelligence in the notebook you need to set these Environment Variables externally as described in the User Guide https://docs.umnai.com/set-up-your-environment. 
This section checks that the environment variables have been set correctly and throws an error if not.

In [1]:
import os

umnai_env_vars = {
    'UMNAI_CLIENT_ID',
    'UMNAI_CLIENT_SECRET',
    'PIP_EXTRA_INDEX_URL',
}

if not umnai_env_vars.issubset(os.environ.keys()):
    raise ValueError(
        'UMNAI environment variables not set correctly. They need to be set before using the Umnai library.'
    )


# Install Hybrid Intelligence
Next we install the UMNAI Platform.

In [2]:
%pip install umnai-platform --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Note: you may need to restart the kernel to use updated packages.


# Set Workspace Paths
Now we will set the the experiment name and workspace paths.

In [3]:
EXP_NAME = 'testandevaluate_adult_income'
WS_PATH = 'resources/workspaces/'+EXP_NAME
EXP_PATH = EXP_NAME

# Import and Prepare Dataset
Import the dataset to a Pandas DataFrame and the clean data in preparation for onboarding into Hybrid Intelligence.

In [4]:
import pandas as pd
import numpy as np

# Import Adult Income Dataset to pandas dataframe: 
# This dataset can be downloaded from https://archive.ics.uci.edu/dataset/2/adult 
column_names = ["Age", "WorkClass", "fnlwgt", "Education", "EducationNum", "MaritalStatus", "Occupation", "Relationship", "Race", "Gender", "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"]
dataset_df = pd.read_csv('https://raw.githubusercontent.com/umnaibase/umnai-examples/main/data/adult.data', names = column_names)

# Data Preparation:
dataset_df = dataset_df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)    # Remove whitespaces
dataset_df["Income"] = np.where((dataset_df["Income"] == '<=50K'), 0, 1)                # Replace Target values with [0,1]
dataset_df.tail(5)

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,1


# Split and Specify The Dataset
Split the imported dataset into train and test data segments and specify the train and test data. 

To specify the train/test data add a new column called `observation_type` and specify the data as `train` or `test`.

In [5]:
from sklearn.model_selection import train_test_split

X = dataset_df.drop(['Income'], axis=1)
y = dataset_df['Income']

test_fraction = 0.2  # Set a fraction between 0 and 1 to decide the size of the test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_fraction, random_state=42)

train = pd.concat([X_train, y_train], axis=1)
train["observation_type"] = 'train'

test = pd.concat([X_test, y_test], axis=1)
test["observation_type"] = 'test'

dataset_df = pd.concat([train, test], axis=0)

dataset_df.groupby('observation_type').nunique()  # Prints distribution of unique values in each data segment.

Unnamed: 0_level_0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
observation_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
test,72,9,5837,16,16,7,15,6,5,2,94,63,81,41,2
train,71,9,18437,16,16,7,15,6,5,2,115,88,94,42,2


# Create or Open a Hybrid Intelligence Workspace
Workspaces are used by the Hybrid Intelligence framework to organize your data and models together in one place.

In [6]:
from umnai.workspaces.context import Workspace

# Open a workspace
ws = Workspace.open(
    path=WS_PATH,
    experiment=EXP_PATH
)

ws # Prints workspace details to confirm created/opened

WorkspaceContext(path=/opt/atlassian/pipelines/agent/build/demo-notebooks/resources/workspaces/testandevaluate_adult_income, experiment=testandevaluate_adult_income, parallel_backend=loky, parallel_jobs=1)

# Onboard Hybrid Intelligence Dataset

Onboard the Pandas DataFrame into a Hybrid Intelligence dataset.

In [7]:
from umnai.data.datasets import Dataset
from umnai.data.enums import PredictionType

features = list(
    dataset_df.drop(['Income', 'observation_type'], axis=1).columns
)  # All columns except 'Income' and 'observation_type' are features.

categorical_features = [
    column for column 
    in dataset_df.select_dtypes(object).columns 
    if column != 'observation_type'
]  # All 'object' columns except for 'observation_type' are categorical.


dataset = Dataset.from_pandas(
    dataset_df,
    prediction_type=PredictionType.CLASSIFICATION,
    features=features,  
    targets=['Income'],
    categorical_features=categorical_features,
)

dataset  # Prints dataset details to confirm created/opened

MLFLOW Run ID: 8b075d626c8444d2a54d8285e2ee9565:   0%|          | 0/36 [00:00<?, ?it/s]

[Analysis]: Processing Tasks:   0%|          | 0/15 [00:00<?, ?it/s]

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


[Transformation]: Processing Tasks:   0%|          | 0/1 [00:00<?, ?it/s]



INFO:tensorflow:Assets written to: /opt/atlassian/pipelines/agent/build/demo-notebooks/resources/workspaces/testandevaluate_adult_income/preprocessing/dataset_name=Dataset_78208322/assets


Dataset(id=ff6e6eaa-f3b1-4a10-b5b0-0bb8db9f2dc6; name=Dataset_78208322; is_named=False; workspace_id=None)

# Induce a Hybrid Intelligence Model

Pre-induced models are available in the notebook workspace on Github and may be downloaded and saved locally. Using Pre-induced models will speed up the execution of the notebook.

If `LOAD_PREINDUCED_MODEL` is set to `1` (default), the notebook will look for and load the pre-induced model with `ESM_ID`. Otherwise, if set to `0` or the pre-induced model is not found, a new model will be induced and saved to the workspace.

In [8]:
# Set this variable to '1' to load a pre-induced model, otherwise set to '0' to re-induce a new model from the dataset
LOAD_PREINDUCED_MODEL = 1

# Model ID
ESM_ID='Dataset_2f336666_2f873c0be3614725ab29b5213140d671'

#### Load or Induce the Model

In [9]:
from umnai.esm.model import ESM
from umnai.induction.inducer import ModelInducer

# Check if a saved model with the ESM_ID exists. If it exists load it, otherwise induce a new model, save it and print the model and run IDs
if (LOAD_PREINDUCED_MODEL == 1):
    try:    
        esm = ESM.from_workspace(id = ESM_ID)
        print('Pre-induced ESM loaded from workspace: ' + esm.id)
    except OSError:
        print("No model found in workspace.")
        LOAD_PREINDUCED_MODEL = 0

if (LOAD_PREINDUCED_MODEL == 0):
    print("Inducing a new model - this may take some time.")
    # Induce a simple model quickly using fast execution parameters
    model_inducer = ModelInducer(
        max_interactions=3,
        max_interaction_degree=2,
        max_polynomial_degree=2,
        trials=2,
        estimators=2,
        batch_size=512,
        iterations=2,
    )

    # # Induce a more realistic model using default Induction parameters:
    # model_inducer = ModelInducer()

    # Create an ESM using Induction
    esm = model_inducer.induce(dataset)

    # Save the ESM to your workspace
    esm.save_to_workspace()

    # Note ESM ID and MLFLow Run ID
    print("ESM ID: ", esm.id)
    print("MLflow Run ID: ", esm.producer_run_id)




Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method




Pre-induced ESM loaded from workspace: Dataset_2f336666_2f873c0be3614725ab29b5213140d671


## ClassificationEvaluationView
The Classification Evaluation View calculates the performance metrics of a Classification model.

### Instantiate a Local Explainer
Create a LocalExplainer to define the ESM you want to query. The local explainer lets you extract query explanations and predictions in real-time.

In [10]:
from umnai.explanations.local import LocalExplainer

# Instantiate a LocalExplainer:
local_explainer = LocalExplainer(esm)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


### Load Test Data and Submit the Query
Then you load the test data, create a Query should pass the Query object to the local explainer instance to generate the Query Result.

In [11]:
from umnai.explanations.local import Query

# Load test data:
test_data = esm.dataset.get_data(transformed=False, filters=[('observation_type', '==', 'test')]).to_pandas()

# Query ESM with test data:
query_eval = Query(dict(test_data.drop(columns=['Income'])))
query_eval_result = local_explainer(query_eval)

### Create Classfication Evaluation View
Now pass the predictions from the Query Result together with the test data actual values to the Classification Evaluation View.

In [12]:
from umnai.views.classification_evaluation import ClassificationEvaluationView

# Create view and inspect data:
classification_evaluation_view = ClassificationEvaluationView(
    true_data=test_data['Income'], 
    predicted_result=query_eval_result.data['predicted_output']
)
classification_evaluation_view.data

{'accuracy': 0.8529095654844158,
 'f1': 0.7775396498966325,
 'precision': 0.8181445683462893,
 'recall': 0.7537474721941355,
 'roc_auc': 0.9066476135449124,
 'pr_auc': 0.7715159249800158,
 'log_loss': 0.32400386252217994,
 'evaluation_duration_s': 0.021739959716796875,
 'confusion_matrix': array([[4673,  272],
        [ 686,  882]]),
 'confusion_matrix_rates': {'tpr': 0.5625,
  'tnr': 0.944994944388271,
  'fdr': 0.23570190641247835,
  'for': 0.1280089568949431,
  'fpr': 0.05500505561172902,
  'fnr': 0.4375,
  'mcc': 0.5682548020392446,
  'fm': 0.6923494970655764,
  'csi': 0.47934782608695653},
 'precision_recall_curve': {'precision': array([0.24074927, 0.24078624, 0.24082322, ..., 1.        , 1.        ,
         1.        ]),
  'recall': array([1.        , 1.        , 1.        , ..., 0.00829082, 0.00127551,
         0.        ]),
  'thresholds': array([0.00190409, 0.00191836, 0.00197626, ..., 0.9999998 , 0.9999999 ,
         0.99999994], dtype=float32)},
 'roc_curve': {'fpr': array([