# Feature Groups Notebook

When you onboard a dataset into Hybrid Intelligence, you can group related *numeric* features together, and these will be treated as a feature group within the ESM.

In this example, we will group the input features `CapitalGain` and `CapitalLoss` into a feature group called `Capital`. You can define feature groups when onboarding a dataset with the `feature_groups` parameter.

# Check Environment Variables
Before installing Hybrid Intelligence in the notebook you need to set these Environment Variables externally as described in the User Guide https://docs.umnai.com/set-up-your-environment. 
This section checks that the environment variables have been set correctly and throws an error if not.

In [1]:
import os

umnai_env_vars = {
    'UMNAI_CLIENT_ID',
    'UMNAI_CLIENT_SECRET',
    'PIP_EXTRA_INDEX_URL',
}

if not umnai_env_vars.issubset(os.environ.keys()):
    raise ValueError(
        'UMNAI environment variables not set correctly. They need to be set before using the Umnai library.'
    )

# Install Hybrid Intelligence
Next we install the UMNAI Platform. 

In [2]:
%pip install umnai-platform

Looking in indexes: https://pypi.org/simple, https://info%40umnai.com:****@umnai.jfrog.io/artifactory/api/pypi/umnai-dev-pypi/simple






Note: you may need to restart the kernel to use updated packages.


# Set Workspace Paths According to Your Environment
Now we will set the workspace path and the experiment path automatically. They will be set to a local path if you are using a local machine environment or to a Databricks path if you are using a Databricks environment. 

## Install Databricks SDK

This checks if you are running on Databricks and installs their SDK if you are.

In [3]:
import os
if os.environ.get('DATABRICKS_RUNTIME_VERSION') is not None:
    %pip install databricks-sdk

If you are on Databricks, you can select whether you would like the workspace to be created in the shared area (available to all users in your account) or in your personal user account area. You can ignore this if you are running on a local environment.

In [4]:
# Set to 1 if you want to use shared or 0 to use personal user account area.
USE_SHARED_WORKSPACE = 1 

## Set Paths
Next the workspace and experiment paths are set automatically.

In [5]:
import os

EXP_NAME = 'featuregroups_adult_income'
if os.environ.get('DATABRICKS_RUNTIME_VERSION') is not None:
    from databricks.sdk import WorkspaceClient
    w = WorkspaceClient()

    # # For a Databricks Environment
    WS_PATH = '/dbfs/FileStore/workspaces/'+EXP_NAME
    if USE_SHARED_WORKSPACE:
        EXP_PREFIX = f'/Shared/experiments/'
    else:
        USERNAME = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
        EXP_PREFIX = f'/Users/{USERNAME}/experiments/'
    w.workspace.mkdirs(EXP_PREFIX)
    EXP_PATH = EXP_PREFIX + EXP_NAME
else:
    # For a Local Machine Environment
    WS_PATH = 'resources/workspaces/'+EXP_NAME
    EXP_PATH = EXP_NAME

# Import and Prepare Dataset
Import the dataset to a Pandas DataFrame and the clean data in preparation for onboarding into Hybrid Intelligence.

In [6]:
import pandas as pd
import numpy as np

# Import Adult Income Dataset to pandas dataframe: 
# This dataset can be downloaded from https://archive.ics.uci.edu/dataset/2/adult 
column_names = ["Age", "WorkClass", "fnlwgt", "Education", "EducationNum", "MaritalStatus", "Occupation", "Relationship", "Race", "Gender", "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"]
dataset_df = pd.read_csv('https://raw.githubusercontent.com/umnaibase/umnai-examples/main/data/adult.data', names = column_names)

# Data Preparation:
dataset_df = dataset_df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)    # Remove whitespaces
dataset_df["Income"] = np.where((dataset_df["Income"] == '<=50K'), 0, 1)                # Replace Target values with [0,1]
dataset_df.tail(5)

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,1


# Create or Open a Hybrid Intelligence Workspace
Workspaces are used by the Hybrid Intelligence framework to organize your data and models together in one place.

In [7]:
from umnai.workspaces.context import Workspace

# Open a workspace
ws = Workspace.open(
    path=WS_PATH,
    experiment=EXP_PATH
)

ws # Prints workspace details to confirm created/opened

2023-08-03 09:42:03.185890: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-08-03 09:42:03.185947: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-08-03 09:42:07.768792: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-08-03 09:42:07.768858: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-08-03 09:42:07.768916: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (UMNAI-LP): /proc/driver/nvidia/version does not exist


<umnai.workspaces.context.WorkspaceContext at 0x7f24f06756d0>

# Onboard Hybrid Intelligence Dataset

Onboard the Pandas DataFrame into a Hybrid Intelligence dataset. 

You can specify feature groups by passing a dictionary where the key is the name of the group, and the values are the list of features to group together, to the `feature_groups` parameter.

In [8]:
from umnai.data.datasets import Dataset
from umnai.data.enums import PredictionType

dataset = Dataset.from_pandas(
    dataset_df,
    prediction_type=PredictionType.CLASSIFICATION,
    features=list(dataset_df.drop(['Income'], axis=1).columns),    # All columns except 'Income' are features
    targets=['Income'],
    feature_groups={'Capital': ['CapitalGain', 'CapitalLoss']}
)

dataset # Prints dataset details to confirm created/opened

23/08/03 09:42:10 WARN Utils: Your hostname, UMNAI-LP resolves to a loopback address: 127.0.1.1; using 172.20.128.1 instead (on interface eth3)
23/08/03 09:42:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/03 09:42:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/08/03 09:42:21 WARN TaskSetManager: Stage 16 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.


[ObservationSpec] - MLFLOW Run ID: 774efc758be84fffa9d3d160f86a3cdf:   0%|          | 0/60 [00:00<?, ?it/s]

23/08/03 09:42:23 WARN TaskSetManager: Stage 17 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/03 09:42:27 WARN TaskSetManager: Stage 19 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/03 09:42:28 WARN TaskSetManager: Stage 20 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/03 09:42:28 WARN TaskSetManager: Stage 21 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/03 09:42:28 WARN TaskSetManager: Stage 22 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/03 09:42:28 WARN TaskSetManager: Stage 23 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/03 09:42:38 WARN TaskSetManager: Stage 24 contains a task of very large size (7652 KiB). The maximum recommended task size is 1000 KiB.
23/08/



2023-08-03 09:43:06.556443: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: /mnt/d/codebase/python/umnai-tests/demo-notebooks/resources/workspaces/featuregroups_adult_income/preprocessing/dataset_name=Dataset_28ab8595/assets


Dataset(id=43b1df74-9717-4672-af7c-c003283d32cd; name=Dataset_28ab8595; is_named=False; workspace_id=None)

# Confirm Feature Group Onboarding
When onboarding a dataset, all user defined feature groups are included in the dataset metadata.

In [9]:
dataset.feature_groups

{'Capital': ['CapitalGain', 'CapitalLoss']}

# Statistical Data
The statistical data for a feature group is still shown in terms of the individual input features.

In [10]:
pd.DataFrame(dataset.stats)

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
minimum,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,0.0
maximum,90.0,,1484705.0,,16.0,,,,,,99999.0,4356.0,99.0,,1.0
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,0.24081
stddev,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,0.427581
null_count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unique_count,,9.0,,16.0,,7.0,15.0,6.0,5.0,2.0,,,,42.0,2.0


# Induce a Hybrid Intelligence Model

You do not need to make any alterations to the standard procedure to induce a model from a dataset that includes feature groups.

To induce a model from a dataset, you must first create a ModelInducer that sets up the induction parameters and settings. Then you simply use the ModelInducer to induce an Explanation Structure Model (ESM) from the onboarded dataset.

In [11]:
from umnai.induction.inducer import ModelInducer
from umnai.esm.model import ESM

# Induce a simple model quickly using fast execution parameters
model_inducer = ModelInducer(
    max_interactions=3,
    max_interaction_degree=2,
    max_polynomial_degree=2,
    trials=2,
    estimators=2,
    batch_size=512,
    iterations=2,
)

# Induce a more realistic model using default Induction parameters:
# model_inducer = ModelInducer()

# Create an ESM using Induction
esm = model_inducer.induce(dataset)

[Modules] - MLFLOW Run ID: 6b1110ff2913499090139f5e4b3b344a:   0%|          | 0/24 [00:00<?, ?it/s]

INFO:tensorflow:Assets written to: /tmp/tmpysx4ouph/model/data/model/assets




# Save the Model
Save the model to your workspace and note down the ESM ID (and MLflow run ID) to be able to load the model again.

In [12]:
# Note ESM ID and MLFLow Run ID
print("MLflow Run ID: ", esm.producer_run_id)
print("ESM ID: ", esm.id)

# Save the ESM to your workspace
esm.save_to_workspace()

MLflow Run ID:  6b1110ff2913499090139f5e4b3b344a
ESM ID:  Dataset_28ab8595_6b1110ff2913499090139f5e4b3b344a
INFO:tensorflow:Assets written to: /mnt/d/codebase/python/umnai-tests/demo-notebooks/resources/workspaces/featuregroups_adult_income/models/Dataset_28ab8595_6b1110ff2913499090139f5e4b3b344a/assets


# Inference: Query a Model 
When you query a Hybrid Intelligence model you get predictions together with explanations in real-time.

## Create a Query with Feature Groups

**No alterations are necessary** to create a query for a model with feature groups, the inputs are still applied to the original feature names.

In [13]:
from umnai.explanations.local import Query
import pandas as pd

query = Query({
    'Age': [39],
    'WorkClass': ['State-gov'],
    'fnlwgt': [77516],
    'Education': ['Bachelors'],
    'EducationNum': [13],
    'MaritalStatus': ['Never-married'],
    'Occupation': ['Adm-clerical'],
    'Relationship': ['Not-in-family'],
    'Race': ['White'],
    'Gender': ['Male'],
    'CapitalGain': [2174],
    'CapitalLoss': [0],
    'HoursPerWeek': [40],
    'NativeCountry': ['United-States']
})

## Query Result with Feature Groups

When you are viewing a query result, the attributions will be in terms of the **feature group**.

In [14]:
from umnai.explanations.local import LocalExplainer

# Instantiate a LocalExplainer:
local_explainer = LocalExplainer(esm)

# Query the model:
query_result = local_explainer(query)

#  Display the Query Result together with the explanation
query_result.data

{'query_input': {'Age': array([39]),
  'WorkClass': array(['State-gov'], dtype=object),
  'fnlwgt': array([77516]),
  'Education': array(['Bachelors'], dtype=object),
  'EducationNum': array([13]),
  'MaritalStatus': array(['Never-married'], dtype=object),
  'Occupation': array(['Adm-clerical'], dtype=object),
  'Relationship': array(['Not-in-family'], dtype=object),
  'Race': array(['White'], dtype=object),
  'Gender': array(['Male'], dtype=object),
  'CapitalGain': array([2174]),
  'CapitalLoss': array([0]),
  'HoursPerWeek': array([40]),
  'NativeCountry': array(['United-States'], dtype=object)},
 'scenario_id': None,
 'context_id': None,
 'query_row_hash': array([223154430221671965198347690320637288902], dtype=object),
 'query_created_time': datetime.datetime(2023, 8, 3, 7, 45, 20, tzinfo=<UTC>),
 'model_id': 'Dataset_28ab8595_6b1110ff2913499090139f5e4b3b344a',
 'model_intercept': -0.992887,
 'dataset_id': '43b1df7497174672af7cc003283d32cd',
 'run_id': '93ce3cc9cbc747e886218039105a

# Explore and Explain a Model with Feature Groups

When you are exploring a model, the feature module and any interaction modules will be in terms of the **feature group**, while module rules will be in terms of the individual features in the group.

## ModelSummaryView
The Model Summary View gives you an overview of the key parameters, inputs and outputs of the model, and of each module within it.

In [15]:
from umnai.views.model_summary import ModelSummaryView

model_summary_view = ModelSummaryView(esm=esm)
model_summary_view.data

{'model_id': 'Dataset_28ab8595_6b1110ff2913499090139f5e4b3b344a',
 'model_name': 'esm',
 'model_title': None,
 'model_created': datetime.datetime(2023, 8, 3, 7, 44, 44, tzinfo=<UTC>),
 'model_last_trained': datetime.datetime(2023, 8, 3, 7, 44, 44, tzinfo=<UTC>),
 'model_uvc': 'b994072a7fe05cb79405809ac34fc750f4390506053afbae1bb7cac7ca1a5d92',
 'model_intercept': -0.992887020111084,
 'has_personal_individual_data': False,
 'has_reuse_restrictions': False,
 'model_doi': '',
 'model_copyright': '',
 'n_input_features': 14,
 'n_transformed_features': 108,
 'n_output_targets': 1,
 'features': ['Age',
  'WorkClass',
  'fnlwgt',
  'Education',
  'EducationNum',
  'MaritalStatus',
  'Occupation',
  'Relationship',
  'Race',
  'Gender',
  'CapitalGain',
  'CapitalLoss',
  'HoursPerWeek',
  'NativeCountry'],
 'targets': ['Income'],
 'n_modules': 13,
 'n_partitions': 21,
 'max_interaction_degree': 1,
 'model_interaction_count': 21,
 'max_width': 1,
 'max_depth': 4,
 'n_categorical_features': 8,
 

## PartialDependencyView
The Partial Dependency View for the Feature Group module shows you the transfer function of the feature group components (input features) to the  module attribution.

In [16]:
from umnai.views.partial_dependency import PartialDependencyView

# Select Feature Group module
selected_module = 'Capital'

# Generate the view
partial_dependency_view = PartialDependencyView(esm=esm, module=selected_module)

# Display the results
partial_dependency_view.data

Unnamed: 0,input_feature.CapitalGain,input_feature.CapitalLoss,attribution.Income,attribution_normalized.Income,module_partition_index,rule_id,condition_expr_friendly,attribution_delta.Income
0,0,0,-0.079082,-0.001242,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,
1,0,155,-0.038105,-0.000598,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.040977
2,0,213,-0.022771,-0.000357,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.015333
3,0,323,0.006309,0.000099,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.029081
4,0,419,0.031689,0.000497,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.025379
...,...,...,...,...,...,...,...,...
205,25236,0,0.951035,0.014930,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.004572
206,27828,0,1.056839,0.016591,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.105804
207,34095,0,1.312653,0.020607,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.255815
208,41310,0,1.607165,0.025231,0,a9fa4935c60946588fad337b38de1c62,FOR ALL,0.294512


# Explain a Query with Feature Groups

When you create a Feature Attribution View or an Interaction Attribution View with a feature group, the attributions will appear in terms of the feature group. Additionally, you will see the input values of the feature group component features in the column called `grouped_features`. The expressions in the Interaction Attribution View will include terms that use the input features, rather than the feature group.

## FeatureAttributionView

In [17]:
from umnai.views.feature_attribution import FeatureAttributionView

# Create the view and display the data
feature_attribution_view = FeatureAttributionView(query_result)
feature_attribution_view.data

Unnamed: 0,input_feature,feature_attribution,feature_attribution_absolute,feature_attribution_normalized,grouped_features,feature_input
0,MaritalStatus,-0.665036,0.665036,0.38047,,Never-married
1,Relationship,-0.372956,0.372956,0.21337,,Not-in-family
2,Education,0.228187,0.228187,0.130547,,Bachelors
3,EducationNum,0.193435,0.193435,0.110665,,13
4,Occupation,-0.187952,0.187952,0.107528,,Adm-clerical
5,Age,0.068587,0.068587,0.039239,,39
6,fnlwgt,-0.012399,0.012399,0.007093,,77516
7,Capital,0.009659,0.009659,0.005526,"[CapitalGain, CapitalLoss]","{'CapitalGain': 2174, 'CapitalLoss': 0}"
8,Race,-0.006435,0.006435,0.003681,,White
9,Gender,-0.001568,0.001568,0.000897,,Male


## InteractionAttributionView

In [18]:
from umnai.views.interaction_attribution import InteractionAttributionView

# Create the view and display the data
interaction_attribution_view = InteractionAttributionView(query_result)
interaction_attribution_view.data

Unnamed: 0,module_id,module_index,module_name,module_partition_index,global_partition_index,rule_id,output_target_index,total_attribution,total_attribution_normalized,input_feature_0,grouped_features_0,feature_attribution_0,feature_input_0,condition_expr_friendly,summarized_then_expr
0,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,7,MaritalStatus,0,13,f27e861d10874a6a808de4f1d8e541e2,0,-0.665036,0.38047,MaritalStatus,,-0.665036,Never-married,"MaritalStatus ≠ ""Married-civ-spouse""",-0.661878705024719 - 0.00315683637745678*(Mar...
1,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,11,Relationship,0,18,1860e66f536044608f2eecfb15c7879c,0,-0.372956,0.21337,Relationship,,-0.372956,Not-in-family,"Relationship ≠ ""Husband""",-0.373656421899796 + 0.000700621982105076*(Re...
2,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,5,Education,0,11,05aa6391703a4316905cd7314480d094,0,0.228187,0.130547,Education,,0.228187,Bachelors,FOR ALL,-0.0373556688427925 + 0.265542834997177*(Educ...
3,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,2,EducationNum,1,8,7ec46e1a131f43bcb35282201c390475,0,0.193435,0.110665,EducationNum,,0.193435,13,EducationNum > 9.5,0.190801963210106 + 8.37128202395103e-6*Educa...
4,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,9,Occupation,0,16,0bedcbcf2427468989da0cb651853672,0,-0.187952,0.107528,Occupation,,-0.187952,Adm-clerical,FOR ALL,-0.140927150845528 - 0.0470249280333519*(Occu...
5,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,0,Age,1,1,0bb54381df8e4bbe981ba53b1382d703,0,0.068587,0.039239,Age,,0.068587,39,(Age > 31.5) and (Age ≤ 39.5),0.0685801729559898 + 2.75818776112183e-9*Age*...
6,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,1,fnlwgt,0,4,6d3fd387d5964ddaa61b52ce10964e9f,0,-0.012399,0.007093,fnlwgt,,-0.012399,77516,fnlwgt ≤ 156804.5,-0.0123987291008234 + 1.67272570929364e-12*fn...
7,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,4,Capital,0,10,a9fa4935c60946588fad337b38de1c62,0,0.009659,0.005526,Capital,"[CapitalGain, CapitalLoss]",0.009659,"{'CapitalGain': 2174, 'CapitalLoss': 0}",FOR ALL,-0.0790820717811584 + 4.08193417800904e-5*Cap...
8,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,10,Race,0,17,c79266a7527f4b098b1c9bc043d772bd,0,-0.006435,0.003681,Race,,-0.006435,White,FOR ALL,-0.00649698171764612 + 6.23680243734270e-5*(R...
9,Dataset_28ab8595_6b1110ff2913499090139f5e4b3b3...,6,Gender,0,12,f6d790c88d5847d7b06fd1360d2df779,0,-0.001568,0.000897,Gender,,-0.001568,Male,FOR ALL,-0.00442887609824538 + 0.00286063877865672*(G...
