# Example: A Dive Into Term Deposit Prediction

To showcase xplainable, we’ll walk through an example using a Portuguese banking institution's open-source bank marketing dataset. The dataset can be found in the UCI Machine Learning Repository.

You can read more about the dataset via the link, but at a high level, it consists of phone marketing campaign information used to predict whether a person will purchase a term deposit following a campaign.

Generally, the intention is to develop a model to predict customers who will purchase a term deposit. But, by taking the approach I outlined earlier, we can frame the objective as follows:

*Identify customers who are more likely to purchase a term deposit and learn the underlying factors that drive purchasing decisions for improved campaign design.*

# Package Imports

In [1]:
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split

# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets['y']

# Create a training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [2]:
X_train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day_of_week,month,duration,campaign,pdays,previous,poutcome
19738,32,management,married,tertiary,no,183,yes,no,cellular,7,aug,112,3,-1,0,
31173,30,management,single,tertiary,no,5561,yes,no,cellular,27,feb,195,1,100,1,success
20792,49,blue-collar,married,primary,no,67,no,yes,cellular,13,aug,269,4,-1,0,
38400,29,blue-collar,single,primary,no,722,yes,no,cellular,15,may,78,1,359,1,failure
35007,50,admin.,married,primary,no,458,yes,no,cellular,6,may,116,2,363,1,failure


## 1. Data Preprocessing

In [3]:
from xplainable.preprocessing.pipeline import XPipeline
import xplainable.preprocessing.transformers as xtf

X_train['city'].nunique()
# Result: 1127

pipeline = XPipeline()

stages = [
    # <-- your other stages here, 
    {"feature": "city", "transformer": xtf.Condense(pct=0.5)}
]

pipeline.add_stages(stages)

X_train_transformed = pipeline.fit_transform(X_train)

X_train_transformed['city'].nunique()
# Result: 256

KeyError: 'city'

## 2. Feature Selection

Feature selection is a critical process in model development, impacting both model performance and interpretability.

### Issues:

- **Multicollinearity**: High inter-feature correlation can degrade the interpretability of a model. It's important to identify and address multicollinearity to improve the utility of individual features.

- **Feature Overload**: Excessive features can make a model complex and difficult to interpret without any additional increase in accuracy. Pruning less important features can aid in maintaining model performance while improving clarity.

### Implementation:

#### GraphSelector Class

`GraphSelector` mitigates multicollinearity by pruning correlated features. It employs a correlation matrix and network graphs to evaluate the relationships between features.

- **Process**: Analyses feature correlations using a matrix and `NetworkX` graph to determine redundancy.
- **Result**: Isolates and removes features with high multicollinearity.



In [4]:
from xplainable.feature_selection.graph import GraphSelector

# Fit the network
graph_selector = GraphSelector(min_feature_corr=0.5, min_target_corr=0.01)
graph_selector.fit(X_train, y_train, start_threshold=0.75)

# Access the remaining features
graph_selector.selected

  0%|          | 0/25 [00:00<?, ?it/s]

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day_of_week',
 'month',
 'duration',
 'campaign',
 'previous',
 'poutcome']

In [9]:
graph_selector.plot_graph()

In [10]:
for feature in graph_selector.dropped:
    print(f"Dropped {feature['feature']} because of {feature['reason']}")

Dropped pdays because of high chance of multicollinearity with ['previous']


To visualise how `GraphSelector` chose our features, we can run the following line:


The feature selector's decision process can be visualized to identify features contributing to multicollinearity. By examining these visual cues, we can adjust the selector's parameters to preserve essential features. For demonstration purposes, the identified features will remain in the dataset.




## XClfFeatureSelector

`XClfFeatureSelector` is designed for optimizing feature sets in classification models (`XClassifier`). It employs a systematic approach to evaluate feature importance by conducting iterative model training sessions on different feature combinations. In each iteration, it assesses feature importance by integrating the feature's importance scores with a chosen performance metric. This iterative training and scoring cycle continues through all iterations, culminating in a ranking of features by their aggregated significance scores. This methodology facilitates an empirical process for feature elimination.

Output example after executing `XClfFeatureSelector` on the Term Deposit dataset (prior to `GraphSelector` application):


In [16]:
from xplainable.feature_selection.classification import XClfFeatureSelector

#Instantiate Feature Selection
fs = XClfFeatureSelector(n_samples=100)
feature_info = fs.fit(X_train, y_train)

  0%|          | 0/100 [00:00<?, ?it/s]

In [17]:
feature_info

{'duration': 37.61680550938488,
 'month': 22.255739583751215,
 'day_of_week': 16.90108510634383,
 'previous': 16.532536675821518,
 'pdays': 15.558511863314113,
 'age': 15.006034002008477,
 'balance': 14.122991891964837,
 'job': 12.918155039018012,
 'housing': 6.479637358859719,
 'campaign': 6.274581901520477,
 'education': 4.200964945450758,
 'marital': 3.4861488914254233,
 'loan': 2.8278915787843144,
 'poutcome': 2.5384148144188505,
 'contact': 1.1456148289911092,
 'default': 0.502623281801677}

# 3. Modelling

## Model Implementation in xplainable

Utilising xplainable for modelling adheres to the conventional `fit`/`transform` paradigm prevalent in Python's machine learning ecosystem, akin to the established patterns in scikit-learn. However, parameter optimisation within xplainable distinguishes itself by its streamlined efficiency and expedited execution, which are especially evident during the rapid refitting process.

For practitioners familiar with the interfaces of open-source Python packages, the API of xplainable will be intuitively recognisable, ensuring a minimal learning curve and ease of integration into existing workflows.


In [18]:
from xplainable.core.models import XClassifier
from xplainable.core.optimisation.bayesian import XParamOptimiser

# Optimise hyperparameters
opt = XParamOptimiser(metric='roc-auc')
params = opt.optimise(X_train, y_train)

# Train model
model = XClassifier(**params)
model.fit(X_train, y_train)

100%|████████| 30/30 [00:05<00:00,  5.69trial/s, best loss: -0.8835904325047984]


<xplainable.core.ml.classification.XClassifier at 0x28c55caf0>

In [24]:
model.evaluate(X_test, y_test, threshold=0.5)

{'confusion_matrix': [[7979, 53], [954, 57]],
 'classification_report': {'0': {'precision': 0.893204970334714,
   'recall': 0.9934013944223108,
   'f1-score': 0.9406424992631889,
   'support': 8032.0},
  '1': {'precision': 0.5181818181818182,
   'recall': 0.05637982195845697,
   'f1-score': 0.1016949152542373,
   'support': 1011.0},
  'accuracy': 0.8886431493973239,
  'macro avg': {'precision': 0.7056933942582662,
   'recall': 0.5248906081903839,
   'f1-score': 0.5211687072587131,
   'support': 9043.0},
  'weighted avg': {'precision': 0.851277688810156,
   'recall': 0.8886431493973239,
   'f1-score': 0.8468488458922887,
   'support': 9043.0}},
 'roc_auc': 0.8862701395210454,
 'neg_brier_loss': 0.9175745435164223,
 'log_loss': 4.013707725626559,
 'cohen_kappa': 0.08154308571352509}

## Model Explainability and Data Leakage Detection

The feature importance and contribution charts generated by `model.explain()` function provide insight into the predictive power of each feature within the model. Specifically, they can highlight issues such as data leakage. 

For instance, the `duration` feature exhibits a strong positive correlation with the target variable, as indicated by its prominent importance score. This suggests that a longer duration is associated with an increased likelihood of a positive outcome. However, such a characteristic of `duration` may be indicative of data leakage, especially if the `duration` pertains to the length of calls or interactions that occur only after the target event has happened.

In practice, the `duration` should not be available at the time of prediction, and its presence in the dataset could lead to overly optimistic model performance during validation, which is unlikely to generalize to new, unseen data. 

The immediate visibility of `duration` as a significant feature in the explainability charts underscores its potential role in data leakage and necessitates further investigation to ensure that the model is not inadvertently learning from features that would not be available in a real-world scenario.

In [20]:
model.explain()

The charts above reveal that the `duration` feature significantly influences model predictions. However, the fact that an increase in `duration` consistently leads to a higher probability of conversion raises concerns about data leakage. If `duration` is measured post-outcome, it should not be included as a model feature, as it can lead to misleadingly high accuracy in predictive scenarios. Such insights are crucial for ensuring the robustness and reliability of the model when deployed in production environments.

*Note that if you want to render the interactive chart above, you’ll need to install the correct dependencies with pip install xplainable[plotting].*

### Dropping the `duration` feature and re-training

In [26]:
# Optimise hyperparameters
opt_dd = XParamOptimiser(metric='roc-auc')
params_dd = opt.optimise(X_train.drop(columns=["duration"]), y_train)

# Train model
model_dd = XClassifier(**params)
model_dd.fit(X_train.drop(columns=["duration"]), y_train)

100%|████████| 30/30 [00:05<00:00,  5.75trial/s, best loss: -0.7293642573275253]


<xplainable.core.ml.classification.XClassifier at 0x28e257a90>

In [27]:
model_dd.explain()

## Instantiate Xplainable Cloud
Initialise the xplainable cloud using an API key from: https://beta.xplainable.io/

This allows you to save and collaborate on models, create deployments, create shareable reports.

In [None]:
xp.initialise(
    api_key="", #<- Add your own token here
)