# Predicting prognosis for breast cancer

This notebook implements the experiments in Section 5.2 in our paper *"Demystifying Black-box Models with Symbolic Metamodels"* submitted to **NeurIPS 2019** by *Ahmed M. Alaa and Mihaela van der Schaar*. The experiments are based on the PREDICT dataset.

In this experiment, we demonstrate the utility of symbolic metamodeling in a real-world setup for which model interpretability and transparency are of immense importance. In particular, we consider the problem of predicting the risk of mortality for breast cancer patients based on clinical features. For this setup, the ACCJ guidelines require prognostic models to be formulated as transparent equations [1] — symbolic metamodeling can enable machine learning models to meet these requirements by converting black-box prognostic models into risk equations that can be written on a piece of paper.

Before starting our experiments, we first import the required libraries...

In [None]:
from pysymbolic.utilities.performance import *
from pysymbolic.algorithms.symbolic_metamodeling import *

from sklearn.metrics import roc_auc_score
import numpy as np

from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier

from data.PREDICT_score import *

Using data for 2,000 breast cancer patients extracted from the UK cancer registry, we fit an XGBoost model $f(x)$ to predict the patients' 5 year mortality risk based on 7 features: age, screening status, number of nodes, tumor size, tumor grade, ER and HER2 status. We first read the PREDICT dataset:

In [None]:
PREDICT_data     = pd.read_csv('data/PREDICT_data_subset.csv').drop(['Unnamed: 0'], axis=1)

predict_features = ['AGE', 'ScreeningvsClinical', 'TUMOURSIZE','GRADEG1', 'GRADEG2', 'GRADEG3', 'GRADEG4', 
                    'NODESINVOLVED', 'ER_STATUSN', 'HER2_STATUSP']

Using 5-fold cross-validation, we compare the area under receiver operating characteristic (AUC-ROC) accuracy of the XGBoost model with that of the PREDICT risk calculator [https://breast.predict.nhs.uk/](https://breast.predict.nhs.uk/), which is the risk equation most commonly used in current practice. 

In [None]:
def cross_validation(model_mode, PREDICT_data, num_folds):

    PREDICT_pref  = []
    num_samples   = len(PREDICT_data)
    
    train_size    = int(np.floor(num_samples * (1 - 1/num_folds)))
    test_size     = num_samples - train_size
    
    test_indexes  = partition_(scrambled(list(range(num_samples))), num_folds)

    minmaxscaler  = MinMaxScaler()
    
    minmaxscaler.fit(PREDICT_data[predict_features])

    for u in range(num_folds):
    
        train_indexes = list(set(range(num_samples)) - set(test_indexes[u]))
    
        train_data    = PREDICT_data.loc[PREDICT_data.index[train_indexes]]
        test_data     = PREDICT_data.loc[PREDICT_data.index[test_indexes[u]]]
        
        if model_mode == 'XGBoost':
            
            model = XGBClassifier(n_estimators=300) 
            
        elif model_mode == 'PREDICT':    
    
            model = PREDICT_model(minmaxscaler) 
    
    
        x_train       = train_data[predict_features]
        x_test        = test_data[predict_features]
        y_train       = train_data['Label'].astype(int)
        y_test        = test_data['Label'].astype(int)
    
        x_train   = minmaxscaler.transform(x_train)
        x_test    = minmaxscaler.transform(x_test)
        
        if model_mode == 'XGBoost':
            
            model.fit(x_train, y_train)
        
        y_pred        = model.predict_proba(x_test)[:, 1]
    
        PREDICT_pref.append(roc_auc_score(y_test, y_pred)) 
        
    return mean_confidence_interval(PREDICT_pref)    

Now let us start by validating the XGBoost model on the PREDICT dataset:

In [None]:
cross_validation("XGBoost", PREDICT_data, num_folds=5)

The number above corresponds to the average AUCROC of the model in addition to the 95% confidence interval. Now let us examine the performance of the PREDICT risk score:

In [None]:
cross_validation("PREDICT", PREDICT_data, num_folds=5)

As we can see, the XGBoost model significantly outperforms the PREDICT model. In the next Section, we show how metamodeling can help us understand the sources of the performance gains achieved by XGBoost.

## Symbolic metamodeling for the XGBoost and PREDICT models

We start by obtaining the symbolic metamodel for the XGBoost model. We do so by first fitting the XGBoost model to the entire dataset as follows:

In [None]:
XGBmodel      = XGBClassifier(n_estimators=300)  
x             = PREDICT_data[predict_features]
minmaxscaler  = MinMaxScaler()

x             = minmaxscaler.fit_transform(x)
y             = PREDICT_data['Label']    

XGBmodel.fit(x, y)

In the cell above, we use sklearn's MinMaxScaler to put all features in the range $[0,1]$. Now to obtain the metamodel for XGBoost, we first create an instance of the **Symbolic_Metamodel** class (which is in the **pysymbolic.algorithms.metamodeling** with default settings as follows:

In [None]:
XGBoost_metamodel = symbolic_metamodel(XGBmodel, x)

The arguments of the metamodel can be summarized as follows: 

- **n_dim**: the number of features in the model.
- **batch_size**, **num_iter** and **learning_rate**: the parameters of the gradient descent learning algorithm.
- **feature_types**: a list of data types for each feature. "b" corresponds to binary features whereas "c" corresponds to continuous ones. 

Now let us fit the metamodel simply by calling the **fit** method after supplying the trained model and the training features:

In [None]:
XGBoost_metamodel.fit()

Finally we are done fitting the metamodel! Now to see the symbolic expression it learned, we need to inspect the **metamodel** attribute in the **Symbolic_Metamodel** object as follows:

In [None]:
init_printing()

XGBoost_metamodel.approx_expression

Now let us check the AUCROC accuracy of the metamodel. This can be done by evaluating the metamodel's prediction using the **evaluate** method as follows:

In [None]:
y_metamodel = XGBoost_metamodel.evaluate(x)
roc_auc_score(y, y_metamodel)

Now let us repeat the same procedure for the PREDICT score:

In [None]:
x              = PREDICT_data[predict_features]
minmaxscaler   = MinMaxScaler()

x              = minmaxscaler.fit_transform(x)
PREDICT_model_ = PREDICT_model(minmaxscaler)     

In [None]:
PREDICT_metamodel = symbolic_metamodel(PREDICT_model_, x)

In [None]:
PREDICT_metamodel.fit()

In [None]:
PREDICT_metamodel.approx_expression

How well does the PREDICT metamodel perform?

In [None]:
y_predict_metamodel = PREDICT_metamodel.evaluate(x)
roc_auc_score(y, y_predict_metamodel)

Now to explain the differences between the XGBoost and PREDICT models, let us compare the average instancewise feature importance for every feature as assigned by the two models:

In [None]:
XGB_instancewise_scores      = XGBoost_metamodel.get_instancewise_scores(x)
XGB_instancewise_scores_mean = np.mean(np.array(XGB_instancewise_scores).reshape((-1, x.shape[1])), axis=0)
XGB_instancewise_scores_mean = XGB_instancewise_scores_mean /np.sum(XGB_instancewise_scores_mean)

In [None]:
PREDICT_instancewise_scores       = PREDICT_metamodel.get_instancewise_scores(x)
PREDICT_instancewise_scores_mean  = np.mean(np.array(PREDICT_instancewise_scores).reshape((-1, x.shape[1])), axis=0)
PREDICT_instancewise_scores_mean  = PREDICT_instancewise_scores_mean/np.sum(PREDICT_instancewise_scores_mean)

In [None]:
from matplotlib import pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(6, 2.5))

ax1=plt.subplot(1,1,1)

ind   = np.arange(len(PREDICT_instancewise_scores_mean))  
width = 0.15                                  

rects1  = ax1.bar(ind, PREDICT_instancewise_scores_mean, width, label='PREDICT: AUC-ROC = 0.762 +/- 0.02', color='b')
rects2  = ax1.bar(ind + width, XGB_instancewise_scores_mean, width, label='XGBoost: AUC-ROC = 0.833 +/- 0.02', color='r')

ax1.set_ylabel('Feature importance')
ax1.set_xticks(ind)
ax1.set_xticklabels(tuple(predict_features), rotation=40)
ax1.legend(loc='upper_left', fontsize=12, frameon=True, fancybox=True)

fig.savefig('feature_importances.pdf', dpi=200,  bbox_inches='tight')

Now we look at the differences between the metamodels with respect to the median ranks of individual features

In [None]:
from pysymbolic.utilities.instancewise_metrics import *

XGB_ranks     = create_rank(np.array(XGB_instancewise_scores).reshape((-1, 10)), k=10)
PREDICT_ranks = create_rank(np.array(PREDICT_instancewise_scores).reshape((-1, 10)), k=10) 

XGB_median_ranks     = [np.median(XGB_ranks[:, k]) for k in range(XGB_ranks.shape[1])]
PREDICT_median_ranks = [np.median(PREDICT_ranks[:, k]) for k in range(PREDICT_ranks.shape[1])]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(6, 2.5))

ax1=plt.subplot(1,1,1)

ind   = np.arange(len(XGB_median_ranks))  
width = 0.15                                  

rects1  = ax1.bar(ind, PREDICT_median_ranks, width, label='PREDICT: AUC-ROC = 0.762 +/- 0.02', color='b')
rects2  = ax1.bar(ind + width, XGB_median_ranks, width, label='XGBoost: AUC-ROC = 0.833 +/- 0.02', color='r')

ax1.set_ylabel('Median Feature Rank')
ax1.set_xticks(ind)
ax1.set_xticklabels(tuple(predict_features), rotation=40)
ax1.legend(loc='upper_left', fontsize=12, frameon=True, fancybox=True)
ax1.set_ylim([0, 20])

fig.savefig('feature_importances.pdf', dpi=200,  bbox_inches='tight')

# References

[1] Michael W Kattan, Kenneth R Hess, Mahul B Amin, Ying Lu, Karl GM Moons, Jeffrey E Gershenwald, Phyllis A Gimotty, Justin H Guinney, Susan Halabi, Alexander J Lazar, et al. American joint committee on cancer acceptance criteria for inclusion of risk models for individ329 ualized prognosis in the practice of precision medicine. CA: a cancer journal for clinicians, 66(5):370–374, 2016.
