# Breast cancer prediction model

Using the Breast Cancer Wisconsin (Diagnostic) Database, we can create a classifier that can help diagnose patients and predict the likelihood of a breast cancer. A few machine learning techniques will be explored. In this exercise, we are building a simple predictive model using xgboost along with the breast cancer data (imported from `scikit-learn`)

In [None]:
# loads the needed libraries 

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Load the required data set

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# load data
cancer = load_breast_cancer()

In [None]:
print(cancer.DESCR)

In [None]:
# splitting into training and test

X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

In [None]:
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)

print(f"Accuracy on training set: {gbrt.score(X_train, y_train)}")
print(f"Accuracy on test set: {gbrt.score(X_test, y_test)}")

It is always a good idea to use some sort of evaluation metrics... in this case purely as an example

In [None]:
from sklearn.metrics import r2_score, explained_variance_score, mean_absolute_error

In [None]:
predictions = gbrt.predict(X_test)

print(f'R^2 score: {r2_score(y_true=y_test, y_pred=predictions):.2f}')
print(f'MAE score: {mean_absolute_error(y_true=y_test, y_pred=predictions):.2f}')
print(f'EVS score: {explained_variance_score(y_true=y_test, y_pred=predictions):.2f}')

Now we are going to serialize our train model. We are going to use this to serve it through and API

## Let's use the explainable ml package

In [None]:
from azureml.explain.model.tabular_explainer import TabularExplainer

Since it is tabular data we can use the tabular explainer

In [None]:
tabular_explainer = TabularExplainer(gbrt, X_train, 
                                     features = cancer.feature_names)

Explain overall model predictions (global explanations)

In [None]:
global_explanation = tabular_explainer.explain_global(X_test)

In [None]:
# Sorted SHAP values 
print('ranked global importance values: {} \n\n'.format(global_explanation.get_ranked_global_values()))
# Corresponding feature names
print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))
# feature ranks (based on original order of features)
print('global importance rank: {}'.format(global_explanation.global_importance_rank))

In [None]:
dict(zip(global_explanation.get_ranked_global_names(), global_explanation.get_ranked_global_values()))

Explain overall model predictions as a collection of local (instance-level) explanations

In [None]:
# feature shap values for all features and all data points in the training data
print('local importance values: {}'.format(global_explanation.local_importance_values))