# About Problem
Preventing heart disease is important. Good data-driven systems for predicting heart disease can improve the entire research and prevention process, making sure that more people can live healthy lives.

In the United States, the Centers for Disease Control and Prevention is a good resource for information about heart disease. According to their [website](https://www.cdc.gov/heartdisease/facts.htm):

* About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.
* Heart disease is the leading cause of death for both men and women. More than half of the deaths due to heart disease in 2009 were in men.
* Coronary heart disease (CHD) is the most common type of heart disease, killing over 370,000 people annually.
* Every year about 735,000 Americans have a heart attack. Of these, 525,000 are a first heart attack and 210,000 happen in people who have already had a heart attack.
* Heart disease is the leading cause of death for people of most ethnicities in the United States, including African Americans, Hispanics, and whites. For American Indians or Alaska Natives and Asians or Pacific Islanders, heart disease is second only to cancer.

For more information, you can look at the [website](https://www.cdc.gov/heartdisease/prevention.htm) of the Centers for Disease Control and Prevention: 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('BLw62AhW_Kc', width=700, height=400)

![](https://i.ibb.co/ypG4Cq3/CVD-infographic-1.jpg)

# About dataset

The dataset contains the following features:
1. **age(in years)**
2. **sex:** (1 = male; 0 = female)
3. **cp:** chest pain type
4. **trestbps:** resting blood pressure (in mm Hg on admission to the hospital)
5. **chol:** serum cholestoral in mg/dl
6. **fbs:** (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. **restecg:** resting electrocardiographic results
8. **thalach:** maximum heart rate achieved
9. **exang:** exercise induced angina (1 = yes; 0 = no)
10. **oldpeak:** ST depression induced by exercise relative to rest
11. **slope:** the slope of the peak exercise ST segment
12. **ca:** number of major vessels (0-3) colored by flourosopy
13. **thal:** 0 = normal; 1 = fixed defect; 2 = reversable defect
14. **target:** 1 or 0 

# Problem Description
our goal is to predict the binary class **target**, which represents whether or not a patient has heart disease:

* **0** represents no heart disease present
* **1** represents heart disease present

# Importing Essential Libraries

In [None]:
!pip install lofo-importance

In [None]:
from lofo import LOFOImportance, Dataset, plot_importance
import shap 
import warnings  
warnings.filterwarnings('ignore')
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
from sklearn.metrics import accuracy_score,recall_score,precision_score,roc_auc_score,f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import os
import seaborn as sns
print(os.listdir("../input"))
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.model_selection import KFold
# Any results you write to the current directory are saved as output.

In [None]:
#Reading dataset 
dt=pd.read_csv('../input/heart.csv')

changing the column names to have a clear understanding of features.

In [None]:
dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
       'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

Changing features into corresponding categories for better interpretation

In [None]:


dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'typical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 2] = 'atypical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 3] = 'non-anginal pain'
dt['chest_pain_type'][dt['chest_pain_type'] == 4] = 'asymptomatic'



dt['rest_ecg'][dt['rest_ecg'] == 0] = 'normal'
dt['rest_ecg'][dt['rest_ecg'] == 1] = 'ST-T wave abnormality'
dt['rest_ecg'][dt['rest_ecg'] == 2] = 'left ventricular hypertrophy'



dt['st_slope'][dt['st_slope'] == 1] = 'upsloping'
dt['st_slope'][dt['st_slope'] == 2] = 'flat'
dt['st_slope'][dt['st_slope'] == 3] = 'downsloping'

dt['thalassemia'][dt['thalassemia'] == 1] = 'normal'
dt['thalassemia'][dt['thalassemia'] == 2] = 'fixed defect'
dt['thalassemia'][dt['thalassemia'] == 3] = 'reversable defect'

In [None]:
# Chacking datatypes of all features 
dt.dtypes

# EDA

In [None]:
dt.head()

In [None]:
dt.describe()

In [None]:
dt.info()

Datatype of all the features seems relevant now check for missing entries.

# Check for Missing Values

In [None]:
## null count analysis
import missingno as msno
p=msno.bar(dt)

Great !! There is no missing values in the dataset

# Countplot
In this step we will check class distribution 

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
dt['target'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('target')
ax[0].set_ylabel('')
sns.countplot('target',data=dt,ax=ax[1])
ax[1].set_title('target')
plt.show()

Since patients having no heart disease have 45% data & patients having heart disease have 55 % data. So, It seems to be a balanced dataset

# Histogram of numrical features

In [None]:
dataset2=dt.drop(['target'],axis=1)
p = dataset2.hist(figsize = (12,8))

There are some outliers in case of cholestrol, resting bood pressure and max heart rate achieved.Further we can detect outliers using **boxplot**

# Boxplot (Outlier Detection)

In [None]:
sns.boxplot(data=dt,x="target", y="cholesterol");

In [None]:
sns.boxplot(data=dt,x="target", y="max_heart_rate_achieved");

In [None]:
sns.boxplot(data=dt,x="target", y="resting_blood_pressure");

From the above boxplots we have clearly seen the outliers

In [None]:
cols_drp=['age','sex','fasting_blood_sugar','exercise_induced_angina','st_depression','num_major_vessels','target']
dt_o=dt.drop(cols_drp,axis=1)

Q1 = dt_o.quantile(0.25)
Q3 = dt_o.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
dt_clean = dt.loc[~dt['cholesterol'].isin([63.5])]
dt_clean= dt_clean.loc[~dt_clean['resting_blood_pressure'].isin([20.0])]
dt_clean= dt_clean.loc[~dt_clean['max_heart_rate_achieved'].isin([32.5])]

dt_clean.shape


In [None]:
sns.boxplot(data=dt_clean,x="target", y="resting_blood_pressure");

# Distplot

In [None]:
sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(dt['age']);

In [None]:
sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(dt['cholesterol']);

In [None]:
sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(dt['max_heart_rate_achieved']);

# Violin plot

In [None]:
sns.swarmplot(data=dt,x="target", y="age");

# Correlation plot

In [None]:
plt.figure(figsize=(12,10))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(dataset2.corr(), annot=True,cmap ='RdYlGn')  # seaborn has very simple solution for heatmap

It seems independent variables are not much correlated with one another.

# One hot encoding

In [None]:
dt1=pd.get_dummies(dt,drop_first=True)

Here, we have converted all the categorical columns to numerical columns and keep the drop_first parameter to true to prevent from dummy variable trap.To read more about dummy variable trap visit this [blog](https://medium.com/@saurav9786/dummy-variable-trap-c6d4a387f10a). You can also read very good explanation of dummy variable trap from this quora question [here](https://www.quora.com/When-do-I-fall-in-the-dummy-variable-trap)

In [None]:
dt1.head()

# Prepare Features & Targets
First of all seperating the data into dependent(Feature) and independent(Target) variables.

1. X==>>Feature
2. y==>>Target

# Normalizing the data

I have used Z-score normalization.
Z-scores are linearly transformed data values having a mean of zero and a standard deviation of 1.
Z-scores are also known as standardized scores; they are scores (or data values) that have been given a common standard.

If the population mean and population standard deviation are known, the standard score of a raw score x[1] is calculated as

![zscore](https://i.ibb.co/6wGCbbQ/z-score-formula.jpg)

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X =  pd.DataFrame(sc_X.fit_transform(dt1.drop(["target"],axis = 1),),
        columns=['age', 'resting_blood_pressure', 'cholesterol', 'max_heart_rate_achieved', 'st_depression',
       'num_major_vessels', 'sex_male', 'chest_pain_type_atypical angina','chest_pain_type_non-anginal pain','chest_pain_type_typical angina','fasting_blood_sugar_lower than 120mg/ml','rest_ecg_left ventricular hypertrophy','rest_ecg_normal','exercise_induced_angina_yes','st_slope_flat','st_slope_upsloping','thalassemia_fixed defect','thalassemia_normal','thalassemia_reversable defect'])

In [None]:
y=dt['target']

# Train test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,stratify=y, random_state=5)

**Stratify property in train test split**
This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.
For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

# Model Building
Now comes the most interesting part i.e., Model Building. In this step we will build different machine learning model starting from our base model which is logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(random_state = 0)
logit.fit(X_train, y_train)

# Predicting Test Set
y_pred = logit.predict(X_test)

In [None]:
from sklearn.model_selection import cross_val_score
roc=roc_auc_score(y_test, y_pred)
accuracies = cross_val_score(estimator = logit, X = X_test, y = y_test, cv = 10)
acc = accuracies.mean()
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results = pd.DataFrame([['Base - Logistic Regression', acc,prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])

results

# RandomForest
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.Random Forest Classifier

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,accuracy_score
random_forest = RandomForestClassifier(n_estimators=500,criterion='entropy',max_depth=5).fit(X_train, y_train)
y_pred_random = random_forest.predict(X_test)

In [None]:
roc=roc_auc_score(y_test, y_pred)
accuracies = cross_val_score(estimator = random_forest, X = X_test, y = y_test, cv = 10)
acc = accuracies.mean()
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['Random Forest', acc,prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results = results.append(model_results,sort=True)
results

# Explaining Model
In this step we will try to explain the model by applying different techniques and algorithms such as 

1. [Permutation Importance (Eli5)](https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html)<br>
2. [Partial dependency plotting (pdpbox)](https://www.kaggle.com/dansbecker/partial-dependence-plots) <br>
3. [SHapley Additive exPlanations (SHAP values)](https://shap.readthedocs.io/en/latest/) <br>
4. [LOFO Importance](https://github.com/aerdem4/lofo-importance) <br>
5. [Alibi](https://github.com/SeldonIO/alibi) <br>
6. [LIME](https://github.com/marcotcr/lime)<br>
7. [pyBreakdown](https://github.com/MI2DataLab/pyBreakDown)

## 1. Permutation Importance

eli5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as **“permutation importance” or “Mean Decrease Accuracy (MDA)”**.


In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(random_forest, random_state=123).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X.columns.tolist(),top=24)

# Interpretations
The values towards the top which are slightly dark in color are the most important features, and those towards the bottom having lighter shade matter least. The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric).Measurement of randomness in permutation importance calculation is performed by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next. 

Here, thalassemia_reversable defect, resting_blood_pressure, thalassemia_fixed defect,chest_pain_type_atypical angina and st_slope_flat are top 5 important features.



Here is how to calculate and show importances with the [eli5](https://eli5.readthedocs.io/en/latest/) library:

In [None]:
from eli5 import explain_prediction


eli5.show_prediction(random_forest, X_test.iloc[50], 
                     feature_names=X_test.columns.tolist(), show_feature_values=True)

## Interpretations

To make random forest predictions more interpretable, every prediction of the model can be presented as a sum of feature contributions (plus the bias), showing how the features lead to a particular prediction. In above plot, ELI5 does it by showing weights for each feature with their actual value depicting how influential it might have been in contributing to the final prediction decision across all trees. In the above individual prediction, the top 3 influential features seems to be, after the bias, the chest_pain_type_non-anginal pain, chest_pain_type_atypical angina and sex_male.

## 2. Partial Dependence Plots (PDP)
While permutation importance shows what variables most affect predictions, partial dependence plots show how a feature affects predictions.[Credit](https://www.kaggle.com/dansbecker/partial-plots).

### 2.a PDP Isolation plot

In [None]:
features = [c for c in X_test.columns]

In [None]:
from pdpbox import pdp, get_dataset, info_plots

pdp_thal = pdp.pdp_isolate(model=random_forest, dataset=X_test, model_features=features, feature='thalassemia_reversable defect')

# plot it
pdp.pdp_plot(pdp_thal, 'thalassemia_reversable defect')

plt.show()

In [None]:
pdp_resting_bp = pdp.pdp_isolate(model=random_forest, dataset=X_test, model_features=features, feature='resting_blood_pressure')

# plot it
pdp.pdp_plot(pdp_resting_bp, 'resting_blood_pressure')

plt.show()

## 2.b Univariate ICE plot
ICE plots are similar to PD plots but offer a more detailled view about the behavior of near similar clusters around the PD plot average curve. ICE algorithm gives the user insight into the several variants of conditional relationships estimated by the black box.

In [None]:
def plot_pdp(model, df, feature, cluster_flag=False, nb_clusters=None, lines_flag=False):
    
    # Create the data that we will plot
    pdp_goals = pdp.pdp_isolate(model=model, dataset=df, model_features=df.columns.tolist(), feature=feature)

    # plot it
    pdp.pdp_plot(pdp_goals, feature, cluster=cluster_flag, n_cluster_centers=nb_clusters, plot_lines=lines_flag)
    plt.show()

In [None]:
plot_pdp(random_forest, X_train, 'thalassemia_reversable defect', cluster_flag=True, nb_clusters=24, lines_flag=True)

In [None]:
plot_pdp(random_forest, X_train, 'resting_blood_pressure', cluster_flag=True, nb_clusters=24, lines_flag=True)

In [None]:
plot_pdp(random_forest, X_train, 'age', cluster_flag=True, nb_clusters=24, lines_flag=True)

In [None]:
plot_pdp(random_forest, X_train, 'st_slope_flat', cluster_flag=True, nb_clusters=24, lines_flag=True)



## 2.c. PDP Interact Plot

In [None]:
inter1  =  pdp.pdp_interact(model=random_forest, dataset=X_test, model_features=features, features=['thalassemia_reversable defect', 'resting_blood_pressure'])

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['thalassemia_reversable defect', 'resting_blood_pressure'], plot_type='contour')
plt.show()



In [None]:
inter1  =  pdp.pdp_interact(model=random_forest, dataset=X_test, model_features=features, features=['age', 'resting_blood_pressure'])

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['age', 'resting_blood_pressure'], plot_type='contour')
plt.show()

In [None]:
inter1  =  pdp.pdp_interact(model=random_forest, dataset=X_test, model_features=features, features=['age', 'st_slope_flat'])

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['age', 'st_slope_flat'], plot_type='contour')
plt.show()

## 2.d Actual Prediction Plot

In [None]:
fig, axes, summary_df = info_plots.actual_plot_interact(
    model=random_forest, X=X_test, features=['age', 'thalassemia_reversable defect'], feature_names=['age', 'thalassemia_reversable defect']
)

In [None]:
fig, axes, summary_df = info_plots.actual_plot_interact(
    model=random_forest, X=X_test, features=['age', 'resting_blood_pressure'], feature_names=['age', 'resting_blood_pressure']
)

# Interpretations
Above plot shows the target plot under partial dependence library, here the bubble size is of less importance, since it pertains to the number of observations (times the incident occurred). The most important insight comes from the color of the bubble, with darker bubbles meaning higher probabilities of heart disease while lighter colors of bubble signifies healthy. This is a powerful tool to use since it has a deep insight on how much two variables of our choice affect the dependent variable

## 3. SHapley Additive exPlanations (SHAP Values)

SHAP values can explain the output of any machine learning model but for complex ensemble models it can be slow. SHAP has c++ implementations supporting XGBoost, LightGBM, CatBoost, and scikit-learn tree models.

SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction. Its novel components include: the identification of a new class of additive feature importance measures, and theoretical results showing there is a unique solution in this class with a set of desirable properties. Typically, SHAP values try to explain the output of a model (function) as a sum of the effects of each feature being introduced into a conditional expectation. Importantly, for non-linear functions the order in which features are introduced matters. The SHAP values result from averaging over all possible orderings. Proofs from game theory show this is the only possible consistent approach.

An intuitive way to understand the Shapley value is the following: The feature values enter a room in random order. All feature values in the room participate in the game (= contribute to the prediction). The Shapley value  ϕij  is the average marginal contribution of feature value  xij  by joining whatever features already entered the room before, i.e.

![](https://i.ibb.co/m0dSM81/shap1.png)

The following figure from the paper, [Consistent Individualized Feature Attribution for Tree Ensembles](https://arxiv.org/pdf/1802.03888.pdf) summarizes this in a nice way!

![](https://i.ibb.co/YZsYsjd/shap2.png)

In [None]:
row_to_show = 17
data_for_prediction = X_test.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)


random_forest.predict_proba(data_for_prediction_array)

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(random_forest)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[0], shap_values[0], data_for_prediction)

In [None]:
row_to_show = 9
data_for_prediction = X_test.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)


random_forest.predict_proba(data_for_prediction_array)

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(random_forest)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[0], shap_values[0], data_for_prediction)

## Interpretations
The above graph is generated when we applied SHAP algorithm on instance number 17 and 9 from our test set. In above plot, we predicted 0.80, whereas the base_value is 0.4517. Feature values causing increased predictions are in pink, and their visual size shows the magnitude of the feature's effect. Feature values decreasing the prediction are in blue. The biggest impact comes sex_male being 1.435, while chest_pain_type_non-anginal pain value has the effect of decreasing the prediction. 

## SHAP Feature Importance Plot
The global mean(|Tree SHAP|) method applied to the heart disease prediction model. The x-axis is essentially the average magnitude change in model output when a feature is “hidden” from the model (for this model the output has log-odds units). See [github repo](https://github.com/slundberg/shap) for details, but “hidden” means integrating the variable out of the model. Since the impact of hiding a feature changes depending on what other features are also hidden, Shapley values are used to enforce consistency and accuracy.

In [None]:
explainer = shap.TreeExplainer(random_forest)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test, plot_type="bar")

## Summary Plot

In [None]:
shap.summary_plot(shap_values[1], X_test)

## Interpretaions
SHAP summary plot of a 19 feature Random Forest heart disease prediction. The higher the SHAP value of a feature, the higher is the  log odds of heart disease in this heart disease prediction model. Every patient in the dataset is run through the model and a dot is created for each feature attribution value, so one patient gets one dot on each feature’s line. Dot’s are colored by the feature’s value for that patient and pile up vertically to show density. 
In above plot we see that **chest pain type non anginal pain**  is the most important risk factor for heart disease patients. The lower values of **chest pain type non anginal pain** leads to heart disease, whereas in non heart disease patients its contribution is mixture of higher and lower values. Higher values of **thalasemia_fixed_defect** increases the risk of heart disease whereas its lower values decreases the chances of heart disease.


## SHAP dependence plot

In [None]:
explainer = shap.TreeExplainer(random_forest)

# calculate shap values. This is what we will plot.
shap_values = explainer.shap_values(X_test)

# make plot.
shap.dependence_plot('chest_pain_type_non-anginal pain', shap_values[0], X_test, interaction_index="age")

In [None]:
shap.dependence_plot('chest_pain_type_non-anginal pain', shap_values[0], X_test, interaction_index="resting_blood_pressure")

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1],plot_cmap="DrDb")

## 4. LOFO Importance

LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric.

LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set. The mean and standard deviation (across the folds) of the importance of each feature is then reported.

If a model is not passed as an argument to LOFO Importance, it will run LightGBM as a default model.

### Advantages of LOFO Importance
LOFO has several advantages compared to other importance types:

1. It does not favor granular features<br>
2. It generalises well to unseen test sets<br>
3. It is model agnostic<br>
4. It gives negative importance to features that hurt performance upon inclusion<br>

In [None]:
# extract a sample of the data
sample_df = dt1.sample(frac=0.5, random_state=0)

In [None]:
cv = KFold(n_splits=4, shuffle=False, random_state=0)

In [None]:
# define the binary target and the features
dataset = Dataset(df=sample_df, target="target", features=[col for col in dt1.columns if col != 'target'])

In [None]:
# define the validation scheme and scorer. The default model is LightGBM
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="roc_auc")


In [None]:
# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_imp.get_importance()

In [None]:
# plot the means and standard deviations of the importances
plot_importance(importance_df, figsize=(12, 20))

In [None]:
!pip install alibi

## 5. Alibi
Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The initial focus on the library is on black-box, instance based model explanations.

## Method Used : Anchors
The anchor algorithm is based on the Anchors: [High-Precision Model-Agnostic Explanations](https://homes.cs.washington.edu/~marcotcr/aaai18.pdf) paper by Ribeiro et al. and builds on the [open source code](https://github.com/marcotcr/anchor) from the paper’s first author.

The algorithm provides model-agnostic (black box) and human interpretable explanations suitable for classification models applied to images, text and tabular data. The idea behind anchors is to explain the behaviour of complex models with high-precision rules called anchors. These anchors are locally sufficient conditions to ensure a certain prediction with a high degree of confidence.

### Goals

1. Provide high quality reference implementations of black-box ML model explanation algorithms

2. Define a consistent API for interpretable ML methods

3. Support multiple use cases (e.g. tabular, text and image data classification, regression)

4. Implement the latest model explanation, concept drift, algorithmic bias detection and other ML model monitoring and interpretation methods

In [None]:
from alibi.explainers import AnchorTabular

In [None]:
predict_fn = lambda x: random_forest.predict_proba(x)

In [None]:
explainer = AnchorTabular(predict_fn, features)

In [None]:
explainer.fit(X_train.values, disc_perc=[25, 50, 75])

In [None]:
class_names=['Healthy','Disease']

idx = 3
explanation = explainer.explain(X_test.values[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

In [None]:
idx = 19
explanation = explainer.explain(X_test.values[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

## 6. LIME

**Local Interpretable Model-agnostic Explanations (LIME)**. The overall goal of LIME is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier.

The local explanation method LIME interprets an individual prediction by learning an interpretable model locally. The intuition behind LIME is that it samples instances both in the vicinity and far away from the interpretable representation of the original input. Then LIME takes the interpretable representation of these sample points, determines their predictions and builds a weighted linear model by minimizing the loss and complexity. The samples weighting is based on their distances from the original point. The points weights decrease as the points get farther away. The explanation is locally faithful, which means it represents the model prediction of vicinity instances.This is illustrated in below figure.

![](https://i.ibb.co/9Hxnb5g/lime1.png)

By explaining a prediction", we mean presenting textual or visual artifacts that provide qualitative understanding of the relationship between the instance’s components (e.g. words in text, patches in an image) and the model’s prediction. We argue that explaining predictions is an important aspect in getting humans to trust and use machine learning effectively, if the explanations are faithful and intelligible.
The process of explaining individual predictions is illustrated in Figure 1. It is clear that a doctor is much better positioned to make a decision with the help of a model if
intelligible explanations are provided. In this case, an explanation is a small list of symptoms with relative weights {symptoms that either contribute to the prediction (in green) or are evidence against it (in red). Humans usually have prior knowledge about the application domain, which they can use to accept (trust) or reject a prediction if they understand the reasoning behind it.

![](https://i.ibb.co/Yc5jQhc/LIME.png)

In [None]:
import lime
import lime.lime_tabular

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=features, class_names=class_names, discretize_continuous=True)

In [None]:
i = 12

print('Actual Label:', y_test[i])
print('Predicted Label:', y_pred[i])

exp = explainer.explain_instance(X_test.iloc[i].values, random_forest.predict_proba).show_in_notebook()

## 7. pyBreakdown

In [None]:
!pip install git+https://github.com/bondyra/pyBreakDown.git

In [None]:
from pyBreakDown.explainer import Explainer
from pyBreakDown.explanation import Explanation

In [None]:
#make explainer object
exp = Explainer(clf=random_forest, data=X_train, colnames=features)

In [None]:
#make explanation object that contains all information
explanation = exp.explain(observation=X.iloc[302,:],direction="up")

In [None]:
#get information in text form
explanation.text()

In [None]:
#customized text form
explanation.text(fwidth=40, contwidth=40, cumulwidth = 40, digits=4)

In [None]:
explanation.visualize()

In [None]:
#customize height, width and dpi of plot
explanation.visualize(figsize=(8,5),dpi=100)

In [None]:
explanation = exp.explain(observation=X.iloc[302,:],direction="up",useIntercept=True)  # baseline==intercept
explanation.visualize(figsize=(8,5),dpi=100)

# Conclusion
I have tried to implement and show the demo of really awesome machine learning explanation libraries. Some of the other great libraries which I didnt able to run on kaggle due to some dependency issuea are as follows :
1. Microsoft's Interpret-ml
2. Skater
3. fairml
4. Contrastive Explanation
5. Skope Rules

Apart from that there are some other more intuitive libraries are there which can help interpret machine learning model and really assist medical practitioners in beleiving the black box model

# References
1. [List of AWESOME Machine learning Libraries](https://github.com/jphall663/awesome-machine-learning-interpretability)

2. [The Importance of Macine Learning Interpretability by Dipanjan Sarkar](https://towardsdatascience.com/human-interpretable-machine-learning-part-1-the-need-and-importance-of-model-interpretation-2ed758f5f476)

3. [Interpretable Machine Learning by Christoph Molnar](https://christophm.github.io/interpretable-ml-book/)