<h1>Let's Start</h1>

You trained a model and it is performing great. Now question is why it is performing that way. What are the features it is dependent on. Considering machine learning models black box is not the way to go. We have to learn the methods to get insights of our models. In this kernel we will discuss some models and how to get their insigts. There are different methods to get information about your model's dependences and we will discuss some of those methods.      
Next thing is why would anyone want to know the insights of the model. There are various reasons like it helps in developing trust on the model, we learn what to tweak to improve our results and above all it is fun to know how the model is doing such a great job :).


<h1>What we are going to discuss in this kernel :- </h1>

* **Permutation Importance**   
* **Partial Dependence Plots**
* **SHAP (SHapley Additive exPlanations)**
* **LIME** 
* **CNN**
* **References**

In [None]:
import warnings
warnings.filterwarnings('ignore')

<h1>Why this dataset ?</h1>   

I selected this [**Heart Disease UCI**](https://www.kaggle.com/ronitf/heart-disease-uci) dataset because it is easy to relate why we need to look inside a model or need to explain our model's working. Here if our model predicts that a person is having heart disease then we should explain why. To know that "why", we need to look inside our model and get to know on what factor it is giving that result. Now as we are clear that model explainability is important, let's dive into it.

In [None]:
#importing libraries
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report
from sklearn import tree
import graphviz
import random 
random.seed(3)

In [None]:
heart_data = pd.read_csv('../input/heart-disease-uci/heart.csv')
heart_data.head()

<h1>About our dataset :- </h1>

* **age:** The person's age in years    
* **sex:** The person's sex (1 = male, 0 = female)
* **cp:** The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
* **trestbps:** The person's resting blood pressure (mm Hg on admission to the hospital)
* **chol:** The person's cholesterol measurement in mg/dl
* **fbs:** The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
* **restecg:** Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
*  **thalach:** The person's maximum heart rate achieved
*  **exang:** Exercise induced angina (1 = yes; 0 = no)
*  **oldpeak:** ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot.)
*  **slope:** the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
*  **ca:** The number of major vessels (0-3)
*  **thal:** A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
*  **target:** Heart disease (0 = no, 1 = yes)

[Source](https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model)


In [None]:
X_train = heart_data.drop('target',axis = 1)
y_train = heart_data['target']
X_train,X_test,y_train,y_test = train_test_split(X_train,y_train,random_state = 3,test_size = 0.2)
clf_randomForest = RandomForestClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(X_train,y_train)

In [None]:
print(accuracy_score(y_test,clf_randomForest.predict(X_test)))
print(classification_report(y_test,clf_randomForest.predict(X_test)))

Our model's performance seems good but here we are not focused on improving the results of the model. We want to know why our model is taking a certain decision and what features are contributing and how much they are contributing to make that decision. 

In [None]:
from IPython.display import Image
from subprocess import call
tree_graph = tree.export_graphviz(clf_randomForest.estimators_[0], out_file='tree.dot', feature_names=X_train.columns.tolist(),proportion = True,rounded = True,filled = True,precision = 2)
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])
Image(filename = 'tree.png')

<h1>Permutation Importance</h1>   

Keeping it short :-
* Used to check importance of a feature to a model in predicting target value.
* It is done after training the model and validation data should be used.       

<h2>Steps</h2>
1. Check accuracy of model on validation data.
2. Randomly shuffle a feature and put it back.
3. Now again check accuracy of the model on that data.
4. Compare it with the accuracy before shuffling.
    * Decrease in accuracy will show importance of a feature.
5. Now undo the shuffling of that feature and put it back. 
6. Repeat above process for every feature

<h2>Advantages</h2>   
* It does not require retraining  the model.
* It automatically takes into account all interactions with other features. By permuting the feature you also destroy the interaction effects with other features. This means that the permutation feature importance takes into account both the main feature effect and the interaction effects on model performance.
* It provides highly compressed, global insight into the model’s behavior.

<h2>Disadvantages</h2>   
* It is linked to the error of the model. Therefore it is not useful when you want to check how robust is you model's output when you manipulate a feature. At that time you will not be interested in how much      model performance decreases when you permute a feature.
* You need access to the true outcome. If someone only provides you with the model and unlabeled data but not the true outcome you cannot compute the permutation feature importance.

* The permutation of features produces unlikely data instances when two or more features are correlated. When they are positively correlated (like height and weight of a person) and we shuffle one of the features, we create new instances that are unlikely or even physically impossible (2 meter person weighing 30 kg for example), yet I use these new instances to measure the importance. Therefore, If features are correlated, the permutation feature importance can be biased by unrealistic data instances.

* It depends on shuffling the feature, which adds randomness to the measurement. When the permutation is repeated, the results might vary greatly. Repeating the permutation and averaging the importance measures over repetitions stabilizes the measure, but increases the time of computation.







In [None]:
#permutation importance
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(clf_randomForest, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

Values towards top are most important and here it show that **thalach** (person's maximum heart rate achieved) and **ca** (number of magor vessels) are the two most important features. Go through [this](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4468223/), they have explained it well and our result also seems reasonable according to that.    

<h2>How to interpret the values :-</h2>  
The first number in each row shows how much model performance decreased with a random shuffling and there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next. There are also some negative values so what about them. When result on shuffle data happen to be more accurate than the real data , we encounter negative values. This usually happens with small dataset like our dataset.


<h1>Partial Dependence Plot</h1>   

In short :-    
* Used to check *how* a feature affects the model.
* It is also done after training the model and validation data should be used.

<h2>Process</h2> 
1. We will aim one feature at a time.
1. Select a row of features.
1. First slightly decrease the value of the feature and note the accuracy of that row.
1. Then slightly increase the value of that feature and again note the accuracy of that row.
1. Now we do the same above mentioned steps but on multiple rows and average of them is taken.
1. Then we will plot the average predicted outcome on vertical axis.

<h2>Advantages</h2>    

* The computation of partial dependence plots is intuitive: The partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value. In my experience, lay people usually understand the idea of PDPs quickly.

* In the uncorrelated case, the interpretation is clear: The partial dependence plot shows how the average prediction in your dataset changes when the j-th feature is changed.

* Partial dependence plots are easy to implement.

<h2>Disadvantages</h2>

* The realistic maximum number of features in a partial dependence function is two. This is not the fault of PDPs, but of the 2-dimensional representation (paper or screen) and also of our inability to imagine more than 3 dimensions.    
* The assumption of independence is the biggest issue with PD plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features.

* Some PD plots do not show the feature distribution. Omitting the distribution can be misleading, because you might overinterpret regions with almost no data. 





In [None]:
#partial dependence plot

from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=clf_randomForest, dataset=X_test, model_features=X_test.columns.tolist(), feature='thalach')
# plot it
pdp.pdp_plot(pdp_goals, 'thalach')
plt.show()

In above graph y-axis indicates change in prediction and x-axis are values of **thalach**. Blue shaded area indicates. Maximum heart rate of a person till 140 have very less affect on increasing his chance of having a heart disease or have very less affect. Around 160 chance of having a heart disease increase and after that it is same.

In [None]:
#partial dependence plot

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=clf_randomForest, dataset=X_test, model_features=X_test.columns.tolist(), feature='ca')
# plot it
pdp.pdp_plot(pdp_goals, 'ca')
plt.show()

This time as number of major vessel (ca) increases, it decreases chance of having a heart disease. Other thing to note is having one, two, three number of major vessels have similiar affect. So something is better than nothing :)  

Let's draw 2D Partial dependence plot and check interactions between **ca** and **thalach**.

In [None]:
#partial dependence plot
features_to_plot = ['ca', 'thalach']
inter1  =  pdp.pdp_interact(model=clf_randomForest, dataset=X_test, model_features=X_test.columns.tolist(), features=features_to_plot)
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()

We can see that having major vessels (ca) =  0 and maximum heart rate (thalach) above 150 increase a person's chance of having heart disease very much. Whereas having 2 major vessels and maximum heart rate less 110 is the most safe.

<h1>SHAP</h1>   

Used to see impact of each feature on prediction. To know what made a model to predict a certain value and which feature is contributing how much in that decision.

<h2>Advantages</h2>   
* The difference between the prediction and the average prediction is fairly distributed among the feature values of the instance
* The Shapley value allows contrastive explanations. Instead of comparing a prediction to the average prediction of the entire dataset, you could compare it to a subset or even to a single data point.
* The Shapley value is the only explanation method with a solid theory. The axioms – efficiency, symmetry, dummy, additivity – give the explanation a reasonable foundation.

<h2>Disadvantages</h2>       
* The Shapley value requires a lot of computing time. In 99.9% of real-world problems, only the approximate solution is feasible. An exact computation of the Shapley value is computationally expensive because there are 2k possible coalitions of the feature values and the “absence” of a feature has to be simulated by drawing random instances, which increases the variance for the estimate of the Shapley values estimation.
* The Shapley value returns a simple value per feature, but no prediction model like LIME. This means it cannot be used to make statements about changes in prediction for changes in the input, such as: “If I were to earn €300 more a year, my credit score would increase by 5 points.”  
the 
* Shapley value method suffers from inclusion of unrealistic data instances when features are correlated.




In [None]:
#shap
row_to_show = 4
data_for_prediction = X_test.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
clf_randomForest.predict_proba(data_for_prediction_array)

In [None]:
import shap  # package used to calculate Shap values
# Create object that can calculate shap values
explainer = shap.TreeExplainer(clf_randomForest)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

Output value is 0.65 more than base value i.e 0.5062. This person is having high chance of heart disease. We can see that major vessels (**ca**) = 0 , **oldpeak** (ST depression) = 0 , maximum heart rate (**thalach**)  186 (which is > 120), **age** = 51 (seems reasonable)  and **thal**(blood disorder) = 2(normal) etc. are contributing to increase the chance of having a heart disease. And on the other side **exang**( Exercise induced angina) = 1(true), **cp** (chest pain) = 0 (normal) and **restecg** (Resting electrocardiographic measurement) = 0 (normal) are decreasing chance of having a heart disease which seems reasonable. [Check this.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4468223/)

<h2>Dependence Contribution Plots</h2>

In [None]:
#Dependence Contribution Plots
shap_values = explainer.shap_values(X_train)
shap.dependence_plot('thalach',shap_values[1], X_train, interaction_index="ca")

In [None]:
#summary plot
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values[1], X_train,auto_size_plot=False)

In the above plot on x-axis we have SHAP value which show impact on the model. Color of the dot show value of a certain feature. We can easily interpret it. Having high value of **ca** and **oldpeak** decreases the chance of having a heart disease. High value of **thalach** increases the chance of having a heart disease.

<h2>Aggregated force_plot</h2>

In [None]:

shap.force_plot(explainer.expected_value[0],shap_values[0] , X_train)

<h1>LIME</h1>  
LIME is model-agnostic, meaning that it can be applied to any machine learning model. The technique attempts to understand the model by perturbing the input of data samples and understanding how the predictions change. LIME provides local model interpretability. LIME modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. Often, this is also related to what humans are interested in when observing the output of a model. You can read more [here](https://towardsdatascience.com/understanding-model-predictions-with-lime-a582fdff3a3b).

<h2>Advantages</h2>
* Even if you replace the underlying machine learning model, you can still use the same local, interpretable model for explanation.   
* LIME is one of the few methods that works for tabular data, text and images.    
* LIME is implemented in Python (lime and Skater) and R (lime package and iml package) and is very easy to use.   
* The explanations created with local surrogate models can use other features than the original model. * This can be a big advantage over other methods, especially if the original features cannot bet interpreted. 

<h2>Disadvantages</h2>  
* The instability of the explanations.Instability means that it is difficult to trust the explanations, and you should be very critical.
* For each application you have to try different kernel settings and see for yourself if the explanations make sense.   
* The complexity of the explanation model has to be defined in advance. 


In [None]:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.astype(int).values,mode='classification',training_labels=y_train,feature_names=X_train.columns.tolist(),class_names=['true','false'])
#Let's take a look for the 100th row
i = 1
exp = explainer.explain_instance(X_train.loc[i,X_train.columns.tolist()].astype(int).values, clf_randomForest.predict_proba, num_features=13)

In [None]:
exp.show_in_notebook(show_table=True)

In [None]:
from lime import submodular_pick
# SP-LIME returns exaplanations on a sample set to provide a non redundant global decision boundary of original model
sp_obj = submodular_pick.SubmodularPick(explainer, X_train.values, clf_randomForest.predict_proba, num_features=13,num_exps_desired=3)
[exp.show_in_notebook() for exp in sp_obj.sp_explanations]

Above plots are self explainatory. We can easily understand which values are contributing towards what.

<h1>Convolution Neural Networks</h1>  

Deep learning models are strongly considered black box because we think it's difficult to know why the deep learning model is taking a certain decision. What features it is depending the most and how is it using different features. But this is not the case, we can also look inside a deep learning model. Here now we will look into a convolution neural network. We will check what features it rely on and what factors are affecting it.  

There are various techniques for interpretation but we will discuss following :- 

* <h3>Visualizing intermediate convnet outputs (intermediate activations)</h3>—Useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters.
* <h3>Visualizing convnets filters</h3>—Useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to.
* <h3>Visualizing heatmaps of class activation in an image</h3>—Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in images.     
More you can read in this [book](https://www.oreilly.com/library/view/deep-learning-with/9781617294433/)

<h2>Visualizing intermediate convnet outputs (intermediate activations)</h2>   

Visualizing intermediate activations consists of displaying the feature maps that are output by various convolution and pooling layers in a network, given a certain input. This gives a view into how an input is decomposed into the different filters learned by the network.

In [None]:
import os
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense, Activation, BatchNormalization
from keras.models import load_model
import numpy as np
import matplotlib.pyplot as plt
from keras import models
from keras.applications import VGG16
from keras import backend as K
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import cv2

In [None]:

model = Sequential()

model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(128, 128, 3)))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

<h3>Our Image</h3>

In [None]:
#Visualize Intermediate activation 

model.load_weights('../input/catndog/model.h5')
img_path = '../input/cnn-image/dog.jpeg'

#Preprocesses the image into a 4D tensor
img = image.load_img(img_path, target_size=(128, 128))
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)
img_tensor /= 255.


plt.imshow(img_tensor[0])
plt.show()

<h2>Visualizations of intermediate activations</h2>

In [None]:
#Visualize Intermediate activation 

layer_outputs = [layer.output for layer in model.layers[:8]]
activation_model = models.Model(inputs=model.input, outputs=layer_outputs)

activations = activation_model.predict(img_tensor)

first_layer_activation = activations[0]
print(first_layer_activation.shape)

In [None]:
#Visualize Intermediate activation
plt.matshow(first_layer_activation[0, :, :, 4], cmap='viridis')

In [None]:
#Visualize Intermediate activation
plt.matshow(first_layer_activation[0, :, :, 7], cmap='viridis')

In [None]:
#Visualize Intermediate activation

layer_names = []
for layer in model.layers[:8]:
    layer_names.append(layer.name)

images_per_row = 16

# Now let's display our feature maps
for layer_name, layer_activation in zip(layer_names, activations):
    # This is the number of features in the feature map
    n_features = layer_activation.shape[-1]

    # The feature map has shape (1, size, size, n_features)
    size = layer_activation.shape[1]

    # We will tile the activation channels in this matrix
    n_cols = n_features // images_per_row
    display_grid = np.zeros((size * n_cols, images_per_row * size))

    # We'll tile each filter into this big horizontal grid
    for col in range(n_cols):
        for row in range(images_per_row):
            channel_image = layer_activation[0,:, :,col * images_per_row + row]
            # Post-process the feature to make it visually palatable
            channel_image -= channel_image.mean()
            channel_image /= channel_image.std()
            channel_image *= 64
            channel_image += 128
            channel_image = np.clip(channel_image, 0, 255).astype('uint8')
            display_grid[col * size : (col + 1) * size,row * size : (row + 1) * size] = channel_image

    # Display the grid
    scale = 1. / size
    plt.figure(figsize=(scale * display_grid.shape[1],
                        scale * display_grid.shape[0]))
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect='auto', cmap='viridis')
    
plt.show()


<h2>Visualizing convnet filters</h2>        

Another easy thing to do to inspect the filters learned by convnets is to display the visual pattern that each filter is meant to respond to. This can be done with gradient ascent in input space: applying gradient descent to the value of the input image of a convnet so as to maximize the response of a specific filter, starting from a blank input image. The resulting input image would be one that the chosen filter is maximally responsive to.

In [None]:
model = VGG16(weights='imagenet',include_top=False)
layer_name = 'block3_conv1'
filter_index = 0
layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])

In [None]:
# The call to `gradients` returns a list of tensors (of size 1 in this case)
# hence we only keep the first element -- which is a tensor.
grads = K.gradients(loss, model.input)[0]

In [None]:
# We add 1e-5 before dividing so as to avoid accidentally dividing by 0.
grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)

In [None]:
iterate = K.function([model.input], [loss, grads])
loss_value, grads_value = iterate([np.zeros((1, 150, 150, 3))])

In [None]:
# We start from a gray image with some noise
input_img_data = np.random.random((1, 150, 150, 3)) * 20 + 128.
# Run gradient ascent for 40 steps
step = 1.  # this is the magnitude of each gradient update
for i in range(40):
    # Compute the loss value and gradient value
    loss_value, grads_value = iterate([input_img_data])
    # Here we adjust the input image in the direction that maximizes the loss
    input_img_data += grads_value * step

In [None]:
def deprocess_image(x):
    # normalize tensor: center on 0., ensure std is 0.1
    x -= x.mean()
    x /= (x.std() + 1e-5)
    x *= 0.1

    # clip to [0, 1]
    x += 0.5
    x = np.clip(x, 0, 1)

    # convert to RGB array
    x *= 255
    x = np.clip(x, 0, 255).astype('uint8')
    return x

In [None]:
def generate_pattern(layer_name, filter_index, size=150):
    # Build a loss function that maximizes the activation
    # of the nth filter of the layer considered.
    layer_output = model.get_layer(layer_name).output
    loss = K.mean(layer_output[:, :, :, filter_index])
    # Compute the gradient of the input picture wrt this loss
    grads = K.gradients(loss, model.input)[0]
    # Normalization trick: we normalize the gradient
    grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)
    # This function returns the loss and grads given the input picture
    iterate = K.function([model.input], [loss, grads])
    # We start from a gray image with some noise
    input_img_data = np.random.random((1, size, size, 3)) * 20 + 128.
    # Run gradient ascent for 40 steps
    step = 1.
    for i in range(40):
        loss_value, grads_value = iterate([input_img_data])
        input_img_data += grads_value * step
        
    img = input_img_data[0]
    return deprocess_image(img)

<h2>Different Filters learned by Model</h2>

In [None]:
plt.imshow(generate_pattern('block3_conv1', 1))
plt.show()

In [None]:
for layer_name in ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1']:
    size = 64
    margin = 5

    # This a empty (black) image where we will store our results.
    results = np.zeros((8 * size + 7 * margin, 8 * size + 7 * margin, 3))

    for i in range(8):  # iterate over the rows of our results grid
        for j in range(8):  # iterate over the columns of our results grid
            # Generate the pattern for filter `i + (j * 8)` in `layer_name`
            filter_img = generate_pattern(layer_name, i + (j * 8), size=size)

            # Put the result in the square `(i, j)` of the results grid
            horizontal_start = i * size + i * margin
            horizontal_end = horizontal_start + size
            vertical_start = j * size + j * margin
            vertical_end = vertical_start + size
            results[horizontal_start: horizontal_end, vertical_start: vertical_end, :] = filter_img

    # Display the results grid
    plt.figure(figsize=(20, 20))
    plt.imshow(np.array(results,np.int32))
    plt.show()

<h2>Visualizing heatmaps of class activation in an image</h2>

A "class activation" heatmap is a 2D grid of scores associated with an specific output class, computed for every location in any input image, indicating how important each location is with respect to the class considered.

In [None]:
#heatmaps of class activation
K.clear_session()
model = VGG16(weights='imagenet')

In [None]:
#heatmaps of class activation
img_path = '../input/cnn-image/elephant.jpeg'
# `img` is a PIL image of size 224x224
img = image.load_img(img_path, target_size=(224, 224))
# `x` is a float32 Numpy array of shape (224, 224, 3)
x = image.img_to_array(img)
# We add a dimension to transform our array into a "batch"
# of size (1, 224, 224, 3)
x = np.expand_dims(x, axis=0)
# Finally we preprocess the batch
# (this does channel-wise color normalization)
x = preprocess_input(x)
preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])
np.argmax(preds[0])
# This is the "african elephant" entry in the prediction vector
african_elephant_output = model.output[:, 386]
# The is the output feature map of the `block5_conv3` layer,
# the last convolutional layer in VGG16
last_conv_layer = model.get_layer('block5_conv3')
# This is the gradient of the "african elephant" class with regard to
# the output feature map of `block5_conv3`
grads = K.gradients(african_elephant_output, last_conv_layer.output)[0]
# This is a vector of shape (512,), where each entry
# is the mean intensity of the gradient over a specific feature map channel
pooled_grads = K.mean(grads, axis=(0, 1, 2))
# This function allows us to access the values of the quantities we just defined:
# `pooled_grads` and the output feature map of `block5_conv3`,
# given a sample image
iterate = K.function([model.input], [pooled_grads, last_conv_layer.output[0]])
# These are the values of these two quantities, as Numpy arrays,
# given our sample image of two elephants
pooled_grads_value, conv_layer_output_value = iterate([x])
# We multiply each channel in the feature map array
# by "how important this channel is" with regard to the elephant class
for i in range(512):
    conv_layer_output_value[:, :, i] *= pooled_grads_value[i]
# The channel-wise mean of the resulting feature map
# is our heatmap of class activation
heatmap = np.mean(conv_layer_output_value, axis=-1)
heatmap = np.maximum(heatmap, 0)
heatmap /= np.max(heatmap)
plt.matshow(heatmap)
plt.show()

In [None]:
# We use cv2 to load the original image
img = cv2.imread(img_path)
plt.imshow(img)
plt.show()
# We resize the heatmap to have the same size as the original image
heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))
# We convert the heatmap to RGB
heatmap = np.uint8(255 * heatmap)
# We apply the heatmap to the original image
heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
# 0.6 here is a heatmap intensity factor
superimposed_img = heatmap * 0.6 + img
plt.imshow(np.array(superimposed_img,np.int32))
plt.show()





This visualisation technique answers two important questions:       
    Why did the network think this image contained an African elephant?     
    Where is the African elephant located in the picture?     
    
we can see that model identify shape of the elephant and it is more focused on the ears of the elephant, maybe  it thinks that this is the main feature of an African elephant.



<h2>Content that helped me to put together this kernel aka References :)</h2>

https://www.oreilly.com/library/view/deep-learning-with/9781617294433/     
https://christophm.github.io/interpretable-ml-book/     
https://www.kaggle.com/learn/machine-learning-explainability       
And Our dear friend [Google](https://www.google.com/) :)


<h2>Thanks</h2> [Dan Becker](https://www.kaggle.com/dansbecker) for creating such a good course on [Machine learning explainability](https://www.kaggle.com/learn/machine-learning-explainability).   
<h2>Thanks</h2> [kaggle](https://www.kaggle.com/) for this challenge because in the process of creating this kernel I learned a lot. Kaggle community is great to learn data science by practical approach. 

<h2>Please let me know your views on this kernel. All kinds of suggestions are welcome. If you like this kernel, I would appreciate if you could upvote it.  </h2> 