![](https://www.incimages.com/uploaded_files/image/970x450/getty_584203352_200013282000928014_380350.jpg)

**Note:**  
Kindly upvote the kernel if you find it useful. Suggestions are always welome. Let me know your thoughts in the comment if any.

**Reference:**  
[Analyzing Machine Learning Models with Yellowbrick by 
Parul Pandey](https://heartbeat.fritz.ai/analyzing-machine-learning-models-with-yellowbrick-37795733f3ee)

**Context**  
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.  

**Attribute Information:**  
* Age
* Sex
* Chest pain type (4 values)
* Resting blood pressure
* Serum cholestoral in mg/dl
* Fasting blood sugar > 120 mg/dl
* Resting electrocardiographic results (values 0,1,2)
* Maximum heart rate achieved
* Exercise induced angina
* Oldpeak = ST depression induced by exercise relative to rest
* The slope of the peak exercise ST segment
* Number of major vessels (0-3) colored by flourosopy
* Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory.  

**Acknowledgements - Creators:**  
* Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.  
* University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.  
* University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.  
* V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.  
* Donor: David W. Aha (aha '@' ics.uci.edu) (714) 856-8779  

**Inspiration**  
Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

**About Yellowbrick**  
The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process and assist in diagnosing problems throughout the machine learning workflow. In short, it tries to find a model described by a triple composed of features, an algorithm, and hyperparameters that best fit the data.  

Yellowbrick is an open source, Python project that extends the scikit-learn API with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create interactive data explorations.  

It extends the scikit-learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the scikit-learn pipeline process, providing visuals throughout the transformation of high-dimensional data.

**Advantages**  
Yellowbrick isn’t a replacement for other data visualization libraries but helps to achieve the following:  
* Model Visualization  
* Data visualization for machine learning  
* Visual Diagnostics  
* Visual Steering  

For additional information on Yellowbrick visit the below link:  
[Yellowbrick](https://www.scikit-yb.org/en/latest/)

**Global Options**

In [None]:
import warnings
warnings.filterwarnings('ignore')

**Listing the files**

In [None]:
!ls ../input/

**Reading the Dataset**

In [None]:
import pandas as pd
ht_dt = pd.read_csv("../input/heart.csv", header = 'infer')

**Viewing the shape of the dataset**

In [None]:
print("The heart dataset has {0} rows and {1} columns".format(ht_dt.shape[0], ht_dt.shape[1]))

**Sample of the dataset**

In [None]:
ht_dt.head()  

**Specifying the feature and target column**

In [None]:
feature_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
                 'exang', 'oldpeak', 'slope', 'ca', 'thal']

target_name = 'target'

X = ht_dt[feature_names]
y = ht_dt[target_name]

print("Features of the dataset are {0}".format(X.columns.values))

**Feature Analysis in Yellowbrick**  
The Yellowbrick feature analysis visualizers focus on aggregation, optimization, and other techniques to give overviews of the data.  

Feature analysis visualizers implementation includes
* Rank Features
* Manifold visualization
* Radviz Visualizer
* Feature Importance
* Parallel coordinates
* Recursive feature elimination
* PCA Projection
* Joint Plots

 

**Rank Features**  
Rank Features rank single and pairs of features to detect covariance. Ranking can be 1D or 2D depending on the number of features utilized for ranking. 

**Rank 1D**  
Rank 1D utilizes a ranking algorithm that takes into account only a single feature at a time. By default, the Shapiro-Wilk algorithm is used to assess the normality of the distribution of instances with respect to the feature.

In [None]:
from yellowbrick.features import Rank1D
# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(features=feature_names, algorithm='shapiro')

# Fit the data to the visualizer
visualizer.fit(X, y)  

# Transform the data
visualizer.transform(X) 

# visualise
visualizer.poof()                   

**Rank 2D**  
Rank 2D, on the other hand, performs pairwise feature analysis as a heatmap. The default ranking algorithm is covariance, but we can also use the Pearson score.

In [None]:
from yellowbrick.features import Rank2D
# covariance
visualizer = Rank2D(features=feature_names, algorithm='covariance') 
visualizer.fit(X, y)                
visualizer.transform(X)             
visualizer.poof()

In [None]:
#pearson
visualizer = Rank2D(features=feature_names, algorithm='pearson')
visualizer.fit(X, y)                
visualizer.transform(X)             
visualizer.poof()

**RadViz**  
RadViz is a multivariate data visualization algorithm that plots each feature dimension uniformly around the circumference of a circle and then plots data points on the interior of the circle. This allows many dimensions to easily fit on a circle, greatly expanding the dimensionality of the visualization.  

In [None]:
#Feature set
feat_1 = ['age', 'trestbps', 'chol', 'thalach']    
feat_2 = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'oldpeak', 'slope', 'ca', 'thal']

from yellowbrick.features import RadViz
# Specify the features of interest and the classes of the target 
features = feat_1
classes = [0, 1]

# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features,size = (800,300))
visualizer.fit(X, y)      
visualizer.transform(X)  
visualizer.poof()

In [None]:
# Specify the features of interest and the classes of the target 
features = feat_2
classes = [0, 1]

# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features,size = (800,300))
visualizer.fit(X, y)      
visualizer.transform(X)  
visualizer.poof()

**Parallel Coordinates**  
This technique is useful when we need to detect clusters of instances that have similar classes, and to note features that have high variance or different distributions. Points that tend to cluster will appear closer together.

In [None]:
from yellowbrick.features import ParallelCoordinates
classes = [0, 1]
# Instantiate the visualizer for feat_1
visualizer = visualizer = ParallelCoordinates(
    classes=classes, features=feature_names,
    normalize='standard', size = (1200,500))

visualizer.fit(X, y)     
visualizer.transform(X)   
visualizer.poof()

Parallel coordinates is a visualization technique used to plot individual data elements across many dimensions. Each of the dimensions corresponds to a vertical axis, and each data element is displayed as a series of connected points along the dimensions/axes.  

The groups of similar instances are called ‘braids’, and when there are distinct braids of different classes, it suggests there’s enough separability that a classification algorithm might be able to discern between each class.  

**Model Evaluation Visualizers**  
Model evaluation signifies how well the values predicted by the model match the actual labeled ones. Yellowbrick has visualizers for classification, regression, and clustering algorithms.  

**Evaluating Classifiers**  
Classification models try to assign the dependent variables one or more categories. The sklearn.metrics module implements a function to measure classification performance.  

![Classifiers Metrics](https://cdn-images-1.medium.com/max/1600/1*U35S7hZqKSZ8DZlxcTl1Bg.png)  

Yellowbrick implements has 7 classifier evaluation metrics.
* ROCAUC
* Class Prediction Error
* Discrimination Error
* Class Balance
* Confusion Matrix
* Classification Report
* Precision - Recall Curves

In [None]:
# Classifier Evaluation Imports
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#Yellowbrick
from yellowbrick.classifier import ClassificationReport,ConfusionMatrix

#Training & Test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**Classification Report**  
The classification report visualizer displays the precision, recall, and F1 scores for the model.  

* precision = true positives / (true positives + false positives)
* recall = true positives / (false negatives + true positives)
* F1 score = 2 * ((precision * recall) / (precision + recall))  

Let's try to visualize the classification report for 2 model's and decide which is better.

**Classification report using Gaussian NB**

In [None]:
# Instantiate the classification model and visualizer 
bayes = GaussianNB()
visualizer = ClassificationReport(bayes, classes=classes)
visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)  
g = visualizer.poof()

**Classification report using Logistic Regression**

In [None]:
bayes = LogisticRegression()
visualizer = ClassificationReport(bayes, classes=classes)
visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)  
g = visualizer.poof()

Visual classification reports are used to compare classification models to select models that are **“redder”**, e.g. have stronger classification metrics or that are more balanced.

**Confusion Matrix**  
The ConfusionMatrix visualizer displays the accuracy score of the model, i.e. it shows how each of the predicted classes compares to their actual classes. Let’s check out the confusion matrix for the Logistic Regression Model.

In [None]:
logReg = LogisticRegression()
visualizer = ConfusionMatrix(logReg)
visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)
g = visualizer.poof()

**Will try the Evaluating Regressors on a seperate kernel**  

**Stay connected**  

**Happy Learning**