# Exploratory Data Analysis and Machine Learning Classification on Heart Disease Prediction

In this notebook, I classified the Heart Disease dataset with the DecisionTreeClassifier. The purpose is to convey what I know about the Decision Tree and Assessment Criteria. I used Seaborn library in Basic Data Analysis, Correlation and Data Visualization sections. I leave the inferences that can be made from the graphics to the reader. I focused on providing information and making applications about Decision Tree and Evaluation Metrics. Hope it will be useful to you.

### If you have questions please ask them on the comment section.

### I will be glad if you can give feedback.

Content:

1. [Importing the Necessary Libraries](#1)
1. [Read Datas & Explanation of Features & Information About Datasets](#2)
   1. [Variable Descriptions](#3)
1. [Basic Data Analysis](#4)
   1. [age](#5)
   1. [sex](#6)
   1. [cp](#7)
   1. [trestbps](#8)
   1. [chol](#9)
   1. [fbs](#10)
   1. [restecg](#11)
   1. [thalach](#12)
   1. [exang](#13)
   1. [oldpeak](#14)
   1. [slope](#15)
   1. [ca](#16)
   1. [thal](#17)
   1. [target](#18)
1. [Correlation](#19)
1. [Pandas Profiling](#20)
1. [Data Visualization](#21)
1. [Train-Test Split](#22)
1. [Decision Tree Classifier](#23)
   1. [Evaluation Metrics](#24)
      1. [Confusion Matrix](#25)
      1. [Classification Report](#26)
      1. [ROC Curve](#27)
   1. [Decision Tree Visualization](#28)
      1. [Visualize Decision Tree with graphviz](#29)
      1. [Print Text Representation](#30)
   1. [k-Fold Cross Validation](#31)
   1. [Hyper-Parameter Optimization](#32)
      1. [GridSearchCV](#33)
      1. [RandomizedSearchCV](#34)
   1. [Feature Importance](#35)
1. [Conclusion](#36)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="1"></a> 
# Importing the Necessary Libraries

In [None]:
import numpy as np 
import pandas as pd
import pandas
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
import seaborn as sns; sns.set()

from sklearn import tree
import graphviz 
import os
import preprocessing 

import numpy as np 
import pandas as pd 
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
from wordcloud import WordCloud
import matplotlib.pyplot as plt

from pandas_profiling import ProfileReport

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import PCA

from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from xgboost import plot_tree, plot_importance

from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE

from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


import warnings
warnings.filterwarnings("ignore")

<a id="2"></a> 
# Read Datas & Explanation of Features & Information About Datasets

In [None]:
dataset = pandas.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
dataset.head(10)

<a id="3"></a> 
## Variable Descriptions

* age
* sex
* chest pain type (4 values)
* resting blood pressure
* serum cholestoral in mg/dl
* fasting blood sugar > 120 mg/dl
* resting electrocardiographic results (values 0,1,2)
* maximum heart rate achieved
* exercise induced angina
* oldpeak = ST depression induced by exercise relative to rest
* the slope of the peak exercise ST segment
* number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Source: https://www.kaggle.com/ronitf/heart-disease-uci

In [None]:
dataset.info()

In [None]:
dataset.describe()

In [None]:
dataset.isnull().sum().sum()

<a id="4"></a> 
# Basic Data Analysis

At this stage, the aim is to obtain statistical information about the features. In addition, subjects are grouped according to their attributions and their distribution is desired to be seen. I used Pie Plot and Histograms for this.

In [None]:
numerical_int64 = (dataset.dtypes == "int64")
numerical_int64_list = list(numerical_int64[numerical_int64].index)

print("Categorical variables:")
print(numerical_int64_list)

In [None]:
def plot_hist(variable):
    sns.set_style('darkgrid')
    plt.figure(figsize = (9,3))
    plt.hist(dataset[variable], bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
for n in numerical_int64_list:
    plot_hist(n)

<a id="5"></a> 
## Age

In [None]:
sns.set_theme(style="ticks")


sns.histplot(
    dataset,
    x="age", hue="target",
    multiple="stack",
    palette="light:m_r",
    edgecolor=".3",
    linewidth=.9,
    #log_scale=True,
)

<a id="6"></a> 
## Sex

In [None]:
dataset[["sex","target"]].groupby(["sex"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
# This code snippet taken from Bharti Prasad's notebook via https://www.kaggle.com/bhartiprasad17/student-academic-performance-analysis (with her permission)

plt.figure(figsize=(14, 7))
labels=['Male', 'Female']
plt.pie(dataset['sex'].value_counts(),labels=labels,explode=[0.1,0.1],
        autopct='%1.2f%%',colors=['#E37383','#FFC0CB'], startangle=90)
plt.title('Gender')
plt.axis('equal')
plt.show()

<a id="7"></a> 
## Cp

In [None]:
dataset[["cp","target"]].groupby(["cp"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
sns.set_theme(style="ticks")

sns.set_style('darkgrid')
sns.histplot(
    dataset,
    x="cp", hue="target",
    multiple="stack",
    palette="gist_rainbow",
    edgecolor=".3",
    linewidth=.9,
    #log_scale=True,
)

<a id="8"></a> 
## Trestbps

In [None]:
sns.set_theme(style="ticks")

sns.set_style('darkgrid')
sns.histplot(
    dataset,
    x="trestbps", hue="target",
    multiple="stack",
    palette="prism",
    edgecolor=".3",
    linewidth=.9,
    #log_scale=True,
)

<a id="9"></a> 
## Chol

In [None]:
sns.set_theme(style="darkgrid", palette="pastel")
sns.boxplot(x="target", y="chol", data=dataset)

<a id="10"></a> 
## Fbs

In [None]:
dataset[["fbs","target"]].groupby(["fbs"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
dataset[["fbs","target"]].groupby(["fbs"], as_index = False).count().sort_values(by="target",ascending = False)

In [None]:
sns.set_theme(style="ticks")

sns.set_style('darkgrid')
sns.histplot(
    dataset,
    x="fbs", hue="target",
    multiple="stack",
    palette="OrRd",
    edgecolor=".3",
    linewidth=.9,
    #log_scale=True,
)

<a id="11"></a> 
## Restecg

In [None]:
dataset[["restecg","target"]].groupby(["restecg"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
labels = dataset['restecg'].value_counts().index
sizes = dataset['restecg'].value_counts().values

plt.figure(figsize = (8,8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Distribution of Samples by 'restecg'",color = 'black',fontsize = 15)

<a id="12"></a> 
## Thalach

In [None]:
dataset[["thalach","target"]].groupby(["thalach"], as_index = False).mean().sort_values(by="target",ascending = False)

<a id="13"></a> 
## Exang

In [None]:
dataset[["exang","target"]].groupby(["exang"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
labels = dataset['exang'].value_counts().index
sizes = dataset['exang'].value_counts().values
sns.set_theme(style="darkgrid", palette="pastel")
plt.figure(figsize = (8,8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Distribution of Samples by 'exang'",color = 'black',fontsize = 15)

<a id="14"></a> 
## Oldpeak

In [None]:
sns.set_theme(style="ticks")

sns.set_style('darkgrid')
sns.histplot(
    dataset,
    x="oldpeak", hue="target",
    multiple="stack",
    palette="OrRd",
    edgecolor=".3",
    linewidth=.9,
    #log_scale=True,
)

<a id="15"></a> 
## Slope

In [None]:
dataset[["slope","target"]].groupby(["slope"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
labels = dataset['slope'].value_counts().index
sizes = dataset['slope'].value_counts().values
sns.set_theme(style="darkgrid", palette="pastel")
plt.figure(figsize = (8,8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Distribution of Samples by 'slope'",color = 'black',fontsize = 15)

<a id="16"></a> 
## Ca

In [None]:
dataset[["ca","target"]].groupby(["ca"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
labels = dataset['ca'].value_counts().index
sizes = dataset['ca'].value_counts().values
sns.set_theme(style="darkgrid", palette="pastel")
plt.figure(figsize = (8,8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Distribution of Samples by 'ca'",color = 'black',fontsize = 15)

<a id="17"></a> 
## Thal

In [None]:
dataset[["thal","target"]].groupby(["thal"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
labels = dataset['thal'].value_counts().index
sizes = dataset['thal'].value_counts().values
sns.set_theme(style="darkgrid", palette="pastel")
plt.figure(figsize = (8,8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Distribution of Samples by 'thal'",color = 'black',fontsize = 15)

<a id="18"></a> 
## Target

In [None]:
labels = dataset['target'].value_counts().index
sizes = dataset['target'].value_counts().values

plt.figure(figsize = (8,8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("Distribution of Samples by 'target'",color = 'black',fontsize = 15)

<a id="19"></a> 
# Correlation

Correlation specifies the direction and strength of the linear relationship between two random variables in correlation, probability theory and statistics. In general statistical use, correlation shows how far the state of independence has been moved away.

In [None]:
features = dataset.columns

In [None]:
dataset.corr()

Heat maps visualize data with color changes. When applied to the table format, its variables are placed in rows and columns. Coloring the boxes in the table is useful for examining multivariate crosstab data. Heat maps are good for showing more than one variable, showing any patterns, or showing if any variables are alike, and detecting whether there is any correlation between them.

In [None]:
plt.figure(figsize=(20, 10))
sns.set_style('white')
mask = np.triu(np.ones_like(dataset.corr(), dtype=np.bool))
heatmap = sns.heatmap(dataset.corr(), mask=mask,annot=True, cmap='BrBG', linewidths = 2)
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':30}, pad=16);

In [None]:
plt.figure(figsize=(20, 10))
heatmap = sns.heatmap(dataset.corr(),  annot=True, cmap='Blues_r', linewidths = 2)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':30}, pad=16);

<a id="20"></a> 
# Pandas Profiling

Pandas profiling is a useful library that generates interactive reports about the data. With using this library, we can see types of data, distribution of data and various statistical information. This tool has many features for data preparing. Pandas Profiling includes graphics about specific feature and correlation maps too. You can see more details about this tool in the following url: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/

In [None]:
import pandas_profiling as pp
pp.ProfileReport(dataset)

<a id="21"></a> 
# Data Visualization

In [None]:
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.despine(f, left=True, bottom=True)
sns.set_theme(style="whitegrid")
sns.scatterplot(x=dataset['trestbps'], y=dataset['chol'],
                hue=dataset['slope'], 
                size="sex",
                palette="tab20",
                hue_order=dataset['slope'],
                sizes=(2,16), 
                linewidth=0,
                data=dataset, ax=ax)

In [None]:
g = sns.JointGrid(data=dataset, x="trestbps", y="chol", space=0, ratio=17)
g.plot_joint(sns.scatterplot, color="r", alpha=.6, legend=False)
g.plot_marginals(sns.rugplot, height=1, color="r", alpha=.6)

In [None]:
g = sns.JointGrid(data=dataset, x="trestbps", y="thalach", space=0, ratio=17)
g.plot_joint(sns.scatterplot, color="g", alpha=.6, legend=False)
g.plot_marginals(sns.rugplot, height=1, color="g", alpha=.6)

In [None]:
plt.figure(figsize=(25,15))
sns.set_theme(style="darkgrid")

plt.subplot(2,2,1)
sns.histplot(dataset['age'], color = 'red', kde = True).set_title('age Interval and Counts')

plt.subplot(2,2,2)
sns.histplot(dataset['trestbps'], color = 'green', kde = True).set_title('trestbps Interval and Counts')

plt.subplot(2,2,3)
sns.histplot(dataset['chol'], kde = True, color = 'blue').set_title('chol Interval and Counts')

plt.subplot(2,2,4)
sns.histplot(dataset['thalach'], kde = True, color = 'black').set_title('thalach Interval and Counts')

In [None]:
sns.set_theme(style="darkgrid")
sns.pairplot(dataset, hue = 'target')

<a id="22"></a> 
# Train - Test Split

In [None]:
X = dataset.iloc[:,0:13].values 
y = dataset.iloc[:,13:].values 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101) 
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

print(f'Total # of sample in whole dataset: {len(X)}')
print(f'Total # of sample in train dataset: {len(X_train)}')
print(f'Total # of sample in validation dataset: {len(X_valid)}')
print(f'Total # of sample in test dataset: {len(X_test)}')

Standardization is a method in which the mean value is 0 and the standard deviation is 1, and the distribution approaches the normal. The formula is as follows, we subtract the average value from the value we have, then divide it by the variance value.

<a id="23"></a> 
# Decision Tree

Tree-based learning algorithms are among the most used supervised learning algorithms. In general, they can be adapted to the solution of problems such as classification and regression.

Decision tree algorithm is one of the data mining classification algorithms. They have a predefined target variable. They offer a strategy from the top to the bottom due to their structure.

**A decision tree is a structure used to divide a data set containing a large number of records into smaller sets by applying a set of decision rules. In other words, it is a structure that is used by applying simple decision-making steps, dividing large amounts of records into very small groups of records.**



In [None]:
decTree_model = DecisionTreeClassifier()
decTree_model.fit(X_train, y_train)

train_score2 = decTree_model.score(X_train, y_train)
print(f'Train score of trained model: {train_score2}')

validation_score2 = decTree_model.score(X_valid, y_valid)
print(f'Validation score of trained model: {validation_score2}')

test_score2 = decTree_model.score(X_test, y_test)
print(f'Test score of trained model: {test_score2}')

**Advantages of Decision Trees:**

* It is easy to understand and interpret.

* It requires little data preparation.

* The cost of the tree used is logarithmic with the number of data points used to train the tree.

* It can process both numerical and categorical data.

* They can handle multi-output problems.

* It is possible to validate a model using statistical tests.

**Disadvantages of Decision Trees:**

* Not very good at estimating persistent attribute values.

* Modeling is not very successful when the number of classes is high and the number of learning cluster examples is low.

* Time and place complexity depends on the number of learning set instances, the number of attributes and the structure of the resulting tree.

* Both tree forming complexity and tree pruning complexity are high.

<a id="24"></a> 
## Evaluation Metrics

Evaluation metrics are used to measure the quality of the statistical or machine learning model. Evaluating machine learning models or algorithms is essential for any project. There are many different types of evaluation metrics available to test a model. These include classification accuracy, logarithmic loss, confusion matrix, and others. 

***Why is this Useful?***

It is very important to use multiple evaluation metrics to evaluate your model. This is because a model may perform well using one measurement from one evaluation metric, but may perform poorly using another measurement from another evaluation metric. Using evaluation metrics are critical in ensuring that your model is operating correctly and optimally. 

***Applications of Evaluation Metrics***
* Statistical Analysis

* Machine Learning

Source: https://deepai.org/machine-learning-glossary-and-terms/evaluation-metrics

<a id="25"></a> 
### Confusion Matrix

Complexity matrix is a measurement tool that provides information about the accuracy of predictions.
The logic behind it is actually simple, but it is often used especially in classification algorithms as it provides easy to understand information about the accuracy of the measurement.

In [None]:
y_predictions = decTree_model.predict(X_test)

conf_matrix = confusion_matrix(y_predictions, y_test)


print(f'Accuracy: {accuracy_score(y_predictions, y_test)*100}')
print()
print(f'Confussion matrix: \n{conf_matrix}\n')

sns.heatmap(conf_matrix, annot=True)

**TP - True Positive:** The model correctly predicted the positive class as a positive class.

**FP - False Positive:** The model predicted the negative class as a false positive class.

**FN - False Negative:** The model predicted the positive class as false, negative class.

**TN - True Negative:** The model predicted the negative class correctly.

In [None]:
tn = conf_matrix[0,0]
fp = conf_matrix[0,1]
tp = conf_matrix[1,1]
fn = conf_matrix[1,0]

total = tn + fp + tp + fn
real_positive = tp + fn
real_negative = tn + fp

**Accuracy Rate:** A measure of how often the classifier predicts correctly.

**Precision:** It shows how many of the values we guess as Positive are actually Positive.

**Recall:** It is a measure of how much the classifier correctly predicts the true positive value. Also known as Sensitivity, Accuracy or Recall. (Sensitivity, Hit Rate or Recall) It should be as high as possible.

**F1 Score:** F1 Score value shows the harmonic mean of Precision and Recall values. The reason why it is a harmonic average instead of a simple average is that we should not ignore extreme cases. If there was a simple average calculation, the F1 Score of a model with a Precision value of 1 and a Recall value of 0 would come as 0.5, and this would mislead us.

**Specificity:** It is a measure of how much the classifier correctly predicted the true negative value.

**Misclassification Rate (Error Rate):** It is a measure of how often the classifier guesses incorrectly. Also known as Error Rate.

**Prevalence:** It is the measure of how often a value of 1 is found at the end of the estimation.

**Miss Rate:** It is the ratio of those predicted to be 0 despite the real value being 1. Also known as loss rate.

**Fall out:** It is the ratio of those predicted to be 1 even though the real value is 0. 

In [None]:
accuracy  = (tp + tn) / total # Accuracy Rate
precision = tp / (tp + fp) # Positive Predictive Value
recall    = tp / (tp + fn) # True Positive Rate
f1score  = 2 * precision * recall / (precision + recall)
specificity = tn / (tn + fp) # True Negative Rate
error_rate = (fp + fn) / total # Missclassification Rate
prevalence = real_positive / total
miss_rate = fn / real_positive # False Negative Rate
fall_out = fp / real_negative # False Positive Rate

print(f'Accuracy    : {accuracy*100}')
print(f'Precision   : {precision*100}')
print(f'Recall      : {recall*100}')
print(f'F1 score    : {f1score*100}')
print(f'Specificity : {specificity*100}')
print(f'Error Rate  : {error_rate*100}')
print(f'Prevalence  : {prevalence*100}')
print(f'Miss Rate   : {miss_rate*100}')
print(f'Fall Out    : {fall_out*100}')

<a id="26"></a> 
### Classification Report

In [None]:
predictions = decTree_model.predict(X_test)


print(classification_report(predictions, y_test))

<a id="27"></a> 
### ROC Curve

AUC - ROC curve is used to evaluate performance in machine learning and classification problems. It is one of the most important evaluation criteria to check the performance of any classification model. It is one of the most widely used metrics to evaluate the performance of machine learning algorithms, especially in cases where there are unstable data sets. This curve explains how well the model is at its prediction.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC' )
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

In [None]:
probs = decTree_model.predict_proba(X_test)
probs = probs[:, 1]

One of the most commonly used metrics is the AUC-ROC curve. AUC stands for "Area under the ROC Curve". The scope of this area is AUC. The larger the area covered, the better the machine learning models at discriminating the classes given. The ideal value for AUC is 1.

In [None]:
auc = roc_auc_score(y_test, probs)
print('AUC: ', auc*100)

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.legend(loc = 'lower right')
plot_roc_curve(fpr, tpr)

<a id="28"></a> 
## Decision Tree Visualization

<a id="29"></a> 
### Visualize Decision Tree with graphviz

In [None]:
exported_tree = tree.export_graphviz(decTree_model,  
                                     filled = True, rounded = True,  
                                     special_characters = True) 
tree_plot = graphviz.Source(exported_tree) 
tree_plot

<a id="30"></a> 
### Print Text Representation

In [None]:
text_representation = tree.export_text(decTree_model)
print(text_representation)

<a id="31"></a> 
## k-Fold Cross Validation

Cross Validation will enable us to see whether we are facing an overfitting problem and also to see the quality of our model. Thus, it will enable us to test the performance of our model before encountering high error rates in the test data set that we have not seen yet. It is a method that is frequently used because it is easy to apply.

In [None]:
print(cross_val_score(decTree_model, X = X_train, y = y_train, cv = 15))

In [None]:
accuracies = cross_val_score(estimator = decTree_model, X = X_train, y = y_train, cv = 100)
print("Accuracy (mean):", accuracies.mean()*100, "%")
print("std: ", accuracies.std()*100)

<a id="32"></a> 
## Hyper-Parameter Optimization

Unlike parameters, hyperparameters are not learned during training the model. They are determined by the data scientist before the modeling phase. For example, KNN algorithm, which is one of the non-parametric classification algorithms, makes classification by looking at the nearest k neighbors to the desired value. Here, the k number (n_neighbors:) and the distance metric (metric:) to be used are the hyperparameters that should be specified by the data scientist before the modeling, which increases the performance of the model.

Hyperparameter optimization is the process of finding the most suitable hyperparameter combination according to the success metric specified for a machine learning algorithm.

Given that there are dozens of hyperparameters for a machine learning algorithm and dozens of values these hyperparameters can take, it's clear how difficult it will be to try all combinations one by one and pick the best combination. For this reason, different methods have been developed for hyperparameter optimization. GridSearcCV and RandomizedSearchCV are among these methods.

<a id="33"></a> 
### GridSearchCV

For the hyperparameters and their values that are desired to be tested in the model, a separate model is established with all combinations and the most successful hyperparameter set is determined according to the specified metric.

In [None]:
parameters = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'max_depth': range(1,14), 
              'min_samples_split': range(2,8), 
              'min_samples_leaf': range(1,3),
             'max_features': ['auto', 'sqrt', 'log2'],
             }

gcv = GridSearchCV(decTree_model, parameters, cv=10).fit(X_train, y_train)

In [None]:
print(f"Best Estimator: {gcv.best_estimator_}")
print(f"Best Parameter: {gcv.best_params_}")
print(f"Best Score: {gcv.best_score_}")

<a id="34"></a> 
### RandomizedSearchCV

A set of hyperparameters is randomly selected and tested by cross-validation and the model set up. These steps continue until the specified calculation time limit or the number of iterations is reached.

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
params = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'max_depth': range(1,14), 
              'min_samples_split': range(2,8), 
              'min_samples_leaf': range(1,3),
             'max_features': ['auto', 'sqrt', 'log2'],
             }

randomizedcv = RandomizedSearchCV(decTree_model, params, n_iter=1000, cv=5, scoring='accuracy', n_jobs=-1, verbose=2).fit(X_train,y_train)

print(f'RandomizedSearchCV Best Score: {randomizedcv.best_score_*100}')
print(f'RandomizedSearchCV Best Estimator: {randomizedcv.best_estimator_}')
print(f'RandomizedSearchCV Best Params: {randomizedcv.best_params_}')

<a id="35"></a> 
## Feature Importance

In [None]:
imp_feature = pd.DataFrame({'Feature': ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal'], 'Importance': decTree_model.feature_importances_})
plt.figure(figsize=(10,4))
plt.title("Feature Importance for DecisionTreeClassifier")
plt.xlabel("Importance ")
plt.ylabel("Features")
plt.barh(imp_feature['Feature'],imp_feature['Importance'],color = 'rgbkymc')
plt.show()

In [None]:
best_features = SelectFromModel(decTree_model)
best_features.fit(X_train, y_train)

transformedX = best_features.transform(X_train)
print(f"Old Shape: {X_train.shape} New shape: {transformedX.shape}")
print("\n")

<a id="36"></a> 
# Conclusion

In this notebook, I examined Heart Disease Dataset. Firstly, I made Exploratory Data Analysis, Visualization, then I applied Desicion Tree Classifiet to this dataset. I visualized Decision Tree and give some explanation and examples about evaluation metrics. I performed k-Fold Cross Validation and GridSearchView. Lastly I showed Feature Importance Graphic.

* If you have questions, please comment them. I will try to explain if you don't understand.
* If you liked this notebook, I will be glad to be informed :)

Thank you for your time.