# Using EDA and Machine Learning to Predict Heart Disease

Task:
    
    Given various parameters about a patient, can we predict whether or not they have heart disease?

**In this notebook we are going to perform Exploratory Data Analysis and use various Machine Learning Models to predict whether the patient has heart disease or not depending on the values of various features. I will be using Bokeh and a little bit of Seaborn to plot the graphs.**

**Please Upvote if you like the notebook and do provide your valuable feedback.**

#### Loading the libraries

In [None]:
import numpy as np 
import pandas as pd

#### Loading the dataset

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')

In [None]:
df.head(5)

In [None]:
df.columns

## Features

Let's have a look at what each of these columns means:

1. **age** -> Age of the person.
2. **sex** -> Sex of the person.  (1 = male; 0 = female)
3. **cp** -> Chest Pain Type. It can take values of 0, 1, 2, 3.
4. **trestbps** -> Resting Blood Presssure (Measured in mm Hg on admission to the hospital). It can take continuous values from 94 to 200.  
5. **chol** -> Serum Cholestrol in mg/dl. It also takes continuous values.
6. **fbs** -> Fasting Blood Sugar. It can take value of either 1 or 0.
7. **restecg** -> Resting Electrocardiographic Results. It can take value of 0, 1 or 2.
8. **thalach** -> Maximum Heart Rate achieved. It can take continuous value from 71 to 202.
9. **exang** -> Exercise Induced Angina. It can take value either of 0 or 1.
10. **oldpeak** -> ST depression induced by exercise relative to rest. It takes continuous decimal values.
11. **slope** -> the slope of the peak exercise ST segment. It can take value of either 0, 1 or 2.
12. **ca** -> Number of major vessels colored by flourosopy. It can take value of either 0, 1, 2, 3 or 4. 
13. **thal** -> 3 = normal; 6 = fixed defect; 7 = reversable defect
14. **target** -> Indicates the presence or absence of heart disease. (= the predicted attribute)

For performing EDA, I will be using [Bokeh](https://bokeh.org).

In [None]:
from bokeh.io import output_notebook
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.transform import cumsum
from bokeh.palettes import Spectral6
from bokeh.models import ColumnDataSource
from bokeh.layouts import gridplot
from math import pi

In [None]:
output_notebook()

# 1. Is Dataset Balanced ?

The first step before we start performing EDA, preprocessing the data, building the ML model is to check whether the variable to predict i.e 'target' is balanced or not. By checking this we can get to know which evaluation metrics will be better suited for this particular dataset.

In [None]:
df['target'].value_counts()

In [None]:
df['target'].value_counts()[0]

In [None]:
unique = ["0", '1']
top = [df['target'].value_counts()[0], df['target'].value_counts()[1]]
source = ColumnDataSource(data = dict(Target = unique, counts = top, color = Spectral6))

In [None]:
p = figure(
    x_range = unique,
    plot_height = 500,
    plot_width = 500,
    x_axis_label = 'Target',
    y_axis_label = 'Count(Target)',
    title = 'Count of People Having Heart Disease and Not Having Heart Disease',
    tools = "hover", tooltips="@Target: @counts"
)

p.vbar(
    x = 'Target',
    top = 'counts',
    bottom = 0,
    width = 0.9,
    source = source,
    color = 'color'
)

In [None]:
target = {
            'No Heart Disease' : df['target'].value_counts()[0], 
          'Have Heart Disease' : df['target'].value_counts()[1]
         }

data = pd.Series(target).reset_index(name = 'value').rename(columns = {'index':'target'})
data['angle'] = data['value']/data['value'].sum() * 2 * pi
data['color'] = ['skyblue', 'salmon']

In [None]:
p1 = figure(
            plot_height = 500, 
            plot_width = 500, 
            title = "Proportion of People Having Heart Disease and not Having Heart Disease", 
            toolbar_location = None,
            tools = "hover", 
            tooltips = "@target: @value", 
            x_range = (-0.5, 1.0)
            )

p1.wedge(
        x = 0, y = 1, radius = 0.4,
        start_angle = cumsum('angle', include_zero=True), 
        end_angle = cumsum('angle'),
        line_color = "white", 
        fill_color = 'color', 
        legend_field = 'target', 
        source = data
        )

p1.legend.location = "top_right"

p1.legend.label_text_font_size = '5pt'

In [None]:
show(gridplot([[p], [p1]]))

In [None]:
print("Percentage of people having Heart Disease", round(df['target'].value_counts()[1] / (df['target'].value_counts()[0] + df['target'].value_counts()[1]), 2) * 100)
print("Percentage of people not having Heart Disease", round(df['target'].value_counts()[0] / (df['target'].value_counts()[0] + df['target'].value_counts()[1]), 2) * 100)

We can see that the dataset is balanced as there is no major difference between the proportion of people having heart disease and those not having heart disease.

Next we need to check whether there are null values present in the dataset.

In [None]:
df.isnull().sum()

So we don't have any null values present which saves us a lot of time :)

# 2. Exploratory Data Analysis(EDA)

First let's classify these columns as Catergorical or Continuous. For Categorical variables we will print out the unique categories for that particular column.

In [None]:
categorical_var = []
continuous_var = []

for column in df.columns:
    if len(df[column].unique()) <= 10:
        print(f"{column} : {df[column].unique()}")
        categorical_var.append(column)
        print()
    else:
        continuous_var.append(column)
        
print("Categorical Variables are: ", categorical_var)
print("Continuous Variables are: ", continuous_var)

**Now we will explore the relation of these categorical variables with the target.**

In [None]:
def count_of_each_category(column_name):
    """
    A function which will plot the count of each category for a particular column using bokeh.
    """
    values = {}
    for i in df[column_name].value_counts().index:
        values[i] = df[column_name].value_counts()[i]
    column = list(values.keys())
    top = list(values.values())
    source = ColumnDataSource(data = dict(Classes = column, counts = top, color = Spectral6))

    p2 = figure(
        plot_height = 400,
        plot_width = 400,
        x_axis_label = column_name,
        y_axis_label = 'Count(Classes)',
        tools="hover", tooltips="@Classes: @counts"
    )

    p2.vbar(
        x = 'Classes',
        top = 'counts',
        bottom = 0,
        width = 0.9,
        source = source,
        color = 'color'
    )
    
    return p2
    

### Sex vs Target

In [None]:
p2 = count_of_each_category('sex')
show(p2)

In [None]:
# For analyzing how much proportion of male or female have heart disease. 

sex_vs_target = df.groupby(['sex', 'target'])['sex'].count().to_list()

unique = [0, 1]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [sex_vs_target[1], sex_vs_target[3]],
        'No Heart Disease'   : [sex_vs_target[0], sex_vs_target[2]]
        }

p3 = figure(plot_height = 400, plot_width = 400, title = "Sex vs Target",
           )

p3.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data,
             legend_label = condition)

p3.legend.location = "top_left"

p3.legend.label_text_font_size = '7pt'
show(p3)

We might think that more number of men have heart disease but if we observe closely, we can see that more proportion of female have heart disease as compared to men.

### Chest Pain vs Target

Different Chest Pain Types:

0: Typical angina: chest pain related decrease blood supply to the heart

1: Atypical angina: chest pain not related to heart

2: Non-anginal pain: typically esophageal spasms (non heart related)

3: Asymptomatic: chest pain not showing signs of disease


In [None]:
p4 = count_of_each_category('cp')
show(p4)

In [None]:
# For analyzing what proportion of different chest pain types patient have heart disease. 

cp_vs_target = df.groupby(['cp', 'target'])['cp'].count().to_list()

unique = [0, 1, 2, 3]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [cp_vs_target[1], cp_vs_target[3], cp_vs_target[5],cp_vs_target[7]],
        'No Heart Disease'   : [cp_vs_target[0], cp_vs_target[2], cp_vs_target[4], cp_vs_target[6]]
        }

p5 = figure(plot_height = 400, plot_width = 400, title = "Chest Pain vs Target")

p5.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data, legend_label = condition)

p5.legend.location = "top_right"

p5.legend.label_text_font_size = '7pt'
show(p5)

It's really shocking to know that majority of the asymptomatic (Type 3) cases and Non-anginal pain patients (Type 2) ended up having heart disease.

### Fasting Blood Sugar vs Target 

FBS > 120 mg/dl (1 = true; 0 = false). 

Those whose Fasting Blood Sugar is greater than 120 indicates that the patient is diabetic.

In [None]:
p6 = count_of_each_category('fbs')
show(p6)

In [None]:
# For analyzing how much proportion of diabetic and non-diabetic patients have heart disease. 

fbs_vs_target = df.groupby(['fbs', 'target'])['fbs'].count().to_list()

unique = [0, 1]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [fbs_vs_target[1], fbs_vs_target[3]],
        'No Heart Disease'   : [fbs_vs_target[0], fbs_vs_target[2]]
        }

p7 = figure(plot_height = 400, plot_width = 400, title = "Fasting Blood Sugar vs Target")

p7.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data, legend_label = condition)

p7.legend.location = "top_right"

p7.legend.label_text_font_size = '7pt'
show(p7)

### Restecg vs Target

0: Nothing to note

1: ST-T Wave abnormality can range from mild symptoms to severe problems signals non-normal heart beat

2: Possible or definite left ventricular hypertrophy. Enlarged heart's main pumping chamber


In [None]:
p8 = count_of_each_category('restecg')
show(p8)

In [None]:
restecg_vs_target = df.groupby(['restecg', 'target'])['restecg'].count().to_list()

unique = [0, 1, 2]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [restecg_vs_target[1], restecg_vs_target[3], restecg_vs_target[5]],
        'No Heart Disease'   : [restecg_vs_target[0], restecg_vs_target[2], restecg_vs_target[4]]
        }

p9 = figure(plot_height = 400, plot_width = 400, title = "Restecg vs Target")

p9.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data, legend_label = condition)

p9.legend.location = "top_right"

p9.legend.label_text_font_size = '7pt'
show(p9)

A large proportion of people having restecg of type 1 actually have heart disease. We must take care of ST-T Wave abnormality as it can range from mild symptoms to severe problems.

### Exercise Induced Angina vs Target

exang means exercise induced angina (1 = yes; 0 = no). Angina is a type of chest pain caused by reduced blood flow to the heart

In [None]:
p10 = count_of_each_category('exang')
show(p10)

In [None]:
exang_vs_target = df.groupby(['exang', 'target'])['exang'].count().to_list()

unique = [0, 1]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [restecg_vs_target[1], restecg_vs_target[3]],
        'No Heart Disease'   : [restecg_vs_target[0], restecg_vs_target[2]]
        }

p11 = figure(plot_height = 400, plot_width = 400, title = "Exang vs Target")

p11.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data, legend_label = condition)

p11.legend.location = "top_right"

p11.legend.label_text_font_size = '7pt'
show(p11)

### Slope vs Target

slope - the slope of the peak exercise ST segment

0: Upsloping: better heart rate with excercise (uncommon)

1: Flatsloping: minimal change (typical healthy heart)

2: Downslopins: signs of unhealthy heart

In [None]:
p12 = count_of_each_category('slope')
show(p12)

In [None]:
slope_vs_target = df.groupby(['slope', 'target'])['slope'].count().to_list()

unique = [0, 1, 2]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [slope_vs_target[1], slope_vs_target[3], slope_vs_target[5]],
        'No Heart Disease'   : [slope_vs_target[0], slope_vs_target[2], slope_vs_target[4]]
        }

p13 = figure(plot_height = 400, plot_width = 400, title = "Slope vs Target")

p13.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data, legend_label = condition)

p13.legend.location = "top_left"

p13.legend.label_text_font_size = '5pt'
show(p13)

As type 2 means Downslopins which is a sign of unhealthy heart, most patients with type 2 slope had Heart Disease.

### Ca vs Target

ca - number of major vessels (0-3) colored by flourosopy

colored vessel means the doctor can see the blood passing through

the more blood movement the better (no clots)

In [None]:
p14 = count_of_each_category('ca')
show(p14)

In [None]:
ca_vs_target = df.groupby(['ca', 'target'])['ca'].count().to_list()

unique = [0, 1, 2, 3, 4]
condition = ['Have Heart Disease', 'No Heart Disease']
colors = ["#e84d60", "#718dbf"]
data = {
        'Classes' : unique,
        'Have Heart Disease' : [ca_vs_target[1], ca_vs_target[3], ca_vs_target[5], ca_vs_target[7], ca_vs_target[9]],
        'No Heart Disease'   : [ca_vs_target[0], ca_vs_target[2], ca_vs_target[4], ca_vs_target[6], ca_vs_target[8]]
        }

p15 = figure(plot_height = 400, plot_width = 400, title = "Ca vs Target")

p15.vbar_stack(condition, x ='Classes', width = 0.9, color = colors, source = data, legend_label = condition)
p15.legend.location = "top_right"

p15.legend.label_text_font_size = '7pt'
show(p15)

We can see a large proportion of patients having 'ca' value of type 0 and type 4 had Heart Disease.

**Now we will see the relation of the Continuous Variables with the target.**

In [None]:
def plot_cont_var(column_name):
    """
    A function which makes histogram for continuous variables.
    """
    hist1, edges1 = np.histogram(df[df["target"] == 0][column_name], density = True, bins = 40)
    hist2, edges2 = np.histogram(df[df["target"] == 1][column_name], density = True, bins = 40)

    p = figure(
        plot_height = 500,
        plot_width = 500,
        x_axis_label = column_name,
        title = column_name.capitalize() + ' vs Target'
    )

    p.quad(
        bottom = 0,
        top = hist1,
        left = edges1[:-1],
        right = edges1[1:],
        line_color = 'white',
        color = 'blue', # Blue represents patients not having heart disease.
        alpha = 0.6
    )

    p.quad(
        bottom = 0,
        top = hist2,
        left = edges2[:-1],
        right = edges2[1:],
        line_color = 'white',
        color = 'red', # Red represents patients having heart disease.
        alpha = 0.6
    )



    return p



### Age vs Target

In [None]:
p16 = plot_cont_var('age')
show(p16)

There is no particular age at which the person is more prone to having heart disease, which proves that age is just a number.

In [None]:
continuous_var

### Resting Blood Pressure vs Target

Resting Blood Pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern

In [None]:
p17 = plot_cont_var('trestbps')
show(p17)

Those patients having Blood Pressure in the range of 120 to 160 have the highest chance of having heart disease

###  Cholestoral vs Target

In [None]:
p18 = plot_cont_var('chol')
show(p18)

We can see that patient having Cholestrol level greater than 200 had heart disease.

### Thalach vs Target

maximum heart rate achieved

In [None]:
p19 = plot_cont_var('thalach')
show(p19)

The patients having maximum heart rate greater than 150 are at a greater risk of having heart disease.

# 3. Correlation Matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# I have used seaborn for plotting correlation matrix as its 
# much faster and much more easier than bokeh 

In [None]:
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
                 annot = True,
                 linewidths = 0.5,
                 fmt = ".2f",
                 cmap = "YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

The correlation between the features with target is not that clear in the correlation matrix as there are a large number of features, lets visualize it in another way.

In [None]:
df.drop('target', axis=1).corrwith(df.target).plot(kind = 'bar', grid = True, 
                                                   figsize = (12, 8), 
                                                   title = "Correlation with Target")

We can see that 'fbs' and 'chol' are least related with 'target' whereas other features are highly correlated with the 'target' variable.

# 3. Data Pre-processing

As we can see that there a number of continuous variables, we need to scale the data so that the continuous variables don't get majority of the weight or in other words, become the deciding factor to predict whether the patient has heart disease. We would also need to convert some categorical variable into dummy variables.

In [None]:
from pandas import get_dummies

categorical_var.remove('target') # Removing the 'target' column from the list of categorical variables.
dataframe = pd.get_dummies(df, columns = categorical_var)

In [None]:
dataframe.head()

In [None]:
dataframe.columns

Before we scale the data, we need to split the data into train and test. We can not apply scaling before splitting because test set is the real world data which the trained model would have never seen. Therefore, we will scale the test data according to the train data

In [None]:
X = dataframe.drop('target', axis = 1)
y = dataframe['target']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [None]:
X_train_std = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train_std

In [None]:
X_test

We can see that both X_train and X_test has been scaled. Now we can apply Machine Learning Algorithms.

# 4. Training Machine Learning Algorithms

Before we train any model, I will create a function which will help to evaluate the model.

In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

In [None]:
def evaluation(model, x_train_std, y_train, x_test, y_test, train = True):
    if train == True:
        pred = model.predict(x_train_std)
        classifier_report = pd.DataFrame(classification_report(y_train, pred, output_dict = True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"F1 Score: {round(f1_score(y_train, pred), 2)}")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{classifier_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    if train == False:
        pred = model.predict(x_test)
        classifier_report = pd.DataFrame(classification_report(y_test, pred, output_dict = True))
        print("Test Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"F1 Score: {round(f1_score(y_test, pred), 2)}")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{classifier_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver = 'liblinear')
lr.fit(X_train_std, y_train)

evaluation(lr, X_train_std, y_train, X_test, y_test, True)
evaluation(lr, X_train_std, y_train, X_test, y_test, False)

Through Logistic Regression we were able to achieve Training Accuracy of 88.55 % and Testing Accuracy of 86.84 %.

In [None]:
train_score_lr = round(accuracy_score(y_train, lr.predict(X_train_std)) * 100, 2)
test_score_lr = round(accuracy_score(y_test, lr.predict(X_test)) * 100, 2)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators = 400)
rfc.fit(X_train_std, y_train)

evaluation(rfc, X_train_std, y_train, X_test, y_test, True)
evaluation(rfc, X_train_std, y_train, X_test, y_test, False)

Through Random Forest Classifier we were able to achieve Training Accuracy of 100 % and Testing Accuracy of 84.21 %.

In [None]:
train_score_rfc = round(accuracy_score(y_train, rfc.predict(X_train_std)) * 100, 2)
test_score_rfc = round(accuracy_score(y_test, rfc.predict(X_test)) * 100, 2)

Now we will determine the right number of n_estimators to be used:

In [None]:
accuracy_scores = []
for i in range(1, 1000, 100):
    rfc = RandomForestClassifier(n_estimators = i)
    rfc.fit(X_train_std, y_train)
    accuracy_scores.append(accuracy_score(y_test, rfc.predict(X_test)))
print(accuracy_scores)

We can see that having 500 number of trees gives the highest accuracy hence we have used 500 above.

### K Nearest Neighbor

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Deciding the right number of Neighbors.

In [None]:
accuracy_scores = []

for i in range(1, 10):
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train_std, y_train)
    accuracy_scores.append(accuracy_score(y_test, knn.predict(X_test)))
    
print(accuracy_scores)

For now we will take the number of neighbors to be 9. 

In [None]:
knn = KNeighborsClassifier(n_jobs = 9)
knn.fit(X_train_std, y_train)

evaluation(knn, X_train_std, y_train, X_test, y_test, True)
evaluation(knn, X_train_std, y_train, X_test, y_test, False)

In [None]:
train_score_knn = round(accuracy_score(y_train, knn.predict(X_train_std)) * 100, 2)
test_score_knn = round(accuracy_score(y_test, knn.predict(X_test)) * 100, 2)

### Support Vector Machine

In [None]:
from sklearn.svm import SVC

svm = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm.fit(X_train_std, y_train)

evaluation(svm, X_train_std, y_train, X_test, y_test, True)
evaluation(svm, X_train_std, y_train, X_test, y_test, False)

In [None]:
train_score_svm = round(accuracy_score(y_train, svm.predict(X_train_std)) * 100, 2)
test_score_svm = round(accuracy_score(y_test, svm.predict(X_test)) * 100, 2)

### Summary

In [None]:
models = {
           'Train Accuracy': [train_score_lr, train_score_rfc, train_score_knn, train_score_svm],
          'Test Accuracy' : [test_score_lr, test_score_rfc, test_score_knn, test_score_svm]
         }

models = pd.DataFrame(models, index = ['Logistic Regression', 'Random Forest Classifier', 'K-Nearest Neighbor', 'Support Vector Machine'])

In [None]:
models.head()

# 5. Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

### Logistic Regression with Hyperparameter Tuning

In [None]:
params = {
        "C": np.logspace(-4, 4, 20), # For Regularization
          "solver": ["liblinear"]}

lr = LogisticRegression()

lr_cv = GridSearchCV(lr, params, scoring = "accuracy", n_jobs = -1, verbose = 1, cv = 5)

In [None]:
lr_cv.fit(X_train_std, y_train)

In [None]:
best_params = lr_cv.best_params_
print(f"Best parameters: {best_params}")

In [None]:
lr = LogisticRegression(**best_params)

lr.fit(X_train_std, y_train)

evaluation(lr, X_train_std, y_train, X_test, y_test, True)
evaluation(lr, X_train_std, y_train, X_test, y_test, False)

In [None]:
train_score_lr = round(accuracy_score(y_train, lr.predict(X_train_std)) * 100, 2)
test_score_lr = round(accuracy_score(y_test, lr.predict(X_test)) * 100, 2)

### K-nearest neighbors with Hyperparameter Tuning

In [None]:
train_score = []
test_score = []
neighbors = range(1, 30)

for k in neighbors:
    model = KNeighborsClassifier(n_neighbors = k)
    model.fit(X_train_std, y_train)
    train_score.append(accuracy_score(y_train, model.predict(X_train_std)))

In [None]:
plt.figure(figsize=(12, 8))

plt.plot(neighbors, train_score, label="Train score")
# plt.plot(neighbors, test_score, label="Test score")
plt.xticks(np.arange(1, 31, 1))
plt.xlabel("Number of Neighbors")
plt.ylabel("Model Score")
plt.legend()

print(f"Maximum KNN score on the train data: {max(train_score)*100:.2f}%")

In [None]:
knn = KNeighborsClassifier(n_neighbors = 27)
knn.fit(X_train_std, y_train)

evaluation(knn, X_train_std, y_train, X_test, y_test, True)
evaluation(knn, X_train_std, y_train, X_test, y_test, False)