# Data Exploration


    Age: age of the patient [years]
    Sex: sex of the patient [M: Male, F: Female]
    ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
    RestingBP: resting blood pressure [mm Hg]
    Cholesterol: serum cholesterol [mm/dl]
    FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
    RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation      or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
    MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
    ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
    Oldpeak: oldpeak = ST [Numeric value measured in depression]
    ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
    HeartDisease: output class [1: heart disease, 0: Normal]


In [16]:
import pandas as pd
import numpy as np

In [17]:
heart = pd.read_csv('heart.csv')

In [18]:
heart

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [19]:
heart.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [20]:
heart.tail()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
917,38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


In [21]:
heart.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [22]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [23]:
heart['Age'].isnull().sum()

0

In [24]:
print(heart['Sex'].value_counts())
print('Not missing value :',heart['Sex'].value_counts()[0]+heart['Sex'].value_counts()[1])

Sex
M    725
F    193
Name: count, dtype: int64
Not missing value : 918


  print('Not missing value :',heart['Sex'].value_counts()[0]+heart['Sex'].value_counts()[1])
  print('Not missing value :',heart['Sex'].value_counts()[0]+heart['Sex'].value_counts()[1])


In [25]:
heart.duplicated().sum()

0

In [26]:
heart.isna().sum()

Unnamed: 0,0
Age,0
Sex,0
ChestPainType,0
RestingBP,0
Cholesterol,0
FastingBS,0
RestingECG,0
MaxHR,0
ExerciseAngina,0
Oldpeak,0


# Data Preprocessing

In [27]:
for col in heart.columns:
    if heart[col].dtype in ['object', 'category']:
        print(col, heart[col].unique())


Sex ['M' 'F']
ChestPainType ['ATA' 'NAP' 'ASY' 'TA']
RestingECG ['Normal' 'ST' 'LVH']
ExerciseAngina ['N' 'Y']
ST_Slope ['Up' 'Flat' 'Down']


In [28]:
new_= heart

In [29]:
from sklearn.preprocessing import LabelEncoder

sex_encoder = LabelEncoder()

chest_pain_encoder = LabelEncoder()

resting_ecg_encoder = LabelEncoder()

exercise_angina_encoder = LabelEncoder()

st_slope_encoder = LabelEncoder()

new_["Encoder_sex"] = sex_encoder.fit_transform(new_["Sex"])
new_["Encoder_ChestPainType"] = chest_pain_encoder.fit_transform(new_["ChestPainType"])
new_["Encoder_RestingECG"] = resting_ecg_encoder.fit_transform(new_["RestingECG"])
new_["Encoder_ExerciseAngina"] = exercise_angina_encoder.fit_transform(new_["ExerciseAngina"])
new_["Encoder_ST_Slope"] = st_slope_encoder.fit_transform(new_["ST_Slope"])


In [31]:
new_

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Encoder_sex,Encoder_ChestPainType,Encoder_RestingECG,Encoder_ExerciseAngina,Encoder_ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0,1,1,1,0,2
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1,0,2,1,0,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0,1,1,2,0,2
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1,0,0,1,1,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0,1,2,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1,1,3,1,0,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1,1,0,1,0,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1,1,0,1,1,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1,0,1,0,0,1


In [32]:
new_.shape

(918, 17)

# Classification Model
# ---> Algorithms

In [54]:
#Importing the classifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import  MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, confusion_matrix, classification_report, accuracy_score

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

attributes = ["Age", "Encoder_sex", "Encoder_ChestPainType", "RestingBP", "Cholesterol", "FastingBS",
              "Encoder_RestingECG", "MaxHR", "Encoder_ExerciseAngina", "Oldpeak", "Encoder_ST_Slope"]

X = new_[attributes]
y = new_['HeartDisease']

# Apply MinMax scaling only to features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

# Save the scaled features to CSV
pd.DataFrame(X_scaled, columns=attributes).to_csv("features.csv", index=False)


In [56]:
X_scaled

array([[0.24489796, 1.        , 0.33333333, ..., 0.        , 0.29545455,
        1.        ],
       [0.42857143, 0.        , 0.66666667, ..., 0.        , 0.40909091,
        0.5       ],
       [0.18367347, 1.        , 0.33333333, ..., 0.        , 0.29545455,
        1.        ],
       ...,
       [0.59183673, 1.        , 0.        , ..., 1.        , 0.43181818,
        0.5       ],
       [0.59183673, 0.        , 0.33333333, ..., 0.        , 0.29545455,
        0.5       ],
       [0.20408163, 1.        , 0.66666667, ..., 0.        , 0.29545455,
        1.        ]])

In [57]:
#Creating and fitting the model, as well as generating predictions
svm_model = SVC(random_state=0)
svm_model.fit(X_train, y_train)
preds = svm_model.predict(X_test)

#Model evaluation
#Mean absolute error

print("The mean absolute error:\n{}\n".format(mean_absolute_error(y_test, preds)))

#Accuracy score

svm_model_accuracy = accuracy_score(y_test, preds)
print("Accuracy score:\n{}\n".format(svm_model_accuracy))
svm_model_accuracy = round(accuracy_score(y_test, preds)*100,2)
print('Accuracy = ' , svm_model_accuracy ,' %')
accuracies = pd.DataFrame({"Algorithm":["SVM"], "Score":[svm_model_accuracy]})


#Classification report

print("Classification report: \n{}\n".format(classification_report(y_test, preds)))

#Confusion matrix

print("Confusion matrix: \n{}\n".format(confusion_matrix(y_test, preds)))


The mean absolute error:
0.15217391304347827

Accuracy score:
0.8478260869565217

Accuracy =  84.78  %
Classification report: 
              precision    recall  f1-score   support

           0       0.85      0.78      0.81        77
           1       0.85      0.90      0.87       107

    accuracy                           0.85       184
   macro avg       0.85      0.84      0.84       184
weighted avg       0.85      0.85      0.85       184


Confusion matrix: 
[[60 17]
 [11 96]]



In [58]:
#Creating and fitting the model, as well as generating predictions
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
preds = lr_model.predict(X_test)

#Model evaluation
#Mean absolute error

print("The mean absolute error:\n{}\n".format(mean_absolute_error(y_test, preds)))

#Accuracy score

lr_model_accuracy = accuracy_score(y_test, preds)
print("Accuracy score:\n{}\n".format(lr_model_accuracy))
lr_model_accuracy = round(accuracy_score(y_test, preds)*100,2)
print('Accuracy = ' , lr_model_accuracy ,' %')
accuracies = pd.DataFrame({"Algorithm":["lr"], "Score":[lr_model_accuracy]})


#Classification report

print("Classification report: \n{}\n".format(classification_report(y_test, preds)))

#Confusion matrix

print("Confusion matrix: \n{}\n".format(confusion_matrix(y_test, preds)))

The mean absolute error:
0.15760869565217392

Accuracy score:
0.842391304347826

Accuracy =  84.24  %
Classification report: 
              precision    recall  f1-score   support

           0       0.83      0.78      0.81        77
           1       0.85      0.89      0.87       107

    accuracy                           0.84       184
   macro avg       0.84      0.83      0.84       184
weighted avg       0.84      0.84      0.84       184


Confusion matrix: 
[[60 17]
 [12 95]]



In [59]:
#Creating and fitting the model, as well as generating predictions
clf_model = RandomForestClassifier(n_estimators = 100)
clf_model.fit(X_train, y_train)
preds = clf_model.predict(X_test)

#Model evaluation
#Mean absolute error

print("The mean absolute error:\n{}\n".format(mean_absolute_error(y_test, preds)))

#Accuracy score

clf_model_accuracy = accuracy_score(y_test, preds)
print("Accuracy score:\n{}\n".format(clf_model_accuracy))
clf_model_accuracy = round(accuracy_score(y_test, preds)*100,2)
print('Accuracy = ' , clf_model_accuracy ,' %')
accuracies = pd.DataFrame({"Algorithm":["clf"], "Score":[clf_model_accuracy]})


#Classification report

print("Classification report: \n{}\n".format(classification_report(y_test, preds)))

#Confusion matrix

print("Confusion matrix: \n{}\n".format(confusion_matrix(y_test, preds)))

The mean absolute error:
0.14130434782608695

Accuracy score:
0.8586956521739131

Accuracy =  85.87  %
Classification report: 
              precision    recall  f1-score   support

           0       0.85      0.81      0.83        77
           1       0.86      0.90      0.88       107

    accuracy                           0.86       184
   macro avg       0.86      0.85      0.85       184
weighted avg       0.86      0.86      0.86       184


Confusion matrix: 
[[62 15]
 [11 96]]

