## Let's build a classifier that will have the ability to predict whether a man has a heart disease or not.
## I will try to use logistic regression, random forest and GBT.

In [3]:
import pandas as pd
df = pd.read_csv("/home/yair/Documents/Bar-Ilan/third-year/semester2/Statistical-Theory/heart_disease_dataset.csv")
df.head()

Unnamed: 0,Age,Gender,Cholesterol,Blood Pressure,Heart Rate,Smoking,Alcohol Intake,Exercise Hours,Family History,Diabetes,Obesity,Stress Level,Blood Sugar,Exercise Induced Angina,Chest Pain Type,Heart Disease
0,75,Female,228,119,66,Current,Heavy,1,No,No,Yes,8,119,Yes,Atypical Angina,1
1,48,Male,204,165,62,Current,Nothing,5,No,No,No,9,70,Yes,Typical Angina,0
2,53,Male,234,91,67,Never,Heavy,3,Yes,No,Yes,5,196,Yes,Atypical Angina,1
3,69,Female,192,90,72,Current,Nothing,4,No,Yes,No,7,107,Yes,Non-anginal Pain,0
4,62,Female,172,163,93,Never,Nothing,6,No,Yes,No,2,183,Yes,Asymptomatic,0


#### Let's normalize and incode categorail variables:

In [6]:
from sklearn.preprocessing import LabelEncoder

# Encoding binary categorical variables into numerical values for better analysis
binary_columns = ['Gender', 'Family History', 'Diabetes', 'Obesity', 'Exercise Induced Angina']
for col in binary_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])


# One-hot encoding for categorical variables with more than two categories
df = pd.get_dummies(df, columns=['Smoking', 'Chest Pain Type', 'Alcohol Intake'], drop_first=True)



In [8]:
# Normalizing
from sklearn.preprocessing import StandardScaler
numerical_columns = ['Age', 'Cholesterol', 'Blood Pressure', 'Heart Rate', 'Exercise Hours', 'Stress Level', 'Blood Sugar']
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])



#### Split to train and test:

In [13]:
# Splitting the data into training and testing set
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(df.drop('Heart Disease',axis=1),df['Heart Disease'],train_size=0.8,random_state=40)

In [14]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)

In [15]:
from sklearn.ensemble import GradientBoostingClassifier
GBT_model = GradientBoostingClassifier()
GBT_model.fit(x_train,y_train)


In [16]:
from sklearn.ensemble import RandomForestClassifier
RF_model = RandomForestClassifier()
RF_model.fit(x_train,y_train)


#### And now let's compare the models:

In [19]:
y_pred_log = log_reg.predict(x_test)

y_pred_rf = RF_model.predict(x_test)

y_pred_GBT = GBT_model.predict(x_test)



from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_all = {"GBT":y_pred_GBT,"Random forest":y_pred_rf,"Logistic regression":y_pred_log}

for predicted_y in y_pred_all:
    accuracy = accuracy_score(y_test, y_pred_all[predicted_y])
    precision = precision_score(y_test, y_pred_all[predicted_y])
    recall = recall_score(y_test, y_pred_all[predicted_y])
    f1 = f1_score(y_test, y_pred_all[predicted_y])
    print(f"accuary of {predicted_y} is: {accuracy}")
    print(f"precision of {predicted_y} is: {precision}")
    print(f"recall of {predicted_y} is: {recall}")
    print(f"f1 of {predicted_y} is: {f1}")
    print("")

accuary of GBT is: 1.0
precision of GBT is: 1.0
recall of GBT is: 1.0
f1 of GBT is: 1.0

accuary of Random forest is: 1.0
precision of Random forest is: 1.0
recall of Random forest is: 1.0
f1 of Random forest is: 1.0

accuary of Logistic regression is: 0.875
precision of Logistic regression is: 0.863013698630137
recall of Logistic regression is: 0.8076923076923077
f1 of Logistic regression is: 0.8344370860927153



### And we acutully got 100% accuracy in both GBT and Random forest, wbe mean that both of them got a perfect score on the test. So I will be taking BGT, and it will be my model.
