# Heart Disease Prediction

Heart disease prediction using machine learning involves creating a model that can identify the likelihood of a person having heart disease based on their health data. The model is trained using historical data, including factors like age, blood pressure, cholesterol levels, and more. By analyzing these patterns, the model can predict the risk of heart disease in new patients. This helps doctors make better-informed decisions and potentially catch heart disease early. The goal is to improve patient outcomes through accurate, data-driven predictions.

# Information about data

*Age: How old a person is in years.
    
*Sex: Whether the person is male (1) or female (0).
    
*CP (Chest Pain Type): Describes the type of chest pain the person experiences.
Values:

    0: Typical angina (chest pain related to heart).

    1: Atypical angina (chest pain not related to heart).

    2: Non-anginal pain (typically musculoskeletal).

    3: Asymptomatic (no chest pain).    
    
*Trestbps (Resting Blood Pressure):This is the measurement of how forcefully blood is being pumped through the arteries of the body when a person is relaxed and not doing any physical activity.

Unit: It is measured in millimeters of mercury (mmHg), which is a standard unit for measuring blood pressure.
    
*Chol (Cholesterol):The person's cholesterol level.

Unit: Milligrams per deciliter (mg/dL)

*FBS (Fasting Blood Sugar):

This refers to the amount of sugar (glucose) present in the blood after a person has not eaten (fasted) for at least 8 hours. It helps doctors assess how well the body is processing sugar, which is important for understanding conditions like diabetes and overall health.
Values:

    1: Higher than 120 mg/dL.

    0: Lower than 120 mg/dL.

*Restecg (Resting Electrocardiographic Results):

Meaning: This describes the results of a person's resting electrocardiogram (ECG or EKG), which is a test that records the electrical activity of the heart.

Values:

    0: The ECG shows a normal result, indicating that the heart's electrical activity is within normal ranges.

    1:The ECG shows abnormalities in the ST-T wave, which can include T wave inversion or ST elevation or depression of more          than 0.05 millivolts (mV).These abnormalities can indicate potential issues with the heart's function.

    2: The ECG suggests that the person may have left ventricular hypertrophy, which means the muscle of the heart's left              pumping chamber (ventricle) has thickened. This can be a sign of heart disease or high blood pressure affecting the              heart.

These values help doctors understand how the heart is functioning and whether there are any potential problems that need further investigation or treatment.

*Thalach (Maximum Heart Rate Achieved):

Meaning: The person's highest heart rate achieved during exercise.

Unit: Beats per minute (bpm).

*Exang (Exercise Induced Angina):

Meaning: This indicates whether a person experiences chest pain (angina) when they exercise.

Values:

1: Yes, exercise induces chest pain (angina). This means that when the person engages in physical activity, such as walking or running, they feel discomfort or pain in their chest.

0: No, exercise does not induce chest pain (angina). This means that the person does not experience chest pain during physical activity.

Exercise-induced angina is a symptom that can indicate underlying heart problems, and its presence or absence helps doctors assess cardiovascular health and recommend appropriate treatments or lifestyle changes.

*Oldpeak (ST Depression Induced by Exercise Relative to Rest):

Meaning: This measures how much the ST segment of a person's electrocardiogram (ECG or EKG) decreases during exercise compared to when they are at rest. The ST segment is a part of the ECG that reflects the heart's electrical activity.

Unit: It is measured in millimeters (mm), which indicates the amount of change observed in the ST segment on the ECG.

Explanation:

During an ECG, doctors look at the ST segment to assess how the heart is functioning. If the ST segment decreases (moves lower) during exercise compared to rest, it can indicate that the heart is not receiving enough oxygen. This condition is known as ST depression and can be a sign of heart disease or other cardiovascular issues. The measurement in millimeters helps quantify this change for diagnostic purposes.

*Slope (Slope of the Peak Exercise ST Segment):

Meaning: This describes the shape or angle of the ST segment on a person's electrocardiogram (ECG or EKG) during peak exercise, which is typically a stress test where the person exercises vigorously.

Values:

    0: Upsloping. The ST segment slopes upward during peak exercise. This is considered a normal finding in many cases.

    1: Flat. The ST segment remains relatively unchanged during peak exercise, maintaining a horizontal or flat position.

    2: Downsloping. The ST segment slopes downward during peak exercise. This can indicate potential heart problems, such as            coronary artery disease or ischemia (reduced blood flow to the heart).

Explanation:
During a stress test, doctors monitor the ECG to see how the heart responds to exercise. The slope of the ST segment helps them assess whether there are signs of stress or strain on the heart muscle. Different slopes can provide valuable information about the heart's health and how it reacts to physical exertion.

*Thal (Thalassemia):

Meaning: Thalassemia is a genetic blood disorder that affects how the body makes hemoglobin, the protein in red blood cells that carries oxygen.

Values:

    3:Normal. This means there are no abnormalities related to thalassemia.

    6:Fixed defect. Indicates there's no normal blood flow in some parts of the heart, possibly due to scarring or blockage.

    7:Reversible defect. Shows there is blood flow in the heart, but it's not normal. This could be due to temporary issues          affecting blood flow.

Explanation:

Thalassemia affects hemoglobin production, leading to anemia and other health problems. In heart disease assessment, the Thal values help doctors see if there are additional heart-related concerns. For instance, values 6 or 7 might indicate issues like poor blood flow or structural heart abnormalities that need attention and further evaluation





# Importing the Dependencies

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score

# Data Collection

In [3]:
heart_dataset=pd.read_csv("heart_disease_data.csv")
heart_dataset

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [4]:
heart_dataset.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
heart_dataset.shape

(303, 14)

In [6]:
#checking any missing values
heart_dataset.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

There is no missing values occur

In [7]:
#summary of the dataset
heart_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [8]:
# statistical measures of the dataset
heart_dataset.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [9]:
# checking distribution of target variable
heart_dataset['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

# Splitting the Features and Target

In [10]:
X=heart_dataset.drop(columns='target',axis=1)
Y=heart_dataset['target']

In [11]:
print(X)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  
0        0   0     1  
1        0   0     2  
2        2   0    

In [12]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64


# Splitting Training and Testing Data

In [18]:
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=32)

In [14]:
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

(242, 13) (242,) (61, 13) (61,)


# Model Training

In [21]:
models = {
   'lr':LogisticRegression(),
   'rfc':RandomForestClassifier(),
   'dtc':DecisionTreeClassifier(),
   'knn':KNeighborsClassifier()
}

for name, mod in models.items():
    mod.fit(X_train,y_train)
    y_pred=mod.predict(X_test)
    print(f"{name} accuracy score: {accuracy_score(y_test,y_pred)} Precisionscore {precision_score(y_test,y_pred)}  recallscore {precision_score(y_test,y_pred)} F1score {f1_score(y_test,y_pred)}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


lr accuracy score: 0.8360655737704918 Precisionscore 0.8  recallscore 0.8 F1score 0.8484848484848486
rfc accuracy score: 0.8524590163934426 Precisionscore 0.8055555555555556  recallscore 0.8055555555555556 F1score 0.8656716417910448
dtc accuracy score: 0.7704918032786885 Precisionscore 0.7428571428571429  recallscore 0.7428571428571429 F1score 0.787878787878788
knn accuracy score: 0.7049180327868853 Precisionscore 0.6857142857142857  recallscore 0.6857142857142857 F1score 0.7272727272727272


Given that the accuracy score, precision score, recall score, and F1 score for both logistic regression and the random forest classifier are the same, I will choose one model to proceed with the next step.

# Select Model-RandomForestClassifier

In [16]:
rfc= RandomForestClassifier()
rfc.fit(X_train,y_train)
y_pred=rfc.predict(X_test)

In [17]:
print(y_pred)

[1 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1
 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1]


# Making a Predictive System

In [43]:
input_data = (63,1,3,145,233,1,0,150,0,2.3,0,0,1)
#changing the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)
# reshape the data as we are predicting the label for only one instance
input_data_reshaped =input_data_as_numpy_array.reshape(1,-1)
prediction =rfc.predict(input_data_reshaped)
print(prediction) 

[1]




In [44]:
print(y_train[0])

1


# Conclusion

In my project, I developed three different machine learning models: Logistic Regression, Random Forest classifier, and Decision Tree classifier. To evaluate the performance of these models, I used four key metrics: accuracy_score,Precision_score,recall_score,F1_score 

Model Evaluation Results:

Logistic Regression:

Accuracy_Score: 0.8360655737704918

Precision_Score:0.8

Recall_Score:0.8

F1_score: 0.8484848484848486
    
Random Forest Classifier:

Accuracy_Score: 0.8360655737704918

Precision_Score:0.8

Recall_Score:0.8

F1_score: 0.8484848484848486
    
Decision Tree Classifier:

Accuracy_Score:0.7704918032786885

Precision_Score:0.7428571428571429

Recall_Score: 0.7428571428571429

F1_score: 0.787878787878788
    
Conclusion:

Based on the evaluation metrics, Logistic Regression and Random Forest Classifier are the best performing models for this project. The accuracy score (0.836), precision score (0.8), recall score (0.8), and F1 score (0.848) are equal for both models.These indicates both the model give the accurate prediction