#HEART DISEASE PREDICTION

In this project, I've made a Machine Learning Model that predicts whether a person's heart is healthy or does he/she has some heart disease, based on the information received from various tests and characteristics of the person. I've used supervised ML technique Logistic Regression for that purpose. Logistic Regression is a statistical method used for binary classification and sometimes extended for multiclass classification problems. Despite its name, it's actually a classification algorithm rather than a regression algorithm.

For my project, I've used the heart dataset available on Kaggle.You can either use the dataset directly in the colab by downloading it within the session storage using the opendatasets module, or you can download the dataset externally and then upload it to the session storage. I've downloaded the dataset externally and then uploaded it to the session storage. You can download the dataset from [here](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset/download?datasetVersionNumber=2)

However, if you want to do it using the opendatasets module, you can do it by using the following code
```
!pip install opendatasets
import opendatasets as od
url='https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset/download?datasetVersionNumber=2'
od.download(url)


```

In [77]:
#Importing the libraries
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [78]:
#Loading the data
df=pd.read_csv('/content/heart.csv')

In [79]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [81]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [82]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


Let's see how many data do we have for healthy heart (represented by 0) and defected heart (represented by 1).

In [83]:
df['target'].value_counts()

1    526
0    499
Name: target, dtype: int64

Since we have almost equal number of data for each category, we don't need to filter out data further more.

Now, we make 2 separate dataframes X and Y, where X will contain the properties to check whether the has a healthy heart or not, and Y will contain the outcome of whether the heart is healthy or not.

In [84]:
X=df.drop(columns='target',axis=1)
Y=df['target']

In [85]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2


In [86]:
Y

0       0
1       0
2       0
3       0
4       0
       ..
1020    1
1021    0
1022    0
1023    1
1024    0
Name: target, Length: 1025, dtype: int64

Now, we can split the data into training and test data by using the train_test_split function present in the sklearn library.

In [87]:
#Splitting the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=1)

#Model Training

In [95]:
model= LogisticRegression(max_iter=904)

In [105]:
#Training the logistic regression model with training data
model.fit(X_train.values,Y_train.values)

MODEL EVALUATION

Now since our model is trained, we will now check the accuracy of our model, about how accurately can it predict.

In [115]:
#accuracy on training data
X_train_prediction = model.predict(X_train.values)
train_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [107]:
train_data_accuracy

0.8463414634146341

We can conclude that out model gives a 84% right prediction when it encounters with the data it is trained on. Now let's see its performance on test data.



In [117]:
#accuracy on test data
X_test_prediction = model.predict(X_test.values)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [109]:
test_data_accuracy

0.8390243902439024

We can conclude that out model gives a 83% right prediction when it encounters with some unseen data.

Now, our model is ready to predict the data based on its properties, so let's evaluate our model.

#Model Evaluation

In [118]:
#Prediction

input_data=[71,0,0,112,149,0,1,125,0,1.6,1,0,2]

#Change input data to a numpy array
input_data=np.asarray(input_data)

#Reshaping the array as we're predicting for inly one datapoint
input_data=input_data.reshape(1,-1)

prediction=model.predict(input_data)
print(prediction[0])

if(prediction[0]==1):
  print("The person has a heart disease")
else:
  print("The person's heart is healthy")

1
The person has a heart disease


In [119]:
#Prediction

input_data=[43,0,0,132,341,1,0,136,1,3,1,0,3]

#Change input data to a numpy array
input_data=np.asarray(input_data)

#Reshaping the array as we're predicting for inly one datapoint
input_data=input_data.reshape(1,-1)

prediction=model.predict(input_data)
print(prediction[0])

if(prediction[0]==1):
  print("The person has a heart disease")
else:
  print("The person's heart is healthy")

0
The person's heart is healthy


##CONCLUSION


In this project, I performed supervised learning technique, Logistic Regression for building a model that predicts whether a person has a healthy heart or not, based on the information received from the tests and characteristics such as age, sex of a person.

## References and Future Work

You can find the links to the resources that I found useful during the execution of this project and learn more about the tools and libraries used in it.


*   Kaggle Dataset:https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset/download?datasetVersionNumber=2
*   Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
*   Numpy user guide: https://numpy.org/doc/stable/user/absolute_beginners.html
*   opendatasets Python library: https://github.com/JovianML/opendatasets
*   Logistic Regression user guide: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
