<a href="https://colab.research.google.com/github/saumilhj/projects/blob/main/Heart_Logistic_Reg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**CHRONIC HEART DISEASE PREDICTION**

Dataset from Kaggle: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

##Import data

In [None]:
df = pd.read_csv('HeartDisease.csv')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


In [None]:
df.drop(columns=['education'], inplace=True)

##Check NaN

In [None]:
df.isna().sum()

male                 0
age                  0
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

###Fill NaN of cigsPerDay

In [None]:
cig_mean = df['cigsPerDay'].mean()
cig_median = df['cigsPerDay'].median()
print(f'Mean of cigs:{cig_mean}')
print(f'Median of cigs:{cig_median}')

Mean of cigs:9.003088619624615
Median of cigs:0.0


cigsPerDay is skewed right

In [None]:
df['cigsPerDay'].fillna(cig_median, inplace=True)

###Fill NaN of BPMeds

In [None]:
df['BPMeds'].sum()

124.0

Only 124 of the total 4238 persons take BPMeds

In [None]:
df['BPMeds'].fillna(0, inplace=True)

###Fill NaN of totalChol

In [None]:
tc_mean = df['totChol'].mean()
tc_median = df['totChol'].median()
print(f'Mean of cholestrol:{tc_mean}')
print(f'Median of cholestrol:{tc_median}')

Mean of cholestrol:236.72158548233045
Median of cholestrol:234.0


Mean and median of total cholestrol very close meaning data is symmetrical

In [None]:
df['totChol'].fillna(tc_mean, inplace=True)

###Fill NaN of BMI

In [None]:
bmi_mean = df['BMI'].mean()
bmi_median = df['BMI'].median()
print(f'Mean of BMI:{bmi_mean}')
print(f'Median of BMI:{bmi_median}')

Mean of BMI:25.80200758473572
Median of BMI:25.4


Mean and median of BMI very close meaning data is symmetrical

In [None]:
df['BMI'].fillna(bmi_mean, inplace=True)

###Fill NaN of heartRate

Fill it with value of 72 BPM

In [None]:
df['heartRate'].fillna(72, inplace=True)

###Fill NaN of glucose

In [None]:
gl_mean = df['glucose'].mean()
gl_median = df['glucose'].median()
print(f'Mean of Glucose:{gl_mean}')
print(f'Median of Glucose:{gl_median}')

Mean of Glucose:81.96675324675324
Median of Glucose:78.0


Mean and median values are very close meaning the data is symmetrically distributed

In [None]:
df['glucose'].fillna(gl_mean, inplace=True)

##Check duplicates

In [None]:
df.duplicated().sum()

0

##Logistic Regression

In [None]:
df.head()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [None]:
# features
X = df.drop(columns=['TenYearCHD'])
y = df['TenYearCHD']

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=150)

Scaling to give all variables equal weightage

In [None]:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

LogisticRegression()

In [None]:
predict = model.predict(X_test)
print(f"Classification Report:\n{classification_report(y_test, predict)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, predict)}")

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.99      0.92       716
           1       0.75      0.09      0.16       132

    accuracy                           0.85       848
   macro avg       0.80      0.54      0.54       848
weighted avg       0.84      0.85      0.80       848

Confusion Matrix:
[[712   4]
 [120  12]]
