# ‚ù§Ô∏è Early Detection of Heart Disease using Logistic Regression and Random Forest
Heart disease is one of the leading causes of death worldwide.  
Early detection and preventive measures can save lives.  
This project uses Logistic Regression and Random Forest to predict heart disease risk based on patient health metrics.


## üéØ Project Objective
- Predict the presence of heart disease at an early stage
- Assist healthcare professionals in decision-making
- Demonstrate practical use of Machine Learning in healthcare


# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import joblib

## üìä Load Dataset
Load the heart disease dataset using pandas.


In [None]:
df = pd.read_csv("/content/NACC_APOE_CVD_filtered (2).csv")

In [None]:
df.head()

Unnamed: 0,NACCID,SEX,BIRTHYR,NACCAPOE,DEMENTED,CVHATT,HATTMULT,CVAFIB,CVANGIO,CVBYPASS,...,STROKE,STROKIF,STROKDEC,STKIMAG,CVD,CVDIF,VASC,VASCIF,VASCPS,VASCPSIF
0,NACC000011,2,1944,1.0,0,0.0,,0.0,0.0,0.0,...,0.0,7.0,,,,,0.0,7.0,,
1,NACC000034,2,1935,4.0,0,0.0,8.0,0.0,0.0,0.0,...,,,8.0,8.0,0.0,7.0,,,,
2,NACC000067,1,1952,1.0,0,0.0,,0.0,0.0,0.0,...,0.0,7.0,,,,,0.0,7.0,0.0,7.0
3,NACC000095,1,1926,2.0,1,0.0,,0.0,0.0,0.0,...,0.0,7.0,,,,,0.0,7.0,0.0,7.0
4,NACC000144,1,1930,1.0,0,0.0,,1.0,0.0,0.0,...,0.0,8.0,,,,,8.0,8.0,8.0,8.0


## ‚ÑπÔ∏è Dataset Information
Check data types, missing values, and number of rows/columns.


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40686 entries, 0 to 40685
Data columns (total 43 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   NACCID    40686 non-null  object 
 1   SEX       40686 non-null  int64  
 2   BIRTHYR   40686 non-null  int64  
 3   NACCAPOE  40686 non-null  float64
 4   DEMENTED  40686 non-null  int64  
 5   CVHATT    29582 non-null  float64
 6   HATTMULT  7713 non-null   float64
 7   CVAFIB    29536 non-null  float64
 8   CVANGIO   29624 non-null  float64
 9   CVBYPASS  29633 non-null  float64
 10  CVPACDEF  7742 non-null   float64
 11  CVPACE    21901 non-null  float64
 12  CVCHF     29598 non-null  float64
 13  CVANGINA  7738 non-null   float64
 14  CVHVALVE  7738 non-null   float64
 15  CVOTHR    29536 non-null  float64
 16  CVOTHRX   3347 non-null   object 
 17  MYOINF    18764 non-null  float64
 18  CONGHRT   18764 non-null  float64
 19  AFIBRILL  18764 non-null  float64
 20  ANGINA    18764 non-null  fl

In [None]:
df.shape

(40686, 43)

In [None]:
df.describe()

Unnamed: 0,SEX,BIRTHYR,NACCAPOE,DEMENTED,CVHATT,HATTMULT,CVAFIB,CVANGIO,CVBYPASS,CVPACDEF,...,STROKE,STROKIF,STROKDEC,STKIMAG,CVD,CVDIF,VASC,VASCIF,VASCPS,VASCPSIF
count,40686.0,40686.0,40686.0,40686.0,29582.0,7713.0,29536.0,29624.0,29633.0,7742.0,...,21922.0,21922.0,18764.0,18674.0,18764.0,18764.0,21922.0,21922.0,15339.0,15296.0
mean,1.565969,1940.892912,1.819471,0.359878,0.110743,7.732141,0.100623,0.118823,0.079101,0.020408,...,0.041602,7.191406,7.85259,7.895041,0.076903,7.128011,2.801387,7.253353,2.945564,7.207178
std,0.495635,12.652103,1.061846,0.47997,0.446419,1.422104,0.365841,0.445972,0.379632,0.158624,...,0.199683,1.032187,1.050306,0.861236,0.266444,1.498858,3.800684,0.903614,3.832178,1.091093
min,1.0,1896.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
25%,1.0,1932.0,1.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,...,0.0,7.0,8.0,8.0,0.0,7.0,0.0,7.0,0.0,7.0
50%,2.0,1941.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,...,0.0,7.0,8.0,8.0,0.0,7.0,0.0,7.0,0.0,7.0
75%,2.0,1949.0,2.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,8.0,0.0,8.0,8.0,8.0,8.0,8.0
max,2.0,2003.0,6.0,1.0,2.0,8.0,2.0,2.0,2.0,2.0,...,1.0,8.0,8.0,8.0,1.0,8.0,8.0,8.0,8.0,8.0


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SEX,40686.0,1.565969,0.495635,1.0,1.0,2.0,2.0,2.0
BIRTHYR,40686.0,1940.892912,12.652103,1896.0,1932.0,1941.0,1949.0,2003.0
NACCAPOE,40686.0,1.819471,1.061846,1.0,1.0,2.0,2.0,6.0
DEMENTED,40686.0,0.359878,0.47997,0.0,0.0,0.0,1.0,1.0
CVHATT,29582.0,0.110743,0.446419,0.0,0.0,0.0,0.0,2.0
HATTMULT,7713.0,7.732141,1.422104,0.0,8.0,8.0,8.0,8.0
CVAFIB,29536.0,0.100623,0.365841,0.0,0.0,0.0,0.0,2.0
CVANGIO,29624.0,0.118823,0.445972,0.0,0.0,0.0,0.0,2.0
CVBYPASS,29633.0,0.079101,0.379632,0.0,0.0,0.0,0.0,2.0
CVPACDEF,7742.0,0.020408,0.158624,0.0,0.0,0.0,0.0,2.0


# Handle Missing / Categorical Data

In [None]:
df.isnull().sum()

Unnamed: 0,0
NACCID,0
SEX,0
BIRTHYR,0
NACCAPOE,0
DEMENTED,0
CVHATT,11104
HATTMULT,32973
CVAFIB,11150
CVANGIO,11062
CVBYPASS,11053


In [None]:
df.fillna(df.median(numeric_only=True), inplace=True)

In [None]:
df.isnull().sum()

Unnamed: 0,0
NACCID,0
SEX,0
BIRTHYR,0
NACCAPOE,0
DEMENTED,0
CVHATT,0
HATTMULT,0
CVAFIB,0
CVANGIO,0
CVBYPASS,0


## ‚öôÔ∏è Preprocessing
- Encode categorical features
- Handle missing values
- Calculate age from birth year

# Encoding + Missing Values

In [None]:
# Encode SEX if object
if df['SEX'].dtype == 'object':
    le = LabelEncoder()
    df['SEX'] = le.fit_transform(df['SEX'])

# Age in years
current_year = 2023
df['age_years'] = current_year - df['BIRTHYR']

# Fill missing numeric values
df.fillna(df.mean(numeric_only=True), inplace=True)

## ‚úÇÔ∏è Feature & Target Selection
- Drop unnecessary columns (e.g., IDs)
- Target = Heart Disease (CVD column, binary: 0=No, 1=Yes)


In [None]:
X = df.drop(['CVD', 'NACCID', 'CVOTHRX'], axis=1)
y = (df['CVD'] > 0).astype(int)

## üîÄ Train-Test Split
Split dataset into 80% training and 20% testing.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## üìè Feature Scaling
Normalize features using StandardScaler.


In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## ü§ñ Model Building
- Logistic Regression
- Random Forest Classifier


In [None]:
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

## üìä Model Evaluation
Check accuracy, confusion matrix, and classification report for both models.


In [None]:
def evaluate_model(y_test, y_pred, model_name):
    print(f"=== {model_name} ===")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("\n")

evaluate_model(y_test, y_pred_log, "Logistic Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Logistic Regression ===
Accuracy: 0.9965593511919391
Confusion Matrix:
 [[7843    0]
 [  28  267]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7843
           1       1.00      0.91      0.95       295

    accuracy                           1.00      8138
   macro avg       1.00      0.95      0.97      8138
weighted avg       1.00      1.00      1.00      8138



=== Random Forest ===
Accuracy: 0.9961907102482183
Confusion Matrix:
 [[7840    3]
 [  28  267]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7843
           1       0.99      0.91      0.95       295

    accuracy                           1.00      8138
   macro avg       0.99      0.95      0.97      8138
weighted avg       1.00      1.00      1.00      8138





## üíæ Save Models
Save trained models and scaler for deployment.


In [None]:
joblib.dump(log_model, "heart_logistic_model.pkl")
joblib.dump(rf_model, "heart_rf_model.pkl")
joblib.dump(sc, "heart_scaler.pkl")

['heart_scaler.pkl']

## üåê Interactive Web App
Use Gradio to accept patient data and predict heart disease risk online.


In [55]:
# Gradio interactive app
import gradio as gr
import pandas as pd
import joblib

# Load models
model = joblib.load("heart_rf_model.pkl")
scaler = joblib.load("heart_scaler.pkl")

def predict_heart(age, gender, height, weight, ap_hi, ap_lo,
                  cholesterol, gluc, smoke, alco, active):
    bmi = weight / ((height / 100) ** 2)
    data = pd.DataFrame([{
        'gender': gender,
        'height': height,
        'weight': weight,
        'ap_hi': ap_hi,
        'ap_lo': ap_lo,
        'cholesterol': cholesterol,
        'gluc': gluc,
        'smoke': smoke,
        'alco': alco,
        'active': active,
        'age_years': age,
        'BMI': bmi
    }])
    data_scaled = scaler.transform(data)
    prediction = model.predict(data_scaled)[0]
    return f"‚ö†Ô∏è HIGH Risk of Heart Disease\nBMI: {bmi:.2f}" if prediction==1 else f"‚úÖ LOW Risk of Heart Disease\nBMI: {bmi:.2f}"

# Gradio interface
app = gr.Interface(
    fn=predict_heart,
    inputs=[
        gr.Number(label="Age (years)"),
        gr.Radio([1,2], label="Gender (1=Male, 2=Female)"),
        gr.Number(label="Height (cm)"),
        gr.Number(label="Weight (kg)"),
        gr.Number(label="Systolic BP"),
        gr.Number(label="Diastolic BP"),
        gr.Radio([1,2,3], label="Cholesterol (1=Normal,2=Above,3=High)"),
        gr.Radio([1,2,3], label="Glucose (1=Normal,2=Above,3=High)"),
        gr.Radio([0,1], label="Smoking (0=No,1=Yes)"),
        gr.Radio([0,1], label="Alcohol (0=No,1=Yes)"),
        gr.Radio([0,1], label="Physically Active (0=No,1=Yes)")
    ],
    outputs="text",
    title="‚ù§Ô∏è Heart Disease Prediction App",
    description="Enter patient data to predict heart disease risk"
)

app.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6b38b5b15d13396c8d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


