# Classification Project: Heart Disease Prediction (Cleveland Dataset)

**Author:** Saratachandra Golla     
**Date:** 11/09/2025    
**Project Goal:** Predict the presence of heart disease (Class 1) versus no disease (Class 0) using clinical data from the Cleveland dataset. This project follows a structured approach: data cleaning, feature engineering, exploratory analysis, and comparative model evaluation using Logistic Regression and a Decision Tree Classifier.

## 1. Import and Inspect the Data

In [14]:
# All imports at the top
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Set plot style
sns.set_style("whitegrid")

### 1.1 Load the dataset and display the first 10 rows.

In [15]:

# Load the Cleveland dataset from the UCI archive
df = pd.read_csv("data\\heart_disease.data",header=0, na_values="?")

print("First 10 rows of the Heart Disease dataset:")
print(df.head(10))

First 10 rows of the Heart Disease dataset:
    age   sex   cp   trestbps   chol   fbs   restecg   thalach   exang  \
0  63.0   1.0  1.0      145.0  233.0   1.0       2.0     150.0     0.0   
1  67.0   1.0  4.0      160.0  286.0   0.0       2.0     108.0     1.0   
2  67.0   1.0  4.0      120.0  229.0   0.0       2.0     129.0     1.0   
3  37.0   1.0  3.0      130.0  250.0   0.0       0.0     187.0     0.0   
4  41.0   0.0  2.0      130.0  204.0   0.0       2.0     172.0     0.0   
5  56.0   1.0  2.0      120.0  236.0   0.0       0.0     178.0     0.0   
6  62.0   0.0  4.0      140.0  268.0   0.0       2.0     160.0     0.0   
7  57.0   0.0  4.0      120.0  354.0   0.0       0.0     163.0     1.0   
8  63.0   1.0  4.0      130.0  254.0   0.0       2.0     147.0     0.0   
9  53.0   1.0  4.0      140.0  203.0   1.0       2.0     155.0     1.0   

    oldpeak   slope   ca   thal   target  
0       2.3     3.0  0.0    6.0        0  
1       1.5     2.0  3.0    3.0        2  
2       2.6 

### 1.2 Check for missing values and display summary statistics.

In [16]:
# Check for missing values
print("\nMissing values count:")
print(df.isnull().sum())

# Display summary statistics
print("\nSummary Statistics:")
print(df.describe().T)


Missing values count:
age          0
 sex         0
 cp          0
 trestbps    0
 chol        0
 fbs         0
 restecg     0
 thalach     0
 exang       0
 oldpeak     0
 slope       0
 ca          4
 thal        2
 target      0
dtype: int64

Summary Statistics:
           count        mean        std    min    25%    50%    75%    max
age        303.0   54.438944   9.038662   29.0   48.0   56.0   61.0   77.0
 sex       303.0    0.679868   0.467299    0.0    0.0    1.0    1.0    1.0
 cp        303.0    3.158416   0.960126    1.0    3.0    3.0    4.0    4.0
 trestbps  303.0  131.689769  17.599748   94.0  120.0  130.0  140.0  200.0
 chol      303.0  246.693069  51.776918  126.0  211.0  241.0  275.0  564.0
 fbs       303.0    0.148515   0.356198    0.0    0.0    0.0    0.0    1.0
 restecg   303.0    0.990099   0.994971    0.0    0.0    1.0    2.0    2.0
 thalach   303.0  149.607261  22.875003   71.0  133.5  153.0  166.0  202.0
 exang     303.0    0.326733   0.469794    0.0    0.0    0

### Reflection 1: What do you notice about the dataset? Are there any data issues?

The dataset is a typical real-world clinical dataset with a few issues:

**1. Missing Values:** The features ca (number of major vessels) and thal (thalassemia) have 4 and 2 missing values, respectively. This needs to be handled before modeling.    
**2. Data Types/Encoding:** Several features (sex, cp, fbs, restecg, exang, slope, ca, thal) are categorical or ordinal but are currently represented as numerical data (int64 or float64), which is fine for storage but requires One-Hot Encoding for most models.    
**3. Target Variable:** The target variable has 5 levels (0, 1, 2, 3, 4). For this binary classification project, we must convert it to $0$ (no disease) and $1$ (disease, i.e., target $\ge 1$).   