# `Title` : Heart Disease Predication

# `Author` : [Shah Ahmad Noorani](https://www.linkedin.com/in/shah-ahmad-noorani-5804b0278/)

# `Data` : 22.Dec.2025

## ü´Ä Heart Disease Dataset ‚Äî Meta-Data (About Dataset)

### üìå Context

This is a **multivariate dataset**, meaning it contains multiple numerical and categorical variables used for statistical and machine learning analysis.

The dataset consists of **14 main attributes**, including:

* Age
* Sex
* Chest pain type
* Resting blood pressure
* Serum cholesterol
* Fasting blood sugar
* Resting electrocardiographic results
* Maximum heart rate achieved
* Exercise-induced angina
* ST depression (oldpeak)
* Slope of ST segment
* Number of major vessels
* Thalassemia

üìå Although the original database contains **76 attributes**, most published research uses only these **14 features**.

üëâ The **Cleveland database** is the most widely used version by Machine Learning researchers.

### üéØ Objective

The primary goal of this dataset is:

* To **predict whether a patient has heart disease or not**
* To perform **exploratory data analysis (EDA)** and extract medical insights

---

## üìÇ Content

### üßæ Column Descriptions

* **id** ‚Üí Unique identifier for each patient

* **age** ‚Üí Age of the patient (in years)

* **origin** ‚Üí Place of study

* **sex** ‚Üí Male / Female

* **cp (Chest Pain Type)**

  * Typical angina
  * Atypical angina
  * Non-anginal pain
  * Asymptomatic

* **trestbps** ‚Üí Resting blood pressure (mm Hg)

* **chol** ‚Üí Serum cholesterol (mg/dl)

* **fbs** ‚Üí Fasting blood sugar > 120 mg/dl (True / False)

* **restecg** ‚Üí Resting electrocardiographic results

  * Normal
  * ST-T wave abnormality
  * Left ventricular hypertrophy

* **thalach** ‚Üí Maximum heart rate achieved

* **exang** ‚Üí Exercise-induced angina (True / False)

* **oldpeak** ‚Üí ST depression induced by exercise relative to rest

* **slope** ‚Üí Slope of the peak exercise ST segment

* **ca** ‚Üí Number of major vessels (0‚Äì3) colored by fluoroscopy

* **thal** ‚Üí

  * Normal
  * Fixed defect
  * Reversible defect

* **num** ‚Üí Target variable (Heart disease presence)

---

## üôè Acknowledgements

### üë®‚Äç‚öïÔ∏è Dataset Creators

* **Hungarian Institute of Cardiology, Budapest** ‚Äî Andras Janosi, M.D.
* **University Hospital, Zurich, Switzerland** ‚Äî William Steinbrunn, M.D.
* **University Hospital, Basel, Switzerland** ‚Äî Matthias Pfisterer, M.D.
* **V.A. Medical Center & Cleveland Clinic Foundation** ‚Äî Robert Detrano, M.D., Ph.D.

---

## üìö Relevant Research Papers

* Detrano et al. (1989). *International application of a new probability algorithm for the diagnosis of coronary artery disease.*
* Aha & Kibler. *Instance-based prediction of heart disease presence.*
* Gennari et al. (1989). *Models of incremental concept formation.*

---

## üìñ Citation Request

Any publication using this dataset should cite the principal investigators from the respective institutions mentioned above.

In [None]:
# import libraries

# 1. to handle the data
import pandas as pd
import numpy as np

# to visualize the dataset
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# To preprocess the data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer, KNNImputer
# import iterative imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# machine learning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
#for classification tasks
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestRegressor
from xgboost import XGBClassifier
#metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# **Load the dataset**

In [None]:
from google.colab import  files
files.upload()

In [None]:
df = pd.read_csv("/content/heart_disease_uci.csv")

# **Exploratory Data Analysis (EDA)**

In [None]:
# exploring the datatype of each column
df.info()

In [None]:
# data shpae
df.shape

In [None]:
# id column
df['id'].min(), df['id'].max()

In [None]:
# age column
df['age'].min(), df['age'].max()

In [None]:
# let's summaries the dat
df['age'].describe().T

In [None]:
# let's summarie the age column
df['age'].describe()

In [None]:
# draw a histogram to see the distribution of age column
sns.histplot(df['age'], kde=True)

In [None]:
#plot the mean, median and mode of age column using sns
sns.histplot(df['age'], kde=True)
plt.axvline(df['age'].mean(), color = 'green')
plt.axvline(df['age'].mode()[0], color = 'red')
plt.axvline(df['age']. median(), color = 'blue')

Mean: 53.51086956521739

Median: 54.0

Mode: 54


Let's explore the gender based distribution of the dataset for age column.

In [None]:
# plot the histogram of age column using plotly and coloring this by sex

fig = px.histogram(data_frame=df, x='age', color='sex')
fig.show()

In [None]:
# find the values of sex column
df['sex'].value_counts()

In [None]:
# calculate the percentages of male and female value counts in the data
male_count = 726
female_count = 194
total_count = male_count + female_count

# calculate percentages
male_percentage = (male_count / total_count) * 100
female_percentage = (female_count / total_count) * 100

# display the results
print(f"Male percentage in the data: {male_percentage:.2f}%")
print(f"Female Percentage in the data: {female_percentage:.2f}%")

# difference
difference_percentage = ((male_count - female_count) / female_count) * 100
print(f"Males are {difference_percentage:.2f}% more than females in the data.")

In [None]:
# find the values count of age columns grouping by sex column

df.groupby('sex')['age'].value_counts().sor

# lets deal with dataset column
# find the unique values in dataset

In [None]:
df.head()

In [None]:
df['dataset'].value_counts()

In [None]:
# count the countplot of dataset columns
# sns.countplot(data=df, x='dataset', hue='sex')

# make a count plot using column
fig = px.bar( df, x='dataset', color = 'sex')
fig.show()


In [None]:
# make a plot of age column using plotly and coloring this by dataset column
fig = px.histogram(data_frame=df, x='age', color='dataset')
fig.show()

# print the mean median and mode of age column grouped by dataset column
print(f"Mean of Data Set: {df.groupby('dataset')['age'].mean()}")
print("-------------------------------------")
print(f"Median of Data Set: {df.groupby('dataset')['age'].median()}")
print("-------------------------------------")
print(f"Mode of Data Set: {df.groupby('dataset')['age'].agg(pd.Series.mode)}")
print("-------------------------------------")

Lets explore cp (chest pain) column:

In [None]:
# values count of cp
df['cp'].value_counts()

In [None]:
# drow the plot of age groupby cp
fig = px.histogram(data_frame=df, x = 'age', color='cp')
fig.show()

In [None]:
# count plot of cp by sex
sns.countplot(df, x='cp', hue='sex')

In [None]:
# drow the plot of age column by grouped by cp
fig = px.histogram(data_frame=df, x='age', color='cp')
fig.show()

#Let'e explore the trestbps (resting blood pressure) column:
The normal resting blood pressure is 120/80 mm Hg.

Write here, what will happen if the blood pressure is high or low and then you can bin the data based on those values.

In [None]:
# finds the value of trestbps (resting blood pressure)
df['trestbps'].describe()

In [None]:
# create a hisplot of trestbps
sns.histplot(df,x='trestbps', kde=True)

#Dealing with missing values

We are going to make a function to deal with missing values

In [None]:
df['trestbps'].isnull().mean()* 100

In [None]:
# impute the missing values of trestbps using Iterative imputer
imputer = IterativeImputer(max_iter=10, random_state=42)

# fit the impute
imputer.fit(df[['trestbps']])

# transform the data
df['trestbps'] = imputer.transform(df[['trestbps']])

# check the missing values
print(f'Missing values in trestbps columns {df['trestbps'].isnull().mean()* 100}')
