## Predicting Heart Disease Using Machine Learning

We attempt to create a machine learning model that detects heart disease based on the medical records of patients.  
This notebook uses various Python libraries for data science and machine learning.

## Approach

1. Problem definition  
2. Data  
3. Evaluation  
4. Features  
5. Modeling  
6. Experimentation

## Problem definition

Given certain medical records of a patient, is it possible to detect the presence of heart disease?  
The machine learning problem is **supervised learning / binary classification**.

## Data

The data we use is the Cleveland Heart Disease Dataset, which is publicly available:  
[UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease)  
[Kaggle](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data)

## Evaluation

We are trying to achieve 95% accuracy with the proof of concept model to pursue the project further.

## Features

**Data Dictionary (information about each data feature)**  
* age (age of patient in years)  
* sex (0=female, 1=male)  
* cp (chest pain type: 1=typical angina, 2=atypical angina, 3=non-anginal, 4=asymptomatic)  
* trestbps (resting blood pressure in mmHg on admission to hospital)  
* chol (serum cholesterol in mg/dl)  
* fbs (whether fasting blood sugar is > 120 mg/dl, 0=false, 1=true)  
* restecg (resting electrocardiographic results: 0=normal, 1=ST/T abnormality, 2=left ventricular hypertrophy)  
* thalach (maximum heart rate achieved)  
* exang (exercise-induced angina: 0=no, 1=yes)  
* oldpeak (ST segment depression induced by exercise relative to resting)  
* slope (slope of the peak exercise ST segment: 1=upsloping, 2=flat, 3=downsloping)  
* ca (number of major vessels (0-3) colored by fluoroscopy)  
* thal (3=normal, 6=fixed defect, 7=reversible defect)  
* target (the predicted attribute: 0=no heart disease, 1=heart disease)

## Preparing the Tools

Python libraries Numpy, Pandas, and Matplotlib are used for data analysis and manipulation.  
Python library Scikit-Learn is used for machine learning.

In [None]:
### importing exploratory data analysis (EDA) tools
import numpy, pandas, seaborn
from matplotlib import pyplot

### rendering plots inside this notebook
%matplotlib inline

### importing sklearn model selection tools
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV

### importing sklearn machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

### importing sklearn model evaluation tools
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay

## Loading Data

In [None]:
### importing from csv file
heart_disease = pandas.read_csv(filepath_or_buffer="data-heart-disease.csv")

In [None]:
### displaying dimensins of the dataframe
heart_disease.shape

In [None]:
### displaying the first 5 rows
heart_disease.head()

In [None]:
### displaying the last 5 rows
heart_disease.tail()

## Exploratory Data Analysis = EDA

The goal is to become a subject matter expert on the dataset.  
1. Data basics: data types, numeric/categorical, statistics, balanced/imbalanced, etc...
2. Cleaning data: handling missing values and outliers
3. Transforming data: common units, standardization, encoding, etc...
4. Data engineering: creating new features from existing ones
5. Reducing data: removing non-relevant features

In [None]:
### displaying dataframe basic information
heart_disease.info()

In [None]:
### displaying basic statistics
heart_disease.describe()

In [None]:
### counting category instances of the target variable
heart_disease["target"].value_counts()

In [None]:
### visualizing category instances of the target variable
heart_disease["target"].value_counts().plot(kind="bar", color=["salmon", "lightblue"])
pyplot.title(label="Category Instances of Target Variable")
pyplot.ylabel(ylabel="Counts")
pyplot.xlabel(xlabel="Target Variable (Heart Disease): 0=No 1=Yes")
pyplot.xticks(rotation=0);

In [None]:
### checking for missing values
heart_disease.isna().sum()