# Heart Disease Dataset UCI

## The Fine Details Provided  

Sex  
Male: 1  
Female: 0  

Chest pain type  
-- Value 1: typical angina  
-- Value 2: atypical angina  
-- Value 3: non-anginal pain  
-- Value 4: asymptomatic  
  
> Angina is chest pain or discomfort caused when your heart muscle doesn't get enough oxygen-rich blood.
It may feel like pressure or squeezing in your chest.  

> oldpeak = ST depression induced by exercise relative to rest  

> serum cholestoral in mg/dl

> resting blood pressure (in mm Hg on admission to the hospital)  

> vessels colored by flourosopy : number of major vessels (0-3) colored by flourosopy.  

> A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)


Target  
No Heart Disease: 0  
Heart Disease: 1



## Include Libs For EDA

In [None]:
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

import pandas as pd
import numpy as np

## EDA

In [None]:
df_raw = pd.read_csv("../input/heart-disease-dataset-uci/HeartDiseaseTrain-Test.csv")
df_raw.head()

In [None]:
# Fix minor typo from Cholestoral to Cholesterol
df_raw.rename(columns={"cholestoral": "cholesterol"}, inplace=True)

In [None]:
# Set aside the pristine primary dataframe
df = df_raw

In [None]:
columns = list(df.columns)
# Look at what info we can expect from the dataset
columns

In [None]:
df.info()

In [None]:
df.isnull().any()

#### Lovely! We won't have to worry about missing values

### EDA on numerical data

In [None]:
# Describe numerics
df.describe(include=[np.int64, np.float64])

#### Just like that, analyzing numerical data, we notice
* The recorded max age = 77 and min age = 29.  
* The recorded lowest cholesterol = 126 and highest cholesterol = 564. (Accepted healthy cholesterol levels < 200 mg/DL)
* The recorded min Max_heart_rate = 71 and max Max_heart_rate = 202
* The recorded min oldpeak = 0 and max oldpeak = 6.2

#### Correlations  
* Correlations tending towards 1 or -1 indicates a strong relation in the given dataset
* Correlations tending towards 0 indicates a weak correlation in the given dataset

In [None]:
# Filter out numerical cols, all but the target variable
numerical_cols = list(df.select_dtypes(include=[np.int64, np.float64]))[:-1]
numerical_corr = {}
for col in numerical_cols:
    corr = df["target"].corr(df[col])
    print(f"%s: %.3f\n" % (col, corr))
    numerical_corr[col] = corr

Let's sort them based on magnitude alone

In [None]:
sorted_corr = sorted(numerical_corr.items(), key=lambda ele:abs(ele[1]))

for i in sorted_corr:
    print(f"%s: %3f" % (i[0], abs(i[1])))

#### Notice that, by magnitude, the variable oldpeak has the highest impact on heart disease whereas cholesterol has the lowest. The latter might be quiet confounding but such is the numerical correlation on this particular dataset/distribution. Good cholesterol? Bad cholesterol? perhaps a combination of both in a skewed ratio? Speculations remain open.

### Numerical Data Viz

In [None]:
import plotly.express as px

In [None]:
# Plotly Theming
template = "plotly_white"
# color_scale = "Bluered"
# color_discrete_sequence=["lightblue", "orangered"] # For discrete map visualization
num_data = df
# Convert target to string for discrete color representation
num_data["target"] = num_data["target"].astype(str)

In [None]:
fig1 = px.box(data_frame=num_data, x="cholesterol", y="target", template=template, title="Cholesterol vs Heart Disease")
fig2 = px.density_contour(data_frame=num_data, x="cholesterol", y="target", template=template, title="Cholesterol vs Heart Disease")
fig1.show(); fig2.show()

In [None]:
fig1 = px.box(data_frame=df, x="Max_heart_rate", y="target", template=template, title="Max_heart_rate vs heart disease")
fig2 = px.density_contour(data_frame=df, x="Max_heart_rate", y="target", template=template, title="Max_heart_rate vs heart disease")
fig1.show(); fig2.show()

In [None]:
fig1 = px.box(data_frame=df, x="age", y="target", template=template, title="Age vs heart disease")
fig2 = px.density_contour(data_frame=df, x="age", y="target", template=template, title="Age vs heart disease")
fig1.show(); fig2.show()

In [None]:
fig1 = px.box(data_frame=df, x="resting_blood_pressure", y="target", template=template, title="Resting blood pressure vs heart disease")
fig2 = px.density_contour(data_frame=df, x="resting_blood_pressure", y="target", template=template, title="Resting blood pressure vs heart disease")
fig1.show(); fig2.show()

In [None]:
fig1 = px.box(data_frame=df, x="oldpeak", y="target", template=template, title="Oldpeak vs heart disease")
fig2 = px.density_contour(data_frame=df, x="oldpeak", y="target", template=template, title="Oldpeak vs heart disease")
fig1.show(); fig2.show()

### Categorical Data Analysis

In [None]:
df.select_dtypes(include=np.object)

#### Some quick data cleaning! Although not completely necessary, the mapping follows the details laid out by the uploader on Kaggle to standardize the process.  
Note: There are a few quirks on the mapping detailed. 3, 6, 7 for the Thalassemia column for example.

In [None]:
# Layout dictionaries to categorize columns
sex_map = {
    "Male": 1,
    "Female": 0
}
chest_pain_map = {
    "Typical angina": 1,
    "Atypical angina": 2,
    "Non-anginal pain": 3,
    "Asymptomatic": 4
}
blood_sugar_map = {
    "Lower than 120 mg/ml": 0,
    "Greater than 120 mg/ml": 1
}
rest_ecg_map = {
    "Normal": 0,
    "ST-T wave abnormality": 1,
    "Left ventricular hypertrophy": 2,
}
exercise_angina_map = {
    "Yes": 1,
    "No": 0
}
slope_map = {
    "Upsloping": 1,
    "Flat": 2,
    "Downsloping": 3
}
fluoroscopy_map = {
    "Zero": 0,
    "One": 1,
    "Two": 2,
    "Three": 3,
    "Four": 4
}
thalassemia_map = {
    "No": 0,
    "Normal": 3,
    "Fixed Defect": 6,
    "Reversable Defect": 7
}

In [None]:
# Create cat dataframe temporarily
df_cat = df

In [None]:
df_cat["sex"].replace(sex_map, inplace=True)
df_cat["chest_pain_type"].replace(chest_pain_map, inplace=True)
df_cat["fasting_blood_sugar"].replace(blood_sugar_map, inplace=True)
df_cat["rest_ecg"].replace(rest_ecg_map, inplace=True)
df_cat["exercise_induced_angina"].replace(exercise_angina_map, inplace=True)
df_cat["slope"].replace(slope_map, inplace=True)
df_cat["vessels_colored_by_flourosopy"].replace(fluoroscopy_map, inplace=True)
df_cat["thalassemia"].replace(thalassemia_map, inplace=True)

df_cat["target"] = df_cat["target"].astype(np.int64)

In [None]:
# Reset df_raw
df_raw = pd.read_csv("../input/heart-disease-dataset-uci/HeartDiseaseTrain-Test.csv")
# Fix minor typo from Cholestoral to Cholesterol
df_raw.rename(columns={"cholestoral": "cholesterol"}, inplace=True)

### Correlations

In [None]:
cat_cols = list(df_raw.select_dtypes(include=np.object))
cat_corr = {}
for col in cat_cols:
    corr = df_cat["target"].corr(df_cat[col])
    cat_corr[col] = corr

In [None]:
# Sort by absolute impact
cat_corr_sorted = sorted(cat_corr.items(), key=lambda ele:abs(ele[1]))
cat_corr_sorted

#### Inference:  
It can be clearly seen fasting blood sugar has the least impact on heart disease in this distribution whereas exercise induced angina has the most impact.

### Categorical Data Viz

In [None]:
# Convert strings into categories for histogram viz. Not completely necessary but clean.
df_cat_str = df_raw
for col in cat_cols:
    df_cat_str[col] = df_raw[col].astype(str)
df_cat_str["target"] = df_raw["target"]

In [None]:
px.histogram(data_frame=df_cat_str, x="fasting_blood_sugar", color="target" , template=template, title="Fasting blood sugar - Heart Disease 0: <120 mg/dl, 1: >120 mg/dl")

#### Inference
Inconclusive?

In [None]:
px.histogram(data_frame=df_cat_str, x="sex", color="target" , template=template, title="Sex & Heart Disease")

#### Inference
A greater portion of Females in the distribution have heart disease or are at risk

In [None]:
px.histogram(data_frame=df_cat_str, x="thalassemia", color="target" , template=template, title="Thalassemia vs Heart Disease")

In [None]:
px.histogram(data_frame=df_cat_str, x="slope", color="target" , template=template, title="Slope of peak exercise ST vs Heart Disease")

In [None]:
px.histogram(data_frame=df_cat_str, x="vessels_colored_by_flourosopy", color="target" , template=template, title="Vessels colored by flourosopy vs Heart Disease")

#### Inference  
Reserved from inferring/commenting due to lack of procedural data and the author's limited/zero medical expertise.

In [None]:
px.histogram(data_frame=df_cat_str, x="chest_pain_type", color="target" , template=template, title="Chest Pain Type vs Heart Disease")

#### Inference
Perhaps anything but Typical Angina could be indicative of a higher probability of being at risk of heart disease in the given distribution.

In [None]:
px.histogram(data_frame=df_cat_str, x="exercise_induced_angina", color="target" , template=template, title="Exercise induced angina vs Heart Disease")

#### Inference
There's a good chance (50% or greater) of non-exercise induced pain being indicative of heart disease

## Prediction by Machine Learning

With data and inferences at hand, we can now try and predict risk of heart disease from the given variables

In [None]:
# Convert strings / categorical data back into int64 for modelling purposes
for col in cat_cols:
    df_cat[col] = df_cat[col].astype(int)

In [None]:
# Clone dataframe for ML ops
df = df_cat

In [None]:
# Initialize dictionary to record scores
Results = {}

## Approach 1: Basic Logistic Regression using Scikit Learn (Naive Approach)

### Import Libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

In [None]:
# Set seed for consistent testing
np.random.seed(1)

In [None]:
# Shuffle Dataframe
df = shuffle(df)

In [None]:
# Segregate inputs from the target variable. Naively select every single varibale other than the target
input_cols = df.columns[:-1]
target_col = df.columns[-1]
list(input_cols)

In [None]:
inputs = df[input_cols]
targets = df[target_col]
inputs.shape, targets.shape

In [None]:
# Split the input target pair into train and test datasets
# Percent Ratio train-test 70-30
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.40, random_state=3)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
model_logistic = LogisticRegression(solver="liblinear")
model_logistic.fit(X_train, y_train)
score_non_naive = model_logistic.score(X_test, y_test)
score_non_naive

In [None]:
Results["NaiveLogisticModel"] = score

#### 85% Accuracy on the naive logistic solver from Scikit-learn. Before we test further, can we improve it? Let's spice it up.

## Approach 2: Non-naive Logistic Regression (Correlation Filter)

In [None]:
low_corr = ["cholesterol", "resting_blood_pressure", "fasting_blood_sugar"]
# Drop low correlation columns
high_corr_df = df.drop(low_corr, axis=1)
high_corr_df.head()

In [None]:
inputs_non_naive = high_corr_df.iloc[:, :-1]
targets = high_corr_df[target_col]
inputs_non_naive.shape, targets.shape

In [None]:
model_logistic = X_train_non_naive, X_test_non_naive, y_train_non_naive, y_test_non_naive = train_test_split(inputs_non_naive, targets, test_size=0.40, random_state=3)

In [None]:
model_logistic = LogisticRegression(solver="liblinear")
model_logistic.fit(X_train_non_naive, y_train_non_naive)
score_non_naive = model_logistic.score(X_test_non_naive, y_test_non_naive)
score_non_naive

In [None]:
Results["ModelLogisitc"] = score_non_naive

### No improvement unfortunately. Quite odd. By filtering out features with low correlation, We haven't progressed in terms of performance on this distribution. Can we take it further?

### Thus far, resorting to just the logistic regression model provided by Scikit learn, we were able to obtain an accuracy of 85%. A constant performance is observed even after to feature selection by correlation. Which is alright. But, can we take it further?

 # Approach 3: Scikit Decision Tree (Naive)

In [None]:
from sklearn import tree

In [None]:
tree_naive = tree.DecisionTreeClassifier()

In [None]:
tree_naive.fit(X_train, y_train)
score_tree_naive = tree_naive.score(X_test, y_test)
score_tree_naive

In [None]:
# Let's visualize the tree
!pip install dtreeviz --quiet
from dtreeviz.trees import dtreeviz
dtreeviz(tree_naive, x_data=X_train, y_data=y_train, target_name="Heart Disease", feature_names=list(X_train.columns), class_names=["No", "Yes"], title="Naive Tree")

In [None]:
Results["DTree_naive"] = score_tree_naive

### A whopping 99.2% test accuracy. At this point, we could pretty much eliminate the need for gradient boosting or a deep neural network (Overfitting on a small and easily learnable dataset is commonplace).

# Conclusions

In [None]:
for i in Results:
    print(f"%s: %.3f" %(i, Results[i]))

### Thus, we have peaked at 99.2% (98%+ even on varying various random split states) test accuracy using a basic dtree classifier from scikit learn. For the sake of completeness, I've tried out a dtree on the non-naive correlation filtered dataset only to drop overall accuracy on the test set by 2%. We can thus infer that feature correlations are close to obsolete given the size of the dataset and this particular distribution of patients.

### I believe this is a good starter dataset for anyone looking to learn or refresh their EDA and data prediction basics. I would also like to thank the author for having provided this dataset. Whether the distribution is skewed or at fault is left to medical expertise. This notebook is focused purely on the data. Furthermore, I encourage comments that can help me improve on the existing model / EDA approach. Cheers!