## Introduction

Goal: Visualize the UCI Heart Disease Data in a meaningful way

By : Angga Bayu Prakhosha

In this notebook I will try to retrieve any meaningful information in UCI Heart Disease Dataset. The analysis is divided into three different category: univariate analysis, bivariate analysis, and multivariate analysis.

Only 14 attributes used:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    1. Value 0: typical angina
    2. Value 1: atypical angina
    3. Value 2: non-anginal pain
    4. Value 3: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    1. Value 0: normal
    2. Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    3. Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    1. Value 0: upsloping
    2. Value 1: flat
    3. Value 2: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. target: diagnosis of heart disease (angiographic disease status)
    1. Value 0: < 50% diameter narrowing
    2. Value 1: > 50% diameter narrowing
    
The complete dataset can be seen here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

In [None]:
import pandas as pd
import numpy as np

First, load the data and show the first 5 rows of the data

In [None]:
path_to_data = '/kaggle/input/heart-disease-uci/heart.csv'
data = pd.read_csv(path_to_data)
data.head()

In [None]:
features = data.columns
categorical_data = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']
numerical_data = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

Before we proceed, we should replace the categorical values with their respective representation

In [None]:
data['sex'].replace(1, 'Male', inplace=True)
data['sex'].replace(0, 'Female', inplace=True)

data['cp'].replace(0, 'Typical Angina', inplace=True)
data['cp'].replace(1, 'Atypical Angina', inplace=True)
data['cp'].replace(2, 'Non-anginal Pain', inplace=True)
data['cp'].replace(3, 'Asymptomatic', inplace=True)

data['fbs'].replace(1, 'fasting blood sugar > 120 mg/dl', inplace=True)
data['fbs'].replace(0, 'fasting blood sugar <= 120 mg/dl', inplace=True)

data['restecg'].replace(0, 'Normal', inplace=True)
data['restecg'].replace(1, 'having ST-T wave abnormality', inplace=True)
data['restecg'].replace(2, 'left ventricular hypertrophy', inplace=True)

data['exang'].replace(0, 'No', inplace=True)
data['exang'].replace(1, 'Yes', inplace=True)

data['slope'].replace(0, 'Upsloping', inplace=True)
data['slope'].replace(1, 'Flat', inplace=True)
data['slope'].replace(2, 'Downsloping', inplace=True)

data['target'].replace(0, '< 50% diameter narrowing', inplace=True)
data['target'].replace(1, '> 50% diameter narrowing', inplace=True)

The final dataset would be

In [None]:
data.head()

## UNIVARIATE ANALYSIS

Univariate analysis uses single feature to analyze the data.

The cell below will show the description of the data.

In [None]:
data.describe().T

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")

The cell below will visualize every categorical features of the data. From this, we can see that:

* There are more male patient than female patient
* The most common type of chest pain is typical angina
* There are more patient that do not have fasting blood sugar more than 120 mg/dl
* There are small number of patient that showing probable or definite left ventricular hypertrophy by Estes' criteria on their resting electrocardiographic results
* There are more patient that have exercise induced angina
* A lot of patient have slope either flat or downloping type
* Many patient did not get their major vessel colored by flouroscopy

In [None]:
for num, feature in enumerate(categorical_data):
    plt.figure(num)
    plot = sns.countplot(x=feature, data=data, palette='Blues')
    plot.set_xticklabels(plot.get_xticklabels(), rotation=20)

The cell below will visualize every numerical features of the data. From this, we can see that:

* The average age of the patients is 54 with standard deviation of 9 years and the data form a normal distribution.
* The resting blood pressure form a normal distribution between 17.5 mm Hg of 131 mm Hg
* The cholestoral measure of the patients form a normal distribution around 246 mg/dl
* Patients heart rate distribution is skewed to the left with a mean of 149.64
* Patients oldpeak is also skewed to the right with a mean of 0.72

In [None]:
# numerical data
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
ax = np.array(axes).reshape(-1)

for num, feature in enumerate(numerical_data):
    sns.distplot(data[feature], ax=ax[num])

## BIVARIATE ANALYSIS

Bivariate analysis can be used to compare two features.

Bivariate analysis can be form using parallel version of univariate analysis. The plow below show how its done. From this cell, we can see that:

1. Patient with > 50% diameter narrowing are more younger than those who do not
2. Patient with < 50% diameter narrowing experience more oldpeak than those with heart disease

In [None]:
for num, feature in enumerate(numerical_data):
    plt.figure(num)
    plot = sns.boxplot(x='target', y =feature, data=data)
    plot.set_xticklabels(plot.get_xticklabels(), rotation=20)

Here is the same analysis but we visualize it with barplot instead of boxplot

In [None]:
for num, feature in enumerate(numerical_data):
    plt.figure(num)
    ax = sns.barplot(x="target", y=feature, hue="sex", data=data)
    ax.set_xticklabels(plot.get_xticklabels(), rotation=20)

## MULTIVARIATE ANALYSIS

Multivariate analysis can be used to see any relationship between many features. The cell below show pair plot between the numerical data.

In [None]:
sns.pairplot(data[numerical_data])