## **Cleveland Heart Disease dataset (UCI Repository) — Exploratory Data Analysis**

**About Dataset:**

Data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = disease and 1 = no disease.

---

Creators:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.

University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.

University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.

V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Donor:
David W. Aha (aha ‘@’ ics.uci.edu) (714) 856–8779

---

Check UCI Machine Learning Repository for more heart Disease dataset.

**Importing some necessary libraries for Data Analysis:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
sns.set_style(style="darkgrid")
plt.style.use("bmh")

**Loading Dataset into Dataframe:**

In [None]:
dataframe = pd.read_csv("../input/heart-disease-cleveland-uci/heart_cleveland_upload.csv")
dataframe.rename(columns={'condition':'target'}, inplace = True)
dataframe.head()

**Attribute Information:**

---



1. age: age in years

2. sex: 1 = male; 0 = female

3. cp: chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)

5. chol: serum cholestoral in mg/dl

6. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

7. restecg: resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
8. thalach: maximum heart rate achieved

9. exang: exercise induced angina (1 = yes; 0 = no)

10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
- Value 1: upsloping
- Value 2: flat
- Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy

13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
---

*Note that it’s quite an old dataset (1988)*

### **Exploratory Data Analysis:**

1. **Checking missing and null values**

In [None]:
dataframe.isna().sum()

In [None]:
dataframe.isnull().sum()

*This dataset looks perfect to use as we don’t have null as well as missing values.*

In [None]:
pd.set_option("display.float", "{:.2f}".format)
dfx = dataframe.drop(columns=["sex","target"])
dfx.describe()

2. No. of people with heart disease vs No. of people without heart disease

In [None]:
dataframe.target.value_counts().plot(kind="bar",width=0.1,color=["salmon","lightgreen"],legend=1,figsize=(8,5))
plt.ylabel("No. of People", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(["No. of people with heart disease"],fontsize=12);
plt.show()

- *We have 165 people with heart disease and 138 people without heart disease, so our problem is balanced.*

3. **CORRELAION MATRIX HEATMAP**

In [None]:
corr_matrix = dataframe.corr()
fig, ax = plt.subplots(figsize=(22, 10))
ax = sns.heatmap(corr_matrix,annot=True,linewidths=0.5,fmt=".2f",cmap="YlGn");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5);

4. **CORRELATION WITH TARGET**

In [None]:
dataframe.drop('target', axis=1).corrwith(dataframe.target).plot(kind='bar', grid=True, figsize=(20, 8), title="Correlation with target",color="lightgreen");

---
***Observations from correlation:***
- *fbs and chol are the least correlated with the target variable.*
- *All other variables have a significant correlation with the target variable.*
---



Categorical and Continous Values:

In [None]:
categorical_val = []
continous_val = []
for column in dataframe.columns:
    print('-------------------------------')
    print(f"{column} : {dataframe[column].unique()}")
    if len(dataframe[column].unique()) <= 10:
        categorical_val.append(column)
    else:
        continous_val.append(column)

5. Categorical Values Histogram

In [None]:
plt.figure(figsize=(20, 16))
for i, column in enumerate(categorical_val, 1):
    plt.subplot(3, 3, i)
    dataframe[dataframe["target"] == 1][column].hist(bins=10, color='red', label='People with heart disease',alpha=0.8)
    dataframe[dataframe["target"] == 0][column].hist(bins=10, color='green', label='People without heart disease',alpha=0.5)
    plt.legend(fontsize=12)
    plt.xlabel(column)
    plt.ylabel("No. of People")

---
***Observations from the above plot:***

- **cp {Chest pain}**: *People with cp 1, 2, 3 are more likely to have heart disease than people with cp 0.*
- **restecg {resting EKG results}**: *People with a value of 1 (reporting an abnormal heart rhythm, which can range from mild symptoms to severe problems) are more likely to have heart disease.*
- **exang {exercise-induced angina}**: *People with a value of 0 (No - angina induced by exercise) have more heart disease than people with a value of 1 (Yes - angina induced by exercise)*
- **slope {the slope of the ST segment of peak exercise}**: *People with a slope value of 2 (Downslopins: signs of an unhealthy heart) are more likely to have heart disease than people with a slope value of 2 slope is 0 (Upsloping: best heart rate with exercise) or 1 (Flatsloping: minimal change (typical healthy heart)).*
- **ca {number of major vessels (0-3) stained by fluoroscopy}**: *The more blood movement the better, so people with ca equal to 0 are more likely to have heart disease.*
- **thal {thalium stress result}**: *People with a thal value of 2 (defect corrected: once was a defect but ok now) are more likely to have heart disease.*
---

6. **Continous Values Histogram:**

In [None]:
plt.figure(figsize=(20, 16))
for i, column in enumerate(continous_val, 1):
    plt.subplot(3, 2, i)
    dataframe[dataframe["target"] == 1][column].hist(bins=35, color='red', label='People with heart disease', alpha=0.8)
    dataframe[dataframe["target"] == 0][column].hist(bins=35, color='green', label='People without heart disease',alpha=0.5)
    plt.legend(fontsize=12)
    plt.xlabel(column)
    plt.ylabel("No. of People")

---
***Observations from the above plot:***
- **trestbps**: *resting blood pressure anything above 120-140 is generally of concern.*
- **chol**: *greater than 200 is of concern.*
- **thalach**: *People with a maximum heart rate of over 140 are more likely to have heart disease.*
- *the **old peak** of exercise-induced ST depression vs. rest looks at heart stress during exercise an unhealthy heart will stress more.*
---



7. **SCATTER PLOT (Age Vs Max Heart Rate)**

In [None]:
plt.figure(figsize=(15, 10))
plt.scatter(dataframe.age[dataframe.target==1],dataframe.thalach[dataframe.target==1],c="red",s=75)
plt.scatter(dataframe.age[dataframe.target==0],dataframe.thalach[dataframe.target==0],c="green",alpha=0.5)
plt.title("Heart Disease in function of Age and Max Heart Rate",fontsize=14)
plt.xlabel("Age", fontsize=14)
plt.ylabel("Max Heart Rate", fontsize=14)
plt.legend(["Disease", "No Disease"],fontsize=18);

**Observation from above plot:**

This data isn't correlated at all so it is of no use to us.

8. **SCATTER PLOT (Age Vs Serum Cholestoral(mg/dl))**

In [None]:
plt.figure(figsize=(15, 10))
plt.scatter(dataframe.age[dataframe.target==1],dataframe.chol[dataframe.target==1],c="red",s=75)
plt.scatter(dataframe.age[dataframe.target==0],dataframe.chol[dataframe.target==0],c="green",alpha=0.5)
plt.title("Heart Disease in function of Age and Serum Cholestoral(mg/dl)",fontsize=14)
plt.xlabel("Age", fontsize=14)
plt.ylabel("Serum Cholestoral(mg/dl)", fontsize=14)
plt.legend(["Disease", "No Disease"],fontsize=18);

9. **SCATTER PLOT (Age Vs Resting Blood Pressure)**

In [None]:
plt.figure(figsize=(15, 10))
plt.scatter(dataframe.age[dataframe.target==1],dataframe.trestbps[dataframe.target==1],c="red",s=75)
plt.scatter(dataframe.age[dataframe.target==0],dataframe.trestbps[dataframe.target==0],c="green",alpha=0.5)
plt.title("Heart Disease in function of Age and Resting Blood Pressure",fontsize=14)
plt.xlabel("Age", fontsize=14)
plt.ylabel("Resting Blood Pressure", fontsize=14)
plt.legend(["Disease", "No Disease"],fontsize=18);

**OBSERVATION from above plots(8 & 9):**

Both Resting Blood Pressure and Serum Cholesterol shows a bit positive correlation but not that much. Also, Resting Blood Pressure and Serum Cholesterol have few Outliers which we will remove in preprocessing steps.

---


*Note: dataset also contains outliers*

*For further use of this dataset, outliers must be removed while dealing with model to get better results*


---