If you like my work, showing appreciation by <b>UPVOTING</b> will be motivating. 

Also for any suggestions and feedback feel free to comment or contact. 
Thanks.

In [None]:
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df

#### Data Dictionary

The Dataset has 303 rows and 14 columns.
Column descriptions are as below:

<br>Age : Age of the patient 
<br>Sex : Sex of the patient (1:male, 0:female)
<br>exang: exercise induced angina (1 = yes; 0 = no) means is there chest pain after exercise?
<br>ca: number of major vessels (0-3)
<br>
<br>cp : Chest Pain type chest pain type
<br>Value 1: typical angina
<br>Value 2: atypical angina
<br>Value 3: non-anginal pain
<br>Value 4: asymptomatic
<br>
<br>trtbps : resting blood pressure (in mm Hg)
<br>chol : cholestoral in mg/dl fetched via BMI sensor
<br>fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
<br>
<br>oldpeak - ST depression induced by exercise relative to rest
<br>rest_ecg : resting electrocardiographic results
<br>Value 0: normal
<br>Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
<br>Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
<br>slp: slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)
<br>thalach : maximum heart rate achieved
<br>thal - 2 = normal; 1 = fixed defect; 3 = reversable defect
<br>target : 0= less chance of heart attack 1= more chance of heart attack


I will be renaming the features for the sake of ease in understanding while EDA.

In [None]:
col_dict = {
        'age':'Age',
        'sex':'Sex',
        'cp':'ChestPainType',
        'trtbps':'RestBloodPressure',
        'chol':'Cholesterol', 
        'fbs':'FastingBloodSugar',
        'restecg':'RestECG',
        'thalachh':'MaxHeartRate',
        'exng':'ExerciseAngina',
        'oldpeak':'STDepbyExercise',
        'slp':'Slop',
        'caa':'MajorVessels',
        'thall':'DefectType',
        'output':'output'
    }

In [None]:
old_col_names = list(df.columns)
new_col_names = [col_dict[i] for i in old_col_names]
df.columns = new_col_names

In [None]:
df

In [None]:
col_dict

In [None]:
df.info()

The data has non-null values for all rows or all columns and except column 'STDepbyExercise' every other column is int64 type.
'STDepbyExercise' is float64 type.

So we must check how values are in 'STDepbyExercise'

In [None]:
df.STDepbyExercise.sample(10)

In [None]:
df.describe().transpose()

We want to focus on these values but only for numerical columns and not for the categorical ones.

First lets check how many unique non null values we have per feature

In [None]:
unique_values = [(i, df[i].nunique()) for i in df.columns]
plt.figure(figsize=(15,5))
plt.bar(*zip(*unique_values))
plt.xticks(rotation=45)
plt.show()

Let's separate the numerical and categorical columns

In [None]:
numerical_columns = ('Age RestBloodPressure Cholesterol MaxHeartRate STDepbyExercise').split()
categorical_columns = ('Sex ChestPainType FastingBloodSugar RestECG ExerciseAngina \
                         Slop MajorVessels DefectType').split()
target_column = 'output'
print(numerical_columns)
print(categorical_columns)
print(target_column)

Now let's look at descriptions of numerical columns

In [None]:
col_details = pd.DataFrame(columns=['ColName', 'Mean', 'Median', 'Mode', 'Min', 'Max'])
for i in numerical_columns:
    col_details.loc[len(col_details)] = [i, df[i].mean(), df[i].median(), df[i].mode(), df[i].min(), df[i].max()]
col_details

The data has people from age 29 to 77, mean age 54.
People in survey have avg Blood Pressure 131, avg Cholesterol 246, Avg Heart Rate 149

In [None]:
rows, cols = 3, 2
plt.figure(figsize=(15,15))
plt.tight_layout()
counter = 1
for i in numerical_columns:
    if counter<=9:
        plt.subplot(rows, cols, counter)
        sns.histplot(data = df, x= i, hue='output')
        plt.title(i)
        counter+=1

The distributions look overlapping. However 
<br>people with max heart rate above 150 were mostly positive and those with less than 150 were mostly negative.
<br>most people with STDepExercise 0 were positive.


In [None]:
plt.figure(figsize=(15,15))
plt.tight_layout()
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    sns.countplot(x=df[i], hue=df.output)
    plt.title(i)
    counter+=1

For any feature, if a category has disproportionate hue results, means, say for DefectType category 2, the chances of heart attack is disproportionately high, so these kind of features are pretty determining of final results in my opinion.

This trend is present in - 
<br>ExerciseAngina category 0.
<br>Slop category 2.
<br>MajorVessel category 0.
<br>DefectType category 2.
<br>Sex category 0.
<br>ChestPainType category 1,2,3.

So we must get the numbers.

In [None]:
def percent_cat(df, feature_name):
    
    unique_values = df[feature_name].unique()
    no_of_unique_values = len(unique_values)
    
    my_dict = {}
    l_pos = []
    l_neg = []
    
    for i in unique_values:
        total = len(df[df[feature_name]==i].output)
        positive = sum(list(df[df[feature_name]==i].output == 1))
        negative = sum(list(df[df[feature_name]==i].output == 0))
        perc_pos = int(positive*100/total)
        perc_neg = int(negative*100/total)
        
        l_pos.append(perc_pos)
        l_neg.append(perc_neg)
    plt.suptitle('No of datapoints: '+str(len(df)))
    plt.bar(unique_values, l_pos, label='Positive', color='Blue')
    plt.bar(unique_values, l_neg, label='Negative', color='Orange')
    plt.xlabel('Category No')
    plt.ylabel('Percent')
    plt.legend()
    plt.title(feature_name)

In [None]:
plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(df,i)
    counter += 1
plt.show()

<html>
So, as we can see, 

<br>Sex category 0
<br>ChestPainType 1,2,3
<br>RestECG 1
<br>ExerciseAngina 0
<br>Slop 2
<br>MajorVessels 0, 4
<br>DefectType 2

Are major determining factors. 
<br>At the same time, for other categories, sometimes negative outweighs positives, that also gives them importance.
<br>    ChestPainType 0
<br>    RestECG 2
<br>    ExerciseAngina 1
<br>    Slop 0, 1
<br>    MajorVessel 1,2,3
<br>    DefectType 1,3
</html>

Let's look at the data only for women as they have higher chances of Heart Attack

#### Women

In [None]:
wdf = df[df.Sex==0]

In [None]:
len(wdf)

We have only 96 data points for women

Lets plot the Numerical and Categorical columns only for women

In [None]:
rows, cols = 3, 2
plt.figure(figsize=(15,15))
plt.tight_layout()
counter = 1
for i in numerical_columns:
    if counter<=9:
        plt.subplot(rows, cols, counter)
        sns.histplot(data = wdf, x= i, hue='output')
        plt.title(i)
        counter+=1

Who has higher risk?
<br>Those with rest blood pressure less than 140
<br>with max heart rate above 140
<br>STDepbyExercise less than 2
<br>Normal cholesterol level is below 200 and our data starts at 200 so almost everyone in survey is at risk.

In [None]:
plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(wdf,i)
    counter += 1
plt.show()

A couple things to notice here:
<br>ChestPainType 3 is 100% Positive
<br>DefectType 0 is 100% Positive
<br>MajorVessels 3 is 100% Negative
<br>DefectType 1 is 100% Negative

Any woman with ChestPainType 1,2,3 is having very high chances of heart attack.
<br>So is true with FastingBloodSugar 0

Now we will check variations in data by age groups for women.

Age Group: Less than 40

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(wdf[(wdf.Age<40)],i)
    counter += 1
plt.show()

There is very less data for this age group

Age Group: 40 to 50

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(wdf[(wdf.Age>40) & (wdf.Age<50)],i)
    counter += 1
plt.show()

This dataset has almost everyone as positive in this age group.


Age group 50 to 60

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(wdf[(wdf.Age>50) & (wdf.Age<60)],i)
    counter += 1
plt.show()

In this age group around 70 percent people are positive.

Age Group 60 to 70

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(wdf[(wdf.Age>60) & (wdf.Age<70)],i)
    counter += 1
plt.show()

Chest pain type 1,2,3 are major indicators.

#### Men

In [None]:
mdf = df[df.Sex==1]

Lets plot the Numerical and Categorical columns only for Men

In [None]:
rows, cols = 3, 2
plt.figure(figsize=(15,15))
plt.tight_layout()
counter = 1
for i in numerical_columns:
    if counter<=9:
        plt.subplot(rows, cols, counter)
        sns.histplot(data = mdf, x= i, hue='output')
        plt.title(i)
        counter+=1

Highly overlapping distributions for numerical features, not easily separable.

In [None]:
plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(mdf,i)
    counter += 1
plt.show()

A couple things to notice here:
<br>ChestPainType 3 is 100% Positive
<br>DefectType 0 is 100% Positive
<br>MajorVessels 3 is 100% Negative
<br>DefectType 1 is 100% Negative

Any man with ChestPainType 1,2,3 is having very high chances of heart attack.
<br>So is true with FastingBloodSugar 0

Again let's look at the data by age group

Age Group: less than 40

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(mdf[(mdf.Age<40)],i)
    counter += 1
plt.show()

Age Group: 40 to 50

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(mdf[(mdf.Age>40) & (mdf.Age<50)],i)
    counter += 1
plt.show()

Chest pain type 1, 2 are good indicators

Age Group: 50 to 60

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(mdf[(mdf.Age>50) & (mdf.Age<60)],i)
    counter += 1
plt.show()

Around 50 50 break of positive and negative and 80+ data points are good.

Some features have 100% negative and some have 100% positive for certain categories. Which make them strong indicators.

In [None]:

plt.figure(figsize=(17,15))
counter = 1
for i in categorical_columns:
    plt.subplot(3,3,counter)
    percent_cat(mdf[(mdf.Age>60) & (mdf.Age<70)],i)
    counter += 1
plt.show()