> ### Data Description

<img src="https://cdn.kqed.org/wp-content/uploads/sites/35/2014/01/heartattack.png">

**Credits** - Image from Internet

* This database contains `76` attributes, but all published experiments refer to using a subset of `14` of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. 

* The `target` field refers to the presence of heart disease in the patient. It is integer valued - 
    - `0` = no/less chance of heart attack
    - `1` = more chance of heart attack


#### Columns

* age
* sex
* chest pain type (4 values)
* resting blood pressure
* serum cholestoral in mg/dl
* fasting blood sugar > 120 mg/dl
* resting electrocardiographic results (values 0,1,2)
* maximum heart rate achieved
* exercise induced angina
* oldpeak = ST depression induced by exercise relative to rest
* the slope of the peak exercise ST segment
* number of major vessels (0-3) colored by flourosopy
* thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
* target: 0= less chance of heart attack 1= more chance of heart attack

**Source** - [Kaggle website](https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility)

> ### Neccessary `import`s

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt
from matplotlib import style

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

> ### Read Data

In [None]:
df = pd.read_csv('/kaggle/input/health-care-data-set-on-heart-attack-possibility/heart.csv')

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.tail()

> ### `NaN` Check

In [None]:
df.isnull().any()

> ### `target` → How many people already got Heart Attack

In [None]:
ha_df = df['target'].value_counts().to_frame()
ha_df

In [None]:
ha_df.plot(kind='pie', figsize=(10, 6), subplots=True)
plt.show()

* Risky → 1
* Safe → 0

The ratio of the people who already got heart attack is more comparitively to the people who have less chance of getting it.

> ### Sex Ratio

In [None]:
sdf = df['sex'].value_counts().to_frame()
sdf

In [None]:
sdf.plot(kind='pie', figsize=(10, 6), subplots=True)
plt.show()

* Males → 1
* Females → 0

It is clear that men are the ones where heart attack is common. It is almost 2 times greater.

> ### Correlation

In [None]:
cor_df = df.corr()

In [None]:
plt.figure(figsize=(10, 6))

sns.heatmap(
    data=cor_df,
    vmin=-1,
    vmax=1,
    center=0,
    cmap='seismic',
    annot=True
)

plt.show()

> ### Divide the Data set based on `target`

In [None]:
df_1 = df[df['target'] == 1]
df_0 = df[df['target'] == 0]

> ### Age group by

In [None]:
ag_df_1 = df_1.groupby(by=['age'])[['chol', 'thalach']].sum()
ag_df_1.plot(kind='bar', figsize=(15, 6), title='Risky')
plt.show()

In [None]:
ag_df_0 = df_0.groupby(by=['age'])[['chol', 'thalach']].sum()
ag_df_0.plot(kind='bar', figsize=(15, 6), title='Safe')
plt.show()

> ### Chest Pain - group by

In [None]:
cp_df_1 = df_1.groupby(by=['cp'])[['chol', 'thalach']].sum()
cp_df_1.plot(kind='pie', figsize=(15, 6), subplots=True, title='Risky')
plt.show()

In [None]:
cp_df_0 = df_0.groupby(by=['cp'])[['chol', 'thalach']].sum()
cp_df_0.plot(kind='pie', figsize=(15, 6), subplots=True, title='Safe')
plt.show()

>### Role of cholestrol in Heart Stroke
 
chol - min and max

In [None]:
df_1['chol'].min()

In [None]:
df_1[df_1['chol'] == 126]

* From the above table, we can see that `chol` is 126 but the still the person got heart attack.
* The person happens to be a male whose `age` is 57.
* The `cp` is 2.

In [None]:
df_0['chol'].min()

In [None]:
df_0[df_0['chol'] == 131]

* From the above table, we can see that `chol` is 131 and the person is safe.
* The person happens to be a male whose `age` is 57.
* The `cp` is 0.

In [None]:
df_1['chol'].max()

In [None]:
df_1[df_1['chol'] == 564]

* From the above table, we can see that `chol` is 564 (the max) and person got the attack.
* The person happens to be female whose `age` is 67.
* The `cp` is 2.

In [None]:
df_0['chol'].max()

In [None]:
df_0[df_0['chol'] == 409]

* From the above table, we can see that `chol` is 409 (the max) and the person is safe.
* The person happens to be female whose `age` is 56.
* The `cp` is 0.

By observing this, we can say that `cp` is one of the important features.

> ### Scatter plot - `age` and `chol`

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df_1['age'], df_1['chol'], label='Risky')
plt.scatter(df_0['age'], df_0['chol'], label='Safe')
plt.xlabel('age')
plt.ylabel('cholestrol')
plt.legend()
plt.show()

> ### Chest Pain - `cp` and Heart Rate - `thalach`

In [None]:
mxh_rate_1 = df_1['thalach'].max()
mih_rate_1 = df_1['thalach'].min()
mxh_rate_0 = df_0['thalach'].max()
mih_rate_0 = df_0['thalach'].min()

In [None]:
print("For 1, the max is → {}".format(mxh_rate_1))
print("For 0, the max is → {}".format(mxh_rate_0))
print('------------')
print("For 1, the min is → {}".format(mih_rate_1))
print("For 0, the min is → {}".format(mih_rate_0))

In [None]:
df_1[df_1['thalach'] == mxh_rate_1]

* From the above table, we can see that `thalach` is 202 which is way higher and the person got a stroke.
* The `age` of the person tends to be lesser (29).
* The `cp` is 1.

In [None]:
df_1[df_1['thalach'] == mih_rate_1]

* From the above table, we can see that `thalach` is 96 which is optimal but still the person got a stroke.
* The `age` of the person tends to be 60.
* The `cp` is 2.

In [None]:
df_0[df_0['thalach'] == mxh_rate_0]

* From the above table, we can see that `thalach` is 195 which is higher.
* The `age` of the person tends to be 54.
* The `cp` is 1.

In [None]:
df_0[df_0['thalach'] == mih_rate_0]

* From the above table, we can see that `thalach` is 71 which is very optimal for person whose `age` is 67.
* The `cp` is 0.

> ### Scatter plot of `age` and `cp`

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df_1['age'], df_1['cp'], label='Risky')
plt.scatter(df_0['age'], df_0['cp'], label='Safe')
plt.xlabel('age')
plt.ylabel('chest pain')
plt.legend()
plt.show()

For a healthy person and healthy heart, irrespective of age the `chest pain` feature tends one of the important.

> ### Scatter plot of `age` and `thalach`

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df_1['age'], df_1['thalach'], label='Risky')
plt.scatter(df_0['age'], df_0['thalach'], label='Safe')
plt.xlabel('age')
plt.ylabel('thalach')
plt.legend()
plt.show()

* When comparing the heart rate of two classes, we can see that people in the category of `Safe` - the heart rate is lower than in the people of `Risky` Category.
* Visually, as per the age we can observe. And it is obvious from the above graph that as the age increases, the heart rate decreases.

> ### Histogram

In [None]:
def plot_histogram(data1, data2, col_name, target_col):
    plt.figure(figsize=(10, 6))
    
    for i in [data1, data2]:
        label = 'Risky' if i[target_col].iloc[0] == 1 else 'Safe'
        plt.hist(i[col_name], label=label, alpha=0.5)
    plt.legend()
    plt.show()
    
    return None

> ### Density and Histogram plots

In [None]:
def plot_hist_density(data1, data2, col_name, target_col):
    plt.figure(figsize=(10, 6))
    
    for i in [data1, data2]:
        label = 'Risky' if i[target_col].iloc[0] == 1 else 'Safe'
        sns.distplot(i[col_name], hist=True, kde=True, label=label)
    plt.legend()
    plt.show()
    
    return None

> ### PDF and CDF

In [None]:
def plot_pdf_cdf(data1, data2, col_name, target_col):
    plt.figure(figsize=(10, 6))

    for i in [data1, data2]:
        counts, bin_edges = np.histogram(a=i[col_name], bins=10, density=True)
        pdf = counts/sum(counts)
        cdf = np.cumsum(pdf)
        
        label = 'Risky' if i[target_col].iloc[0] == 1 else 'Safe'

        plt.plot(bin_edges[1:], pdf, label='Status {} - pdf'.format(label))
        plt.plot(bin_edges[1:], cdf, label='Status {} - cdf'.format(label))

    plt.xlabel(col_name)
    plt.legend()
    plt.show()
    
    return None

Histogram - `thalach`

In [None]:
plot_histogram(data1=df_1, data2=df_0, col_name='thalach', target_col='target')

Density and Histogram - `thalach`

In [None]:
plot_hist_density(data1=df_1, data2=df_0, col_name='thalach', target_col='target')

PDF and CDF - `thalach`

In [None]:
plot_pdf_cdf(data1=df_1, data2=df_0, col_name='thalach', target_col='target')

> ### Statistical Measurements

**Mean** - (do change when introduced an outlier)

In [None]:
thalach_risky_mean = np.mean(df_1['thalach'])
thalach_safe_mean = np.mean(df_0['thalach'])

print("Risky →", thalach_risky_mean)
print("Safe →", thalach_safe_mean)

**Median** (do not reflect when introduced outlier)

In [None]:
thalach_risky_med = np.median(df_1['thalach'])
thalach_safe_med = np.median(df_0['thalach'])

print("Risky →", thalach_risky_med)
print("Safe →", thalach_safe_med)

**Standard deviation** (do reflect when introduced outlier)

In [None]:
thalach_risky_std = np.std(df_1['thalach'])
thalach_safe_std = np.std(df_0['thalach'])

print("Risky →", thalach_risky_std)
print("Safe →", thalach_safe_std)

**MAD** - Mean Absolute Deviation

In [None]:
def compute_mad(data, c=0.6745):
    med = np.median(data)
    abs_std = [abs(i - med) for i in data]
    mad = np.median(abs_std) / c
    return round(mad, 2)

In [None]:
print('Risky →', compute_mad(data=df_1['thalach']))
print('Safe →', compute_mad(data=df_0['thalach']))

**Percentile**

In [None]:
def compute_percentile(p, data):
    data = sorted(data)
    
    if (p == 100):
        return data[-1]
    
    l_p = (len(data) - 1) * (p / 100) + 1
    
    int_l_p = int(l_p)
    fl_l_p = l_p - int_l_p
    
    val1 = data[int_l_p - 1]
    val2 = data[int_l_p]    
    pval = val1 + (fl_l_p * (val2 - val1))
    
    return round(pval, 2)

In [None]:
print('--------------')
for d in ['df_1', 'df_0']:
    name = 'Risky' if d == 'df_1' else 'Safe'
    print(name)
    data = eval(d)['thalach'].to_list()
    for i in [0, 25, 75, 90, 95, 100]:
        perc = compute_percentile(p=i, data=data)
        print('\t{} \t→ {}'.format(i, perc))
    print('--------------')

**IQR**

In [None]:
def get_iqr(data):
    p75 = compute_percentile(p=75, data=data)
    p25 = compute_percentile(p=25, data=data)
    return p75 - p25

In [None]:
print('Risky →', get_iqr(data=df_1['thalach']))
print('Safe →', get_iqr(data=df_0['thalach']))

> ### Data Visualization

**Box plot**

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='target',y='thalach', data=df)
plt.show()

**Violin plots**

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='target',y='thalach', data=df)
plt.show()

**Contour plots**

In [None]:
plt.figure(figsize=(10, 6))
sns.jointplot(x='age', y='thalach', data=df_1, kind='kde')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.jointplot(x='age', y='thalach', data=df_0, kind='kde')
plt.show()

### Observations

* No matter what the person's age is, if `cp` is very less then there are very negligible chances to get a heart stroke.
* Resting heart or healthy heart, the heart beat rate is optimal.
* A normal resting heart rate for adults ranges from 60 to 100 beats per minute.

We should also consider other important factors that influence heart rate. Like -

* Fitness and activity levels
* Being a smoker
* Having cardiovascular disease, high cholesterol or diabetes
* Air temperature
* Body position (standing up or lying down, for example)
* Emotions
* Body size
* Medications

### End