In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Heart Attack Analysis using Python- A Case Study

When a heart attack occurs, the heart muscle that has lost blood supply begins to suffer injury. The amount of damage to the heart muscle depends on the size of the area supplied by the blocked artery and the time between injury and treatment. Heart muscle damaged by a heart attack heals by forming scar tissue.
    
   In this session, we are taken a popular dataset from **Kaggle**, `Heart_Attack_Analysis` and will be analyzing this dataset based on the information provided. At the end we will get to many conclusions that for what are the reasons which can effect the heart attack from this **Exploratory Data Analysis**.

"<a href="https://ibb.co/7kBRNR4"><img src="https://i.ibb.co/jJqWZWR/3-D-illustration-of-Heart-Part-of-Human-Organic.jpg" alt="3-D-illustration-of-Heart-Part-of-Human-Organic" border="0"></a><br />

## IMPORTING LIBRARIES

List of all the python libraries that are required:

* Library `pandas` will be required to work with data in tabular representation.
* Library `numpy` will be required to round the data in the correlation matrix.
* Library `warning` will be required to ignore all warnings.
* Library `matplotlib`, `seaborn`, `plotly` required for data visualization.

In [None]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [None]:
df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
## understanding data

df.columns

Our dataset contain a total of 14 columns, with abbreviations in their column names.

Lets look at what every column contains information.

#### Columns Description:

1. `'age'`= Age of the person (in years)
2. `'sex'` =  Gender of the person (1=Male ; 0=Female)
3. `'cp'` = Chest Pain type
      * Value 0: typical angina
      * Value 1: atypical angina
      * Value 2: non-anginal pain
      * Value 3: asymptomatic
4. `'trtbps'` = resting blood pressure (in mm Hg on admission to the hospital)
5. `'chol'` = cholestoral in mg/dl (fetched via BMI sensor)
6. `'fbs'` = (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. `'restecg'` = resting electrocardiographic results
      * Value 0: normal
      * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. `'thalachh'` = maximum heart rate achieved
9. `'exng'` = exercise induced angina (1 = yes; 0 = no)
10. `'oldpeak'` = Previous peak or ST depression induced by exercise relative to rest
11. `'slp'` = The slope of the peak exercise ST segment
     * Value 0: upsloping
     * Value 1: flat
     * Value 2: downsloping
12. `'caa'` = number of major vessels
13. `'thall'` = Thal rate
14. `'output'` = Target variable OR diagnosis of heart disease (angiographic disease status)
     * Value 0: < 50% diameter narrowing
     * Value 1: > 50% diameter narrowing

In [None]:
# Now that the columns are well-described, look at the first 10 rows of dataset.
df.head(10)

In [None]:
print('Shape of the dataframe is: ' ,df.shape)

In [None]:
#checking summary of dataset

df.describe()

In [None]:
# checking NaN values

df.isna().sum()

In [None]:
dict = {}
for x in list(df.columns):
    dict[x] = df[x].value_counts().shape[0]

pd.DataFrame(dict, index=["Unique Counts"]).transpose()

## Exploratory Analysis and Visualization

Now, we will be analyzing our data, graphically.

Let's begin by importing`matplotlib.pyplot` and `seaborn` and setting the theme to `darkgrid`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## 1. Univariate Analysis of Categorical Variables

In [None]:
fig, axes = plt.subplots(4,2, figsize=(18,18))

#use the axis for plotting
axes[0, 0].set_title('(Plot.1.1) SEX')
sns.countplot(df.sex,
              palette = 'Set3',
              edgecolor=sns.color_palette("Set3", 4),
              linewidth=2,
             ax=axes[0,0]);


#use the axis for plotting
axes[0, 1].set_title('(Plot.1.2)CP')
sns.countplot(df.cp,
              palette = 'Set1',
              edgecolor=sns.color_palette("Set1", 2),
              linewidth=2,
             ax=axes[0,1]);


#use the axis for plotting
axes[1, 0].set_title('(Plot.1.3)FBS')
sns.countplot(df.fbs,
              palette = 'Set2',
              edgecolor=sns.color_palette("Set2", 2),
              linewidth=2,
             ax=axes[1,0]);


#use the axis for plotting
axes[1, 1].set_title('(Plot.1.4)REST-ECG')
sns.countplot(df.restecg,
              palette = 'Blues_r',
              edgecolor=sns.color_palette('Blues_r', 4),
              linewidth=2,
             ax=axes[1,1]);


#use the axis for plotting
axes[2, 0].set_title('(Plot.1.5)EXNG')
sns.countplot(df.exng,
              palette = 'Oranges_r',
              edgecolor=sns.color_palette('Oranges_r', 4),
              linewidth=2,
             ax=axes[2,0]);


#use the axis for plotting
axes[2, 1].set_title('(Plot.1.6)SLOPE')
sns.countplot(df.slp,
              palette = 'autumn_r',
              edgecolor=sns.color_palette('autumn_r', 2),
              linewidth=2,
             ax=axes[2,1]);


#use the axis for plotting
axes[3, 0].set_title('(Plot.1.7)CAA')
sns.countplot(df.caa,
              palette = 'icefire_r',
              edgecolor=sns.color_palette('icefire_r', 2),
              linewidth=2,
             ax=axes[3,0]);


axes[3, 1].set_title('(Plot.1.8)THALL')
sns.countplot(df.thall,
              palette = 'summer',
              edgecolor=sns.color_palette('summer', 4),
              linewidth=2,
             ax=axes[3,1]);

plt.tight_layout(pad=3);

### Conclusion:-
        - Male patients are more than number of female patients.
        - Most of the patient are suffering with typical anginal chest pain (cp=0).
        - Most people have their fasting blood sugar lesser than 120mg/dl.
        - Most people have either normal or abnormal ST-T electrocardiographic wave.
        - Lesser people are suffering with exercise induced angina.
        - Almost 125 people have 0 major vessel.

## 2. Univariate Analysis of Continuous and Target Variables

In [None]:
fig, axes = plt.subplots(2,3, figsize=(15,12))

#use the axis for plotting
axes[0, 0].set_title('(Plot.2.1)AGE')
sns.boxenplot(y=df.age,
            palette='Greens', 
            color='red',
           linewidth=3,
           ax=axes[0,0]);


#use the axis for plotting
axes[0,1].set_title('(Plot.2.2)TRTBPS')
sns.boxenplot(y=df.trtbps,
            palette='prism', 
            color='red',
           linewidth=2,
           ax=axes[0,1]);


#use the axis for plotting
axes[0, 2].set_title('(Plot.2.3)CHOL')
sns.boxenplot(y=df.chol,
            palette='viridis',
           linewidth=1,
           ax=axes[0,2]);


#use the axis for plotting
axes[1, 0].set_title('(Plot.2.4)THALACHH')
sns.boxenplot(y=df.thalachh,
            palette='Blues_r', 
            color='red',
           linewidth=3,
           ax=axes[1,0]);


#use the axis for plotting
axes[1, 1].set_title('(Plot.2.5)OLDPEAK')
sns.boxenplot(y=df.oldpeak,
            palette='RdPu', 
            color='red',
           linewidth=3,
           ax=axes[1,1]);


#use the axis for plotting
axes[1, 2].set_title('(Plot.2.6)OUTPUT')
sns.countplot(df.output,
             palette = 'vlag_r',
             saturation=0.50,
             ax=axes[1,2]);

plt.tight_layout(pad=3);

### Conclusions:
        - Most patient are between the age (48-61)
        - Most patient have their blood pressure between (120-140)
        - Most patient have their cholestrol level between (220-260)
        - Most patient have their heart rate between (135-165)

## 3. Bivariate Analysis

In this block, we will be analyzing variables to `output variable`
 * age and output
 * bloop pressure and output
 * cholestrol and output
 * heart rate and output
 * chest pain and output
 * thall and output
 * electrocardiograph and output
 * angina and output

In [None]:
fig, axes = plt.subplots(4,2, figsize=(20,18))

#use the axis for plotting
axes[0, 0].set_title('(Plot.3.1)AGE and OUTPUT')
sns.kdeplot(x=df.age,
            hue=df.output,
            fill=True,
            palette= 'Set2',
            ax=axes[0,0])

#use the axis for plotting
axes[0, 1].set_title('(Plot.3.2) BLOOD PRESSURE DISTRIBUTION')
sns.kdeplot(x=df.trtbps,
            hue=df.output,
            fill=True,
            palette= 'Set2',
            ax=axes[0,1])


#use the axis for plotting
axes[1, 0].set_title('(Plot.3.3) CHOLESTROL DISTRIBUTION')
sns.kdeplot(x=df.chol,
            hue=df.output,
            fill=True,
            palette= 'Set2',
            ax=axes[1,0])


#use the axis for plotting
axes[1, 1].set_title('(Plot.3.4) HEART RATE DISTRIBUTION')
sns.kdeplot(x=df.thalachh,
            hue=df.output,
            fill=True,
            palette= 'Set2',
            ax=axes[1,1])


#use the axis for plotting
axes[2, 0].set_title('(Plot.3.5) CHEST PAIN DISTRIBUTION')
sns.kdeplot(x=df.cp,
           hue=df.output,
           fill= True,
           palette = 'Set2',
           ax= axes[2,0])


#use the axis for plotting
axes[2, 1].set_title('(Plot.3.6) THALL DISTRIBUTION')
sns.kdeplot(x=df.thall,
           hue=df.output,
           fill= True,
           palette = 'Set2',
           ax= axes[2,1])


#use the axis for plotting
axes[3, 0].set_title('(Plot.3.7) RESTECG DISTRIBUTION')
sns.kdeplot(x=df.restecg,
           hue=df.output,
           fill= True,
           palette = 'Set2',
           ax= axes[3,0])


#use the axis for plotting
axes[3, 1].set_title('(Plot.3.6) EXNG DISTRIBUTION')
sns.kdeplot(x=df.exng,
           hue=df.output,
           fill= True,
           palette = 'Set2',
           ax= axes[3,1])


plt.tight_layout(pad=3);

### Conclusions:
        - There is no significant relationship between age and heart attack. So, the intuition of older age to heart attack is rejected
        - Blood pressure and cholestrol doesn't contribute to prone to heart attack.
        - Patients with higher heart rate are more prone to heart attack
        - Patients with non-anginal chest pain are more prone
        - Patients with abnormal ST-T electrocardiographic wave are more prone.

## 4. Correlation Between Variables

In [None]:
plt.figure(figsize = (12,10))
plt.title('(Plot.4.1) Correlation between variables')
sns.heatmap(df.corr(), fmt='.1f', annot=True, cmap= "bone_r");

### Conclusions:
    There isn't any strong correlation between any variables, but few variable is showing upto 0.4 correlation as positive and negative association. So, they can be considered as affecting the target variable than any other variables.
    Variables with 0.4 (positive and negative) association are:
   * chest pain
   * heart rate
   * exercise induced angina
   * old peak
   * caa

## 5. Multivariate Analysis

In [None]:
print('(Plot.5.1)')
sns.pairplot(df, hue='output', palette="tab10");


##  Final Conclusions:

1. There is no **NaN** values in our dataset. (shown in `In[17]`)
2. The dataset consists of more the number of **male** than **female**. (shown in `Plot.1.1`)
3. There are outliers in almost every continous variables. (shown in `Plot.2.1` to `Plot.2.5`)
4. There is no strong relationship between **age** and Heart Attack. (shown in `Plot.3.1`)
5. **Blood pressure** and **cholestrol** doesn't have much contribute to Heart Attack. (shown in `Plot.3.2` and `Plot.3.3`)
6. People with **higher heart rate** is more prone to Heart Attack. (shown in `Plot.3.4`)
7. People with **non-anginal chest pain** i.e. cp=2 are more likely(shown in `Plot.3.5`)
8. People with **thall rate = 2** are more likely (shown in `Plot.3.6`)
9. People with **restecg = 1** i.e. having abnormal ST-T wave are more prone (shown in `Plot.3.7`)
10. People with **exng = 0** i.e. no exercise induced angina are more prone (shown in `Plot.3.8`)
11. There is no apparent linear correlation between continuous variable according to the heatmap. (shown in `Plot.4.1`)
12. Also, (according to `Plot.5.1`):

       * **Male** is more prone to heart attacks than **Female** (with a less significance difference)
       * Heart Attack is less prone when **fasting blood sugar** is less than 120 mg/dl
       * **Oldpeak** at 0 is more prone
       * **Slope** at 2 i.e. at downsloping is more prone
       * **Caa** or number of major vessels at 1 is more prone

## Inferences and Conclusion

So, after concluding we may say that factors that don't affect Heart attack much are:
 * age
 * sex
 * blood pressure
 * cholestrol
 
And factors that do affect to Heart Attack are:
 * chest pain(non-anginal)
 * fasting blood sugar(when more than 120 mg/dl)
 * electrocardiograph (with abnormal ST-T wave)
 * high heart rate
 * angina induced by exercise
 * old peak
 * downsloping peak exercise ST segement
 * majoe vessels (when it is 1)
 * thall rate

#### Here's my another exploratory data analysis [notebook](https://www.kaggle.com/tanushagupta/superstore-eda). Do check and review.