## Heart Patient risk analysis using machine learning

### 1. Importing the dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
df.head()

#### Dataset feature details

0. id - Patient id

1. age - Age in years

2. sex - Sex (1 = male; 0 = female)

3. cp - Chest pain type (0 = asymptomatic; 1 = typical angina; 2 = atypical angina; 3 = non-anginal pain)

4. trtbps - Resting blood pressure (in mm Hg on admission to the hospital)

5. chol - Serum cholestoral in mg/dl

6. fbs - Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

7. restecg - Resting electrocardiographic results (0 = normal; 1 = having ST-T wave abnormality; 2 = hypertrophy)

8. thalachh - Maximum heart rate achieved

9. exng - Exercise induced angina (1 = yes; 0 = no)

10. oldpeak - ST depression induced by exercise relative to rest

11. slp - Slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)

12. caa - Number of major vessels (0-3) colored by flourosopy

13. thall - Thalassemia blood disorder (1 = fixed defect; 2 = normal;  3 = reversable defect)

14. output - The target variable and predicted attribute (0 = less chance of heart attack; 1 = more chance of heart attack)

### 2. Basic data analysis and cleaning

* Checking the dimensions of the dataset

In [None]:
print("No of rows and columns:", df.shape)

* Checking the column names and data type of each column

In [None]:
print(df.dtypes)

* Changing the data type of categorical columns

In [None]:
df.sex.astype('category')
df.fbs.astype('category')
df.restecg.astype('category')
df.exng.astype('category')
df.slp.astype('category')
df.thall.astype('category')
print("Data type has been changed for categorical variables.")

* Checking if there are any null values

In [None]:
print(df.isna().sum())

* Finding the correlation between variables using a correlation table

In [None]:
df.corr()

* Extracting the correlation of all the features to the target variable 'output'

In [None]:
print(df.corr().loc['output'])

*Strong pearson correlation is observed between the following features and the target variable:*
1. Age                  
2. Sex                           
3. Chest pain type
4. Max heart rate       
5. Exercise induced angina       
6. ST depression caused by exercise
7. ST segment slope.    
8. Number of major vessels       
9. Thalassemia blood disorder

* Statistical summary of each column

In [None]:
df.describe()

* We observe that there are values outside the specified range for the columns 'caa' and 'thall'. Hence, we will have to impute the data of these columns.

##### 1. 'caa'

  The values of this column should be in the range of 0-3 as specified above. We first check the count of each  of these values.

In [None]:
df.caa.value_counts()

It is observed that there are five entries with the value 4 which is outside our range. Since it is outside the maximum value of the range, e have to impute them by replacing them with the max value i.e 3. 

In [None]:
df.loc[df.caa == 4, 'caa'] = 3
df.caa.value_counts()

It can now be observed that those five entries have been replaced with value 3.

##### 2. 'thall'
  The range of values should be in the range 1-3.

In [None]:
df.thall.value_counts()

There are two values of the value 0. Since this is a categorical variable and we have no idea what 0 represents, we replace these two entries with the mode of this feature.

In [None]:
from statistics import mode
md = mode(df.thall)
df.loc[df.thall == 0, 'thall'] = md
df.thall.value_counts()

The entries with value '0' have been replaced by the mode i.e '2'.

*The dataset is now clean and ready for further exploratory data analysis of the various features.*

### 3. Exploratory data analysis

* We first explore the target variable and see the count of patients with low risk and high risk of heart attack

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

df_lc = df[df['output'] == 0]
df_hc = df[df['output'] == 1]

output_label = ['low chance','high chance']
sns.set_context("notebook")
sns.set_style("darkgrid")

ax = sns.countplot(x='output', data = df)
ax.set_title("Chance of people having heart conditions")
ax.set(xlabel = 'Chance', ylabel = 'No of patients')
plt.xticks(ticks = [0,1], labels = output_label)
plt.show()

*It is observed that there is an healthy ratio of both types of patients. We can now explore each feature in detail.*

#### 1. Age
   
   Age distribution

In [None]:
ag = sns.histplot(data = df, x = 'age', binwidth = 5)
ag.set_title("Age distribution")
plt.show()

In [None]:
ax = sns.kdeplot(data = df, hue = 'output', bw_adjust = 1, x = 'age', palette = ['tab:cyan', 'tab:green'])
ax.set(xlabel = 'age', ylabel = 'age distribution')
plt.legend(["high risk", "low risk"])
plt.show()

*The graph indicates that:* 
* More people lesser than age 55 have a higher risk
* More people within the age group of 55-70 have a lower risk
* More people greater than age 70 have a higher risk

*There might be a non linear correlation between age group and heart attack risk for patients*

#### 2. Sex

In [None]:
hue_color = {0:'black', 1:'red'}
sex = ['female', 'male']
g = sns.countplot(data = df, x = 'sex', hue = 'output', palette = hue_color)
plt.xticks(ticks = [0,1], labels = sex)
plt.legend(['less chance', 'high chance'])
plt.show()

*This shows us that* 
* More males are at a low risk but it is comparable. 
* A signifciant no of women are at a high risk compared to the no of women with low risk.

*There seems to be a correlation between sex and heart attack risk even if it isn't a causation.*

#### 3. Chest pain

In [None]:
sns.countplot(data = df, x = 'cp', hue = 'output', palette = ['skyblue', 'darkgray'])
plt.xlabel("chest pain type")
plt.legend(['less chance', 'high chance'])
plt.show()

*Inference:* 
* Patients with chest pain type 1, 2, or 3 have a high chance 
* Patients with chest pain type 0 i.e asymptomatic have a low chance

*Linear correlation exists between the two.*

#### 4. Resting blood pressure

In [None]:
ax = sns.kdeplot(data = df, x = 'trtbps', bw_adjust = 0.9, hue = 'output', palette = ['mediumslateblue', 'yellowgreen'])
ax.set(xlabel = 'resting blood pressure')
plt.legend(['high chance', 'low chance'])
plt.show()

*Inference:*
* Patients with resting blood pressure range between 80 and 150 are observed to have a higher chance of being at high risk

*Difficult to establish a correlation even though there might be one.*

#### 5. Cholestrol

In [None]:
sns.kdeplot(x = 'chol', data = df, hue = 'output')
plt.legend(['high chance', 'low chance'])
plt.xlabel("Cholestrol")
plt.show()

*Inference:*
* Density of patients with chlestrol roughly greater than 250 and lesser than 150 are mmore or less the same for patients with high and low chance.
* Patients within the range of 150-250 appx are observed to have a higher chance of having a heart attack

*Hard to predict but there is a chance of a non linear correlation.*

#### 6. Fasting blood sugar

In [None]:
x = pd.crosstab(df['fbs'], df['output'])
x.plot(kind = 'bar', color = ['k', 'grey'])
plt.xticks(ticks = [0,1], labels = ['False', 'True'], rotation = 0)
plt.legend(['less chance', 'high chance'])
plt.show()

*It is observed that there isn't much of a correlation between one's fasting blood sugar and heart condition risk*

#### 7. Resting electrocardiographic results

In [None]:
x = pd.crosstab(df['restecg'], df['output'])
x.plot(kind = 'bar', color = ['tab:blue', 'tab:red'])
plt.legend(['less chance', 'high chance'])
plt.xticks(rotation = 0)
plt.show()

*There seems to be a significant difference between the count low risk and high risk patients only for value 1. This is might just be a correlation and not a causation.*

#### 8. Max heart rate achieved

In [None]:
sns.kdeplot(data = df, hue = 'output', x = 'thalachh', bw_adjust = 0.75, palette = ['cornflowerblue', 'turquoise'])
plt.xlabel(" Max heart rate achieved")
plt.legend(['high risk', 'low risk'])
plt.show()

*Inference:*
* Patients with max heart rate lesser than 140 are more likely to have a lower risk.
* Patients with max heart rate greater than 140 are more likely to have a higher risk.

*There is a strong correlation between the max heart rate and risk of having a heart attack*

#### 9. Exercise induced angina

In [None]:
x = pd.crosstab(df['exng'], df['output'])
x.plot(kind = 'bar', color = ['plum', 'mediumpurple'])
plt.legend(['less chance', 'high chance'])
plt.xticks(ticks = [0,1], labels = ['no', 'yes'], rotation = 0)
plt.show()

*The stark difference points that there is a strong correlation between this feature and the target variable.*

#### 10. Exercise induced ST depression

In [None]:
sns.histplot(data = df, x = 'oldpeak', bins = 10, hue = 'output', stat = 'probability', palette = {0:'c',1:'m'})
plt.legend(['high risk', 'low risk'])
plt.show()

*Inference - There is a good correlation between these two variables because:* 
* Lower value patients are observed to have a high risk compared to low risk
* Higher value patients are observed to have a low risk compared ti high risk

#### 11. ST segment slope

In [None]:
x = pd.crosstab(df['slp'], df['output'])
x.plot(kind = 'bar', color = ['tab:purple', 'tab:pink'])
plt.legend(['less chance', 'high chance'])
plt.xticks(rotation = 0)
plt.show()

*Inference*
- For value 0: No significant difference
- For value 1: Patients with low risk are higher in number
- For value 2: Patients with high risk are higher in number

*There is a correlation between these two variables.*

#### 12. Number of major vessels

In [None]:
x = pd.crosstab(df['caa'], df['output'])
x.plot(kind = 'bar', color = ['lightcoral', 'skyblue'])
plt.legend(['less chance', 'high chance'])
plt.xticks(rotation = 0)
plt.show()

*As the value increases, there seems to be a higher probability of patients having a low risk compared to high risk.*

*There is a linear correlation between the predictor variable and number of major vessels.*

#### 13. Thalassemia blood disorder

In [None]:
x = pd.crosstab(df['thall'], df['output'])
x.plot(kind = 'barh', color = ['bisque', 'darkseagreen'])
plt.legend(['less chance', 'high chance'])
plt.xticks(rotation = 0)
plt.show()

*Inference:*
* Value 1 and 3: Patients with low risk are more in number
* Value 2: Patients with high risk are more in number

*There seems to be some correlation between these two variable*

### 4. Data modelling using ML

*Since this is a classification problem involving multiple categorical and numerical variables, the best machine learning algorithm we can use for preparing our model is the **Random Forest algorithm**.*

To learn more about the random forest algorithm, you can refer to this link: [Random Forest Classifier](https://medium.com/machine-learning-101/chapter-5-random-forest-classifier-56dc7425c3e1)

* Importing the required libraries and tuning our dataset with the required features

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = df.drop(['output'], axis = 1)
y = df["output"]
X.head()

* Defining our data model

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
my_model = RandomForestClassifier(n_estimators = 100, criterion = 'gini', min_samples_split = 4, 
                               max_depth = 6, random_state = 0)

* Training our data model

In [None]:
my_model.fit(train_X, train_y)
prediction = my_model.predict(val_X)

* Testing and validating our data model

In [None]:
accuracy = accuracy_score(val_y, prediction)
print("Model accuracy:", round(accuracy*100, 2), "%")

* We now see the importance of each feature in our ML data model using a feature importance table

In [None]:
importance = my_model.feature_importances_
print("Feature importance table\n")
# summarize feature importance
for i,v in enumerate(importance):
    print(X.columns[i], ':', round((v*100),2), '%')

*From this we can infer that barring fasting blood sugar and resting electrocardiographic results, all other did contribute towards the final decision tree that was used to make our predictions.*

#### Results and conclusion:
* We analysed the dataset and understood the correlation between each feature and the target variable.
* We prepared a data model using the random forest machine learning algorithm for this classification problem.
* We prepared our data model and found out that almost all features except two played a role in the making our predictions.
* The most significant factors that contribute towards predictiong the heart attack risk of a patient are:
  1. Chest pain type
  2. Maximum heart rate achieved
  3. ST depression induced by exercise
  4. Number of major vessels
  5. Thalassemia blood disorder
* We tested and validated our data model. **It achieved an accuracy of 88.16%.**