# Problem Statement : Heart Failure Predicton

## Problem Description
Cardiovascular diseases (CVDs) are the leading cause of mortality worldwide, accounting for approximately 31% of all global deaths annually. Heart failure, a significant consequence of CVDs, demands early detection and intervention, particularly in individuals at high risk due to factors like hypertension, diabetes, hyperlipidaemia, or existing heart conditions. The Heart Failure [Prediction Dataset from Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction) provides crucial insights into this issue by featuring 11 attributes that help in predicting potential heart disease in individuals.

## Task
The task is to develop a machine learning model using the provided dataset to predict the likelihood of heart disease in patients. This involves analyzing various patient characteristics and health indicators such as age, sex, type of chest pain, blood pressure, cholesterol levels, fasting blood sugar, resting electrocardiogram results, maximum heart rate, presence of exercise-induced angina, and ST depression during peak exercise. The goal is to effectively use these features to discern patterns that indicate the presence of heart disease, which is marked in the dataset as either 1 (heart disease) or 0 (normal). The successful implementation of this model could aid medical professionals in diagnosing heart disease more accurately and promptly.

#  Load the Data Set



In [30]:
import pandas as pd

# Load the dataset using pandas
df = pd.read_csv("heart.csv")

df.head()


Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
       'HeartDisease'],
      dtype='object')


# One hot Encoding using Pandas

In [27]:
cat_variables = ['Sex',
'ChestPainType',
'RestingECG',
'ExerciseAngina',
'ST_Slope'
]

In [28]:
# This will replace the columns with the one-hot encoded ones and keep the columns outside 'columns' argument as it is.
df = pd.get_dummies(data = df,
                         prefix = cat_variables,
                         columns = cat_variables)
df.head()

KeyError: "None of [Index(['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], dtype='object')] are in the [columns]"

In [None]:
features = [x for x in df.columns if x not in 'HeartDisease'] ## Removing our target variable

In [None]:
print(len(features))

In [None]:
RANDOM_STATE = 55

X_train, X_val, y_train, y_val = train_test_split(df[features], df['HeartDisease'], train_size = 0.8, random_state = RANDOM_STATE)



# Scikit-learn Implementation
[Scikit learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Hyperparameters:

 - min_samples_split: The minimum number of samples required to split an internal node.
   - Choosing a higher min_samples_split can reduce the number of splits and may help to reduce overfitting.
 - max_depth: The maximum depth of the tree.
   - Choosing a lower max_depth can reduce the number of splits and may help to reduce overfitting.
 - max_features: The maximum depth of the tree.
   - The number of features to consider when looking for the best split
 - n_jobs: The maximum depth of the tree.
   - Since the fitting of each tree is independent of each other, it is possible fit more than one tree in parallel. The number of jobs to run in parallel.

In [18]:
min_samples_split_list = [2,10, 30, 50, 100, 200, 300, 700]  ## If the number is an integer, then it is the actual quantity of samples,
                                             ## If it is a float, then it is the percentage of the dataset
max_depth_list = [2, 4, 8, 16, 32, 64, None]
n_estimators_list = [10,50,100,500]

In [19]:
from sklearn.ensemble import RandomForestClassifier

RANDOM_STATE = 50
accuracy_list_train = []
accuracy_list_val = []
for min_samples_split in min_samples_split_list:
    # You can fit the model at the same time you define it, because the fit function returns the fitted estimator.
    model = RandomForestClassifier(min_samples_split = min_samples_split,
                                   random_state = RANDOM_STATE).fit(X_train,y_train)
    predictions_train = model.predict(X_train) ## The predicted values for the train dataset
    predictions_val = model.predict(X_val) ## The predicted values for the test dataset
    accuracy_train = accuracy_score(predictions_train,y_train)
    accuracy_val = accuracy_score(predictions_val,y_val)
    accuracy_list_train.append(accuracy_train)
    accuracy_list_val.append(accuracy_val)

plt.title('Train x Validation metrics')
plt.xlabel('min_samples_split')
plt.ylabel('accuracy')
plt.xticks(ticks = range(len(min_samples_split_list )),labels=min_samples_split_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_val)
plt.legend(['Train','Validation'])

NameError: name 'X_val' is not defined

In [20]:
accuracy_list_train = []
accuracy_list_val = []
for max_depth in max_depth_list:
    # You can fit the model at the same time you define it, because the fit function returns the fitted estimator.
    model = RandomForestClassifier(max_depth = max_depth,
                                   random_state = RANDOM_STATE).fit(X_train,y_train)
    predictions_train = model.predict(X_train) ## The predicted values for the train dataset
    predictions_val = model.predict(X_val) ## The predicted values for the test dataset
    accuracy_train = accuracy_score(predictions_train,y_train)
    accuracy_val = accuracy_score(predictions_val,y_val)
    accuracy_list_train.append(accuracy_train)
    accuracy_list_val.append(accuracy_val)

plt.title('Train x Validation metrics')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.xticks(ticks = range(len(max_depth_list )),labels=max_depth_list)
plt.plot(accuracy_list_train)
plt.plot(accuracy_list_val)
plt.legend(['Train','Validation'])

NameError: name 'X_val' is not defined

In [21]:
random_forest_model = RandomForestClassifier(n_estimators = 100,
                                             max_depth = 16,
                                             min_samples_split = 10).fit(X_train,y_train)

In [22]:
print(f"Metrics train:\n\tAccuracy score: {accuracy_score(random_forest_model.predict(X_train),y_train):.4f}\nMetrics test:\n\tAccuracy score: {accuracy_score(random_forest_model.predict(X_val),y_val):.4f}")

NameError: name 'accuracy_score' is not defined