![](https://i.ibb.co/yW9HZS8/random-forest.png)

- **ML Part 1** - Logistic Regression
- **ML Part 2** - K-Nearest Neighbors (KNN)
- **ML Part 3** - Support Vector Machine (SVM)
- **ML Part 4** - Artificial Neural Network (NN)
- **ML Part 5** - Classification and Regression Tree (CART)
- **ML Part 6 - Random Forests**
- **ML Part 7** - Gradient Boosting Machines (GBM)
- **ML Part 8** - XGBoost
- **ML Part 9** - LightGBM
- **ML Part 10** - CatBoost

Random Forest is a community model where multiple decision trees are combined to achieve a stronger model. The derived model will be more robust, accurate, and will handle overfitting better than constitutive models.

## Basic Theory
Random Forest has a series of decision trees combined with the "bagging method" to obtain classification and regression outputs. In classification, the output is calculated using majority voting, while in regression the average is calculated.

Random Forest creates a robust and accurate model that can process a wide variety of input data with binary, categorical, continuous features.


![](https://miro.medium.com/max/592/1*i0o8mjFfCn-uD79-F1Cqkw.png)


## Lost Function
We use the entropy / Gini score to calculate the missing value of data sets.


## Advantages
- The correct and powerful model.
- It efficiently handles overfitting.
- Supports implicit feature selection and derives feature significance.


## Disadvantages
- When the forest grows, it is computationally complex and slower.
- It's not a well descriptive model on prediction.


## Hyperparameters
- **n_estimators:**
    - Default: 100
    - It is the number of trees in the forest. With a large number of trees comes high accuracy, but high computational complexity.
- **max_features:**
    - Default: 'auto'
    - the maximum number of features allowed in a single tree.
- **min_samples_split:**
    - Default: 2
    - The minimum number of samples required to split an internal node.
- **min_samples_leaf:**
    - Default: 1
    - The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- **criterion:**
    - Default: 'gini'
    - The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
- **max_depth:**
    - Default: None
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
    
    
## Comparison with Other Models
The Random Forest comparison is quite similar to Decision tree comparisons.

### Random Forest vs Naive Bayes
- Random Forest is a complex and large model, while Naive Bayes is a relatively small model.
- While Naive Bayes perform better with small training data, RF needs a larger set of training data.

### Random Forest vs Artificial Neural Networks (NN)
- Both are very powerful and highly accurate algorithms.
- Both have property interactions internally and are less explainable.
- While Random Forest feature does not need scaling, NN does need to scale features.
- The batch version of both models will be strong.

In [None]:
# Import the necessary packages
import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import and read dataset
input_ = "../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv"
df = pd.read_csv(input_)

df.head(10)

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True)
plt.show()

In [None]:
df.describe()

In [None]:
x = df.drop(columns='DEATH_EVENT')
y = df['DEATH_EVENT']

model = RandomForestClassifier()
model.fit(x,y)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(12).plot(kind='barh')
plt.show()

When we examine the graph above, we can predict that time, serum_creatinine, ejection_fraction, age values will increase accuracy in education.

In [None]:
# Delete outlier
df=df[df['ejection_fraction']<70]

In [None]:
#inp_data = df.drop(df[['DEATH_EVENT']], axis=1)
inp_data = df.iloc[:,[0,4,7,11]]
out_data = df[['DEATH_EVENT']]

X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2, random_state=0)

## Applying Transformer
sc=StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
## X_train, X_test, y_train, y_test Shape

print("X_train Shape : ", X_train.shape)
print("X_test Shape  : ", X_test.shape)
print("y_train Shape : ", y_train.shape)
print("y_test Shape  : ", y_test.shape)

In [None]:
## I coded this method for convenience and to avoid writing the same code over and over again

def result(clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print('Accuracy Score: {:.4f}'.format(accuracy_score(y_test, y_pred)))
    print('Random Forest Classifier f1-score      : {:.4f}'.format(f1_score( y_test , y_pred)))
    print('Random Forest Classifier precision     : {:.4f}'.format(precision_score(y_test, y_pred)))
    print('Random Forest Classifier recall        : {:.4f}'.format(recall_score(y_test, y_pred)))
    print("Random Forest Classifier roc auc score : {:.4f}".format(roc_auc_score(y_test,y_pred)))
    print("\n",classification_report(y_pred, y_test))
    
    plt.figure(figsize=(6,6))
    cf_matrix = confusion_matrix(y_test, y_pred)
    sns.heatmap((cf_matrix / np.sum(cf_matrix)*100), annot = True, fmt=".2f", cmap="Blues")
    plt.title("RandomForestClassifier Confusion Matrix (Rate)")
    plt.show()
    
    cm = confusion_matrix(y_test,y_pred)
    plt.figure(figsize=(6,6))
    sns.heatmap(cm, annot=True, cmap="Blues",
                xticklabels=["FALSE","TRUE"],
                yticklabels=["FALSE","TRUE"],
                cbar=False)
    plt.title("RandomForestClassifier Confusion Matrix (Number)")
    plt.show()
    
def sample_result(
    n_estimators=100,
    max_features='auto',
    max_depth=None,
    min_samples_split=2):    
    
    scores = [] 
    for i in range(0,100): # 100 samples
        n_estimators, max_features, max_depth, min_samples_split
        X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2)
        clf = RandomForestClassifier(n_estimators= n_estimators,
                                     max_features=max_features,
                                     max_depth=max_depth,
                                     min_samples_split=min_samples_split) 
        sc=StandardScaler()
        X_train = sc.fit_transform(X_train)
        X_test = sc.fit_transform(X_test)
        clf.fit(X_train, y_train)
        scores.append(accuracy_score(clf.predict(X_test), y_test)) 
    
    plt.hist(scores)
    plt.show()
    print("Best Score: {}\nMean Score: {}".format(np.max(scores), np.mean(scores)))

### Simple Metod
I applied Random Forests directly without changing anything and the result is as follows:

In [None]:
clf = RandomForestClassifier(random_state=0)
result(clf)
sample_result()

### Advanced Method

In [None]:
param_grid = {
    "n_estimators": [100, 500, 1000],
    "max_features": [0.5,1,'auto'],
    "max_depth": [1,2,3,4,None],
    "min_samples_split": [2,5,8]
}

clf = RandomForestClassifier()
grid = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=2, cv=10)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
clf = RandomForestClassifier(
    n_estimators=1000,
    max_features=0.5,
    max_depth=3,
    min_samples_split=5,
    random_state=0
)

result(clf)
sample_result(1000,0.5,3,5)

In [None]:
Importance = pd.DataFrame({'Importance':clf.feature_importances_*100},index=df.iloc[:,[0,4,7,11]].columns)
Importance.sort_values(by='Importance',axis=0,ascending=True).plot(kind='barh',color='lightblue')
plt.xlabel('Importance for variable');

## Reporting
I evaluated the results I found with Confusion Matrix, the results are as follows: 

**Correctly predicted -> %95.00 (282 of 297 predict are correct)**
- True Negative -> %68.33 (41 people) -> Those who were predicted not to die and who did not die
- True Positive -> %26.67 (16 people) -> Those who were predicted to die and who did die

**Wrong predicted-> %10.98 (15 of 297 predict are wrong)**
- False Positive -> %3.33 (2 people) -> Those who were predicted to die but who did not die
- False Negative -> %01.67 (1 people) -> Those who were predicted to not die but who did die