# **Random Forest**

Random forest is a type of ensemble learning that combines multiple decision trees to make a final prediction, and is widely used in practice due to its simplicity and effectiveness.

**For example**, we can use random forest to predict the class labels of a dataset, and the final prediction is the class that receives the most votes from the decision trees. Random forest can reduce the variance and bias of the final prediction, and improve the generalization performance of the model.

- Belongs to ensemble learning family.
- Bagging and Boosting.

### **Bootstrap Aggregating (Bagging)**

Random forests creates multiple decision trees using bagging. It randomly selects samples from the dataset with replacement and builds a decision tree for each sample.

#### Random Feature Selection

In each split during the tree construction, a random subset of features is considered. This randomness helps in making the model more robust and prevents overfitting.

#### Aggregation of Results

- For classification, the most voted class becomes the models prediction.
- For regression, it averages the outputs of different trees.

#### Key Features

- Robustness to overfitting.
- Handling Non-Linearity.
- Feature Importance.

![Random Forest](./images/decision_treee_single.png)

![Random Forest](./images/random_forest.png)

![Random Forest](./images/random_forest_1.png)

____

In [7]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [8]:
# load the data
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [9]:
# encode features which are categorical or object using for loop
le = LabelEncoder()
for i in df.columns:
    if df[i].dtype == 'object' or df[i].dtype == 'category':
        df[i] = le.fit_transform(df[i])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [10]:
# split the data into X and y for classification
X = df.drop('sex', axis = 1)
y = df['sex']
# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
# create, train and predict the mode
model_cl = RandomForestClassifier(n_estimators=200, random_state=42)
model_cl.fit(X_train, y_train)
y_pred = model_cl.predict(X_test)

#evaluate the model
print('accuracy score: ', accuracy_score(y_test, y_pred))
print('confusion matrix:\n', confusion_matrix(y_test, y_pred))
print('classification report:\n', classification_report(y_test, y_pred))

accuracy score:  0.6122448979591837
confusion matrix:
 [[ 7 12]
 [ 7 23]]
classification report:
               precision    recall  f1-score   support

           0       0.50      0.37      0.42        19
           1       0.66      0.77      0.71        30

    accuracy                           0.61        49
   macro avg       0.58      0.57      0.57        49
weighted avg       0.60      0.61      0.60        49



In [11]:
# Use random Forest for Regression task
X = df.drop('tip', axis = 1)
y = df['tip']

# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

#create, train and predict the model
model_reg = RandomForestRegressor()
model_reg.fit(X_train, y_train)
y_pred = model_reg.predict(X_test)

# evaluate the model
print('mean squared error: ', mean_squared_error(y_test, y_pred))
print('mean absolute error: ', mean_absolute_error(y_test, y_pred))
print('r2 score: ', r2_score(y_test, y_pred))
print('root mean squared error: ', np.sqrt(mean_squared_error(y_test, y_pred)))

mean squared error:  0.9559056226530626
mean absolute error:  0.7809285714285716
r2 score:  0.2352579710981786
root mean squared error:  0.9777042613454554
