# Random Forest Algorithm

Machine learning, a fascinating blend of computer science and statistics, has witnessed incredible progress, with one standout algorithm being the Random Forest. Random forests or Random Decision Trees is a collaborative team of decision trees that work together to provide a single output. Originating in 2001 through Leo Breiman, Random Forest has become a cornerstone for machine learning enthusiasts. In this article, we will explore the fundamentals and implementation of Random Forest Algorithm.

## What is the Random Forest Algorithm?

Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works by creating a number of Decision Trees during the training phase. Each tree is constructed using a random subset of the data set to measure a random subset of features in each partition. This randomness introduces variability among individual trees, reducing the risk of overfitting and improving overall prediction performance.

In prediction, the algorithm aggregates the results of all trees, either by voting (for classification tasks) or by averaging (for regression tasks) This collaborative decision-making process, supported by multiple trees with their insights, provides an example stable and precise results. Random forests are widely used for classification and regression functions, which are known for their ability to handle complex data, reduce overfitting, and provide reliable forecasts in different environments.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20240701170624/Random-Forest-Algorithm.webp">

## What are Ensemble Learning models?
Ensemble learning models work just like a group of diverse experts teaming up to make decisions – think of them as a bunch of friends with different strengths tackling a problem together. Picture it as a group of friends with different skills working on a project. Each friend excels in a particular area, and by combining their strengths, they create a more robust solution than any individual could achieve alone.

Similarly, in ensemble learning, different models, often of the same type or different types, team up to enhance predictive performance. It's all about leveraging the collective wisdom of the group to overcome individual limitations and make more informed decisions in various machine learning tasks. Some popular ensemble models include- XGBoost, AdaBoost, LightGBM, Random Forest, Bagging, Voting etc.

## What is Bagging and Boosting?
Bagging is an ensemble learning model, where multiple weak models are trained on different subsets of the training data. Each subset is sampled with replacement and prediction is made by averaging the prediction of the weak models for regression problem and considering majority vote for classification problem.

Boosting trains multiple based models sequentially. In this method, each model tries to correct the errors made by the previous models. Each model is trained on a modified version of the dataset, the instances that were misclassified by the previous models are given more weight. The final prediction is made by weighted voting.

Algorithm for Random Forest Work:

* Step 1: Select random K data points from the training set.
* Step 2:Build the decision trees associated with the selected data points(Subsets).
* Step 3:Choose the number N for decision trees that you want to build.
* Step 4:Repeat Step 1 and 2.
* Step 5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.

## How Does Random Forest Work?
The random Forest algorithm works in several steps which are discussed below-->

* Ensemble of Decision Trees: Random Forest leverages the power of ensemble learning by constructing an army of Decision Trees. These trees are like individual experts, each specializing in a particular aspect of the data. Importantly, they operate independently, minimizing the risk of the model being overly influenced by the nuances of a single tree.
* Random Feature Selection: To ensure that each decision tree in the ensemble brings a unique perspective, Random Forest employs random feature selection. During the training of each tree, a random subset of features is chosen. This randomness ensures that each tree focuses on different aspects of the data, fostering a diverse set of predictors within the ensemble.
* Bootstrap Aggregating or Bagging: The technique of bagging is a cornerstone of Random Forest's training strategy which involves creating multiple bootstrap samples from the original dataset, allowing instances to be sampled with replacement. This results in different subsets of data for each decision tree, introducing variability in the training process and making the model more robust.
* Decision Making and Voting: When it comes to making predictions, each decision tree in the Random Forest casts its vote. For classification tasks, the final prediction is determined by the mode (most frequent prediction) across all the trees. In regression tasks, the average of the individual tree predictions is taken. This internal voting mechanism ensures a balanced and collective decision-making process.

## Key Features of Random Forest
Some of the Key Features of Random Forest are discussed below-->

* High Predictive Accuracy: Imagine Random Forest as a team of decision-making wizards. Each wizard (decision tree) looks at a part of the problem, and together, they weave their insights into a powerful prediction tapestry. This teamwork often results in a more accurate model than what a single wizard could achieve.
* Resistance to Overfitting: Random Forest is like a cool-headed mentor guiding its apprentices (decision trees). Instead of letting each apprentice memorize every detail of their training, it encourages a more well-rounded understanding. This approach helps prevent getting too caught up with the training data which makes the model less prone to overfitting.
* Large Datasets Handling: Dealing with a mountain of data? Random Forest tackles it like a seasoned explorer with a team of helpers (decision trees). Each helper takes on a part of the dataset, ensuring that the expedition is not only thorough but also surprisingly quick.
* Variable Importance Assessment: Think of Random Forest as a detective at a crime scene, figuring out which clues (features) matter the most. It assesses the importance of each clue in solving the case, helping you focus on the key elements that drive predictions.
* Built-in Cross-Validation: Random Forest is like having a personal coach that keeps you in check. As it trains each decision tree, it also sets aside a secret group of cases (out-of-bag) for testing. This built-in validation ensures your model doesn't just ace the training but also performs well on new challenges.
* Handling Missing Values: Life is full of uncertainties, just like datasets with missing values. Random Forest is the friend who adapts to the situation, making predictions using the information available. It doesn't get flustered by missing pieces; instead, it focuses on what it can confidently tell us.
* Parallelization for Speed: Random Forest is your time-saving buddy. Picture each decision tree as a worker tackling a piece of a puzzle simultaneously. This parallel approach taps into the power of modern tech, making the whole process faster and more efficient for handling large-scale projects.

## Random Forest vs. Other Machine Learning Algorithms
Some of the key-differences are discussed below.

# Comparison of Random Forest with Other Machine Learning Algorithms

| **Feature**                | **Random Forest**                                                                                                                                              | **Other ML Algorithms**                                                                                                                                           |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Ensemble Approach**       | Utilizes an ensemble of decision trees, combining their outputs for predictions, fostering robustness and accuracy.                                             | Typically relies on a single model (e.g., linear regression, support vector machine) without the ensemble approach, potentially leading to less resilience against noise. |
| **Overfitting Resistance**  | Resistant to overfitting due to the aggregation of diverse decision trees, preventing memorization of training data.                                            | Some algorithms may be prone to overfitting, especially when dealing with complex datasets, as they may excessively adapt to training noise.                      |
| **Handling of Missing Data**| Exhibits resilience in handling missing values by leveraging available features for predictions, contributing to practicality in real-world scenarios.         | Other algorithms may require imputation or elimination of missing data, potentially impacting model training and performance.                                     |
| **Variable Importance**     | Provides a built-in mechanism for assessing variable importance, aiding in feature selection and interpretation of influential factors.                        | Many algorithms may lack an explicit feature importance assessment, making it challenging to identify crucial variables for predictions.                         |
| **Parallelization Potential**| Capitalizes on parallelization, enabling the simultaneous training of decision trees, resulting in faster computation for large datasets.                     | Some algorithms may have limited parallelization capabilities, potentially leading to longer training times for extensive datasets.                               |


## Applications of Random Forest in Real-World Scenarios
Some of the widely used real-world application of Random Forest is discussed below:

* Finance Wizard: Imagine Random Forest as our financial superhero, diving into the world of credit scoring. Its mission? To determine if you're a credit superhero or, well, not so much. With a knack for handling financial data and sidestepping overfitting issues, it's like having a guardian angel for robust risk assessments.
* Health Detective: In healthcare, Random Forest turns into a medical Sherlock Holmes. Armed with the ability to decode medical jargon, patient records, and test results, it's not just predicting outcomes; it's practically assisting doctors in solving the mysteries of patient health.
* Environmental Guardian: Out in nature, Random Forest transforms into an environmental superhero. With the power to decipher satellite images and brave noisy data, it becomes the go-to hero for tasks like tracking land cover changes and safeguarding against potential deforestation, standing as the protector of our green spaces.
* Digital Bodyguard: In the digital realm, Random Forest becomes our vigilant guardian against online trickery. It's like a cyber-sleuth, analyzing our digital footsteps for any hint of suspicious activity. Its ensemble approach is akin to having a team of cyber-detectives, spotting subtle deviations that scream "fraud alert!" It's not just protecting our online transactions; it's our digital bodyguard.
## Preparing Data for Random Forest Modeling
For Random Forest modeling, some key-steps of data preparation are discussed below:

* Handling Missing Values: Begin by addressing any missing values in the dataset. Techniques like imputation or removal of instances with missing values ensure a complete and reliable input for Random Forest.
* Encoding Categorical Variables: Random Forest requires numerical inputs, so categorical variables need to be encoded. Techniques like one-hot encoding or label encoding transform categorical features into a format suitable for the algorithm.
* Scaling and Normalization: While Random Forest is not sensitive to feature scaling, normalizing numerical features can still contribute to a more efficient training process and improved convergence.
* Feature Selection: Assess the importance of features within the dataset. Random Forest inherently provides a feature importance score, aiding in the selection of relevant features for model training.
* Addressing Imbalanced Data: If dealing with imbalanced classes, implement techniques like adjusting class weights or employing resampling methods to ensure a balanced representation during training.

# Implement Random Forest for Classification

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(url)

In [6]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# Drop rows with missing target values
titanic_data = titanic_data.dropna(subset=['Survived'])

# Select relevant features and target variable
X = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = titanic_data['Survived']

In [8]:
X.loc[:, 'Sex'] = X['Sex'].map({'female': 0, 'male': 1})

In [9]:
X.loc[:, 'Age'].fillna(X['Age'].median(), inplace=True)

In [10]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Key Points About n_estimators:
* Ensemble Size: The random forest is an ensemble method that aggregates the predictions of multiple decision trees. The n_estimators parameter defines how many individual decision trees will be trained and included in this ensemble.
A larger number of trees typically increases the model's robustness and reduces variance, but it also increases computational cost.
* Impact on Performance: More trees generally lead to better performance up to a certain point, as the model can better average out noise in the data.
However, beyond a certain threshold, adding more trees may yield diminishing returns in terms of accuracy improvement.
* Computational Trade-off: Increasing n_estimators leads to higher computation and memory usage since more trees need to be trained and stored.
For large datasets or real-time applications, choosing an optimal number of trees is critical to balance performance and efficiency.
* Deterministic Behavior: The random_state parameter ensures reproducibility, so the same set of trees will be generated for the same data if the same seed is provided.

In [11]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

* Here, 100 decision trees will be trained as part of the random forest.
* Predictions from these 100 trees will be aggregated (e.g., using majority voting for classification or averaging for regression) to make the final prediction.

In [12]:
rf_classifier.fit(X_train, y_train)

In [13]:
y_pred = rf_classifier.predict(X_test)

In [14]:
accuracy = accuracy_score(y_test, y_pred)

In [15]:
classification_rep = classification_report(y_test, y_pred)

In [16]:
print(classification_rep)

              precision    recall  f1-score   support

           0       0.82      0.85      0.83       105
           1       0.77      0.73      0.75        74

    accuracy                           0.80       179
   macro avg       0.79      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



### Classification Report Explanation
Per-Class Metrics:
* Precision:

Precision indicates the percentage of correctly predicted instances for a specific class out of all instances predicted for that class.
Example: For class 0, a precision of 0.82 means that 82% of the predictions for class 0 are correct.
* Recall:

Recall shows the percentage of actual instances of a specific class that were correctly predicted.
Example: For class 0, a recall of 0.85 means that 85% of the actual class 0 instances were identified correctly.
* F1-Score:

The F1-score balances precision and recall, providing a single measure of the model’s performance for each class.
Example: For class 0, the F1-score is 0.83, which indicates balanced precision and recall.
* Support:

Support refers to the total number of actual instances for each class in the dataset.
Example: There are 105 instances of class 0 and 74 instances of class 1.

In [17]:
accuracy

0.7988826815642458

In [35]:
# Example: New passenger data (make sure it matches the features used during training)
new_passenger_data = pd.DataFrame({
    'Pclass': [3, 1],
    'Sex': [1, 0],  # 1 for male, 0 for female
    'Age': [25, 38],
    'SibSp': [0, 1],
    'Parch': [0, 0],
    'Fare': [7.25, 71.2833]
})

# Make predictions for the new passengers
new_predictions = rf_classifier.predict(new_passenger_data)

# Display the predictions
new_passenger_data['Predicted_Survived'] = new_predictions
print("\nPredictions for New Passengers:")
new_passenger_data



Predictions for New Passengers:


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Predicted_Survived
0,3,1,25,0,0,7.25,0
1,1,0,38,1,0,71.2833,1


# Implement Random Forest for Regression

In [18]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [19]:
california_housing = fetch_california_housing()

In [20]:
california_data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)

In [21]:
california_data['MEDV'] = california_housing.target

In [22]:
X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
rf_regressor = RandomForestRegressor(n_estimators=100,random_state=42)

In [25]:
rf_regressor.fit(X_train, y_train)

In [26]:
y_pred = rf_regressor.predict(X_test)

In [27]:
mse = mean_squared_error(y_test, y_pred)

In [28]:
r2 = r2_score(y_test, y_pred)

In [29]:
mse

0.2553684927247781

In [30]:
r2

0.8051230593157366

In [31]:
# Example: Making predictions for the test set
predictions = rf_regressor.predict(X_test)

# Display a few predictions alongside actual values
prediction_results = pd.DataFrame({'Actual': y_test.values, 'Predicted': predictions})
print(prediction_results.head())


    Actual  Predicted
0  0.47700   0.509500
1  0.45800   0.741610
2  5.00001   4.923257
3  2.18600   2.529610
4  2.78000   2.273690


In [32]:
# Example: New data (ensure it has the same feature structure as the training data)
new_data = pd.DataFrame({
    'MedInc': [8.3],
    'HouseAge': [15],
    'AveRooms': [6.0],
    'AveBedrms': [1.2],
    'Population': [1000],
    'AveOccup': [3.5],
    'Latitude': [34.0],
    'Longitude': [-118.0]
})

# Make predictions for the new data
new_predictions = rf_regressor.predict(new_data)

# Display the predictions
print("Predicted MEDV for new data:", new_predictions)


Predicted MEDV for new data: [4.4338958]


## Assumptions for Random Forest:
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:

* There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result.
* The predictions from each tree must have very low correlations.

### Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:

* It takes less training time as compared to other algorithms.
* It predicts output with high accuracy, even for the large dataset it runs efficiently.
* It can also maintain accuracy when a large proportion of data is missing




### Overcoming Challenges in Random Forest Modeling
To use Random Forest algorithm very efficiently in real-world applications, we need to overcome some potential challenges which are discussed below:

1. Addressing Overfitting: Taming the tendency of individual decision trees to overfit remains a challenge. Strategies like tuning hyperparameters, adjusting tree depth and implementing feature selection techniques are crucial for striking the right balance between complexity and generalization.
2. Optimizing Computational Resources: Random Forest's efficiency in handling large datasets can sometimes be a double-edged sword, demanding substantial computational resources. Implementing parallelization techniques and exploring optimized algorithms are key steps in overcoming computational challenges and ensuring scalability.
3. Dealing with imbalanced data: When unbalanced data sets are encountered, where one class is significantly superior to the other, the random forest may skew toward the majority group. Reducing this bias includes strategies like adjusting class weights, oversampling subclasses, or using special algorithms to deal with imbalanced situations.
4. Defining complex models: Although random forests provide strong predictions, interpretation of the model's decision process can be complicated by its clustered nature. Methods such as feature importance analysis, partial dependence plots, and model-agnostic interpretability methods are used to improve model interpretation.
5. Handling noisy data: The resilience of random forests to noisy data is a strength, but can still be a challenge in high-noise situations. Careful data preprocessing techniques, outlier identification, and feature engineering are required to ensure model accuracy and reliability.
6. Managing Memory Usage: As Random Forest constructs numerous decision trees during training, managing memory usage becomes critical. Fine-tuning parameters like the number of trees, tree depth, and the size of feature subsets can help strike a balance between model performance and memory efficiency.