1. Model Selection:
Choose Algorithm: Select an appropriate machine learning algorithm based on the nature of the problem (classification, regression, clustering, etc.), the size of the dataset, and other factors.

2. Model Building:
Instantiate Model: Create an instance of the chosen machine learning algorithm.

Fit Model: Train the model on the training data by calling the fit() method. During training, the model learns the patterns and relationships present in the data.

Random Forest :

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best split strategy, i.e. equivalent to passing splitter="best" to the underlying DecisionTreeRegressor. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Overview:
Ensemble Learning: Random Forest is an ensemble learning method, meaning it combines the predictions of multiple individual models (decision trees) to produce a more accurate and robust final prediction.

Decision Trees: Random Forest is built upon a collection of decision trees. Each decision tree is trained independently on a random subset of the training data and features.

Bagging (Bootstrap Aggregating): Random Forest employs bagging, a technique that involves training each decision tree on a random bootstrap sample (sampling with replacement) from the original dataset. This helps in reducing variance and overfitting.

Random Feature Selection: At each node of the decision tree, a random subset of features is considered for splitting, rather than using all features. This further adds randomness to the model and helps in decorrelating the trees.

How it works:
Training Phase:

Random Forest builds a specified number of decision trees (controlled by the n_estimators parameter) using the bootstrapped samples of the training data.
At each node of each tree, a random subset of features (controlled by the max_features parameter) is considered for splitting.
The trees grow deep enough to minimize impurity (e.g., Gini impurity for classification) or maximize information gain until a stopping criterion is met (e.g., maximum depth of the tree, minimum number of samples required to split a node).
Prediction Phase:

During prediction, each decision tree in the forest independently classifies the input data point.
For classification tasks, the final prediction is typically made by a majority vote (mode) of the predictions of individual trees. For regression tasks, it's the average of the predictions.
Key Advantages:
Robust to Overfitting: Random Forest is less prone to overfitting compared to individual decision trees, especially when trained with a large number of trees.

Handles High-Dimensional Data: It performs well even with a large number of input features.

Implicit Feature Selection: By considering only a random subset of features at each split, Random Forest implicitly performs feature selection and can handle irrelevant or redundant features.

Scalability: It can efficiently handle large datasets and is highly parallelizable, making it suitable for distributed computing environments.

Works Well Out-of-the-Box: Random Forest typically requires minimal hyperparameter tuning and is known for producing good results with default settings.


In [1]:
#Import Required Library [Details are available in README.md file]
from sklearn.ensemble import RandomForestClassifier
import os
import matplotlib.pyplot as plt
import random
import cv2
import pickle


In [2]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
current_dir = os.path.dirname(current_dir)

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: E:\upgrade_capston_project-main


In [3]:
preprocessed_data_dir = parent_dir+'/datasets/processed_dataset/'

In [4]:
#Load the preprocessed data
with open(os.path.join(preprocessed_data_dir,'X_train.pkl'), 'rb') as f:
    X_train = pickle.load(f)

# Load y_train from file
with open(os.path.join(preprocessed_data_dir,'y_train.pkl'), 'rb') as f:
    y_train = pickle.load(f)


In [5]:
# Define a function to train a Random Forest classifier
def train_random_forest(X_train, y_train):
	rf_classifier = RandomForestClassifier(n_estimators=1000, criterion='gini', max_depth=5)
	rf_classifier.fit(X_train, y_train)
	return rf_classifier

In [6]:
# Train Random Forest classifiers
rf_classifier = train_random_forest(X_train, y_train)

In [7]:
model_dir = parent_dir+'/models/'

In [8]:
# Save the trained model to a file
with open(os.path.join(model_dir,'randomForestClassifier.pkl'), 'wb') as f:
    pickle.dump(rf_classifier, f)