# Random Forests

In [None]:
# Import libraries
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Create toy data
X, y = make_classification(n_samples=1000, n_features=10,
                           n_informative=5, n_redundant=0,
                           random_state=123, shuffle=False)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The code above generates synthetic data using the `make_classification` function, which creates a dataset with 1,000 samples and 10 features, of which 5 are informative. The `random_state` parameter ensures reproducibility, while `shuffle=False` maintains the order of the generated data. The `train_test_split` function then divides the data into training (70%) and testing (30%) sets, allowing for effective model training and evaluation. This approach provides a quick way to create a classification dataset and prepare it for machine learning tasks.

In [None]:
# Create base model
base = DecisionTreeClassifier(max_depth=5)

# Create an ensemble model
ensemble = BaggingClassifier(estimator=base, n_estimators=100, random_state=7)

# Fit the base model on the training data
base.fit(X_train,y_train)

# Fit the ensemble model on the training data
ensemble.fit(X_train,y_train)

print("Accuracy base:",base.score(X_test, y_test))
print("Accuracy ensemble:",ensemble.score(X_test, y_test))

Accuracy base: 0.9166666666666666
Accuracy ensemble: 0.9566666666666667


In this code, a Decision Tree classifier is created as the base model, with a maximum depth of 5 to prevent overfitting. An ensemble model is then constructed using the `BaggingClassifier`, which combines 100 instances of the base decision tree to improve the model's accuracy and stability. The `random_state` parameter ensures that the results can be reproduced. After initialising the models, the code fits both the base and ensemble models to the training data, enabling them to learn from it. Finally, the accuracy of both models is evaluated using the test data and printed to compare their performance. This process highlights the potential benefits of using ensemble methods, such as bagging, which can often outperform single classifiers by reducing variance and improving predictions.

###  Feature importance scores

A property of the Random Forest ensemble method in sklearn is that it lets you print importance scores for features in the dataset.

In [None]:
forest = RandomForestClassifier(n_estimators=100, random_state=7)
forest.fit(X_train, y_train)

feature_imp = pd.Series(forest.feature_importances_).sort_values(ascending=False)
feature_imp

Unnamed: 0,0
0,0.367378
1,0.240483
2,0.117017
3,0.108648
4,0.064458
7,0.023878
8,0.022136
9,0.020229
6,0.019116
5,0.016657


At the start of this notebook, we specified that this dataset has 10 features, of which 5 are informative. The classifier relied on the bottom five features (0,1,2,3,4) less than the higher numbered features (7,8,9,6,5). An advantage of investigating the importance of features is that irrelevant features can be removed. This removal of noise tends to improve performance and reduce training time.

In [None]:
# select important features
X = X[:, :5]

# retrain
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test
base.fit(X_train,y_train)
ensemble.fit(X_train,y_train)

print("Accuracy base:",base.score(X_test, y_test))
print("Accuracy ensemble:",ensemble.score(X_test, y_test))

Accuracy base: 0.91
Accuracy ensemble: 0.9366666666666666
