In [105]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn import tree
from sklearn.metrics import r2_score

In [106]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [107]:
df_train.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

In [108]:
df_test.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

In [109]:
df_train = shuffle(df_train, random_state = 0)
df_test = shuffle(df_test, random_state = 0)

In [110]:
x_train = df_train.loc[:, 'battery_power' : "wifi"]
y_train = df_train.loc[:, 'price_range']

x_test = df_test.loc[:, 'battery_power' : "wifi"]
y_test = df_test.loc[:, 'price_range']

## BaggingClassifier

In [111]:
classifier = BaggingClassifier(n_estimators = 10, random_state = 0).fit(x_train,y_train)

In [112]:
y_predict = classifier.predict(x_test)

In [113]:
r2_score(y_test, y_predict)

0.9325204609988308

The accuracy score of BaggingClassifier is 93.25%. 

We do not analyze the feature importance here for the reason that Bagging Classifier has no attribute to feature_importance_.

## Random Forest Classifier

In [114]:
classifier = RandomForestClassifier(max_depth = 2, random_state = 0).fit(x_train, y_train)



In [115]:
y_predict = classifier.predict(x_test)

In [116]:
r2_score(y_test, y_predict)

0.7638216134959078

The accuracy score of RandomForestClassifier is 76.38%.

In sklearn if we bag decision trees, we still end up using all features with each decision tree. In random forests however, we use a subset of features. The RandomForestClassifier introduces randomness externally (relative to the individual tree fitting) via bagging as BaggingClassifier does.

However it injects randomness also deep inside the tree construction procedure by sub-sampling the list of features that are candidate for splitting: a new random set of features is considered at each new split. This randomness is controlled via the max_features parameter of RandomForestClassifier that has no equivalent in BaggingClassifier(base_estimator=DecisionTreeClassifier()).

The official sklearn documentation (https://scikit-learn.org/stable/modules/ensemble.html) on ensembling methods could have been a bit more clear about the difference:

"When samples are drawn with replacement, then the method is known as Bagging"
"In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set."
So it would appear there is no difference if we bag decision trees. It turns out, the documentation also states:

"Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features."

So this is one more way of introducing randomness, by limiting the number of features at the splits. In practice, it is useful to indeed tune max_features to get a good fit.