## Ensembles and Random Forests
The dataset is available for download at: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) The dataset that is going to be used is already processed(standardised) and is ready for machine learning algorithm.

### Simple Example of an Ensemble

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

In [9]:
# load dataset
df = pd.read_csv('data_bcw.csv')
df.head()
# Column 'diagnosis' contains the labels
# label 0 denotes 'benign'
# label 1.0 denotes 'malignant'

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015,1.0
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119,1.0
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391,1.0
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501,1.0
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971,1.0


In [10]:
# Split data in train and test set
# features & labels
X = df.drop(columns = ['diagnosis'])
y = df['diagnosis']

# Converting to NumPy arrays
X = X.to_numpy()
y = y.to_numpy()

# Splitting into train, test set
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
for train_index, test_index in sss.split(X,y):
    X_train, y_train = X[train_index], y[train_index]
    X_test, y_test = X[test_index], y[test_index] 

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# different classifiers
log_clf = LogisticRegression()
tree_clf = DecisionTreeClassifier(max_depth=3)
svm_clf = SVC()
knn_clf = KNeighborsClassifier(n_neighbors=10)
rf_clf = RandomForestClassifier()

# 'hard voting' classifier
voting_clf = VotingClassifier(estimators = [('logistic_regression', log_clf),
                                            ('decision_tree', tree_clf),
                                            ('suppor_vector_machine', svm_clf),
                                            ('kNN', knn_clf),
                                            ('random_forest', rf_clf)],
                              voting='hard')

# Fit models and look at the accuracy scores
for clf in (log_clf, tree_clf, svm_clf, knn_clf, rf_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9824561403508771
DecisionTreeClassifier 0.9298245614035088
SVC 0.9824561403508771
KNeighborsClassifier 0.9649122807017544
RandomForestClassifier 1.0
VotingClassifier 0.9824561403508771


### Bagging and Pasting

In [32]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Single Decision Tree Classifier
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
y_pred = tree_clf.predict(X_test)
print('Accuracy of Decision Tree Classifier: ', accuracy_score(y_pred, y_test))

# Bagging: bootstrap = True
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                           n_estimators=200,
                           max_samples=100,
                           bootstrap=True,
                            n_jobs = -1,)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print('Accuracy of Bagging Classifier: ', accuracy_score(y_pred, y_test))

Accuracy of Decision Tree Classifier:  0.8947368421052632
Accuracy of Bagging Classifier:  1.0


### Random Forests

In [33]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=200, 
                                max_leaf_nodes=16, 
                                n_jobs=-1)
rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(X_test)
print('Accuracy of a Random Forest: ', accuracy_score(y_pred, y_test))

Accuracy of a Random Forest:  1.0


### Feature Importance

In [35]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=200, 
                                max_leaf_nodes=16, 
                                n_jobs=-1)
rf_clf.fit(X_train, y_train)

# feature importance
for name, score in zip(df.columns, rf_clf.feature_importances_):
    print(name,':  ', score)

radius_mean :   0.032510671036917314
texture_mean :   0.011437265285317666
perimeter_mean :   0.048328830660957355
area_mean :   0.04946280291125219
smoothness_mean :   0.00649347987461469
compactness_mean :   0.013532379715857622
concavity_mean :   0.03739312617927278
concave points_mean :   0.10055422113957928
symmetry_mean :   0.003750617446667143
fractal_dimension_mean :   0.003941124510846824
radius_se :   0.016119093153723948
texture_se :   0.004140526345688042
perimeter_se :   0.012428268562690348
area_se :   0.024994579470688033
smoothness_se :   0.003146883050068023
compactness_se :   0.0039214507826431915
concavity_se :   0.0057466339384760635
concave points_se :   0.004088335634291357
symmetry_se :   0.003976920650656093
fractal_dimension_se :   0.003623212821246567
radius_worst :   0.12044094414307663
texture_worst :   0.017553876533889
perimeter_worst :   0.1608275609094665
area_worst :   0.11915614529091925
smoothness_worst :   0.014120217092386766
compactness_worst :   0