<h1>gcForest Algorithm</h1>

<p>The gcForest algorithm was suggested in Zhou and Feng 2017 ( https://arxiv.org/abs/1702.08835 , refer for this paper for technical details) and I provide here a python3 implementation of this algorithm.<br>
I chose to adopt the scikit-learn syntax for ease of use and hereafter I present how it can be used.</p>

In [1]:
from GCForest import gcForest
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

<h2>Iris example</h2>

<p>The iris data set is actually not a very good example as the gcForest algorithm is better suited for time series and images where informations can be found at different scales in one sample.<br>
Nonetheless it is still an easy way to test the method.</p>

In [2]:
# loading the data
iris = load_iris()
X = iris.data
y = iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33)

<p>First calling and training the algorithm.
A specificity here is the presence of the 'shape_1X' keyword to specify the shape of a single sample.
I have added it as pictures fed to the machinery might not be square.<br>
Obviously it is not very relevant for the iris data set but still, it has to be defined.</p>

In [3]:
gcf = gcForest(shape_1X=[4,1], window=[2], tolerance=0.0)
gcf.fit(X_tr, y_tr)

Slicing Sequence...
Training MGS Random Forests...
Adding/Training Layer, n_layer=1
Layer validation accuracy = 1.0
Adding/Training Layer, n_layer=2
Layer validation accuracy = 1.0


<p>Now checking the prediction for the test set:<p>

In [4]:
pred_X = gcf.predict(X_te)
print(pred_X)

Slicing Sequence...
[1 0 0 1 1 0 2 2 1 0 2 1 0 0 2 1 0 2 0 2 0 1 1 2 0 1 0 1 1 1 2 1 2 2 0 2 0
 1 0 1 2 0 0 2 0 1 2 2 2 2]


In [5]:
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))

gcForest accuracy : 0.94


<h2>Digits Example</h2>
<p>A much better example is the digits data set containing images of hand written digits.
The scikit data set can be viewed as a mini-MNIST for training purpose.</p>

In [6]:
# loading the data
digits = load_digits()
X = digits.data
y = digits.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4)

<p> ... taining gcForest ... (can take some time...) </p>

In [7]:
gcf = gcForest(shape_1X=[8,8], window=[4,6], tolerance=0.0, min_samples=7)
gcf.fit(X_tr, y_tr)

Slicing Images...
Training MGS Random Forests...
Slicing Images...
Training MGS Random Forests...
Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9675925925925926
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9722222222222222
Adding/Training Layer, n_layer=3
Layer validation accuracy = 0.9722222222222222


<p> ... and predicting classes ... </p>

In [8]:
pred_X = gcf.predict(X_te)
print(pred_X)

Slicing Images...
Slicing Images...
[9 1 5 2 5 7 6 8 9 5 6 7 6 0 7 6 0 7 5 5 7 0 0 1 1 0 4 7 9 5 5 3 8 2 7 4 1
 9 9 4 3 9 1 6 4 2 1 5 8 0 8 3 7 2 3 7 4 9 6 4 7 6 0 9 0 3 9 9 9 5 1 9 0 0
 7 4 2 1 8 8 7 8 6 0 0 8 6 2 9 3 7 4 3 0 5 9 5 1 4 3 6 8 5 3 2 9 3 6 8 0 3
 5 4 6 1 2 2 5 9 1 5 1 9 9 5 6 2 1 3 1 3 4 3 8 0 8 5 7 0 6 0 0 7 7 7 4 4 3
 6 3 7 3 2 5 8 1 2 6 4 2 6 2 3 4 9 3 3 3 7 1 5 1 4 6 9 9 2 3 6 2 0 2 4 7 5
 1 7 6 8 0 3 0 3 2 7 2 9 2 6 6 3 7 7 5 9 3 8 0 7 9 2 1 5 2 4 7 6 2 0 3 8 4
 4 4 7 1 8 3 0 8 0 7 2 6 3 0 8 6 7 8 6 1 4 1 2 5 4 8 6 4 3 0 9 9 0 6 4 0 3
 6 5 6 6 9 4 3 9 8 4 8 7 1 1 7 2 4 3 6 7 5 9 9 9 1 8 5 4 1 2 7 6 1 7 6 0 2
 0 6 1 0 7 9 3 6 3 0 9 8 9 4 3 3 7 2 3 3 7 6 8 1 1 4 6 5 5 8 3 5 6 6 9 0 2
 2 5 2 7 3 4 3 3 4 8 9 8 1 8 7 6 0 1 5 2 3 9 0 3 4 8 4 9 8 5 8 7 3 4 7 9 6
 1 1 3 8 5 7 7 0 6 4 0 0 6 2 4 8 9 0 3 2 2 2 4 5 8 6 8 7 6 1 7 3 9 6 5 7 3
 9 6 8 5 4 1 3 0 7 0 6 7 5 7 5 4 0 3 7 6 1 2 6 0 5 0 6 8 1 5 0 1 2 0 0 1 8
 3 3 0 7 9 6 9 4 3 9 2 5 4 5 2 3 9 2 1 2 2 2 5 0 5 7 4 5 1 9 6 0

In [9]:
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))

gcForest accuracy : 0.9847009735744089


<h2>Using mg-scanning and cascade_forest Sperately</h2>
<p>As the Multi-Grain scanning and the cascade forest modules are quite independent it is possible to use them seperately.<br>
If a target `y` is given the code automaticcaly use it for training otherwise it recalls the last trained Random Forests to slice the data.</p>

In [10]:
gcf = gcForest(shape_1X=[8,8], window=[5], min_samples=7)
X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)

Slicing Images...
Training MGS Random Forests...


In [11]:
X_te_mgs = gcf.mg_scanning(X_te)

Slicing Images...


<p>It is now possible to use the mg_scanning output as input for cascade forests using different parameters. Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade. Hence the need to first take the mean of the output and then find the max.</p>

In [12]:
gcf = gcForest(tolerance=0.0, min_samples=7)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)

Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9768518518518519
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9722222222222222


In [13]:
pred_proba = gcf.cascade_forest(X_te_mgs)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.98470097357440889

In [14]:
gcf = gcForest(tolerance=0.0, min_samples=20)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)

Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9722222222222222
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9814814814814815
Adding/Training Layer, n_layer=3
Layer validation accuracy = 0.9814814814814815


In [15]:
pred_proba = gcf.cascade_forest(X_te_mgs)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.98470097357440889

<h3>Skipping mg_scanning</h3>
<p>It is also possible to directly use the cascade forest and skip the multi grain scanning step.</p>

In [16]:
gcf = gcForest(tolerance=0.0, min_samples=20)
_ = gcf.cascade_forest(X_tr, y_tr)

Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9351851851851852
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9537037037037037
Adding/Training Layer, n_layer=3
Layer validation accuracy = 0.9629629629629629
Adding/Training Layer, n_layer=4
Layer validation accuracy = 0.9629629629629629


In [17]:
pred_proba = gcf.cascade_forest(X_te)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.95132127955493739