<b>Foreword:</b><br>
This notebook shows an example of how the model is used on a sample dataset.

This sample dataset (~31,000 entries) is taken from <b>UCI Machine Learning Repository</b>. 
For the purposes of this example, the task will be to predict whether an individual earns above or below 50,000 annually using the Decision Tree/Random Forest models built from scratch. 

In [8]:
#import required libraries 
%load_ext autoreload
%autoreload 2

import numpy as np
from sklearn import metrics
from src.datapipeline import transform
from src.decision_tree import DecisionTree
from src.random_forest import RandomForest

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Data Cleaning

First, we need to clean the raw data. For purposes of illustration, I have written a data pipeline that does the cleaning and returns a train test split.  

In [3]:
# load data 
data_path = './data/data.csv'
X_train, X_test, y_train, y_test = transform(data_path)

  df['education'] = df['education'].replace(value_to_index)


Now, let's inspect the shape of the cleaned data. 
We have 20,734 entries for our train set; 10,213 entries for our test set.
We also have 294 columns which represents features. 

In [4]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:',y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (20734, 294)
X_test shape: (10213, 294)
y_train shape: (20734, 1)
y_test shape: (10213, 1)


Note that our data transformation pipeline creates dataframes. 
So we need to convert them into numpy arrays first. 

In [5]:
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

#### <b>Part 1: Decision Tree</b>: 
Now we fit the decision tree on our train dataset.

A decision tree is a flexible, non-parametric model used for both classification and regression. In my implementation, I built a tree-like structure by splitting the data into smaller groups based on feature values. At each step, the tree chooses the feature and threshold that result in the best split, by reducing Gini Impurity. This process continues until a stopping condition is met, such as reaching a maximum depth or having too few samples to split further.

##### Model Training

In [29]:
dt = DecisionTree(max_depth=3)
fitted_tree = dt.fit(X_train, y_train)

##### Model Inference and Evaluation

We completed fitting the decision tree! Now lets evaluate it. 

In [30]:
pred_dt = dt.predict(X_test, fitted_tree)

The predicted y is a list of 0s and 1, indicating the class prediction. 
Whereas the true y has the shape (10213, 1). This means we need to apply np.squeeze in order to remove one dimension. This allows us to apply sklearn's classification report to evaluate the performance of the decision tree.

In [31]:
print("predicted y shape:", pred_dt.shape)
print("true y shape:", y_test.shape)

predicted y shape: (10213,)
true y shape: (10213, 1)


In [34]:
report = metrics.classification_report(np.squeeze(y_test), pred_dt)
print('Performance metrics for Decision Tree from scratch:\n-----------------------------------------------------\n', report)

Performance metrics for Decision Tree from scratch:
-----------------------------------------------------
               precision    recall  f1-score   support

           0       0.89      0.79      0.84      6128
           1       0.73      0.86      0.79      4085

    accuracy                           0.82     10213
   macro avg       0.81      0.83      0.82     10213
weighted avg       0.83      0.82      0.82     10213



#### <b>Part 2. Random Forest </b>

Random Forests are an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training.

The Random Forest I built from scratch using only Numpy is for the purposes of classification task. <br>

To create a Random Forest, we need to define the number of Decision Trees in the forest.I draw samples from the X_train set using bootstrap aggregation (just a fancy term for sampling with replacement, also called bagging); I also sample features from the X_train set. How many times do I draw? As many times as the number of trees in my random forest. <br>

Model training happens by fitting Decision Trees to each sample data. <br>

For inferencing, (we use the X_test data now) we use a simple averaging mechanism, where the "votes" are averaged across the trees for each data point (an entry of X_test).

##### Model Training

In [10]:
rf = RandomForest(n_trees=100, subsample_size=0.5, feature_proportion=0.5)
rf.fit(X_train, y_train)

##### Model Inferencing and Evaluation

Rmb that we use X_test set for inferencing!
We can just call the predict method here

In [12]:
preds_rf = rf.predict(X_test)

As you can see from the classification report below, the random forest that I have built from scratch scored an accuracy of 88 %. As expected, this is an improvement of +6 percentage points (7.3%) from using a single decision tree (82 % accuracy). 

Accuracy is a fair metric to use here because the dataset is not too imbalanced; you can observe our precision and recall scores are not too far off from each other too. 

In [35]:
report = metrics.classification_report(np.squeeze(y_test), np.squeeze(preds_rf))

print('Performance metrics for Random Forest from scratch:\n-----------------------------------------------------\n', report)

Performance metrics for Random Forest from scratch:
-----------------------------------------------------
               precision    recall  f1-score   support

           0       0.91      0.88      0.90      6128
           1       0.83      0.87      0.85      4085

    accuracy                           0.88     10213
   macro avg       0.87      0.88      0.87     10213
weighted avg       0.88      0.88      0.88     10213

