# Random Forest Classification

The final classification algorithm that will be tried. ***NOTE***: Naive Bayes is not tried since the data analysis showed that the interesting features are not independent of each other.

## Setup

In [1]:
import pandas as pd
from sklearn import tree

In [2]:
# Load the files
existing_customers = pd.read_excel('data/existing-customers.xlsx')
potential_customers = pd.read_excel('data/potential-customers.xlsx')

# Define the score metric
def ROI(precision, amount):
    return amount * (88*precision - 25.5*(1-precision))

  warn("Workbook contains no default style, apply openpyxl's default")
  warn("Workbook contains no default style, apply openpyxl's default")


## Premise
See if the data can be classified using a random forest classifier. We have an unbelievable result with Decision Trees so the questions is if the random forest can do better.

In [3]:
from sklearn.model_selection import train_test_split

def preprocessing_and_feature_selection(
    train_ratio = 0.70,
    validation_ratio = 0.15,
    test_ratio = 0.15,
):
    # Do the feature selection
    data_x = existing_customers[["age", "education", "education-num", "marital-status", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week"]]
    data_y = existing_customers[["class"]]

    # Deal with the NaN entries
    # - By ignoring the variables that contain the Nan entries.

    # Do the conversion from categorical to nominal
    data_x = pd.get_dummies(data_x)
    data_y = pd.get_dummies(data_y, drop_first=True) 

    # Split the data into test, training and validation
    x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=1 - train_ratio)
    x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

    return x_train, x_val, x_test, y_train, y_val, y_test

x_train, x_val, x_test, y_train, y_val, y_test = preprocessing_and_feature_selection()

In [12]:
# Train the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score

# Try different max_depth and keep the best model
best_roi = 0
model = None

for depth in range(1,40):
    # Train the model
    model = RandomForestClassifier(n_estimators=100, max_depth=depth)
    model.fit(x_train, y_train.values.ravel())

    # Validate the model
    y_pred = model.predict(x_val)
    precision = precision_score(y_val, y_pred)
    roi = ROI(precision,  y_pred.sum())
    print(f"Recall= {precision}\tAmount= {y_pred.sum()}\tROI= {roi}")

    if roi > best_roi:
        best_roi = roi
        best_model = model


Recall= 0.0	Amount= 0	ROI= -0.0
Recall= 0.08583690987124463	Amount= 101	ROI= -1591.5085836909873
Recall= 0.2369098712446352	Amount= 283	ROI= 393.16351931330473
Recall= 0.4918454935622318	Amount= 714	ROI= 21651.666952789703
Recall= 0.5339055793991416	Amount= 791	ROI= 27762.742060085835
Recall= 0.5579399141630901	Amount= 841	ROI= 31811.817596566518
Recall= 0.5613733905579399	Amount= 827	ROI= 31604.53261802575
Recall= 0.5708154506437768	Amount= 849	ROI= 33355.1330472103
Recall= 0.5708154506437768	Amount= 842	ROI= 33080.120171673814
Recall= 0.5768240343347639	Amount= 853	ROI= 34094.00729613734
Recall= 0.5793991416309013	Amount= 856	ROI= 34464.10300429184
Recall= 0.5811158798283261	Amount= 860	ROI= 34792.72103004292
Recall= 0.5879828326180258	Amount= 870	ROI= 35875.36480686696
Recall= 0.5914163090128756	Amount= 876	ROI= 36464.15793991416
Recall= 0.6103004291845494	Amount= 928	ROI= 40617.72360515022
Recall= 0.6145922746781116	Amount= 938	ROI= 41512.337339055804
Recall= 0.6197424892703862	Amo

In [13]:
# Test the best model
y_pred = best_model.predict(x_test)
precision = precision_score(y_test, y_pred)
roi = ROI(precision,  y_pred.sum())
print(f"Recall= {precision}\tAmount= {y_pred.sum()}\tROI= {roi}")

Recall= 0.6367414796342478	Amount= 1082	ROI= 50605.31088944306


In [14]:
best_model.get_params

<bound method BaseEstimator.get_params of RandomForestClassifier(max_depth=23)>

In [15]:
deploy_x = potential_customers[["age", "education", "education-num", "marital-status", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week"]]
deploy_x = pd.get_dummies(deploy_x)

y_pred = best_model.predict(deploy_x)
amount = y_pred.sum()

print(f"Recall= {precision}\tAmount= {y_pred.sum()}\tROI={ROI(precision,  y_pred.sum())}")

Recall= 0.6367414796342478	Amount= 3342	ROI=156305.86783042396


### Conclusion
This does not score better than the Decision tree.