# Customized Random Forest Classifier Model 

- TODO: implement tqdm.contrib.concurrent for build trees in parallel.

## Fundamentals:

- Based on Random Forest method principles: ensemble of models (decision trees).

- In bootstrap process:

    - the data sampled ensure the balance between classes, for training and validation;

    - the list of features used are randomly sampled (with random number of features and order).
    
- For each tree:

    - fallowing the sequence of a given list of features, the data is splited half/half based on meadian value;
    
    - the splitting process ends when the samples have one only class;
    
    - validation process based on dynamic threshold can discard the tree.
    
- For use the forest:

    - all trees predictions are combined as a vote;
    
    - it is possible to use soft or hard-voting.
    
- Positive side-effects:

    - possible more generalization caused by the combination of overfitted trees, each tree is highly specialized in a smallest and different set of feature;
    
    - robustness for unbalanced and missing data, in case of missing data, the feature could be skipped without degrade the optimization process;
    
    - in prediction process, a missing value could be dealt with a tree replication considering the two possible paths;
    
    - the survived trees have a potential information about feature importance.

### Premises: 
- all features must be numeric.
   
- Is case of categorical data, the splitting is done for each categorical value, creating one branch for each value.

- Author: Israel Oliveira [\[e-mail\]](mailto:'Israel%20Oliveira%20'<prof.israel@gmail.com>)

In [1]:
!pip3 install random-forest-mc

Collecting random-forest-mc
  Downloading random_forest_mc-0.2.1-py3-none-any.whl (9.0 kB)
Collecting numpy<2.0.0,>=1.21.2
  Downloading numpy-1.21.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB)
[K     |████████████████████████████████| 15.8 MB 200 kB/s eta 0:00:01    |█████████████████▎              | 8.5 MB 179 kB/s eta 0:00:41
Collecting poetry-version<0.2.0,>=0.1.5
  Downloading poetry_version-0.1.5-py2.py3-none-any.whl (13 kB)
Collecting tomlkit<0.6.0,>=0.4.6
  Downloading tomlkit-0.5.11-py2.py3-none-any.whl (31 kB)
Installing collected packages: tomlkit, numpy, poetry-version, random-forest-mc
  Attempting uninstall: numpy
    Found existing installation: numpy 1.20.3
    Uninstalling numpy-1.20.3:
      Successfully uninstalled numpy-1.20.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.54.0 requires numpy<1.21,>=1.17

In [1]:
%load_ext watermark

In [10]:
import pandas as pd
import numpy as np
from random_forest_mc.model import RandomForestMC
from random_forest_mc.utils import load_file_json
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score
from collections import Counter
import hashlib
import joblib

In [3]:
from tqdm import tqdm

# from glob import glob

# import matplotlib.pyplot as plt
# %matplotlib inline
# from matplotlib import rcParams
# from cycler import cycler

# rcParams['figure.figsize'] = 12, 8 # 18, 5
# rcParams['axes.spines.top'] = False
# rcParams['axes.spines.right'] = False
# rcParams['axes.grid'] = True
# rcParams['axes.prop_cycle'] = cycler(color=['#365977'])
# rcParams['lines.linewidth'] = 2.5

# import seaborn as sns
# sns.set_theme()

# pd.set_option("max_columns", None)
# pd.set_option("max_rows", None)
# pd.set_option('display.max_colwidth', None)

from IPython.display import Markdown, display
def md(arg):
    display(Markdown(arg))

# from pandas_profiling import ProfileReport
# #report = ProfileReport(#DataFrame here#, minimal=True)
# #report.to

# import pyarrow.parquet as pq
# #df = pq.ParquetDataset(path_to_folder_with_parquets, filesystem=None).read_pandas().to_pandas()

# import json
# def open_file_json(path,mode='r',var=None):
#     if mode == 'w':
#         with open(path,'w') as f:
#             json.dump(var, f)
#     if mode == 'r':
#         with open(path,'r') as f:
#             return json.load(f)

# import functools
# import operator
# def flat(a):
#     return functools.reduce(operator.iconcat, a, [])

# import json
# from glob import glob
# from typing import NewType


# DictsPathType = NewType("DictsPath", str)


# def open_file_json(path):
#     with open(path, "r") as f:
#         return json.load(f)

# class LoadDicts:
#     def __init__(self, dict_path: DictsPathType = "./data"):
#         Dicts_glob = glob(f"{dict_path}/*.json")
#         self.List = []
#         self.Dict = {}
#         for path_json in Dicts_glob:
#             name = path_json.split("/")[-1].replace(".json", "")
#             self.List.append(name)
#             self.Dict[name] = open_file_json(path_json)
#             setattr(self, name, self.Dict[name])


In [4]:
# Run this cell before close.
%watermark -d --iversion -b -r -g -m -v
!cat /proc/cpuinfo |grep 'model name'|head -n 1 |sed -e 's/model\ name/CPU/'
!free -h |cut -d'i' -f1  |grep -v total

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.27.0

Compiler    : GCC 10.2.1 20210110
OS          : Linux
Release     : 5.11.0-7633-generic
Machine     : x86_64
Processor   : 
CPU cores   : 4
Architecture: 64bit

Git hash: 5962e2dd43def4f66834036ad82aba1cac443e17

Git repo: https://github.com/ysraell/random-forest-mc.git

Git branch: main

numpy : 1.20.3
joblib: 1.0.1
pandas: 1.3.2

CPU	: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Mem:            15G
Swap:          4.0G


In [5]:
df = pd.read_csv('/work/data/creditcard_trans_int.csv')
target_col = 'Class'
ds_cols = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']

## How much data for each class:

In [6]:
df[target_col].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [7]:
# How much for test step:
N_fraud_test = 200
N_truth_test = int(2e4)

# How much for training step:
N_truth_train = int(2e5)
# The remaing data for fraud is all used.

# How much (survived) trees:
T = 64

# Parameters for bootstrap

# Data for each tree:
N_T = int((df[target_col].value_counts()[1] - N_fraud_test)/2) # remaing data for fraud.
N_V = N_T # keep balanced amount for each classe.

# How much features:
n_F_max = len(ds_cols)
n_F_min = n_F_max//2
# From EDA: decision trees with small amount of features could be not so efficient.

# Droped trees limit:
D = 64
# When a total of D trees are dropped, the validation threshold decreases delta_Th, restarting the counting.

# Delta Threshold
delta_th = 0.001
th_start = 0.96
# The validation threshold decreases dynamically as need to selection more specialized tree.

# Metric in validation
M = 'Accuracy'
# This parameter have no effect, only to say which metric is used in validation process.

#Seeds
#split_seeds = [43, 47, 53, 59]
split_seeds = [43]
# One for each experiment.

In [8]:
N_T

146

In [9]:
# Splitting by unique values of the classes.
df = df[ds_cols+[target_col]]
df_fraud = df.query('Class == 1').reset_index(drop=True).copy()
df_truth = df.query('Class == 0').reset_index(drop=True).copy()

Results = {}
    
# Each seed is a experiment.
for seed in split_seeds:

    # Start the experiment
    df_fraud_train, df_fraud_test = train_test_split(df_fraud, test_size=N_fraud_test, random_state=seed)
    df_truth_train, df_truth_test = train_test_split(df_truth, train_size=N_truth_train, test_size=N_truth_test, random_state=seed)
    df_train = pd.concat([df_fraud_train, df_truth_train]).reset_index(drop=True)
    df_test = pd.concat([df_fraud_test, df_truth_test]).reset_index(drop=True)
    del df_fraud_test, df_truth_test, df_fraud_train, df_truth_train
    
    # Training step
    cls = RandomForestMC(
        n_trees=T, target_col='Class', max_discard_trees=D,
        delta_th=delta_th, th_start=th_start,
        batch_train_pclass=N_T, batch_val_pclass=N_V,
        th_decease_verbose = False
    )
    cls.process_dataset(df_train)
    cls.fitParallel(max_workers=16, thread_parallel_method=False)
    #cls.fit()
    
    # Test step
    probs = cls.testForestProbs(df_test)

    # Save results
    target_test = df_test.Class.to_list()
    Results[seed] = {
        'probs': probs,
        'target_test': target_test
    }

Planting the forest:   0%|          | 0/64 [00:00<?, ?it/s]

In [10]:
cls.fitParallel(max_workers=16, thread_parallel_method=False)
#cls.fit()

# Test step
probs = cls.testForestProbs(df_test)

# Save results
target_test = df_test.Class.to_list()
Results[seed] = {
    'probs': probs,
    'target_test': target_test
}

Planting the forest:   0%|          | 0/64 [00:00<?, ?it/s]

In [11]:
cls = RandomForestMC(
    n_trees=T, target_col='Class', max_discard_trees=D,
    delta_th=delta_th, th_start=th_start,
    batch_train_pclass=N_T, batch_val_pclass=N_V,
    th_decease_verbose = False
)

In [13]:
ModelDict = load_file_json('/work/data/cls_rfmc.json')
cls.dict2model(ModelDict)

In [14]:
# Splitting by unique values of the classes.
df = df[ds_cols+[target_col]]
df_fraud = df.query('Class == 1').reset_index(drop=True).copy()
df_truth = df.query('Class == 0').reset_index(drop=True).copy()

Results = {}

# Each seed is a experiment.
for seed in tqdm(split_seeds):

    # Start the experiment
    df_fraud_train, df_fraud_test = train_test_split(df_fraud, test_size=N_fraud_test, random_state=seed)
    df_truth_train, df_truth_test = train_test_split(df_truth, train_size=N_truth_train, test_size=N_truth_test, random_state=seed)
    df_train = pd.concat([df_fraud_train, df_truth_train]).reset_index(drop=True)
    df_test = pd.concat([df_fraud_test, df_truth_test]).reset_index(drop=True)
    del df_fraud_test, df_truth_test, df_fraud_train, df_truth_train

100%|██████████| 1/1 [00:00<00:00,  8.18it/s]


In [24]:
    # Test step
    probs = cls.testForestProbs(df_test, soft_voting=True, weighted_tree=True)

    # Save results
    target_test = df_test.Class.to_list()
    Results[seed] = {
        'probs': probs,
        'target_test': target_test
    }

In [25]:

# Generates the metrics
data = []
for seed,exp in Results.items():
    target_test = exp['target_test']
    rf_classes = [0, 1]
    df_exp = pd.DataFrame([(d['0'], d['1']) for d in exp['probs']], columns=rf_classes)
    df_exp['pred'] = df_exp[[0, 1]].apply(lambda x: rf_classes[np.argmax(x)], axis=1)
    df_exp['target'] = target_test
    Fraud_True_Sum = df_exp.loc[(df_exp.pred == 1) & (df_exp.target == 1)][1].sum()/sum(df_exp.target == 1)
    Truth_False_Sum = df_exp.loc[(df_exp.pred == 0) & (df_exp.target == 1)][0].sum()/sum(df_exp.target == 1)
    Fraud_False_Sum = df_exp.loc[(df_exp.pred == 1) & (df_exp.target == 0)][1].sum()/sum(df_exp.target == 0)
    F1_M = f1_score(target_test, df_exp['pred'].to_numpy(), average='macro')
    AUC_ROC_M = roc_auc_score(target_test, df_exp['pred'].to_numpy(), average='macro')
    TP_0 = df_exp.loc[(df_exp.pred == 0) & (df_exp.target == 0)].shape[0]/sum(df_exp.target == 0)
    TP_1 = df_exp.loc[(df_exp.pred == 1) & (df_exp.target == 1)].shape[0]/sum(df_exp.target == 1)


    data.append([
        seed, Fraud_True_Sum, Truth_False_Sum, Fraud_False_Sum, F1_M, AUC_ROC_M, TP_0, TP_1
    ])

In [23]:
columns = ['seed', 'Fraud_True_Sum','Truth_False_Sum', 'Fraud_False_Sum', 'F1_M', 'AUC_ROC_M', 'TP_0', 'TP_1']
df_Results = pd.DataFrame(data, columns=columns)
df_Results.to_csv('/work/data/Results_creditcard_RFMC_6.csv', index=False)
df_Results

Unnamed: 0,seed,Fraud_True_Sum,Truth_False_Sum,Fraud_False_Sum,F1_M,AUC_ROC_M,TP_0,TP_1
0,43,0.834806,0.102332,0.007831,0.775837,0.928775,0.98755,0.87


In [26]:
columns = ['seed', 'Fraud_True_Sum','Truth_False_Sum', 'Fraud_False_Sum', 'F1_M', 'AUC_ROC_M', 'TP_0', 'TP_1']
df_Results = pd.DataFrame(data, columns=columns)
df_Results.to_csv('/work/data/Results_creditcard_RFMC_7.csv', index=False)
df_Results

Unnamed: 0,seed,Fraud_True_Sum,Truth_False_Sum,Fraud_False_Sum,F1_M,AUC_ROC_M,TP_0,TP_1
0,43,0.834806,0.102332,0.007831,0.775837,0.928775,0.98755,0.87


# Results

In [None]:
md('# T = 128 e D = 512 (delta_th = 0.002)')
#md("With 16 threads: {:.2f}s/tree".format((6*3600+16*60+13)/512))
df_Results

In [None]:
md('# T = 512 e D = 32')
md("With 16 threads: {:.2f}s/tree".format((6*3600+16*60+13)/512))
df_Results

In [None]:
md('# T = 128 e D = 16')
md("With 16 threads: {:.2f}s/tree".format((50*60+28)/128))
df_Results = pd.read_csv('/work/data/Results_creditcard_RFMC_3.csv')
df_Results

### Basic statistics for all experiments

In [None]:
df_Results[df_Results.columns[1:]].describe().loc[['mean', 'std', 'min', 'max']]

# Baseline model metrics

In [None]:
df_Baseline = pd.read_csv('Results_creditcard_Baseline.csv')

In [None]:
columns = ['seed', 'n_estimators', 'Fraud_True_Sum','Truth_False_Sum', 'Fraud_False_Sum', 'F1_M', 'AUC_ROC_M', 'TP_0', 'TP_1']
df_Baseline[columns].groupby(['seed','n_estimators']).mean()

In [21]:
Counter([round(x,8) for x in cls.survived_scores])

Counter({0.924: 6,
         0.909: 5,
         0.906: 6,
         0.913: 5,
         0.919: 4,
         0.902: 2,
         0.905: 8,
         0.908: 2,
         0.916: 3,
         0.911: 2,
         0.922: 7,
         0.929: 3,
         0.923: 4,
         0.946: 2,
         0.917: 4,
         0.931: 7,
         0.928: 5,
         0.93: 4,
         0.934: 3,
         0.933: 2,
         0.9: 1,
         0.912: 4,
         0.914: 3,
         0.915: 1,
         0.901: 3,
         0.896: 2,
         0.903: 3,
         0.907: 5,
         0.94: 2,
         0.921: 4,
         0.937: 1,
         0.927: 1,
         0.91: 2,
         0.92: 5,
         0.932: 2,
         0.925: 2,
         0.935: 1,
         0.899: 1,
         0.904: 1})

In [51]:
Forest = cls.Forest+cls.Forest
print(len(Forest))
conds = pd.DataFrame([hashlib.md5(str(Tree).encode('utf-8')).hexdigest() for Tree in Forest]).duplicated().to_list()
Forest = [Tree for Tree, cond in zip(Forest,conds) if cond]
print(len(Forest))
Forest = [score for score, cond in zip(survived_scores,conds) if cond]
print(len(Forest))

256
128


NameError: name 'survived_scores' is not defined

In [35]:
i

0

In [36]:
hashlib.md5(str(Tree).encode('utf-8')).hexdigest()

'37f4348ff86b2fc5486ef88873d9363e'