# Customized Random Forest Classifier Model 

- TODO: implement tqdm.contrib.concurrent for build trees in parallel.

## Fundamentals:

- Based on Random Forest method principles: ensemble of models (decision trees).

- In bootstrap process:

    - the data sampled ensure the balance between classes, for training and validation;

    - the list of features used are randomly sampled (with random number of features and order).
    
- For each tree:

    - fallowing the sequence of a given list of features, the data is splited half/half based on meadian value;
    
    - the splitting process ends when the samples have one only class;
    
    - validation process based on dynamic threshold can discard the tree.
    
- For use the forest:

    - all trees predictions are combined as a vote;
    
    - it is possible to use soft or hard-voting.
    
- Positive side-effects:

    - possible more generalization caused by the combination of overfitted trees, each tree is highly specialized in a smallest and different set of feature;
    
    - robustness for unbalanced and missing data, in case of missing data, the feature could be skipped without degrade the optimization process;
    
    - in prediction process, a missing value could be dealt with a tree replication considering the two possible paths;
    
    - the survived trees have a potential information about feature importance.

### Premises: 
- all features must be numeric.
   
- Is case of categorical data, the splitting is done for each categorical value, creating one branch for each value.

- Author: Israel Oliveira [\[e-mail\]](mailto:'Israel%20Oliveira%20'<prof.israel@gmail.com>)

In [1]:
!pip3 install random-forest-mc

Collecting random-forest-mc
  Downloading random_forest_mc-0.2.0-py3-none-any.whl (8.9 kB)
Collecting numpy<2.0.0,>=1.21.2
  Downloading numpy-1.21.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB)
[K     |████████████████████████████████| 15.8 MB 3.8 MB/s eta 0:00:01
[?25hCollecting poetry-version<0.2.0,>=0.1.5
  Downloading poetry_version-0.1.5-py2.py3-none-any.whl (13 kB)
Collecting pandas<2.0.0,>=1.3.2
  Downloading pandas-1.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 9.6 kB/s eta 0:00:01
[?25hCollecting tqdm<5.0.0,>=4.62.1
  Downloading tqdm-4.62.2-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 141 kB/s eta 0:00:01
Collecting tomlkit<0.6.0,>=0.4.6
  Downloading tomlkit-0.5.11-py2.py3-none-any.whl (31 kB)
Installing collected packages: tomlkit, numpy, tqdm, poetry-version, pandas, random-forest-mc
  Attempting uninstall: numpy
    Found existing install

In [2]:
%load_ext watermark

In [3]:
import pandas as pd
import numpy as np
from random_forest_mc.model import RandomForestMC
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score

In [4]:
# from tqdm import tqdm

# from glob import glob

# import matplotlib.pyplot as plt
# %matplotlib inline
# from matplotlib import rcParams
# from cycler import cycler

# rcParams['figure.figsize'] = 12, 8 # 18, 5
# rcParams['axes.spines.top'] = False
# rcParams['axes.spines.right'] = False
# rcParams['axes.grid'] = True
# rcParams['axes.prop_cycle'] = cycler(color=['#365977'])
# rcParams['lines.linewidth'] = 2.5

# import seaborn as sns
# sns.set_theme()

# pd.set_option("max_columns", None)
# pd.set_option("max_rows", None)
# pd.set_option('display.max_colwidth', None)

# from IPython.display import Markdown, display
# def md(arg):
#     display(Markdown(arg))

# from pandas_profiling import ProfileReport
# #report = ProfileReport(#DataFrame here#, minimal=True)
# #report.to

# import pyarrow.parquet as pq
# #df = pq.ParquetDataset(path_to_folder_with_parquets, filesystem=None).read_pandas().to_pandas()

# import json
# def open_file_json(path,mode='r',var=None):
#     if mode == 'w':
#         with open(path,'w') as f:
#             json.dump(var, f)
#     if mode == 'r':
#         with open(path,'r') as f:
#             return json.load(f)

# import functools
# import operator
# def flat(a):
#     return functools.reduce(operator.iconcat, a, [])

# import json
# from glob import glob
# from typing import NewType


# DictsPathType = NewType("DictsPath", str)


# def open_file_json(path):
#     with open(path, "r") as f:
#         return json.load(f)

# class LoadDicts:
#     def __init__(self, dict_path: DictsPathType = "./data"):
#         Dicts_glob = glob(f"{dict_path}/*.json")
#         self.List = []
#         self.Dict = {}
#         for path_json in Dicts_glob:
#             name = path_json.split("/")[-1].replace(".json", "")
#             self.List.append(name)
#             self.Dict[name] = open_file_json(path_json)
#             setattr(self, name, self.Dict[name])


In [5]:
# Run this cell before close.
%watermark -d --iversion -b -r -g -m -v
!cat /proc/cpuinfo |grep 'model name'|head -n 1 |sed -e 's/model\ name/CPU/'
!free -h |cut -d'i' -f1  |grep -v total

Python implementation: CPython
Python version       : 3.9.6
IPython version      : 7.26.0

Compiler    : GCC 8.3.0
OS          : Linux
Release     : 5.11.0-7620-generic
Machine     : x86_64
Processor   : 
CPU cores   : 8
Architecture: 64bit

Git hash: 9b38fea0fc70036091e4c8e101933f73ae0f88a3

Git repo: https://github.com/ysraell/random-forest-mc.git

Git branch: main

numpy : 1.21.2
pandas: 1.3.2

CPU	: Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
Mem:           31G
Swap:         4.0G


In [6]:
df = pd.read_csv('/work/data/creditcard_trans_int.csv')
target_col = 'Class'
ds_cols = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']

## How much data for each class:

In [7]:
df[target_col].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [20]:
# How much for test step:
N_fraud_test = 200
N_truth_test = int(2e4)

# How much for training step:
N_truth_train = int(2e5)
# The remaing data for fraud is all used.

# How much (survived) trees:
T = 128

# Parameters for bootstrap

# Data for each tree:
N_T = int((df[target_col].value_counts()[1] - N_fraud_test)/2) # remaing data for fraud.
N_V = N_T # keep balanced amount for each classe.

# How much features:
n_F_max = len(ds_cols)
n_F_min = n_F_max//2
# From EDA: decision trees with small amount of features could be not so efficient.

# Droped trees limit:
D = 16
# When a total of D trees are dropped, the validation threshold decreases delta_Th, restarting the counting.

# Delta Threshold
delta_th = 0.01
th_start = 0.98
# The validation threshold decreases dynamically as need to selection more specialized tree.

# Metric in validation
M = 'Accuracy'
# This parameter have no effect, only to say which metric is used in validation process.

#Seeds
#split_seeds = [43, 47, 53, 59]
split_seeds = [43]
# One for each experiment.

In [21]:
N_T

146

In [22]:
# Splitting by unique values of the classes.
df = df[ds_cols+[target_col]]
df_fraud = df.query('Class == 1').reset_index(drop=True).copy()
df_truth = df.query('Class == 0').reset_index(drop=True).copy()

Results = {}
    
# Each seed is a experiment.
for seed in split_seeds:

    # Start the experiment
    df_fraud_train, df_fraud_test = train_test_split(df_fraud, test_size=N_fraud_test, random_state=seed)
    df_truth_train, df_truth_test = train_test_split(df_truth, train_size=N_truth_train, test_size=N_truth_test, random_state=seed)
    df_train = pd.concat([df_fraud_train, df_truth_train]).reset_index(drop=True)
    df_test = pd.concat([df_fraud_test, df_truth_test]).reset_index(drop=True)
    del df_fraud_test, df_truth_test, df_fraud_train, df_truth_train
    
    # Training step
    cls = RandomForestMC(
        n_trees=T, target_col='Class', max_discard_trees=D,
        delta_th=delta_th, th_start=th_start,
        batch_train_pclass=N_T, batch_val_pclass=N_V
    )
    cls.process_dataset(df_train)
    #cls.fitParallel(max_workers=16, thread_parallel_method=False)
    cls.fit()
    
    # Test step
    probs = cls.testForestProbs(df_test)

    # Save results
    target_test = df_test.Class.to_list()
    Results[seed] = {
        'probs': probs,
        'target_test': target_test
    }

Planting the forest:   1%|          | 1/128 [09:59<21:08:15, 599.18s/it]


KeyboardInterrupt: 

In [None]:

# Generates the metrics
data = []
for seed,exp in Results.items():
    target_test = exp['target_test']
    rf_classes = [0, 1]
    df_exp = pd.DataFrame([(d['0'], d['1']) for d in exp['probs']], columns=rf_classes)
    df_exp['pred'] = df_exp[[0, 1]].apply(lambda x: rf_classes[np.argmax(x)], axis=1)
    df_exp['target'] = target_test
    Fraud_True_Sum = df_exp.loc[(df_exp.pred == 1) & (df_exp.target == 1)][1].sum()/sum(df_exp.target == 1)
    Truth_False_Sum = df_exp.loc[(df_exp.pred == 0) & (df_exp.target == 1)][0].sum()/sum(df_exp.target == 1)
    Fraud_False_Sum = df_exp.loc[(df_exp.pred == 1) & (df_exp.target == 0)][1].sum()/sum(df_exp.target == 0)
    F1_M = f1_score(target_test, df_exp['pred'].to_numpy(), average='macro')
    AUC_ROC_M = roc_auc_score(target_test, df_exp['pred'].to_numpy(), average='macro')
    TP_0 = df_exp.loc[(df_exp.pred == 0) & (df_exp.target == 0)].shape[0]/sum(df_exp.target == 0)
    TP_1 = df_exp.loc[(df_exp.pred == 1) & (df_exp.target == 1)].shape[0]/sum(df_exp.target == 1)


    data.append([
        seed, Fraud_True_Sum, Truth_False_Sum, Fraud_False_Sum, F1_M, AUC_ROC_M, TP_0, TP_1
    ])

In [None]:
columns = ['seed', 'Fraud_True_Sum','Truth_False_Sum', 'Fraud_False_Sum', 'F1_M', 'AUC_ROC_M', 'TP_0', 'TP_1']
df_Results = pd.DataFrame(data, columns=columns)
df_Results.to_csv('/work/data/Results_creditcard_RFMC_2.csv', index=False)

# Results

In [None]:
df_Results

### Basic statistics for all experiments

In [14]:
df_Results[df_Results.columns[1:]].describe().loc[['mean', 'std', 'min', 'max']]

Unnamed: 0,Fraud_True_Sum,Truth_False_Sum,Fraud_False_Sum,F1_M,AUC_ROC_M,TP_0,TP_1
mean,0.712227,0.0,0.692839,0.009804,0.5,0.0,1.0
std,0.033056,0.0,0.037334,0.0,0.0,0.0,0.0
min,0.685703,0.0,0.657598,0.009804,0.5,0.0,1.0
max,0.760508,0.0,0.745611,0.009804,0.5,0.0,1.0


# Baseline model metrics

In [15]:
df_Baseline = pd.read_csv('Results_creditcard_Baseline.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Results_creditcard_Baseline.csv'

In [None]:
columns = ['seed', 'n_estimators', 'Fraud_True_Sum','Truth_False_Sum', 'Fraud_False_Sum', 'F1_M', 'AUC_ROC_M', 'TP_0', 'TP_1']
df_Baseline[columns].groupby(['seed','n_estimators']).mean()

In [23]:
cls.survived_scores

[]

In [19]:
cls.Forest[0]

{'V12': {'split': {'feat_type': 'numeric',
   'split_val': 1783555388,
   '>=': {'V22': {'split': {'feat_type': 'numeric',
      'split_val': 1086423201,
      '>=': {'V4': {'split': {'feat_type': 'numeric',
         'split_val': 605622932,
         '>=': {'V19': {'split': {'feat_type': 'numeric',
            'split_val': 738258662,
            '>=': {'V16': {'split': {'feat_type': 'numeric',
               'split_val': 1398400753,
               '>=': {'V18': {'split': {'feat_type': 'numeric',
                  'split_val': 1014049836,
                  '>=': {'V27': {'split': {'feat_type': 'numeric',
                     'split_val': 223298180,
                     '>=': {'leaf': {'0': 1.0}},
                     '<': {'V23': {'split': {'feat_type': 'numeric',
                        'split_val': 462561420,
                        '>=': {'leaf': {'0': 1.0}},
                        '<': {'leaf': {'1': 1.0}}}}}}}},
                  '<': {'leaf': {'0': 1.0}}}}},
               '<': {'