# Random Forest Approach

This Jupyter notebook provides a full example of the Random Forest Approach.

## Imports and setup:

The following code cell deals with all the imports and initial setup. The seed of the numpy random number generator is fixed to create reproducible results and the ray-tune framework is initialized.

In [1]:
from typing import List

import numpy as np
import ray
from ray import tune
from ray.tune.suggest.bohb import TuneBOHB
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler

from ml4pdm.data import Dataset
from ml4pdm.evaluation import Evaluator
from ml4pdm.evaluation.metrics import loss_asymmetric, loss_false_negative_rate, loss_false_positive_rate, score_performance
from ml4pdm.parsing import DatasetParser
from ml4pdm.prediction import EnsembleApproach, WindowedPredictor
from ml4pdm.transformation import WindowingApproach, AttributeFilter, SklearnWrapper

np.random.seed(2)
ray.init(include_dashboard=False)

{'node_ip_address': '192.168.31.31',
 'raylet_ip_address': '192.168.31.31',
 'redis_address': '192.168.31.31:6379',
 'object_store_address': 'tcp://127.0.0.1:65527',
 'raylet_socket_name': 'tcp://127.0.0.1:65250',
 'webui_url': None,
 'session_dir': 'C:\\Users\\CHRIST~1\\AppData\\Local\\Temp\\ray\\session_2021-08-25_23-43-38_371426_23832',
 'metrics_export_port': 65286,
 'node_id': 'c67c697361d914c0634d398a8305b2bebc7084974bcaf7316cf45cfe'}

## Prepare dataset:

The base for the datasets is the CMAPSS FD001. The train and test datasets are prepared by removing non-changing as well as settings features. After that a min max scaling is also applied per feature. Windowing is applied to be able to pass single timesteps to the Decision Tree. The labels are computed per timestep by adding the distance to the end onto the given target for an entire instance.

In [2]:
train_dataset, test_dataset = DatasetParser.get_cmapss_data(test=True)
train_dataset = AttributeFilter.remove_features(train_dataset, [1, 2, 3, 4, 8, 13, 19, 22])
test_dataset = AttributeFilter.remove_features(test_dataset, [1, 2, 3, 4, 8, 13, 19, 22])

preprocessing = make_pipeline(SklearnWrapper(MinMaxScaler(), SklearnWrapper.extract_timeseries_concatenated, SklearnWrapper.rebuild_timeseries_concatenated), WindowingApproach(1), "passthrough")

train_instance_lengths = WindowedPredictor.extract_instance_lengths(train_dataset)
test_instance_lengths = WindowedPredictor.extract_instance_lengths(test_dataset)

train_dataset = preprocessing.fit_transform(train_dataset)
test_dataset = preprocessing.transform(test_dataset)

def annotate_single_timesteps(data: Dataset, instance_lengths: List[int]):
    new_targets = []
    start = 0
    for i, instance_len in enumerate(instance_lengths):
        for j in range(start, start+instance_len):
            if len(data.target) > i:
                new_targets.append(data.target[i] + instance_len - j)
            else:
                new_targets.append(instance_len - j)
    data.target = new_targets


annotate_single_timesteps(train_dataset, train_instance_lengths)
annotate_single_timesteps(test_dataset, test_instance_lengths)

## Hyperparameter optimization:

The following code cell performs hyperparameter optimization using the ray-tune framework. The parameters "n_elements", "num_samples", "max_features", "splitter" and "criterion" are optimized using the asymmetric loss function.

In [None]:
def pipeline_training(config, data=None):
    rfa = EnsembleApproach(config["n_elements"], DecisionTreeRegressor, fit_preprocessing=EnsembleApproach.random_sampling(config["num_samples"]),
                           max_features=config["max_features"], splitter=config["splitter"], criterion=config["criterion"])
    evaluator = Evaluator(None, [rfa], None, [loss_asymmetric, mean_squared_error])
    results = evaluator.evaluate_train_test_split(data["train_dataset"], data["test_dataset"])[0]
    tune.report(loss=results[0], mse=results[1])


data_dict = {
    "train_dataset": train_dataset,
    "test_dataset": test_dataset,
}

algo = TuneBOHB(seed=2)

analysis = tune.run(
    tune.with_parameters(pipeline_training, data=data_dict),
    search_alg=algo,
    metric="loss",
    mode="min",
    num_samples=-1,
    time_budget_s=int(90*60),
    resources_per_trial={"cpu": 2},
    config={
        "n_elements": tune.choice(range(1, 50)),
        "num_samples": tune.choice(range(1000, int(0.75*len(train_dataset.data)), 500)),
        "max_features": tune.choice(range(1, 16)),
        "splitter": tune.choice(["best", "random"]),
        "criterion": tune.choice(["mse", "mae"]),
    }
)

best_config = analysis.get_best_config(metric="loss", mode="min")
print("Best config: ", best_config)

## Evaluate Approach:

The best config that was obtained in the above hyperparameter optimization is evaluated using various metrics. This allows for comparing it with other approaches.

In [3]:
rfa = EnsembleApproach(43, DecisionTreeClassifier, fit_preprocessing=EnsembleApproach.random_sampling(14000),
                           max_features=15, splitter="best", criterion="gini")
evaluator = Evaluator(None, [rfa], None, [loss_asymmetric, mean_squared_error, score_performance, mean_absolute_error,
                                          mean_absolute_percentage_error, loss_false_positive_rate, loss_false_negative_rate])

results = evaluator.evaluate_train_test_split(train_dataset, test_dataset)[0]
for i in [2, 4, 5, 6]:
    results[i] *= 100 
print("S:\t{:.2f}\nMSE:\t{:.2f}\nA(%):\t{:.2f}\nMAE:\t{:.2f}\nMAPE:\t{:.2f}\nFPR(%):\t{:.2f}\nFNR(%):\t{:.2f}".format(*results))

S:	190435757706.53
MSE:	7854.98
A(%):	6.72
MAE:	75.95
MAPE:	42.76
FPR(%):	10.16
FNR(%):	83.12
