# Testing the Tree

## Importing the Basics

In [1]:
import numpy as np
from matplotlib import pyplot as plt
from machineLearning.metric import ConfusionMatrix, RegressionScores
from machineLearning.utility import ModelIO
from machineLearning.data import DataSet
from machineLearning.rf import (
    DecisionTree,
    Gini, Entropy, MSE, MAE, ODD,
    Mode, Mean, Confidence, Probabilities, AnomalyDetection,
    CART, ID3, C45, RSA,
    ReducedError, CostComplexity, PessimisticError
)

## Generating Test Data

Here I generate random test data. It's two blocks shifted very slightly in some dimensions. For classifier tasks each block gets a label, for regressor tasks each block gets the average coordinates plus some random value as a traget. It's a very simple dummy data set meant for testing the code.

Here one can change the dimensionallity and amount of the data.

In [2]:
def dataShift(dims):
    offSet = [5, 1.5, 2.5]
    diffLen = abs(len(offSet) - dims)
    offSet.extend([0] * diffLen)
    np.random.shuffle(offSet)
    return offSet[:dims]

# Initialize some parameters
totalAmount = 64000
dims = 7
evalAmount = totalAmount // 4
trainAmount = totalAmount - evalAmount
offSet = dataShift(dims)

# Create covariance matrix
cov = np.eye(dims)  # This creates a covariance matrix with variances 1 and covariances 0

# Generate random multivariate data
oneData = np.random.multivariate_normal(np.zeros(dims), cov, totalAmount)
twoData = np.random.multivariate_normal(offSet, cov, totalAmount)

# Split the data into training and evaluation sets
trainData = np.vstack((oneData[:trainAmount], twoData[:trainAmount]))
validData = np.vstack((oneData[trainAmount:], twoData[trainAmount:]))

# Labels for classification tasks
trainLabels = np.hstack((np.zeros(trainAmount), np.ones(trainAmount)))
validLabels = np.hstack((np.zeros(evalAmount), np.ones(evalAmount)))

# Targets for regression tasks
trainTargets = np.sum(trainData, axis=1) + np.random.normal(0, 0.1, 2*trainAmount)
validTargets = np.sum(validData, axis=1) + np.random.normal(0, 0.1, 2*evalAmount)

# Shuffle the training data
trainIndex = np.random.permutation(len(trainData))
trainData = trainData[trainIndex]
trainLabels = trainLabels[trainIndex]
trainTargets = trainTargets[trainIndex]

In [3]:
def scatterPairwise(data, labels, size: float = 10):
    num_dims = data.shape[1]
    fig, axes = plt.subplots(num_dims, num_dims, figsize=(12, 12))

    if len(labels.shape) > 1:
        labels = np.argmax(labels, axis=1)
    
    colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
    point_colors = [colors[label] for label in labels]

    for i in range(num_dims):
        for j in range(num_dims):
            if i == j:
                axes[i][j].axis('off')
            else:
                axes[i][j].scatter(data[:, i], data[:, j], c=point_colors, s=size, alpha=0.5,label='data')
                axes[i][j].set_xlabel(f"Dim {i}")
                axes[i][j].set_ylabel(f"Dim {j}")
    plt.tight_layout()
    plt.show()

In [4]:
#scatterPairwise(trainData, trainLabels.astype('int'))

## Creating the Tree

Here the tree is created. One can set the maximum depth of the tree. Depending on the task, we add a different impurity function and a different leaf function. Finally we add the split algorithm and set the feature percentile. Higher numbers look at more possible splits, but decreases speed. Lower numbers look at less possible splits, speeding up the algorithm. Depending on the data set this can have a strong impact on the performance.

In [5]:
task = 'classifier' # 'classifier'/'regressor'/'outlier'
tree = DecisionTree(maxDepth=5, minSamplesSplit=2)
if task == 'regressor':
    tree.setComponent(MSE())
    tree.setComponent(Mean())
    tree.setComponent(C45(featurePercentile=90))
elif task == 'classifier':
    tree.setComponent(Entropy())
    tree.setComponent(Mode())
    tree.setComponent(CART(featurePercentile=90))
elif task == 'outlier':
    tree.setComponent(RSA())
    tree.setComponent(ODD())
    tree.setComponent(AnomalyDetection())


if task == 'classifier':
    trainSet = DataSet(trainData, labels=trainLabels, targets=trainLabels)
    validSet = DataSet(validData, labels=validLabels, targets=validLabels)
elif task == 'regressor':
    trainSet = DataSet(trainData, labels=trainTargets, targets=trainTargets)
    validSet = DataSet(validData, labels=validTargets, targets=validTargets)
else:
    trainSet = DataSet(trainData)
    validSet = DataSet(validData)

## Trainining the tree

Again, depending on the task we train the tree with targets or labels. Then we make a prediction and plot the tree.

In [6]:
tree.train(trainSet)
print(tree)

tree 1 |⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿| done ✔                  | 47%
—————————————————————— tree: 1/1 ———————————————————————
split: CART, impurity: Entropy, leaf: Mode, nodes: 31
maxDepth: 5, reached depth: 5, minSamplesSplit: 2
························································
╴feat: 6 <= 2.79, samples: 96000
     ├─feat: 6 <= 1.97, samples: 48527
     │   ├─feat: 6 <= 1.50, samples: 46927
     │   │   ├─feat: 0 <= 2.30, samples: 44864
     │   │   │   └─╴value: 0.0
     │   │   │   └─╴value: 0.0
     │   │   └─╴feat: 0 <= 1.82, samples: 2063
     │   │       └─╴value: 0.0
     │   │       └─╴value: 0.0
     │   └─╴feat: 0 <= 1.36, samples: 1600
     │       ├─feat: 0 <= 0.21, samples: 1002
     │       │   └─╴value: 0.0
     │       │   └─╴value: 0.0
     │       └─╴feat: 3 <= 0.54, samples: 598
     │           └─╴value: 1.0
     │           └─╴value: 1.0
     └─╴feat: 6 <= 3.58, samples: 47473
         ├─feat: 0 <= 0.78, samples: 3131
         │   ├─fea

In [7]:
tree.bake()

In [8]:
prediction = tree.eval(validSet)

## Evaluating predictions

Depending on the task at hand we create a confusion matrix (classification) or simple metrics (regression). Since the number of classes is fixed to two, we don't need to change anything here.

In [9]:
if task == 'regressor':
    metrics = RegressionScores(numClasses=2)
    metrics.calcScores(prediction, validTargets, validLabels)
    print(metrics)
elif task == 'classifier':
    confusion = ConfusionMatrix(numClasses=2)
    confusion.update(prediction, validLabels)
    confusion.percentages()
    confusion.calcScores()
    print(confusion)

━━━━━━━━━━━━ evaluation ━━━━━━━━━━━━
————————— confusion matrix —————————
              Class 0     Class 1   
····································
     Class 0   15953         47     
                49%          0%     
····································
     Class 1     50        15950    
                 0%         49%     

———————————————————————————————— scores ———————————————————————————————
                accuracy       precision      sensitivity      miss rate    
·······································································
     Class 0     0.997           0.997           0.997           0.003      
     Class 1     0.997           0.997           0.997           0.003      
·······································································
       total     0.997           0.997           0.997           0.003      


## Saving and Loading a Tree

Trees can be converted to dictionaries and then saved as a json file. This allows us to load them and re-use them. Also json is a raw text format, which is neat.

In [10]:
ModelIO.save(tree, 'tree-test')
newTree = ModelIO.load('tree-test')
print(newTree)

—————————————————————— tree: 1/2 ———————————————————————
split: CART, impurity: Entropy, leaf: Mode, nodes: 31
maxDepth: 5, reached depth: 5, minSamplesSplit: 2
························································
╴feat: 6 <= 2.79, samples: 96000
     ├─feat: 6 <= 1.97, samples: 48527
     │   ├─feat: 6 <= 1.50, samples: 46927
     │   │   ├─feat: 0 <= 2.30, samples: 44864
     │   │   │   └─╴value: 0.0
     │   │   │   └─╴value: 0.0
     │   │   └─╴feat: 0 <= 1.82, samples: 2063
     │   │       └─╴value: 0.0
     │   │       └─╴value: 0.0
     │   └─╴feat: 0 <= 1.36, samples: 1600
     │       ├─feat: 0 <= 0.21, samples: 1002
     │       │   └─╴value: 0.0
     │       │   └─╴value: 0.0
     │       └─╴feat: 3 <= 0.54, samples: 598
     │           └─╴value: 1.0
     │           └─╴value: 1.0
     └─╴feat: 6 <= 3.58, samples: 47473
         ├─feat: 0 <= 0.78, samples: 3131
         │   ├─feat: 3 <= 0.59, samples: 207
         │   │   └─╴value: 0.0
         │   │   └─╴value: 1.0
 

In [11]:
prediction = newTree.eval(validData)

if task == 'regressor':
    newMetrics = RegressionScores(numClasses=2)
    newMetrics.calcScores(prediction, validTargets, validLabels)
    print(newMetrics)
elif task == 'classifier':
    newConfusion = ConfusionMatrix(numClasses=2)
    newConfusion.update(prediction, validLabels)
    newConfusion.percentages()
    newConfusion.calcScores()
    print(newConfusion)

━━━━━━━━━━━━ evaluation ━━━━━━━━━━━━
————————— confusion matrix —————————
              Class 0     Class 1   
····································
     Class 0   15953         47     
                49%          0%     
····································
     Class 1     50        15950    
                 0%         49%     

———————————————————————————————— scores ———————————————————————————————
                accuracy       precision      sensitivity      miss rate    
·······································································
     Class 0     0.997           0.997           0.997           0.003      
     Class 1     0.997           0.997           0.997           0.003      
·······································································
       total     0.997           0.997           0.997           0.003      


## Comment

The tree works pretty well with both regression and classification tasks. Labels shouldn't be one-hot encoded, it works but it's still rather iffy. Targets should 1D, I haven't tested with 2D, it might work. Training can be really fast with a percentile set in the split algorithm, otherwise it can be rather slow. Making predictions work fast and well enough.