In [1]:
import numpy as np
from simforest.cluster import SimilarityForestCluster

Let's say we want to find distance matrix for the 4 vectors. We'll fit single unsupervised tree and then use it to compare the instances.

First, we sample a pair of vectors that are used to define the split. The first one is sampled randomly, then the second one is sampled, making sure that it's different from the first one.

In [7]:
# Input
X = np.array([[1.0, 0.0, 2.0], [2.0, 0.0, 2.0], [0.0, 3.0, -1.0], [-1.0, -2.5, 3.6]])

# Number of instances in the current tree node
n = X.shape[0]

# Sample index of first vector randomly
first = np.random.randint(0, n)

# Prepare a pool of vectors different from the first one
others = np.where(np.abs(X - X[first]) > 0)[0]

# Choose the second one from the others pool
second = np.random.choice(others)

Then, we calculate the projection of vectors in current node. The projection is defined as dot(X, second) - dot(X, first). One can notice, that the projection is equivalent to dot(X, second - first).

In the next step, the split point is found in unsupervised way by choosing a random split point at the projection line and splitting the vectors into partitions using this threshold.

In [27]:
# Calculate the projection
similarities = np.dot(X, X[second]-X[first])

# Figure out the range of similarity values
similarities_min = np.min(similarities)
similarities_max = np.max(similarities)

# Find a random threshold in this range
split_point = np.random.uniform(similarities_min, similarities_max, 1)

# Find indexes of vectors that should go to the left and right children of current node
left_indexes = similarities <= split_point
right_indexes = np.invert(left_indexes)

In [28]:
# Check which vectors went will be used to build left children node
X[left_indexes]

array([[ 1. ,  0. ,  2. ],
       [ 2. ,  0. ,  2. ],
       [-1. , -2.5,  3.6]])

In [29]:
# Check which vectors went will be used to build right children node
X[right_indexes]

array([[ 0.,  3., -1.]])

The process of partitioning the dataset continues recursively until the depth limit is reached or all examples in each tree nodes are the same.

After the tree is grown, each pair of instances is passed through the tree, and the depth at which the pair splits is used to determine dissimilarity of the instances. The dissimilarity is 1 / depth at which the pair splits. This way, we achieve maximal distance if the pair split in the first node of the tree, and small distance if the pair splits deep down the tree.

In [None]:
# At each node of the tree, check if the pair goes in the same direction
path_i = np.dot(xi, q-p) <= split_point
path_j = np.dot(xj, q-p) <= split_point

