# Week 4: Subspace Clustering and projected clustering

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Exercise 1: Theoretical questions

1. Why is traditional clustering ill-suited for high dimensional data?
1. What is the goal of subspace clustering and projected clustering?
1. Why is exhaustive subspace cluster search infeasible in practice?
1. Give a definition of what a subspace cluster is according to the models proposed by Subclu, Clique and Proclus. How are they similar and where do they differ?


## Exercise 2: CLIQUE
1. What is a subspace cluster in CLIQUE?
1. How is monotonicity used in CLIQUE?
1. Use the CLIQUE algorithm to compute the hidden subspace clusters in the following 6 dimensional data set.
    Compute the first step of the CLIQUE algorithm to detect dense cells. Use 3 equal intervals
    per dimension in the domain 0...100 (number of intervals $\xi = 3$) and consider a cell as dense
    if it contains at least 5 objects (density threshold $\tau = 21\%$).
   

In [None]:
# Dim1 Dim2 Dim3 Dim4 Dim5 Dim6
X = np.array([
    [6, 23, 22, 21, 31, 49],
    [7, 22, 21, 20, 51, 76],
    [26, 85, 75, 52, 53, 50],
    [28, 94, 76, 63, 76, 87],
    [29, 45, 93, 51, 54, 51],
    [35, 73, 76, 51, 52, 50],
    [38, 23, 22, 21, 33, 61],
    [41, 22, 21, 21, 32, 99],
    [56, 15, 66, 39, 36, 66],
    [58, 1, 14, 53, 52, 51],
    [66, 1, 40, 19, 86, 13],
    [70, 90, 25, 32, 70, 65],
    [71, 23, 21, 20, 3, 81],
    [80, 19, 42, 23, 57, 1],
    [82, 80, 6, 54, 81, 81],
    [82, 81, 38, 35, 81, 82],
    [82, 81, 77, 57, 81, 82],
    [82, 83, 44, 59, 81, 83],
    [82, 81, 35, 86, 81, 81],
    [84, 80, 66, 10, 81, 81],
    [86, 33, 59, 51, 54, 50],
    [89, 34, 36, 53, 54, 51],
    [92, 25, 27, 40, 14, 22]
])

## Exercise 3: PROCLUS

Consider the following four-dimensional data set:

In [None]:
X = np.array([
    ( 15 , 12 , 16 ,  9 ),  # A
    ( 14 , 13 , 18 ,  3 ),  # B
    ( 12 , 14 , 14 , 15 ),  # C 
    ( 16 , 13 , 19 , 19 ),  # D 
    (  5 ,  6 ,  9 ,  4 ),  # E 
    (  4 , 11 , 10 , 18 ),  # F 
    (  6 , 17 ,  8 , 13 ),  # G 
    (  6 ,  9 , 14 , 16 ),  # H 
    ( 14 , 19 , 13 , 15 ),  # I 
    ( 19 ,  3 , 15 , 14 ),  # J 
])

Calculate the following steps of a PROCLUS clustering using $k=3$ clusters. 
Please use the complete data set in the Algorithm (no sample; $A=\frac{10}{3}$).

1. Compute a set of four medoids M.
1. Use the first three medoids and compute the locality and $Z_{ij}$ values for each medoid.
1. Determine the optimal dimension set $D_i$ for each medoid $m_i$ (use $l=3$).

## Exercise 4: OPTICS 
Draw the OPTICS plot for the following 2-d data set using Manhattan distance, $minpts=6$, $\epsilon = 2$. 
Start with $o = (0,4)$, then, once the ControlList is empty, restart with $p = (2,0)$.

In [None]:
X = np.array([ 
    (2,0),(2,0),(3,0),(3,0),(3,0),(3,0),(4,0),(4,0),(3,1),(3,1),(3,1),
    (4,1),(4,1),(4,1),(0,4),(0,4),(0,5),(0,5),(1,4),(1,4),(1,5),(1,5),
    (2,4),(3,4),(3,4),(3,5),(3,5),(3,5),(4,4),(4,4),(4,5),(4,5),(4,5)
])


![](graphics/W4.Q5.png)

Note: you do not need to do the
actual computation, but you may
refer to the figure for reading off
the reachability and core distances,
respectively.
Given the resulting OPTICS plot, which two settings $\epsilon=1,2$ correspond to a DBSCAN that outputs
two and three clusters, respectively?

## Exercise 5: BIRCH/CF-Tree
Insert the following points into an empty CF-Tree and compute the micro clusters and associated cluster features (use the diameter D = 2R).

1. $P_1=(5,5)$ 
1. $P_2=(2,2)$
1. $P_3=(4,5)$
1. $P_4=(1,4)$
1. $P_5=(2,1)$.

The tree parameters are: $B=2$, $L=2$, $T=2$


<img src="graphics/formulas.png" width="1000"/>

# Optional Exercises

## Exercise 6: SUBCLU
1. What is a subspace cluster in SUBCLU?
1. How is the monotonicity used in SUBCLU?

## Exercise 7: Proving Property of BIRCH 
In BIRCH [1], they claim that the the _average intra-cluster distance_ $D3$ can be computed efficiently and exactly from the clustering feature (CF) of two clusters. 
We aim to prove that claim here.

The average intra-cluster distance is defined as follows. 
Given $N_1$ d-dimensional data points in cluster: $C_1 = \{ X_i \}$ where $i = 1, \dots, N_1$, and $N_2$ datapoints in another cluster: $C_2 = \{ X_j \}$ where $j = N_1 + 1, \dots, N_1 + N_2$, 

$$
D3(C1, C2) =\left(\frac{\sum_{i=1}^{N_{1}+N_{2}} \sum_{j=1}^{N_{1}+N_{2}}\left(X_{i}-X_{j}\right)^{2}}{\left(N_{1}+N_{2}\right)\left(N_{1}+N_{2}-1\right)}\right)^{\frac{1}{2}} \qquad\quad\quad (1)
$$

and the CF for cluster $i$ is defined as a triple $CF_i = (N_i, LS_i, SS_i)$, where $LS_i = \sum_{j=1}^{N_i} X_j$ and $SS_i = \sum_{j=1}^{N_i} X_j^2$.

1. Given two CFs, $CF_1$ and $CF_2$, for clusters $C_1$ and $C_2$, respectively, show that $D3(C_1, C_2)$ can becomputed only from information in $CF_1$ and $CF_2$.
2. Compare the running times of Equation (1) and you derived algorithm. Which one is faster?

If you want, you could test your derived formula here. Look for the _TODO_ below.

In [None]:
# D3 the slow way
import time
from tqdm import tqdm
fast = True
# Cluster statistics
def D3_slow(C1, C2): # Slow algorithm
    C = np.concatenate([C1, C2], axis=0)
    s = 0.
    
    N1, d = C1.shape
    N2, _ = C2.shape
    N,  _ = C.shape
    
    if fast: # Fast version of the slow algorithm
        C_ = C.reshape(N, 1, d)
        C  = C.reshape(1, N, d)
        D  = (C_ - C).reshape(N*N, 1, d)
        s  = (D @ D.reshape(N*N, d, 1)).sum()
    else: # Slow version of the slow algorithm
        for i in range(N):
            for j in range(N):
                s += np.dot((C[i] - C[j]), (C[i] - C[j]))

    s = s / ((N1 + N2)*(N1 + N2 -1))
    return np.sqrt(s)

# Statistics for fast implementation
LS = lambda C: np.sum(C, axis=0)
SS = lambda C: np.sum(C ** 2)

# TODO implement your fast algorithm here.
def D3_fast(C1, C2):
    N1, _ = C1.shape
    N2, _ = C2.shape
    LS1, SS1 = LS(C1), SS(C1)
    LS2, SS2 = LS(C2), SS(C2)

    return 0 # TODO return DS3

## FROM HERE ON IS JUST TESTING AND PLOTTING. YOU DO NOT NEED TO CODE ANYTHING ## 

# Generate random samples in two different clusters.
# Check that the two algorithms give the same result.
size = 4
C1 = np.random.randn(size, 2) * 0.5 
C2 = np.random.randn(size, 2) * 0.5 + 2
assert np.allclose(D3_slow(C1, C2), D3_fast(C1, C2))

## TEST running time for the two algorithms
repeats     = 20   # Average running time over `repeats` time.
size_from   = 10   # Data set size from
size_to     = 1010  # Dataset size to
size_step   = 100   # Step size
data_sizes  = range(size_from, size_to + 1, size_step) # Test sizes

def test(fn):
    times = []
    results = []
    
    for size in data_sizes:
        C1 = np.random.randn(size, 2) * 0.5 
        C2 = np.random.randn(size, 2) * 0.5 + 2
        
        t0 = time.time()
        for _ in range(repeats): 
              print(f'\rSize {size}: {1+_}/{repeats}', end="")
              t = fn(C1, C2)
        td = time.time() - t0
        print(f"\rSize {size}: \t{td / repeats:.6f} secs.")
        results.append(t)
        times.append(td / repeats)
    return times, results

print("Testing slow algorithm")
slows = test(D3_slow)
print("Testing fast algorithm")
fasts = test(D3_fast)

print("\n|  i  | %-9s | %-9s |" % ('Fast', 'Slow'))
print("-"*31)
for i, st, ft in zip(data_sizes, slows, fasts):
    print("| %3i | %9.5f | %9.5f |" % (i, ft, st))

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
ax.plot(data_sizes, fasts, 'r-o', label="Fast")
ax.plot(data_sizes, slows, 'b--o', label="Slow")
ax.set_ylabel('Seconds')
ax.set_xlabel('# Data rows')
ax.legend()

#### References:
[1] Zhang, T., Ramakrishnan, R. and Livny, M., 1996. BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record, 25(2), pp.103-114.