In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Dropout Spherical K-Means

A major goal in machine learning or neural networks is to learn deep hierarchies of features for other tasks. Many algorithms are available to learn deep hierarchies of features from unlabeled data, especially text documents and images. It has been found [17, 18, 21] that using K-means clustering as the unsupervised learning module in these types of "feature learning" pipelines can lead to excellent results, often rivaling the state-of-the-art systems.

The classic K-means clustering algorithm finds cluster centroids that minimize the distance between data points and the nearest centroid. K-means is also called "vector quantization" (VQ) in the sense that K-means can be viewed as a way of constructing a "dictionary" \(D \in \mathbb{R}^{n \times K}\) of \(K\) column vectors \(d^{(j)}\), \(j = 1, \ldots, K\) so that a data vector \(x^{(i)} \in \mathbb{R}^n\) (i = 1, \ldots, m) can be mapped to a code vector \(s^{(i)}\) that minimizes the error in reconstruction. A popular modified version of K-means is known as "gain shape vector quantization" [169] or "spherical K-means" [21].

Given \(m\) data vectors \(x^{(i)} \in \mathbb{R}^n\), \(i = 1, \ldots, m\). The VQ of the data vector \(x^{(i)}\) can be described by the model:

$$
x^{(i)} = D s^{(i)}, \quad i = 1, \ldots, m,
$$

where \(s^{(i)}\) is a quantization coefficient vector or "code vector" associated with the input \(x^{(i)}\), \(D \in \mathbb{R}^{K \times n}\) denotes the codebook matrix or "dictionary" whose \(j\)-th column is denoted by \(d^{(j)}\). The goal of spherical K-means is equivalently to find \(D\) and \(s^{(i)}\) from the input \(x^{(i)}\) according to [18]:

$$
\min_{D, s} \sum_{i=1}^m \| D s^{(i)} - x^{(i)} \|_2^2,
$$

subject to \(\|s^{(i)}\|_0 \leq 1, \, \forall i = 1, \ldots, m\) and \(\|d^{(j)}\|_2 = 1, \, \forall j = 1, \ldots, K\).

In the above equation, the code vector \(s^{(i)}\) can be thought of as a "feature representation" of each example \(x^{(i)}\) that satisfies several criteria:
- Given \(s^{(i)}\) and \(D\), the original example \(x^{(i)}\) should be able to be reconstructed well.
- The first constraint \(\|s^{(i)}\|_0 \leq 1\) means that each \(s^{(i)}\) should have at most one nonzero entry (associated with vector quantization). Thus, a new representation of \(x^{(i)}\) should preserve it as well as possible, but also be a very simple or parsimonious representation.
- The second constraint \(\|d^{(j)}\|_2 = 1\) requires that each dictionary column have unit length, preventing them from becoming arbitrarily large or small.

The full K-means algorithm for learning feature representation can be summarized below [18]:

1. **Normalize inputs**:

$$
x^{(i)} \leftarrow \frac{x^{(i)} - \text{mean}(x^{(i)})}{\sqrt{\text{var}(x^{(i)}) + \epsilon_{\text{norm}}}}, \quad \forall i = 1, \ldots, m.
$$   7.9.23

2. **Eigenvalue decomposition and estimate inputs**:

$$
\text{cov}(x^{(i)}) = V D V^T, \quad i = 1, \ldots, m,
$$

$$
x^{(i)} \leftarrow V (D + \epsilon_{\text{norm}} I)^{-1/2} V^T x^{(i)}, \quad i = 1, \ldots, m.
$$

3. **Loop until convergence (typically 10 iterations is enough)**:

$$
s_j^{(i)} = 
\begin{cases} 
(d^{(j)})^T x^{(i)}, & \text{if } j = \arg \max_l \|D^{(l)} x^{(i)}\|_1, \\
0, & \text{otherwise},
\end{cases}
$$

$$
d^{(j)} \leftarrow X s_j + d^{(j)},
$$

$$
d^{(j)} \leftarrow \frac{d^{(j)}}{\|d^{(j)}\|_2},
$$

where \(j = 1, \ldots, K, \, i = 1, \ldots, m\); while \(X = [x^{(1)}, \ldots, x^{(m)}] \in \mathbb{R}^{n \times m}\) and \(s_j = [s_j, \ldots, s_j]^T \in \mathbb{R}^{m \times 1}\) are the data (or example) matrix and the code vectors, respectively.

When dropout is applied to the outputs of a dictionary, the spherical K-means is extended to the dropout spherical K-means [173]. The optimization problem of the dropout spherical K-means can be described as:

$$
\min_{D, s} \sum_{i=1}^m \| (M \odot D) s^{(i)} - x^{(i)} \|_2^2,
$$

subject to \(\|s^{(i)}\|_0 \leq 1, \, \forall i = 1, \ldots, m\) and \(\|d^{(j)}\|_2 = 1, \, \forall j = 1, \ldots, K\),

where \(M\) is a binary mask matrix with the same size as \(D\), and each column is drawn independently from \(M_j \sim \text{Bernoulli}(p)\). Given a dictionary \(D\) and a dropout mask matrix \(M\), then \(M \odot D\) can be viewed as a "thinned dictionary" \(D_{\text{thin}} = M \odot D\), where \(D_{\text{thin}} \in D\). Hence, Eq. (7.9.24) becomes:

$$
s_j^{(i)} = 
\begin{cases} 
(d_{\text{thin}}^{(j)})^T x^{(i)}, & \text{if } j = \arg \max_l \|D_{\text{thin}}^{(l)} x^{(i)}\|_1, \\
0, & \text{otherwise},
\end{cases}
$$

for all \(j = 1, \ldots, K; i = 1, \ldots, m\).

After the code vector \(s^{(i)}\) is obtained, new centroids can be computed according to:

$$
D_{\text{new}} = \arg \min_{D_{\text{thin}}} \|D_{\text{thin}} S - X\|_2^2 + \|D_{\text{thin}} - D_{\text{old}}\|_2^2,
$$

$$
D_{\text{thin}} = \left(S S^T + I\right)^{-1} \left(X S^T + \alpha D_{\text{thin}}\right).
$$

The full dropout K-means algorithm [173] for learning feature representation can be summarized below:

1. **Normalize inputs**: Use Eq. (7.9.21) to compute the normalized input vectors \(x^{(i)}\).
2. **Eigenvalue decomposition and estimate inputs**: Use Eq. (7.9.22) to compute the EVD of the variance matrix \(\text{cov}(x^{(i)})\), then estimate the inputs \(x^{(i)}\) via Eq. (7.9.23).
3. **Loop until convergence**: Calculate the code vectors \(s^{(i)}\) by using Eq. (7.9.29), and update \(D_{\text{new}} = X S^T + D_{\text{thin}}\). Finally, make the normalization \(d^{(j)} = \frac{d^{(j)}}{\|d^{(j)}\|_2}\).


In [1]:
import numpy as np

def normalize_inputs(X):
    """
    Normalize input data X.
    """
    X_mean = np.mean(X, axis=0)
    X_std = np.std(X, axis=0)
    X_norm = (X - X_mean) / (X_std + 1e-8)
    return X_norm

def eigenvalue_decomposition(X):
    """
    Perform eigenvalue decomposition on the covariance matrix of X.
    """
    cov_matrix = np.cov(X, rowvar=False)
    eig_values, eig_vectors = np.linalg.eigh(cov_matrix)
    return eig_values, eig_vectors

def estimate_inputs(X, eig_values, eig_vectors):
    """
    Estimate inputs using eigenvalue decomposition results.
    """
    D_inv_sqrt = np.diag(1.0 / np.sqrt(eig_values + 1e-8))
    X_estimated = eig_vectors @ D_inv_sqrt @ eig_vectors.T @ X.T
    return X_estimated.T

def spherical_kmeans(X, K, max_iters=10):
    """
    Spherical K-means clustering.
    """
    n_samples, n_features = X.shape
    D = np.random.randn(K, n_features)
    D = D / np.linalg.norm(D, axis=1, keepdims=True)

    for _ in range(max_iters):
        # Compute code vectors
        S = np.zeros((n_samples, K))
        for i in range(n_samples):
            dot_products = np.dot(D, X[i])
            j = np.argmax(dot_products)
            S[i, j] = dot_products[j]

        # Update dictionary
        D_new = np.dot(S.T, X)
        D_new = D_new / np.linalg.norm(D_new, axis=1, keepdims=True)
        D = D_new

    return D, S

def dropout_spherical_kmeans(X, K, dropout_rate=0.5, max_iters=10):
    """
    Dropout Spherical K-means clustering.
    """
    n_samples, n_features = X.shape
    D = np.random.randn(K, n_features)
    D = D / np.linalg.norm(D, axis=1, keepdims=True)

    for _ in range(max_iters):
        # Apply dropout
        dropout_mask = np.random.binomial(1, 1-dropout_rate, D.shape)
        D_thin = D * dropout_mask

        # Compute code vectors
        S = np.zeros((n_samples, K))
        for i in range(n_samples):
            dot_products = np.dot(D_thin, X[i])
            j = np.argmax(dot_products)
            S[i, j] = dot_products[j]

        # Update dictionary
        D_new = np.dot(S.T, X)
        D_new = D_new / np.linalg.norm(D_new, axis=1, keepdims=True)
        D = D_new

    return D, S

def main():
    # Example usage
    np.random.seed(0)
    X = np.random.randn(100, 20)  # 100 samples, 20 features

    # Normalize inputs
    X_norm = normalize_inputs(X)

    # Eigenvalue decomposition
    eig_values, eig_vectors = eigenvalue_decomposition(X_norm)

    # Estimate inputs
    X_estimated = estimate_inputs(X_norm, eig_values, eig_vectors)

    # Perform Dropout Spherical K-means
    K = 10  # Number of clusters
    D, S = dropout_spherical_kmeans(X_estimated, K, dropout_rate=0.5, max_iters=10)

    print("Dictionary D:\n", D)
    print("Code vectors S:\n", S)

if __name__ == "__main__":
    main()


Dictionary D:
 [[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]
 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
  nan nan]]
Code vectors S:
 [[nan  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [nan  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [nan  0.  0.  0.  0.  0.  0.  



In [None]:
## Exercise - Rectify and modify it

## Definition 7.5 (DropConnect Network [153])
Let \(\{x_1, \ldots, x_l\}\) with labels \(\{y_1, \ldots, y_l\}\) be the data set \(S\) of \(l\) entries. A DropConnect network is defined as a mixture model:

$$
o = \sum_{m} p(M) f(x; \theta, M) = \mathbb{E}_m \{f(x; \theta, M)\},
$$

where \(m\) is the DropConnect layer mask, \(\theta = \{W_s, W, W_g\}\) are network parameters: \(W_s\) are the softmax layer parameters, \(W\) are the DropConnect layer parameters, and \(W_g\) are the feature extractor parameters. Each network \(f(x; \theta, M)\) has weights \(p(M)\) such that \(M_{ij} \sim \text{Bernoulli}(p)\).

When each element of \(M\) has equal probability of being on and off (\(p = 0.5\)), the mixture model has equal weights for all sub-models \(f(x; \theta, M)\), otherwise, the mixture model has larger weights in some sub-models than others.

A standard DropConnect model architecture comprises the following basic steps [152]:

1. **Feature Extractor**:

$$
v = g(x; W_g),
$$

where \(v \in \mathbb{R}^{n \times 1}\) is the output feature vector, \(x \in \mathbb{R}^{I \times 1}\) is the input data vector to the overall model, and \(W_g \in \mathbb{R}^{n \times I}\) is the parameter matrix for the feature extractor. \(g(\cdot)\) is chosen to be a multilayered convolutional neural network (CNN), with \(W_g\) being the convolutional filters (and biases) of the CNN.

2. **DropConnect Layer**:

$$
r = f(u) = f((M \odot W)v) \in \mathbb{R}^{d \times 1},
$$

where \(v\) is the output of the feature extractor, \(W \in \mathbb{R}^{d \times n}\) is a fully connected weight matrix, \(f\) is a nonlinear activation function, and \(M \in \mathbb{R}^{d \times n}\) is the binary mask matrix.

3. **Softmax Classification Layer**:

$$
o_i = \text{softmax}(r; W_s) = \frac{\exp(w_{s,i}^T r)}{1 + \sum_{j} \exp(w_{s,j}^T r)},
$$

where \(W_s \in \mathbb{R}^{k \times d}\) is the weight matrix of the softmax classification layer which takes \(r\) as its input and uses \(W_s\) to map this input to a \(k\)-dimensional output vector \(o\) (\(k\) is the number of classes).


In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, Conv2D, Flatten, Softmax
from tensorflow.keras.models import Model
import numpy as np

class DropConnect(Layer):
    def __init__(self, units, drop_prob=0.5):
        super(DropConnect, self).__init__()
        self.units = units
        self.drop_prob = drop_prob

    def build(self, input_shape):
        self.kernel = self.add_weight(name='kernel',
                                      shape=(input_shape[-1], self.units),
                                      initializer='glorot_uniform',
                                      trainable=True)
        self.bias = self.add_weight(name='bias',
                                    shape=(self.units,),
                                    initializer='zeros',
                                    trainable=True)
    
    def call(self, inputs, training=None):
        if training:
            binary_tensor = tf.random.uniform(self.kernel.shape) > self.drop_prob
            kernel = self.kernel * tf.cast(binary_tensor, dtype=tf.float32)
        else:
            kernel = self.kernel * (1 - self.drop_prob)
        output = tf.matmul(inputs, kernel) + self.bias
        return output

class DropConnectNetwork(Model):
    def __init__(self, input_shape, num_classes, drop_prob=0.5):
        super(DropConnectNetwork, self).__init__()
        self.conv1 = Conv2D(32, (3, 3), activation='relu', input_shape=input_shape)
        self.flatten = Flatten()
        self.drop_connect = DropConnect(128, drop_prob)
        self.dense = Dense(num_classes)
        self.softmax = Softmax()

    def call(self, inputs, training=None):
        x = self.conv1(inputs)
        x = self.flatten(x)
        x = self.drop_connect(x, training=training)
        x = self.dense(x)
        return self.softmax(x)

# Define constants
input_shape = (28, 28, 1)  # Example for MNIST dataset
num_classes = 10

# Instantiate the model
model = DropConnectNetwork(input_shape, num_classes)

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Load and preprocess data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# Train the model
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test accuracy: {accuracy}')


2024-07-30 09:12:15.553996: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-30 09:12:20.084294: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-07-30 09:12:20.084338: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-07-30 09:12:20.511633: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-30 09:12:31.424638: W tensorflow/stream_executor/platform/de

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.9812999963760376
