# K-means

**Considering the problem of identifying groups, or clusters, of data points in a multidimensional space.**

Suppose we have a data set {$x_1, x_2, x_3,..., x_n$} consisting of $N$ observations of a random $D$-dimension
Euclidean variable $x$, our goal is to partition these points into $K$ clusters

A cluster k contains a cluster center $\mu_k$($D$-dimension). Our goal is then to find an assignment of data points to clusters,
as well as a set of vectors {$\mu_k$}, such that the sum of the squares of the distances of each data point to its
closest vector $\mu_k$, is a minimum.

Also, for each points $x_n$, we have a binary indicator variable $r_{nk} \in (0, 1)$ indicates whether the point
$x_{in}$ is assigned to cluster k, this implies that if $x_n$ is in k, then $r_{nk} = 1$ and $r_{nj} = 0$
$\forall j \in K, j \neq k$ This is known as the 1-of-K coding scheme.

Now we have a loss function(distortion measure in this case):

$J = \sum_{n=1}^{N}\sum_{k=1}^{K} r_{nk} ||x_n - \mu_k||^2$

which represents the sum of the squares of the distances of each data point to its assigned cluster point $\mu_k$

**Ultimate goal**:

Find {$r_{nk}$}, {$\mu_k$} that minimize J


### Algorithm

1. We first initialize $\mu_k$
2. we minimize J with respect to $r_{nk}$ fixing $\mu_k$ (E step)
3. we minimize J with respect to $\mu_k$ fixing $r_{nk}$ (M step)

repeat 2, 3 until convergence


**For phase 2**:

  \begin{equation}
    r_{nk}=
    \begin{cases}
      1, & k = argmin_j ||x_n - \mu_j||^2 \\
      0, & \text{otherwise}
    \end{cases}
  \end{equation}

**For phase 3**:

since J is convex with fixed $r_{nk}$, by taking the derivative of J and set it to 0, we have:

$2 \sum_{n=1}^{N}r_{nk}(x_n - \mu_n) = 0$

by solving above equation, we have:

$\mu_k = \frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}$

*and so this result has a simple interpretation, namely set $\mu_k$ equal to the
mean of all of the data points $x_n$ assigned to cluster k*

#### Convergence

The algorithm converges simply after:
1. max iteration reached
2. no change in the assignment of cluster points

In [1]:
import numpy as np

class KMeans:

    def __init__(self, k=2, max_iteration=10, init_method='random'):

        self.k = k
        self.max_iteration = max_iteration
        self.init_method = init_method

    def fit_transform(self, x: np.array):

        n, d = x.shape
        prev_mu = self._init_mu(x)
        curr_mu = prev_mu.copy()
        r_matrix = np.zeros((n, self.k))
        curr_iter = 0

        while curr_iter <= self.max_iteration:

            # step one, map each instance to cluster center muk
            for sample in range(n):

                distance = []

                for j in range(self.k):

                    distance.append(np.linalg.norm(x[sample] - prev_mu[j]))

                mu_assigned = np.argmin(distance)
                r_matrix[sample][mu_assigned] = 1

            # step two calculate new mu_k

            for j in range(self.k):

                total_points = np.sum(r_matrix[:j])
                cluster_total = np.zeros((1, d))

                for sample in range(n):

                    cluster_total = cluster_total + r_matrix[sample][j] * x[sample]

                if total_points != 0:
                    curr_mu[j] = cluster_total / total_points

            if (prev_mu == curr_mu).all():

                break

            curr_iter += 1

        print(f'finished k-means algorithm, with iteration {curr_iter}')

        return r_matrix, curr_mu

    def _init_mu(self, x):

        if self.init_method == 'random':
            col_max = x.max(axis=0)
            col_min = x.min(axis=0)

            return (col_max - col_min) * np.random.random_sample((self.k, x.shape[1])) + col_min

        elif self.init_method == 'random_points':

            random_int = np.random.randint(0, x.shape[0], size=self.k)

            return x[random_int]





In [2]:
x = np.array([[1, 2], [0, 0], [8, 8], [10, 10]])
kmeans = KMeans()
kmeans.fit_transform(x)

finished k-means algorithm, with iteration 11


(array([[0., 1.],
        [0., 1.],
        [1., 0.],
        [1., 0.]]),
 array([[6.22522539, 3.72560899],
        [1.        , 2.        ]]))