# K-Means

- K-means is a simple and elegant approach for partitioning a data into K distinct, non-overlapping clusters.
- To perform K-means we must first specify the desired number of clusters K.
- Then the K-means clustering will assign each observation to exactly one of the k clusters.

- Let $C_1, ...C_k$ denote sets containing the indices of the observations in each cluster.
- These sets satisfy two properties:
    1. $C_1 \cup  C_2 \cup... \cup C_k = {1,...n}$. Each observation belongs to atleast one class.
    2. $C_k \cap C_{k'} = 0$ for all $k \ne k'$. Clusters are non-overlapping: No observation belong to more than one cluster.
- The concept is that the within-cluster variation should be as small as possible.
- The within-cluster variation for cluster $C_k$ is a measure $W(C_k)$ of the amount by which the observations within a cluster differ from each other.
- Hence we want to solve the problem 

    $minimize {\sum_{k=1}^{K} W(C_k)}$      
- The most common choice to define this concept is the squared Euclidean distance.

    $W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k}^{} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2 $

- Where $|C_k|$ denotes the number of observations in kth cluster. 
- The within-cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in the kth cluster, divided by the total number of observations in the kth cluster.
- Combining above equations give the optimization problem as :

    $minimize{\sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i,i' \in C_k}^{} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2} $

Algorithm:

- Start with a set of k-means, which are points in n-dimensional space
- Assign each point to the mean to which it is closest
- If no point's assignment has been changed, stop and keep the clusters
- If somr point's assignment has been changed, recompute the mean and return to step 2

In [1]:
class KMeans:
    """Performs K-means clustering"""
    def __init__(self,k):
        self.k = k          # Number of clusters
        self.means = None   # means of clusters
    
    def classify(self, input):
        """return the index of cluster closest to the input"""
        return min(range(self.k),
                   key= lambda i:squared_distance(input, self.means[i])) # Closest distance between centroid/means and observation points(input)
    
   # def squared_distance(v,w):
   #     return sum_of_squares(vector_subtract(v,w))
   
   
    def train(self,inputs):
        """choose k random points as the initial means"""
        self.means = random.sample(inputs, self.k)
        assignments = None
        
        while True:
            # Find new assignments
            new_assignments = map(self.classify, inputs)
            
            # If no assignments have changed, we're done
            if assignments == new_assignments:
            return
            
            # Otherwise keep the new assignments
            assignments = new_assignments
            
            # And compute the new means based on new assignments
            for i in range(self.k):
                # find all the points assigned to cluster i
                i_points = [p for p,a in zip(inputs, assignments) if a == i]
                
                # make sure i_points is not empty so don't divide by 0
                if i_points:
                    self.means[i] = vector_mean(i_points)
                    
                    

                    
# random seed(0)
# Clusterer = KMeans(3)
# Clusterer.train(inputs)
# print(Clusterer.means)