**Supervised Learning and Unsupervised Learning Table**

![alt text](https://s-media-cache-ak0.pinimg.com/736x/8b/23/3e/8b233e2d7f26b00d0c594894917a127b--supervised-learning-variables.jpg "ml")

`Clustering` -  A typical and well-known type of unsupervised learning. Clustering algorithms try to find natural groupings in data. Similar data points (according to some notion of similarity) are considered in the same group. We call these groups clusters.

**`K-Means clustering`** is a simple and widely-used clustering algorithm, which is used to find groups that have not been explicitly labeled in the data. Given value of  k , it tries to build  k  clusters from samples in the dataset.

The K-Means algorithm iterates between two steps till convergence:
- Data assignment. Each data point is assigned to its closet centroids, with ties broken arbitrarily. This results in a partitioning of the data.
- Relocation of 'means'. Each cluster representative is relocated to the center (mean) of all data points assigned to it. If the data points come with a probability measure (weights), then the relocation is to the expectations (weighted mean) of the data partitions.

The algorithm is sensitive to the order in which data samples are explored, so run it several times to get varied orders, then average the cluster centers from each run and input those centers as ones for one final run analysis.

The algorithm is sensitive to initial condition and the presence of outliers, so pre-processing (removing outliers) and post-processing (eliminating small clusters and merging close clusters into a large cluster) are good ideas.

`Elbow Method` to find the 'k'. Mean distance to the centroid as a function of K is plotted and 'elbow point', where the rate of decrease sharply shifts, can be roughly determine 'k' (distance decreases when k value increases).

Euclidean distance for measuring distance, and other distances are possible such as KL.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def load_dataset(filename):
    return np.loadtxt(filename)


def euclidean_distance(a,b):
    return np.linalg.norm(a-b)

def kmeans(dataset,k,epsilon=0,distance='euclidean'):
    # list past centroids
    history_centroids = []
    
    if distance=='euclidean':
        distance_method = euclidean_distance
        
    num_examples, num_features = dataset.shape
    
    # define k centroids randomly
    prototypes = dataset[np.random.randint(0,num_examples-1,size=k)]
    # set those centroids into history list
    history_centroids.append(prototypes)
    # to keep track of centroid at every iteration
    prototypes_old = np.zeros(prototypes.shape)
    # to store clusters
    clusters = np.zeros((num_examples,1))
    norm = distance_method(prototypes,prototypes_old)
    iteration = 0
    while norm>epsilon:
        iteration += 1
        norm = distance_method(prototypes,prototypes_old)
        # for each example in the dataset
        for index,example in enumerate(dataset):
            dist_vector = np.zeros((k,1))
            #for each centroid
            for idx,prototype in enumerate(prototypes):
                dist_vector[idx] = distance_method(prototype,example)
            # find the smallest distance
            clusters[index,0] = np.argmin(dist_vector)
        
        temp_prototypes = np.zeros((k,num_features))
        
        # for each cluster
        for index in range(len(prototypes)):
            # get all points assigned to a cluster
            examples_close = [i for i in range(len(clusters)) if clusters[i]==index]
            # find the mean of those points
            prototype = np.mean(dataset[examples_close],axis=0)
            # add new centroids to new temporary list
            temp_prototypes[index,:] = prototype
            
        # set the new list to the current list
        prototypes = temp_prototypes
        
        # add calculated centroids to history list
        history_centroids.append(temp_prototypes)
        
    return prototypes,history_centroids,clusters

def plot(dataset,history_centroids,clusters):
    colors = ['r','g']
    fig,ax = plt.subplots()
    
    for index in range(dataset.shape[0]):
        examples_close = [i for i in range(len(clusters)) if clusters[i]==index]
        for idx in examples_close:
            ax.plot(dataset[idx][0],dataset[idx][1],colors[index]+'o')
            
    history_points = []
    for index,centroids in enumerate(history_centroids):
        for idx,centroid in enumerate(centroids):
            if index==0:
                history_points.append(ax.plot(idx[0],idx[1],'bo')[0])
            else:
                history_points[idx].set_data(idx[0],idx[1])
                print('Centroids {} {}'.format(index,idx))
                plt.show()

In [None]:
dataset = load_dataset('kmeansdataset.txt')
centroids,history_centroids,clusters = kmeans(k=2,dataset=dataset)
plot(dataset,history_centroids,clusters)