## Clustering via $k$-means
We previously studied the classification problem using the logistic regression algorithm. Since we had labels for each data point, we may regard the problem as one of <i>supervised learning</i>. However, in many applications, the data have no labels but we wish to discover possible labels (or other hidden patterns or structures). This problem is one of <i>unsupervised learning</i>. How can we approach such problems?

<b>Clustering</b> is one class of unsupervised learning methods. In this lab, we'll consider the following form of the clustering task. Suppose you are given

a set of observations, $X≡{x̂_{i} | \{0≤i<n\}}$, and<br>
a target number of $clusters, k$.

Your goal is to partition the points into $k$ subsets, $C_{0},…,C_{k−1} ⊆X$, which are 

- disjoint, i.e., $i≠j⟹Ci∩Cj=∅$;
- but also complete, i.e., $C_{0} ∪ C_{1} ∪ ⋯ ∪ C_{k−1} = X.

Intuitively, each cluster should reflect some "sensible" grouping. Thus, we need to specify what constitutes such a grouping.

## Setup: Dataset
The following cell will download the data you'll need for this lab. Run it now.

In [1]:
import requests
import os
import hashlib
import io

def on_vocareum():
    return os.path.exists('.voc')

def download(file, local_dir="", url_base=None, checksum=None):
    local_file = "{}{}".format(local_dir, file)
    if not os.path.exists(local_file):
        if url_base is None:
            url_base = "https://cse6040.gatech.edu/datasets/"
        url = "{}{}".format(url_base, file)
        print("Downloading: {} ...".format(url))
        r = requests.get(url)
        with open(local_file, 'wb') as f:
            f.write(r.content)
            
    if checksum is not None:
        with io.open(local_file, 'rb') as f:
            body = f.read()
            body_checksum = hashlib.md5(body).hexdigest()
            assert body_checksum == checksum, \
                "Downloaded file '{}' has incorrect checksum: '{}' instead of '{}'".format(local_file,
                                                                                           body_checksum,
                                                                                           checksum)
    print("'{}' is ready!".format(file))
    
if on_vocareum():
    URL_BASE = "https://cse6040.gatech.edu/datasets/kmeans/"
    DATA_PATH = "../resource/lib/publicdata/kmeans/"
else:
    URL_BASE = "https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/"
    DATA_PATH = ""

datasets = {'logreg_points_train.csv': '9d1e42f49a719da43113678732491c6d',
            'centers_initial_testing.npy': '8884b4af540c1d5119e6e8980da43f04',
            'compute_d2_soln.npy': '980fe348b6cba23cb81ddf703494fb4c',
            'y_test3.npy': 'df322037ea9c523564a5018ea0a70fbf',
            'centers_test3_soln.npy': '0c594b28e512a532a2ef4201535868b5',
            'assign_cluster_labels_S.npy': '37e464f2b79dc1d59f5ec31eaefe4161',
            'assign_cluster_labels_soln.npy': 'fc0e084ac000f30948946d097ed85ebc'}

for filename, checksum in datasets.items():
    download(filename, local_dir=DATA_PATH, url_base=URL_BASE, checksum=checksum)
    
print("\n(All data appears to be ready.)")

Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/logreg_points_train.csv ...
'logreg_points_train.csv' is ready!
Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/centers_initial_testing.npy ...
'centers_initial_testing.npy' is ready!
Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/compute_d2_soln.npy ...
'compute_d2_soln.npy' is ready!
Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/y_test3.npy ...
'y_test3.npy' is ready!
Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/centers_test3_soln.npy ...
'centers_test3_soln.npy' is ready!
Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/assign_cluster_labels_S.npy ...
'assign_cluster_labels_S.npy' is ready!
Downloading: https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/assign_cluster_labels_soln.npy ...
'assign_cluster_labels_soln.npy' is ready!

(All data appear

## The $k$-means clustering criterion
Here is one way to measure the quality of a set of clusters. For each cluster $C$, consider its center $μ$ and measure the distance $‖x−μ‖$ of each observation $x∈C$ to the center. Add these up for all points in the cluster; call this sum is the <i>within-cluster sum-of-squares (WCSS)</i>. Then, set as our goal to choose clusters that minimize the total WCSS over <i>all</i> clusters.

More formally, given a clustering $C=\{C_{0},C_{1},…,C_{k−1}\}$, let<br>
$WCSS(C)≡∑i=0k−1∑x∈Ci‖x−μi‖_2$,<br>
where $μ_{i}$ is the center of $C_i$. This center may be computed simply as the mean of all points in $C_i$, i.e.,

$μi≡1|Ci|∑x∈Cix$.

Then, our objective is to find the "best" clustering, $C∗$, which is the one that has a minimum WCSS.

$C∗=argminCWCSS(C)$.

In [1]:
from IPython.display import HTML
HTML(filename='14-main.html')

Unnamed: 0,x_1,x_2,label
0,-0.234443,-1.07596,1
1,0.730359,-0.918093,0
2,1.43227,-0.439449,0
3,0.026733,1.0503,0
4,1.87965,0.207743,0
