# Clustering 

Our vision and visual brain are extremely efficient at identifying groups of objects in space. Achieving the same result using algorithmic calculation is nontrivial. **Clustering** is an example of **unsupervised learning**, i.e. works on unlabeled (no target) data.

Consider the following toy datasets in two dimensions (for easy visualization):

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn import datasets

In [None]:
# reproducability
np.random.seed(0)
# sample size
n_samples = 1500
# dot size for scatter plots. Choose a smaller dotsize for larger sample sizes.
dotsize =10

In [None]:
X1, y1 = datasets.make_circles(n_samples=n_samples, factor=0.5,
                                      noise=0.07)

In [None]:
#help(datasets.make_circles)
print(X1.shape);print(y1.shape)
#print(X1);print(y1)

In [None]:
plt.scatter(X1[:, 0], X1[:, 1],s=dotsize)

In [None]:
X2, y2 = datasets.make_moons(n_samples=n_samples, noise=.07)

In [None]:
plt.scatter(X2[:, 0], X2[:, 1],s=dotsize)

In [None]:
X3, y3 = datasets.make_blobs(n_samples=n_samples, random_state=15)

In [None]:
plt.scatter(X3[:, 0], X3[:, 1],s=dotsize)

In [None]:
X_raw, y4 = datasets.make_blobs(n_samples=n_samples, random_state=170)
transformation = [[0.2, -0.9], [-0.4, 0.8]]
X4 = np.dot(X_raw, transformation)

In [None]:
plt.scatter(X4[:, 0], X4[:, 1],s=dotsize)

In [None]:
X5,y5 = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[1.0, 2.5, 0.7],
                             random_state=170)

In [None]:
plt.scatter(X5[:, 0], X5[:, 1],s=dotsize)

In [None]:
X6 = np.random.rand(n_samples, 2)
y6 = np.zeros(n_samples,dtype=np.int8)

In [None]:
plt.scatter(X6[:, 0], X6[:, 1],s=dotsize)

The target vector *y* contains the grouping used in generating the data. 
It can be used to measure the accuracy of the assignments of the clustering algorithms
but it not used as input.

In [None]:
colors = np.array(['r','b','g'])
plt.scatter(X2[:, 0], X2[:, 1], s=dotsize, color=colors[y2])

Colors in matplotlib

https://matplotlib.org/3.1.1/api/colors_api.html

In [None]:
# predefined colors
colors = np.array(['r','b','g'])

In [None]:
# colors as red,green blue value in hex
colors = np.array(['#377eb8', '#ff7f00', '#4daf4a',
                                             '#f781bf', '#a65628', '#984ea3',
                                             '#999999', '#e41a1c', '#dede00'])

In [None]:
# colors as red,green blue value as tuple
colors = np.array([(0.9,0.0,0.0),(0.0,0.9,0.9),(0.5,0.5,0.1)])

In [None]:
plt.scatter(X1[:, 0], X1[:, 1], s=dotsize, color=colors[y1])

# Clustering algorithms

Many available in sklearn, see (https://scikit-learn.org/stable/modules/clustering.html)


# Project for Friday 10/23 (due date TBD, work on 1-3 for now)

Try as many of the clustering algorithms available in sklearn as you like (with a minimum of 3, see below) and apply them to the 6 data sets in this notebook. Perform the following tasks and/or answer the following questions:

1. Read the description of the algorithm and estimate for which datasets the method will succeed or fail.
1. What parameters does the cluster algorithm need and what do they do? Demonstrate by varying the parameters.
1. Visualize the results. 
1. You must be able to explain your code to the instructor.
1. Email your project (due Monday before class)

A minimum set of methods includes 

* one out of (K-Means, Gaussian mixtures, Birch, Ward hierarchical clustering )
* one out of (Affinity propagation,Mean-shift, Spectral clustering , Agglomerative clustering) 
* one out of (DBSCAN, OPTICS)
