## Unsupervised learning

**The problem**: we have some data, and we aren't given a label that neatly categorizes it. But we want to separate the data in some meaningful way (in clusters, using some measure of nearness).

**We need to supply how many clusters we want the algorithm to find ahead of time.**

We don't know what the clusters represent, just that we are hoping that there will be a division in the data that will help us understand it.

---

In the following example, we'll explore unsupervised clustering with an algorithm called KMeans.

First, lets create a function to create a mock dataset for us.

The function will sample one thousand points in the x-y plane (`blobs`) from 3 different probability distributions using the Scikit-learn function `make_blobs`. We will keep track of which distribution each point is sampled from (`cluster_labels`).

These will be returned from our function and stored in the variables `xy_points` and `labels` (**note: the KMeans algorithm won't know about the label here, but we can use it in this contrived example to examine the output**)

In [None]:
import numpy
from sklearn.datasets import make_blobs

numpy.random.seed(1337)

centers = [[-10, -10], [-10, 13], [8, -1]]

def get_points_and_labels(**kwargs):
    blobs, cluster_labels = make_blobs(n_samples=1000, n_features=3,
                                       centers=centers, cluster_std=5.0)
    return blobs, cluster_labels

xy_points, labels = get_points_and_labels(initialize_seed=True)

Next, lets make a function that will visualize our x-y points, optionally coloring the points if the labels are also included

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def plot_clusters(title, xy_points, labels=None):
    plt.figure()
    plt.title(title)
    xy_points_df = pd.DataFrame(xy_points, columns=['x', 'y'])

    if labels is None:
        plt.scatter(xy_points_df.x, xy_points_df.y, c="grey")
    else:
        xy_points_df['labels'] = pd.Series(labels)
        colours = list(mcolors.TABLEAU_COLORS.keys())
        clusters = range(len(set(labels)))
        for cluster_id in clusters:
            cluster_data = \
                xy_points_df.loc[xy_points_df["labels"] == cluster_id,
                                 ["x", "y"]]
            plt.scatter(cluster_data.x, cluster_data.y,
                        c=colours[cluster_id-1])

    plt.show()

Let's look at our x-y points both with and without the labels.

In [None]:
print('First 10 xy_points: \n',xy_points[:10])
plot_clusters('Unlabeled clusters', xy_points)
print('First 10 labels: \n', labels[:10])
plot_clusters('Labeled clusters', xy_points, labels)

### KMeans

From [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering) ...

---

1. k initial randomly chosen "means" (or "seeds", in this case k=3) are randomly generated within the data domain (shown in color).
![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5e/K_Means_Example_Step_1.svg/200px-K_Means_Example_Step_1.svg.png)

---

2. k clusters are created by associating every observation with the nearest mean.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/K_Means_Example_Step_2.svg/200px-K_Means_Example_Step_2.svg.png)

---

3. The centroid of each of the k clusters becomes the new mean.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/K_Means_Example_Step_3.svg/200px-K_Means_Example_Step_3.svg.png)

---

4. Steps 2 and 3 are repeated until convergence has been reached (not quaranteed)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/K_Means_Example_Step_4.svg/200px-K_Means_Example_Step_4.svg.png)


---

In Scikit-learn, KMeans is provided by the [`sklearn.cluster.KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class.

Notice that we specify the number of clusters we want the algorithm to find `n_clusters`.
The setting `n_init=1000` means that we will try 1000 times with different initial means, then choose the best result.

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3, n_init=1000)
model.fit(xy_points)

In [None]:
print('\nActual cluster means')
for x_y in centers:
    print('%f,%f' % (x_y[0], x_y[1]))
    
print('\nPredicted cluster means')
for x_y in model.cluster_centers_:
    print('%f,%f' % (x_y[0], x_y[1]))

In [None]:
kmeans_labels = model.predict(xy_points)

print('First 10 actual labels: ', labels[:10])
print('First 10 computed labels: ', kmeans_labels[:10])

plot_clusters('Re-plot of original clusters', xy_points, labels)
plot_clusters('Calculated clusters', xy_points, kmeans_labels)

### Huh?

Notice that most of the predicted labels are actually wrong!

KMeans finds clusters, but it has no way of knowing what the actual labels mean. It just detects clusters.

You will notice in the above plot that the shape of the clusters are pretty close, but the colors of the individual clusters might be wrong.

---

### Trying our model out on some new data

We can sample a new dataset from the same probability distribution and use our trained model to predict clusters.

In [None]:
xy_points2, labels2 = get_points_and_labels()
kmeans_labels2 = model.predict(xy_points2)

plot_clusters('New clusters', xy_points2, labels2)
plot_clusters('New predicted clusters', xy_points2, kmeans_labels2)

### That was all nice and easy, but ...

What if our data looks like this instead?

In [None]:
from sklearn.datasets import make_moons

features, labels = make_moons(n_samples=1000, noise=0.1)

plot_clusters('Uh oh ...', features, labels)

Or how about this one?

In [None]:
from sklearn.datasets import make_circles

features, labels = make_circles(n_samples=1000, noise=0.1, factor=0.5)

plot_clusters('Uh oh 2.0...', features, labels)

---

In these cases, KMeans may not be the best algorithm to try for clustering.

You can see a number of different clustering algorithms, and instances where some algorithms might work better than others:

https://scikit-learn.org/stable/modules/clustering.html

### Exercise

Try one of the algorithms on the Scikit-learn webpage to create a pipeline to cluster one of the above datasets.

In [None]:
### Your code here ...

## Exercise

Consider the following dataset ...

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()
iris_df = pd.DataFrame(iris['data'], columns= iris['feature_names'])

labels = np.array(iris['target'])

# Note: label == 0 ==> 'setosa'
#       label == 1 ==> 'versicolor'
#       label == 2 ==> 'virginica'

print(iris['DESCR'])

There are four features in the `iris_df` dataframe -- cluster them!.
Plot any two of the features both with the original labels, and with the cluster colors (try different combinations)

In [None]:
### Your code here ...

# Hint, to get two features for a plot ...
# xy_points_iris = iris_df[['some feature', 'some other feature']].to_numpy()

## Homework

The next section is on neural networks. It's good to understand some of the theory behind neural networks, before diving in. I am not an expert at this, and I certainly don't think I can do a better job explaining some of the concepts behind neural networks than [Grant Sanderson](https://twitter.com/3blue1brown) (a.k.a. [3blue1brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)).

He has three short introductory videos on how neural networks work, and how they are trained.

* [But what is a Neural Network?](https://www.youtube.com/watch?v=aircAruvnKk) (19:13)
* [Gradient descent, how neural networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w) (21:00)
* [What is back propagation really doing?](https://www.youtube.com/watch?v=Ilg3gGewQ5U) (13:53)

(He has a fourth video in the series, feel free to watch it if you want to get deep into the calculus involved -- not necessary for what we are doing.)

---

And on to the next section is on [neural networks](04-neural-networks.ipynb)...