In [None]:
%reload_ext nb_black

# Self Organizing Maps (SOMs) with `minisom`

In [None]:
import pandas as pd
import numpy as np

# !pip install minisom
from minisom import MiniSom

# We're gonna make some 3d plots which is relatively easy with plotly
import plotly.express as px

## The idea behind SOMs

### The process

One liner: SOMs work like k-means, but your centroids are constrained to a lower dimensional grid.

More detail below.

#### Step 1:

Throw a lower-d net into a higher-d representation of the data.

For example, in the below gif, we have 3d data, and we've placed a 2d grid within this higher dimensional space.  Note, that we'll talk about this net as if it's a graph. The black dots in the grid will be referred to as nodes.

<img src='https://i.imgur.com/WladupI.gif' width=30%>

#### Step 2: 

Update the positions of the nodes in our 2d net.

We're allowed to move these nodes in the higher dimensional space. For example, we can move our nodes in the gif along the x, y, and z axis.  

Our goal when moving these nodes is to have them represent our data's location (i.e. our observations are metal and our nodes are magnets, the nodes will gravitate towards denser areas of points).

Our nodes will end up acting like centroids in k-means, they will have a location that represents the average features of the points in the vicinity.  This node movement will happen over many iterations; at the end we might end up with something like below.

Note, this is just an example of what might happen, these nodes were dragged by hand rather than by a SOM.

<img src='https://i.imgur.com/bkQi46c.gif' width=30% caption='s'>

### Output

So what do we get out of this?  We get both dimension reduction and clusters.

Dimension reduction is achieved by mapping higher-d observations into lower-d based on which node of the grid they are closest to.  Each node in our example grid can be thought of as a 2d coordinate, we can assign this 2d locationg to our 3d observations.  We can adjust the number of nodes to have different levels of granularity in our reduced dimensions.

Clusters are acheived very similarly to k-means.  Cluster labels can be assigned to points based on what node they are closest to.

## Example

Build a completely random dataset.

* Set a seed to make the dataset reproducible
* Use np.random.random() to generate a dataset with 15 rows and 3 columns
* Put this data into a dataframe.  Name the columns `['x0', 'x1', 'x2']`

Plot all three columns of the dataframe at once using `px.scatter_3d`

Create a `MiniSom` instance
* Set a `random_seed` when creating the instance
* Train it using `train_random` with 100 iterations

In [None]:
n_cols = df.shape[1]

Inspect the 'weights' using `.get_weights()`.  What do these represent?

Our weights are arranged in a 2d grid, so the first weight might be at position (0, 0) and we have a weight at position (2, 3) etc.

Find the closest weight vector to the first row in the dataframe (this closest weight vector is called the 'winner').  What position is this weight at?

What are the corresponding values to the weight vector at that winning location?  What do these values represent?

Print out the data of the first row of the dataframe.  Based on these values, does the winner seem to make sense?

Iterate over the rows in the dataframe, and find the winner for each row.
* Store the x and y locations for each winner
* Create columns in your data from for `['winner_x', 'winner_y', 'winner_id']`

Aggregate the dataframe by all of the winning columns, and count how many observations belong to each winner.

Who was the biggest winner?
Did any weight vectors never win?

Make a 2d scatter plot using the aggregated dataframe.  Use the `winner_x` & `winner_y` as the axes.  Size by the number of observations belonging to each winner.

Make a 3d scatter plot of the original data and color by the `winner_id`.  Do the groupings make sense?

We can plot the weight vectors in our higher dimensional space as well, but this gets pretty busy.  It could be prettied up with more complex pre-processing.

In [None]:
weight_df = pd.DataFrame(weights.reshape(9, n_cols))
weight_df.columns = [f"x{i}" for i in range(n_cols)]
weight_df["winner_id"] = "weight_vector"

plot_df = pd.concat((df, weight_df), sort=False)

plot_df["type"] = plot_df["winner_id"]
plot_df.loc[plot_df["type"] != "weight_vector", "type"] = "observation"
plot_df["size"] = plot_df["type"].replace({"weight_vector": 5, "observation": 15})

px.scatter_3d(plot_df, "x0", "x1", "x2", color="winner_id", symbol="type", size="size")