# Visualizations large SOM - Wine quality dataset

In this notebook we prepare demonstrations of different PySomVis Visualisations.

Dataset: Wine Quality (see https://archive.ics.uci.edu/dataset/186/wine+quality) including 11 physicochemical features and 4898 instances on which we train a 200x150 SOM with MiniSom. For performance reasons the SOM is trained outside of the notebook and the visualisations gathered from a running pysom instance (see wine_l_pysom.py).

Github project: https://github.com/stephan-klein/PySOMVisJ5



## Hit histogram
The hit histogram visualizes the frequency with which neurons get hit during the training of a SOM.

Compared to the small SOM, there are now visible clusters. We assume this is because the SOM size is now more adequate.

![](img/winel_hit_magma.png)
Hit histogram



## Smoothed data histogram

The SDH is an extension of hit histograms that maps input vectors onto n-best matching units and achieves a smoothing effect.

The visualization is very similar to the hit histogram, but some very small clusters disappear and overall clusters appear more smooth.

![](img/winel_sdh_10000.png)
SDH with smoothing factor 10000

![](img/winel_sdh_weighted_20000.png)
SDH with weighted factor 20000


## Neighbourhood graph

Neighbourhood graphs visualize which areas of the SOM are in proximity based on the input space.

We plotted the neighbourhood graph for 2 nearest neighbours (since 5 caused too many lines) over a hit histogram. 
Compared to the small SOM, the clusters are now clearly identifiable, but still with a high number of topology violations.


![](img/winel_nbh_hithist_knn2.png)
Neighbourhood graph 2 nearest neighbours


![](img/winel_nbh_hithist_knn2_zoom.png)
Neighbourhood graph 2 nearest neighbours zoomed in


## Sky Metaphor

Sky Metaphor is another density visualization, but maps data items on the exact position within a unit and therefore helps identify similarity between inputs within the same unit or across neighbouring units more accurately.

The visualization is more "irregular" than other density visualizations since data items are not centered within units anymore.


![](img/winel_starmap.png)
Sky Metaphor - pull factor 0.25

![](img/winel_starmap_zoom.png)
Sky Metaphor zoomed in

## Activity Histogram

The Activity Histogram per data point visualizes the distance between input vector and all weight vectors.

We chose two input vectors: 253 and 2886. We notice the one neuron at coordinates ~(-0.02,-0.5) that is consistently red for almost all input vectors. This indicates a high distance in all samples in this area.

![](img/winel_achisto_253.png)
Activity histogram - Sample 253

![](img/winel_achisto_2886.png)
Activity histogram - Sample 2886

## Minimum spanning tree

The Minimum Spanning Tree visualizes related nodes on the map by connecting similar nodes with each other.  The weights of the edges are computed by a distance metric between the vectors of the vertices and subsequently minimized.[0]

Unfortunately for the large SOM PySOMVis was not able to produce the visualisation in a reasonable amount on time (On a High performance cloud server with 4 CPUs and 32GB RAM - directly run with python not via jupyter)

 [0]: https://www.ifs.tuwien.ac.at/~mayer/publications/pdf/may_icann10.pdf ,accessed 02.02.2024

## Cluster Connections

In this visualization technique, connecting lines are drawn between units based on threshold values. The intensity of the connections between nodes indicates the similarity of underlying data points.

We observe some issues with this visualisation on the large SOM. The connections are not rendered in the same way for the small SOM and we can produce some artifacts which only show on zooming or panning.

![](img/winel_clustercomp.png)
(Potentially Faulty) Cluster Connection Visualization

![](img/winel_clustercomp_onscroll.png)
Artifacts on scrolling

## U-Matrix
The U-Matrix visualization displays the distances between neurons on the SOM grid. Low values correspond to small distances between neighbouring neurons, whereas high values indicate large distances and can be used to identify cluster boundaries. 

In contrast to the U-Matrix of the small SOM multiple smaller clusters, some with clear boundaries can be observed here.

![](img/winel_umatrix.png)
U-Matrix

## D-Matrix
The D-Matrix is similar to the U-Matrix, but averages the distance instead of using interpolation.

This results in a similar visualisation, but with smoother transitions between "mountains" and "valleys". The boundaries are therefore not as clear.

![](img/winel_dmatrix.png)
D-Matrix


## P-Matrix & U*-Matrix

Unlike the U-Matrix, P-Matrix is a density and not a distance based metric. It involves estimating the empirical density at each neuron's weight vector in the feature space.

The U*-Matrix combines both distance and density information, enhancing cluster visualization by adjusting the U-Matrix with density-derived scale factors.
  
An interesting observation is the high density of most regions of the SOM. According to A. Utsch "Maps for the visualization of high-dimensional data spaces" *"neurons with large P-heights are situated in dense regions of the dataspace"* and *"„plateaus“ on a P-Matrix point to cluster centers"*


![](img/winel_pmatrix_optimal.png)
P-Matrix with optimal values

![](img/winel_ustarmatrix_optimal.png)
U*-Matrix (=P-Matrix + U-Matrix) with optimal values


## Pie chart

We plotted the pie chart over a hit histogram and over a U-Matrix.
Each quality score in the dataset (target) is mapped to one color as shown in the legend and the pie charts show what classes and their distribution for each unit. 

It seems that the cluster at the center right position has predominantly wines with higher quality ratings. 

![](img/winel_piechart.png)
Pie Chart over Hit histogram


## Chessboard

Chessboard visualization is a type of class coloring visualization, combining Voronoi Tesselation and chessboard style pixel coloring according to dominant classes.

We identify an issue with the generation of the Vonoroi Tesselation, as the Border seem not to be generated properly. However the Vonoroi cells themself seems to be generated and colored accoring to the algorithm. Unfortunately there is no legend for class instances generated (despite providing class labels) so we cannot interpret the result. We still obtain a good overview about overall class class frequency distribution.

![](img/winel_chessboard_voronoi.png)
Chessboard visualization



## Component planes
The component planes visualization shows the distribution of the weights for the selected attributes across the SOM units. 

There seems to be more of an overlap in the components here compared to the components planes visualization of the small SOM. Some of the clusters can be identified at least partially on each of the component's visualizations.

![](img/winel_comp_0_fixedacidity.png)
Component 0 - Fixed acidity

![](img/winel_comp_1_volatile_acidity.png)
Component 1 - Volatile acidity

![](img/winel_comp_2_citric_acidity.png)
Component 2 - Citric acidity

![](img/winel_comp_3_residual_sugar.png)
Component 3 - Residual sugar

![](img/winel_comp_4_chlorides.png)
Component 4 - Chlorides

![](img/winel_comp_5_free_sulfur_dioxide.png)
Component 5 - Free sulfur dioxide

![](img/winel_comp_6_total_sulfur_dioxide.png)
Component 6 - Total sulfur dioxide

![](img/winel_comp_7_density.png)
Component 7 - Density

![](img/winel_comp_8_pH.png)
Component 8 - PH

![](img/winel_comp_9_sulphates.png)
Component 9 - Sulphates

![](img/winel_comp_10_sugar.png)
Component 10 - Sugar

## Metro Map

MetroMap is similar to component planes, but groups weights of the selected attribute into bins.
Component lines connect the centers of gravity of each bin.

When one attribute is selected with the option of 4 bins, we see how the fixed acidity values are distributed into the bins.

On selection of multiple attributes we observe the component lines, and thus the direction of the gradient of the weight bins point in different directions. Another indication of the inhomogenity within the different attributes of the dataset.


![](img/winel_metro_comp10_4bins.png)
Metro Map - 1 attribute, 4 bins

![](img/winel_metro_allcomp_level0.4.png)
Metro Map - all attributes

## Clustering

Clustering is a non-deterministic division of the map into regions based on the weights. There are two approaches: k-means and agglomerative clustering.

With a number of 4 clusters we see that while the general SOM is rather noisy, there are well-defined clusters of bigger size. With agglomerative clustering, the noise is reduced, which is the same experience we made on the small SOM.

![](img/winel_cluster_kmeans4.png)
KMeans clustering with 4 cluster

![](img/winel_cluster_agglo_complete_10clusters.png)
Agglomerative clustering with 10 clusters



## Quantization error

The visualization shows the average distance between the input vector, and it's best matching unit and serves as an indication of how well the map is trained.

We notice some quantization errors, but compared to the small SOM it seems the big map is better trained.


![](img/winel_quanterror.png)
Quantization error

## Topographic Error

The topographic error visualizes how well the SOM preserves the topography of the input
space by calculating the percentage of data samples for which the first and second BMU are not placed in adjacent units in the SOM.
 
We notice topographic errors in the clusters, however these might be false positive errors since the clusters are rather dense. 

![](img/winel_topoerror_4unitnbh.png)
Topographic error


## SOMStreamVis

The SOMStreamVis plots best matching BMU indexes over time (natural order of samples) and can provide additional information to a SOM visualization.

As Wine Quality is not a timeseries the visualisation does not apply for this dataset.

## Intrinsic distance

Intrinsic distance visualization combines topographic and quantization error visualizations.

Due to performance issues, we weren't able to generate the visualization in a reasonable amount on time (On a High performance cloud server with 4 CPUs and 32GB RAM - directly run with python not via jupyter)

## Mnemonic SOM

We used mnemonic SOM visualization to plot the SOM as a stick figure.
Mnemonic SOM visualization eases identification of clusters in a SOM - in this case we can notice a cluster in the left hip area of the stick figure.
 

![](img/winel_mnemonic.png)
Mnemonic SOM - stick figure