# Visualizations small SOM - Wine Quality

## 1. Hit histogram
The hit histogram visualizes the frequency with which neurons get hit during the training of a SOM.

We observe the BMU distribution is spread out over large areas of the som (compared to room occupancy) with two peaking areas in the center (-0.2,-0.1) and on left top (-0.45,0.25).
We compare different color schemes for the same area, comparing rainbow, a mono sequential (from white to dark red) and a uniform sequential (inferno).

We conclude the mono sequential has the best visibility to identify high value clusters quickly.

![](img/wine_hit_rainbow.png)

![](img/wine_hit_reds.png)

![](img/wine_hit_inferno.png)

## 2. Smoothed data histogram

The SDH is an extension of hit histograms that maps input vectors onto n-best matching units and achieves a smoothing effect.

To increase the smoothing we also activated interpolation.

We experimented with different smoothing factors. As expected the smoothing factor 1 gives us the same visualization as the hit histogram visualization. Bigger and smoother clusters become visible with higher factors such as 50. 
With the weighted SDH approach in particular, additional high density clusters emerge more clearly, for example at the bottom (0.2, -0.4).

![SDH](img/wine_sdh_1_interpol.png)
SDH with smoothing factor 1

![SDH weighted factor](img/wine_sdh_50_interpol.png)
SDH with smoothing Factor 50


## 3. Neighbourhood graph

Neighbourhood graphs visualize which areas of the SOM are in proximity based on the input space.

We plotted the neighbourhood graph for 4 nearest neighbours over a hit histogram and for the neighbors with radius 1 in the input space. In both cases we see a lot of topology violations, but also some local connections on zooming on a cluster.

We double-checked the SOM Training and could not determine any issues, even retrained multiple epochs, but fundamentally we did not receive a different result. On inspecting the separate attribute weights (with the component diagram) we see poor correlation throughout the different attributes. We suspect this upon calculation of the BMUs heterogeneous attribute cancel each other out causing the noise in the SOM.

![](img/wine_neighbours_knn4.png)
KNN 4

![](img/wine_neighbours_radius1.png)
Radius 1

![](img/wine_neighbours_radius1_zoom.png)
Radius 1 - Zoomed to a cluster

## 4. Sky Metaphor

Sky Metaphor is another density visualization, but maps data items on the exact position within a unit and therefore helps identify similarity between inputs within the same unit or across neighbouring units more accurately.

The visualization is more "irregular" than other density visualizations since data items are not centered within units anymore. We notice that the image becomes less cloudy with a higher pull factor.

![](img/wine_sky_pf_0.25.png)
Sky Metaphor - pull factor 0.25

![](img/wine_sky_pf_0.5.png)
Sky Metaphor - pull factor 0.5


## 5. Activity Histogram

The Activity Histogram per data point visualizes the distance between input vector and all weight vectors.

We chose two input vectors: 253 and 2886. We notice the one neuron at coordinates ~(0.4,-0,2) that is consistently red for almost all input vectors. This indicates a high distance in all samples in this area.

![](img/wine_activityhist_253.png)
Activity histogram - Sample 253

![](img/wine_activityhist_2886.png)
Activity histogram - Sample 2886

## 6. Minimum spanning tree

The Minimum Spanning Tree visualizes related nodes on the map by connecting similar nodes with each other.  The weights of the edges are computed by a distance metric between the vectors of the vertices and subsequently minimized.
ADD CITATION https://www.ifs.tuwien.ac.at/~mayer/publications/pdf/may_icann10.pdf

There are four available settings for the distance in PySOMVis: all, diagonal, direct, MST input data. We could only test it with the All option, due to performance problems on the other options.


![](img/wine_mspt_all.png)


## 7. Cluster Connections

In this visualization technique, connecting lines are drawn between units based on threshold values. The intensity of the connections between nodes indicates the similarity of underlying data points.

We notice for this SOM that there are few clear cluster boundaries, even for higher thresholds such as 50. Only the center-right area stands out with few connections.

![](img/wine_clustercon_0.27.png)
Cluster connections - low threshold

![](img/wine_clustercon_0.50.png)
Cluster connections - high threshold

## 8.U-Matrix
The U-Matrix visualization displays the distances between neurons on the SOM grid. Low values correspond to small distances between neighbouring neurons, whereas high values indicate large distances and can be used to identify cluster boundaries. 

The U-matrix visualization leads to similar findings as previously discussed visualizations. The data is noisy, there is a mix of low and high values across the SOM. There is one identifiable cluster boundary in the center-right region.


![](img/wine_umatrix.png)
U-Matrix

## 9.D-Matrix
The D-Matrix is similar to the U-Matrix, but averages the distance instead of using interpolation.

This results in a similar visualisation, but with smoother transitions between "mountains" and "valleys". The boundaries are therefore not as clear.


![](img/wine_dmatrix.png)



## 10.P-Matrix & U*-Matrix

Unlike the U-Matrix, P-Matrix is a density and not a distance based metric. It involves estimating the empirical density at each neuron's weight vector in the feature space.

The U*-Matrix combines both distance and density information, enhancing cluster visualization by adjusting the U-Matrix with density-derived scale factors.
  
We see the low density regions (0.5, -0.1) and (-0.4, 0.1) shown in the P-Matrix coincide with high distance regions (U-Matrix) resulting in an overall uniform distribution in the combined visualisation (U*-Matrix).

![](img/wine_pmatrix_p1_r1.8.png)
P-Matrix

![](img/wine_umatrix_p1_r1.8.png)
U*-Matrix

![](img/wine_umatrix_p14_r2.8.png)
U*-Matrix with higher percentile and radius

## 11. Pie chart

We plotted the pie chart over a hit histogram and over a U-Matrix.
Each quality score in the dataset (target) is mapped to one color as shown in the legend and the pie charts show what classes and their distribution for each unit. We notice classes are distributed over the entire map, no clear trends are visibile, a further display of the general noisyness of the SOM


![](img/wine_pie_hhist_full.png)
Pie Chart over Hit histogram

![](img/wine_pie_umatrix_full.png)
Pie Chart over U-Matrix

## 12. Chessboard

Chessboard visualization is a type of class coloring visualization, combining Voronoi Tesselation and chessboard style pixel coloring according to dominant classes.


![](img/wine_chessboardandvoronoi.png)
Chessboard visualization


## 13. Component planes
The component planes visualization shows the distribution of the weights for the selected attributes across the SOM units. 

Analysing the component planes visualizations show no apparent correlation between all components.
Some component pairs do unsurprisingly correlate, such as residual sugar and sugar, free sulfur oxide and total sulfur oxide or the inversely correlated citric acidity and volatile acidity.

![](img/wine_comp_0_fixedacidity.png)
Component 0 - Fixed acidity

![](img/wine_comp_1_volatile_acidity.png)
Component 1 - Volatile acidity

![](img/wine_comp_2_citric_acidity.png)
Component 2 - Citric acidity

![](img/wine_comp_3_residual_sugar.png)
Component 3 - Residual sugar

![](img/wine_comp_4_chlorides.png)
Component 4 - Chlorides

![](img/wine_comp_5_free_sulfur_dioxide.png)
Component 5 - Free sulfur dioxide

![](img/wine_comp_6_total_sulfur_dioxide.png)
Component 6 - Total sulfur dioxide

![](img/wine_comp_7_density.png)
Component 7 - Density

![](img/wine_comp_8_pH.png)
Component 8 - PH

![](img/wine_comp_9_sulphates.png)
Component 9 - Sulphates

![](img/wine_comp_10_sugar.png)
Component 10 - Sugar




## 14. Metro Map

MetroMap is similar to component planes, but groups weights of the selected attribute into bins.
Component lines connect the centers of gravity of each bin.

When one attribute is selected with the option of 4 bins, we see how the fixed acidity values are distributed into the bins.

If multiple attributes are selected, only one bin is visualized.

TODO rework text


![](img/wine_metro_class0_4bins.png)
Metro Map - 1 attribute, 4 bins

![](img/wine_metro_allclasses_level_0.6.png)
Metro Map - all attributes

## 15. Clustering

Clustering is a non-deterministic division of the map into regions based on the weights.
There are two approaches: k-means and agglomerative clustering.

With a number of 4 clusters we observe that neurons belonging to the same cluster are in relative proximity, with only a couple of outliers. Our main cluster with low sensor data readings is not visualized, however, this can change since clustering is non-deterministic.
Increasing the number of clusters we see topology violations again in the neurons belonging to the cluster with low sensor data readings.

With agglomerative clustering we see clusters get aggregated, especially the low sensor data cluster.


![](img/wine_cluster_kmeans3.png)
![](img/wine_cluster_kmeans15.png)
![](img/wine_cluster_agglo_ward10.png)
![](img/wine_cluster_agglo_complete10.png)




## 16. Quantization error

The visualization shows the average distance between the input vector, and it's best matching unit and serves as an indication of how well the map is trained.

We observe only one neuron with a high quantization error.

![](img/wine_quant_error.png)




## 17. Topographic Error

The topographic error visualizes how well the SOM preserves the topography of the input
space by calculating the percentage of data samples for which the first and second BMU are not placed in adjacent units in the SOM.

We notice that the cluster with low sensor readings has a high amount of topographic errors, but the cluster is dense and according to the lecture this might be misleading. 

![](img/wine_topo_error_4unitnbh.png)


## 18. SOMStreamVis

The SOMStreamVis plots best matching BMU indexes over time (natural order of samples) and can provide additional information to a SOM visualization.

As Wine Quality is not a timeseries the visualisation does not apply for this dataset

## 18. Intrinsic distance

Intrinsic distance visualization combines topographic and quantization error visualizations.

Due to performance issues, we weren't able to generate the visualization in a reasonable amount on time (On a High performance cloud server with 4 CPUs and 32GB RAM - directly run with python not via jupyter)

## 19. Mnemonic SOM

Due to an error we were unable to render this visualization.
ValueError: cannot reshape array of size 4400 into shape (10,10,11)
  File "/home/sklei/miniconda3/envs/sorg2/lib/python3.12/site-packages/param/depends.py", line 41, in _depends
    return func(*args, **kw)
           ^^^^^^^^^^^^^^^^^
  File "/home/sklei/PySOMVisJ5/PySOMVis/controls/controllers.py", line 418, in trainMnemonicSOM
    self._calculate()
  File "/home/sklei/PySOMVisJ5/PySOMVis/mnemonics/mnemonicSOM.py", line 54, in _calculate
    self._main._display(UMatrix.calculate_UMatrix(self._weights, self._controls.M, self._controls.N, self._main._dim))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sklei/PySOMVisJ5/PySOMVis/visualizations/umatrix.py", line 35, in calculate_UMatrix
    U = weights.reshape(m, n, dim)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 4400 into shape (10,10,11)
 