# Visualizations large SOM - Room occupancy dataset

In this notebook we prepare demonstrations of different PySomVis Visualisations.

Dataset: Room Occupancy (see https://archive.ics.uci.edu/dataset/864/room+occupancy+estimation) including 16 sensory features and 10129 instances on which we train a 300x200 SOM with MiniSom. The instances also include day and time information which we also consider as normalized attributes. For performance reasons the SOM is trained outside of the notebook and the visualisations gathered from a running pysom instance (see occupancy_l_pysom.py).

Github project: https://github.com/stephan-klein/PySOMVisJ5


## Hit histogram
The hit histogram visualizes the frequency with which neurons get hit during the training of a SOM.

We identify 3 clusters: a small one at the top, a bigger, yet sparse one in the middle and a dense one at the bottom. The neurons most frequently hit belong to the bottom cluster.


![](img/occl_hithist.png)
Hit histogram

## Smoothed data histogram

The SDH is an extension of hit histograms that maps input vectors onto n-best matching units and achieves a smoothing effect.

Compared to the visualization of the smaller SOM, the smoothing factor has less of a visible effect on this one. With the weighted SDH additional neurons become visible.

![](img/occl_smhisto_sdh49.png)
SDH with smoothing factor 50

![](img/occl_smhisto_weightedsdh49.png)
SDH with weighted smoothing factor 50

## Neighbourhood graph

Neighbourhood graphs visualize which areas of the SOM are in proximity based on the input space.

Using the radius approach, we set the radius to a value of 2.0. Same as with the smaller SOM, the connections formed are different compared to the KNN approach. Most connections are displayed between the middle and the bottom clusters with the radius setting. With the KNN Connections are predominantly formed within a cluster and some topology violations are shown between middle and top clusters.

![](img/occl_neighbourhood_radius2.png)
Neighborhood Connections Radius 2

![](img/occl_neighbourhood_knn8.png)
Neighborhood Connections KNN 8

![](img/occl_neighbourhood_knn8_zoomcentercluster.png)
Neighborhood Connections KNN 8 - Zoomed to middle cluster

![](img/occl_neighbourhood_knn8_zoombottomcluster.png)
Neighborhood Connections KNN 8 - Zoomed to bottom cluster


## Sky Metaphor

Sky Metaphor is another density visualization, but maps data items on the exact position within a unit and therefore helps identify similarity between inputs within the same unit or across neighbouring units more accurately.


We observe the sky metaphor has one of the worst runtime performance for a large som. We still were able to capture a plot which is shown below and are able to inspect exact fine-grained density structures within the clusters on zooming in


![](img/occl_skymetaphor_sf2.png)
Smoothing Factor 2 - Full SOM

![](img/occl_skymetaphor_sf2_zoom.png)
Smoothing Factor 2 - Zoomed in



## Activity Histogram

The Activity Histogram per data point visualizes the distance between input vector and all weight vectors.

We chose two input vectors: 0 and 816. Sample 0 represents a sample with low sensor readings in the early mornings as opposed to sample 816 from 18:16 on the same day which has high sensor readings.
Both samples show cluster homogeneity (gradients in distances clearly shown - instead of high and low values as close neighbours) with one exception - the bottom part of the middle cluster.

![](img/occl_act_0.png)
Activity histogram - sample 0

![](img/occl_act_816.png)
Activity histogram - sample 816

 

## Minimum spanning tree

The Minimum Spanning Tree visualizes related nodes on the map by connecting similar nodes with each other.  The weights of the edges are computed by a distance metric between the vectors of the vertices and subsequently minimized.[0]

Unfortunately for the large SOM PySOMVis was not able to produce the visualisation in a reasonable amount on time (On a High performance cloud server with 4 CPUs and 32GB RAM - directly run with python not via jupyter).

 [0]: https://www.ifs.tuwien.ac.at/~mayer/publications/pdf/may_icann10.pdf ,accessed 02.02.2024

##  Cluster Connections

In this visualization technique, connecting lines are drawn between units based on threshold values. The intensity of the connections between nodes indicates the similarity of underlying data points.

We observe some issues with this visualisation on the large SOM. The connections are not rendered in the same way for the small SOM and we can produce some artifacts which only show on zooming or panning.

![](img/occl_clustercon_0.18.png)
(Potentially Faulty) Cluster Connection Visualization

![](img/occl_clustercon_0.18_onzoom.png)
Artifacts on scrolling


## U-Matrix
The U-Matrix visualization displays the distances between neurons on the SOM grid. Low values correspond to small distances between neighbouring neurons, whereas high values indicate large distances and can be used to identify cluster boundaries. 

The visualization helps to discern individual cluster structures that appeared unclear in earlier visualizations.
 Especially the top (-0.15, -0.45) and bottom clusters (0.2, -0.3) with low sensor reading values form coherent regions(valleys) with visible cluster boundaries in the U-matrix. The middle cluster (0.15, -0.1) with high sensor readings does not form coherent, but noisy regions and unclear boundaries.

![](img/occl_umatrix.png)
U-Matrix

## D-Matrix
The D-Matrix is similar to the U-Matrix, but averages the distance instead of using interpolation.

This results in a similar visualisation, but with smoother transitions between "mountains" and "valleys". The boundaries are therefore not as clearly visible


![](img/occl_dmatrix.png)
D-Matrix

## P-Matrix & U*-Matrix

Unlike the U-Matrix, P-Matrix is a density and not a distance based metric. It involves estimating the empirical density at each neuron's weight vector in the feature space.

The U*-Matrix combines both distance and density information, enhancing cluster visualization by adjusting the U-Matrix with density-derived scale factors.
  
For the P-Matrix we calculate the optimal Percentile and Radius which results in a percentile of 43 and radius of 3.6. On increasing the percentile (and thus the radius) to 60 we can reduce the noise of the low density regions. In the P-Matrices we observe that the supposedly 'empty' regions of the SOM (we see them empty in hit histogram) are shown in bright red indicating a high density region, this is something we cannot observe on the smaller SOM and have no explanation apart from a fault in the implementation on a large SOM.

In the U*-Matrix this behaviour vanishes, and we observe no large structural differences compared to the U-Matrix, indicating that density information does not fundamentally contradict the distance based Metrics of the U-Matrix


![](img/occl_pmatrix_optimal_43_3.6.png)
P-Matrix with optimal values

![](img/occl_ustarmatrix_optimal_43_3.6.png)
U*-Matrix (=P-Matrix + U-Matrix) with optimal values

![](img/occl_pmatrix_higher_60_4.9.png)
P-Matrix with higher percentile and radius

![](img/occl_ustarmatrix_higher_60_4.9.png)
U*-Matrix (=P-Matrix + U-Matrix) higher percentile and radius


## Pie chart

This visualization is for classification type datasets. The room occupancy provides the occupancy count as an integer type target, which is not suitable for this classification visualization.


## Chessboard

Chessboard visualization is a type of class coloring visualization, combining Voronoi Tesselation and chessboard style pixel coloring according to dominant classes.

Since the dataset is not suitable for classification type visualizations, we didn't use this visualization on the room occupancy data.

## Component planes
The component planes visualization shows the distribution of the weights for the selected attributes (=components) across the SOM units. 

We observe that the middle cluster (0.15, -0.1) represents the instances with high sensory readings for temperature, light, sound, CO2 and motion - consistently throughout those components. We can also identify, that the middle cluster has a smaller cluster with opposing readings (valleys) attached on its bottom side, indicating low values for light, sound, CO2 and motion.

The bottom cluster (0.2, -0.3) represents instances with low sensory readings, with consistency across the components (just a small violation within this cluster for the light component).

The last visualization shows the time of day component. It reveals the high sensory readings occur in afternoon and evening time (middle cluster) and the low sensory readings on nights and mornings (bottom cluster)

![](img/occl_component0_temp.png)
Component 0 - Temperature

![](img/occl_component4_light.png)
Component 4 - Light

![](img/occl_component8_sound.png)
Component 8 - Sound

![](img/occl_component12_co2.png)
Component 12 - Co2

![](img/occl_component14_PIR.png)
Component 14 - PIR

![](img/occl_component17_timeofday.png)
Component 17 - Time of day


## Metro Map

MetroMap is similar to component planes, but groups weights of the selected attribute into bins.
Component lines connect the centers of gravity of each bin.

When one attribute is selected with the option of 5 bins, we see how the temperature readings from the one sensor are distributed into the bins, and the component lines indicating the gradients between centers of gravity. We see the lines for the lower bins not forming a clear direction, this makes sense as the low temperature clusters are spread throughout top and bottom of the SOM.

If multiple attributes are selected, only one bin is visualized. The centers of gravity for two temperature readings S3 and S4, and time of day lie in the bottom cluster, so we observe a clear direction for the metro lines towards this cluster. For all other attributes this is not the case, as all centers of gravity, and thus the metro lines, are located in proximity of the central cluster.

![](img/occl_metro_1comp.png)
Metro Map - Attribute 0

![](img/occl_metro_c-2-3-17.png)
Metro Map - Attributes: 2 (Temp S3), 3 (Temp S4), 17 (Time of Day)

![](img/occl_metro_allexcept-2-3-17.png)
Metro Map - All attributes except 2,3


## Clustering

Clustering is a non-deterministic division of the map into regions based on the weights.
There are two approaches: k-means and agglomerative clustering.

With the Kmeans Approach we receive a noisy result. This stands in contrast to the smaller som, where we do not receive such noise. The noise can be explained by neurons which are seldomly chosen as BMU (they will stay with randomly initialized weights). The noise increased with the cluster size.

We can remove the noise in agglomerative clustering, and can observe the inner structure of the middle (high sensor data) cluster. The bottom cluster (low sensor data) only emerges on increasing the cluster size. This indicates lower distances to the background data of the som in this cluster.


![](img/occl_clustering_kmeans4.png)
KMeans Clustering with 4 clusters

![](img/occl_clustering_kmeans8.png)
KMeans Clustering with 8 clusters

![](img/occl_clustering_agglo.png)
Agglomerative Clustering with 20 clusters

![](img/occl_clustering_agglo_50.png)
Agglomerative Clustering with 50 clusters



## Quantization error

The visualization shows the average distance between the input vector, and it's best matching unit and serves as an indication of how well the map is trained.

We observe a couple of single neurons with a high quantization error in the top and bottom cluster, so we conclude that our map is trained fairly well

![](img/occl_quanterror.png)
Quantization error


## Topographic Error

The topographic error visualizes how well the SOM preserves the topography of the input
space by calculating the percentage of data samples for which the first and second BMU are not placed in adjacent units in the SOM.

Consistent to the small SOM visualisation of topographic error, we notice that the cluster with low sensor readings has a high amount of topographic errors, but the cluster is dense and according to the lecture this might be misleading. 

![](img/occl_topoerror.png)
Topographic error

## SOMStreamVis

The SOMStreamVis plots best matching BMU indexes over time (natural order of samples) and can provide additional information to a SOM visualization.

We used the SOMStreamVis together with the Agglomerative Clustering visualization. Colors are therefore matched between cluster and the matching BMUs over time.
The dataset contains information from the 22.12. starting at 11 am until the 26.12. at 9 am, followed by a gap until the 10.01. 15:30 (at sample number 8086) and ending on the 11.01. at 9am.
SOMStreamVis reflects patterns in the readings, data that belongs to the cluster with low sensor readings match samples from the nights and the high sensor readings are found in samples from afternoons and evenings. The 25.12. is an outlier in the sense that there is no match with the high sensor reading cluster in the afternoon/evening, indicating low or now occupancy at this day.

![](img/occl_somstreamviz_cluster.png)
Agglomerative clusters

![](img/occl_somstreamviz_time.png)
Timeseries visualization

## Intrinsic distance

Intrinsic distance visualization combines topographic and quantization error visualizations.

Due to performance issues or a bug, we weren't able to display the visualization. According to the logging we have built into it, the calculate function never terminated.

## Mnemonic SOM

Due to performance issues on this dataset, we were unable to render this visualization. The logging we additionally implemented indicated that the calculate function never terminated.