# Visualizations small SOM - Room occupancy dataset

In this notebook we prepare demonstrations of different PySomVis Visualisations.

Dataset: Room Occupancy (See https://archive.ics.uci.edu/dataset/864/room+occupancy+estimation) including 16 sensory features and 10129 instances on which we train a 60x40 SOM with MiniSom. For performance reasons the SOM is trained outside of the notebook and the visualisations gathered from a running pysom instance (See occupancy_s_pyson.py)

## 1. Hit histogram
The hit histogram visualizes the frequency with which neurons get hit during the training of a SOM.

We use a gradient from black to white to visualize the hits, white representing the nodes with the most hits.
There is a similar pattern of clusters as in the component planes visualisation, where the low occupancy region shows the most hits.


![](img/occs_hithist_gistgray.png)

Hit histogram 


## 2. Smoothed data histogram

The SDH is an extension of hit histograms that maps input vectors onto n-best matching units and achieves a smoothing effect.

We experimented with different smoothing factors. As expected the smoothing factor 1 gives us the same visualization as the hit histogram visualization. Bigger and smoother clusters become visible with higher factors such as 50. 
With the weighted SDH approach in particular, three clusters become clear.

![SDH](img/occs_hithistsdh_factor49_gistgray.png)
SHD with factor 49

![SDH weighted factor](img/occs_hithistsdh_weighted_factor49_gistgray.png)
SDH with weighted factor 49

## 3. Neighbourhood graph

Neighbourhood graphs visualize which areas of the SOM are in proximity based on the input space.

We plotted the neighbourhood graph for 8 nearest neighbours over a hit histogram. There are only a couple of long lines and most connections are within clusters especially in the cluster with high readings from the sensors, which indicates that the topology is mainly preserved well.

Using the radius approach, we start noticing graph lines with a radius of 0.9.
The connections formed are different compared to the KNN approach, with two observations:
for the cluster with high sensor data readings, we see no neighborhood connections with the radius approach unlike the knn approach which shows multiple edges. This indicates a low density cluster. Moreover, additional connections between the clusters with lower sensor data readings emerge with the radius approach.

![](img/occs_hithist_neigborhood_knn8.png)
Neighbourhood graph 8 nearest neighbours

![](img/occs_hithist_neigborhood_radius_0.9.png)
Neighbourhood graph radius of 0.9


## 4. Sky Metaphor

Sky Metaphor is another density visualization, but maps data items on the exact position within a unit and therefore helps identify similarity between inputs within the same unit or across neighbouring units more accurately.

The visualization is more "irregular" than other density visualizations since data items are not centered within units anymore.


![](img/occs_skymetaphor.png)
Sky metaphor 

## 5. Activity Histogram

The Activity Histogram per data point visualizes the distance between input vector and all weight vectors.

We chose two input vectors: 0 and 816. Sample 0 represents a sample with low sensor readings as opposed to sample 816 which has high sensor readings.
Sample 0 shows cluster homogeneity, while sample 816 reveals some topology violations in the high sensor readings cluster indicating cluster substructures.

![](img/occs_acthistogram_0.png)
Activity histogram sample 0

![](img/occs_acthistogram_816.png)
Activity histogram sample 816

## 6. Minimum spanning tree

The Minimum Spanning Tree visualizes related nodes on the map by connecting similar nodes with each other.  The weights of the edges are computed by a distance metric between the vectors of the vertices and subsequently minimized.
ADD CITATION https://www.ifs.tuwien.ac.at/~mayer/publications/pdf/may_icann10.pdf

There are four available settings in PySOMVis: all, diagonal, direct, MST input data. We could only test it with the All option, due to performance problems.


![](img/occs_mspt_all.png)
MSPT - all

## 7. Cluster Connections

In this visualization technique, connecting lines are drawn between units based on threshold values.

We observe that with a certain threshold combination we can identify see the cluster boundaries. If the thresholds are too low, the cluster boundaries are not as distinctly visible in the visualization.
The area with sparse connections in the bottom-right region shows that the underlying data items are not similar in that region.


![](img/occs_cluster_connections-0.22.png)
Cluster connections low threshold

![](img/occs_cluster_connections-0.33.png)
Cluster connections high threshold

## 8.U-Matrix
The U-Matrix visualization displays the distances between neurons on the SOM grid. Low values correspond to small distances between neighbouring neurons, whereas high values indicate large distances and can be used to identify cluster boundaries. 

The visualization helps to discern individual cluster structures that appeared unclear in earlier visualizations.
 Especially the clusters with low sensor reading values form coherent regions(valleys) with visible cluster boundaries in the U-matrix. The regions with high sensor readings (assuming high occupancy) do not form coherent, but noisy regions.

![](img/occs_umatrix.png)
U-Matrix 

## 9.D-Matrix
The D-Matrix is similar to the U-Matrix, but averages the distance instead of using interpolation.

This results in a similar visualisation, but with smoother transitions between "mountains" and "valleys". The boundaries are therefore not as clear.


![](img/occs_dmatrix.png)
D-Matrix


## 10.P-Matrix & U*-Matrix

Unlike the U-Matrix, P-Matrix is a density and not a distance based metric. It involves estimating the empirical density at each neuron's weight vector in the feature space.

The U*-Matrix combines both distance and density information, enhancing cluster visualization by adjusting the U-Matrix with density-derived scale factors.
  
We experimented with higher percentile values, and thus higher radius.


![](img/occs_pmatrix.png)
P-matrix

![](img/occs_ustarmatrix.png)
U*-matrix 

![](img/occs_ustarmatrix_doublepercentile.png)
U*-Matrix with higher percentile and radius


## 11. Pie chart

This visualization is for classification type datasets. The room occupancy provides the occupancy count as an integer type target, which is not suitable for this classification visualization.


## 12. Chessboard

Chessboard visualization is a type of class coloring visualization, combining Voronoi Tesselation and chessboard style pixel coloring according to dominant classes.

Since the dataset is not suitable for classification type visualizations, we didn't use this visualization on the room occupancy data.

## 13. Component planes
The component planes visualization shows the distribution of the weights for the selected attributes across the SOM units. 

The component plane visualization contains two clusters for every light attribute coming from the sensors s1-s4. We can observe a positive correlation with the visualizations of the other attributes such as temperature, sound, PIR. Analysing this together with the visualization for the time of day together with the sensor readings with high values (temperature, co2, motion) point to a higher occupancy during evenings.

![](img/occs_comp_temp.png)
Component 0 - Temperature

![](img/occs_comp_light.png)
Component 4 - Light

![](img/occs_comp_sound.png)
Component 8 - Sound

![](img/occs_comp_co2.png)
Component 12 - Co2

![](img/occs_comp_pir.png)
Component 14 - PIR

![](img/occs_comp_timeofday.png)
Component 17 - Time of day

## 14. Metro Map

MetroMap is similar to component planes, but groups weights of the selected attribute into bins.
Component lines connect the centers of gravity of each bin.

When one attribute is selected with the option of 5 bins, we see how the temperature readings from the one sensor are distributed into the bins.

If multiple attributes are selected, only one bin is visualized and the gradients of most attributes are very similar going from the cluster with high sensor readings to the one with low sensor readings.


![](img/occs_metromap_attr1.png)
Metro Map - 1 attribute, 5 bins

![](img/occs_metromap_allattr.png)
Metro Map - all attributes, 1 bin

## 15. Clustering

Clustering is a non-deterministic division of the map into regions based on the weights.
There are two approaches: k-means and agglomerative clustering.

With a number of 4 clusters we observe that neurons belonging to the same cluster are in relative proximity, with only a couple of outliers. Our main cluster with low sensor data readings is not visualized, however, this can change since clustering is non-deterministic.
Increasing the number of clusters we see topology violations again in the neurons belonging to the cluster with low sensor data readings.

With agglomerative clustering we see clusters get aggregated, especially the low sensor data cluster.


![](img/occs_clusters_4.png)
KMeans Clustering with 4 clusters

![](img/occs_clusters_8.png)
KMeans Clustering with 8 clusters

![](img/occs_clusters_agglo_ward8.png)
Agglomerative clustering with 8 clusters


## 16. Quantization error

The visualization shows the average distance between the input vector, and it's best matching unit and serves as an indication of how well the map is trained.

We observe only one neuron with a high quantization error.

![](img/occs_quant_error.png)
Quantization error 

## 17. Topographic Error

The topographic error visualizes how well the SOM preserves the topography of the input
space by calculating the percentage of data samples for which the first and second BMU are not placed in adjacent units in the SOM.

We notice that the cluster with low sensor readings has a high amount of topographic errors, but the cluster is dense and according to the lecture this might be misleading. 

![](img/occs_toperror.png)
Topographic error

## 18. SOMStreamVis

The SOMStreamVis plots best matching BMU indexes over time (natural order of samples) and can provide additional information to a SOM visualization.

We used the SOMStreamVis together with the Clustering visualization. Colors are therefore matched between cluster and the matching BMUs over time.
The dataset contains information from the 22.12. starting at 11 am until the 26.12. at 9 am, followed by a gap until the 10.01. 15:30 (at sample number 8086) and ending on the 11.01. at 9am.
SOMStreamVis reflects patterns in the readings, data that belongs to the cluster with low sensor readings match samples from the nights and the high sensor readings are found in samples from afternoons and evenings. The 25.12. is an outlier in the sense that there is no match with the high sensor reading cluster in the afternoon/evening, indicating low occupancy.



![](img/occs_somviz_clusters.png)
Agglomerative clusters
![](img/occs_somviz_timeline.png)
Timeseries visualization

## 19. Intrinsic distance

Intrinsic distance visualization combines topographic and quantization error visualizations.

Due to performance issues or a bug, we weren't able to display the visualization. According to the logging we have built into it, the calculate function never terminated.

## 20. Mnemonic SOM

Due to performance issues on this dataset, we were unable to render this visualization. The logging we additionally implemented indicated that the calculate function never terminated.