Image clustering based on InceptionV3 activations on an Nvidia Jetson Nano
The Jetson TX2 has been the Tumbleweed’s main computer for the past year and we have had a great experience working with it so far. With the release of their new Jetson Nano, NVIDIA provided us with a sample unit to try out. The Jetson Nano is a slightly less powerful but far cheaper SoC (system on a chip) that is meant to run neural networks locally, reducing latency when deploying applications that need such features “on-the-edge” such as robotics. While it is not radiation hardened or tolerant, it is certainly a useful test platform for parts of the Tumbleweed. We took a closer look and came up with a sample application that showcases the Jetson Nano’s strengths when used to help select which images to send back to Earth, a use case that is extremely important for us.
The goal of this project is to finely cluster images of various martian surface features (like rock formations, sand dunes, ravines, etc.). This may allow Tumbleweeds operations team to more selectively decide which images to transmit back to Earth from Mars to better utilize the available bandwidth.
The underlying clustering technique is the basic (but fast) k-means algorithm. K-means clustering works by starting with some initial (random or heuristically selected) cluster centers and then adjusting them iteratively.
Unfortunately k-means performs badly in high-dimensional spaces (such as most real-world images) where datapoints are located on complex submanifolds. Thus applying k-means directly to images usually doesn't work very well. Especially in cases where different clusters can't easily be distinguished through colors or image composition.
To remedy this we first pass the images through all but the last layer of a pretrained InceptionV3 neural network. This acts as a dimensionality reduction technique which greatly simplifies the clustering task itself. The network produces semantically meaningful outputs without needing to be trained on our dataset specifically. Finally we cluster based on the neural activations at that layer.
Sidenotes: The InceptionV3 network was pretrained on the imagenet dataset. It can be replaced by any larger (and possibly smaller) architectures such as Resnets while maintaining a similar performance. K-means could be replaced by other fast clustering algorithms. For the initial training step (where we determined the cluster centers) we clustered about 30000 images, each with 2048 activation dimensions. Any number of clusters between 32 and 512 yielded useful results.
Example clustering with 128 clusters and using this dataset: