# 3D colour space 
As we've seen in previous notebooks, we can explore an image by the position of its pixels in an n-dimensional colour space. The 3D RGB space can be chopped up into sections (or bins) as follows:

![RGB space](https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/RGB_Cube_Show_lowgamma_cutout_a.png/1280px-RGB_Cube_Show_lowgamma_cutout_a.png)

By counting number of pixels appearing in each bin, the underlying pixel positions remain continuous, but the view we obtain from the reduced space is more computationally manageable (a $(16 \times 16 \times 16)$ binning of the space produces 4096 degrees of freedom, while the original $(256 \times 256 \times 256)$ gives us 16777216 to deal with). It also gives a more intuitive, blurred view of the similarity of neighbouring colours to one another. If we can find a way of computing the similarity of the binned spaces for two images, we should be well on our way.

Let's start by loading in some images

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
plt.rcParams['figure.figsize'] = (20, 20)

import os
import itertools
import numpy as np
import pandas as pd
from PIL import Image
from skimage.color import rgb2lab
from sklearn.metrics.pairwise import pairwise_distances
from umap import UMAP

from tqdm import tqdm_notebook as tqdm

In [None]:
n_images = 5000
path_to_images = '../data/small_images/'

random_ids = np.random.choice(os.listdir(path_to_images), 
                              n_images, 
                              replace=False)

image_dict = {}
for image_id in tqdm(random_ids):
    try: 
        image = Image.open(path_to_images + image_id)
        if len(np.array(image).shape) != 3:
            image = Image.fromarray(np.stack((image,)*3, -1))
        image_dict[image_id] = image
    except: 
        pass

image_ids = list(image_dict.keys())
images = list(image_dict.values())

We'll now resize and reshape the data for each image. First we shrink the image to a small size (just $75 \times 75$ pixels!), before stacking them all into a long, flat array of pixel 3-vectors. For simplicity later on, we'll also join up each pixel-array with its corresponding `image_id` in a dictionary.

In [None]:
small_size = 75

pixel_lists = [(np.array(image.resize((small_size, small_size),
                                      resample=Image.BILINEAR))
                .reshape(-1, 3)) for image in images]

pixel_dict = dict(zip(image_ids, pixel_lists))

It's worth emphasizing how important the resizing is at this stage. The original images (already shrunk down to a max width/height of 500px) can contain hundreds of thousands of pixels, and processing that much data for each of our thousands of images makes the next stage of the process _painfully_ slow. $75 \times 75$ might seem small, but it seems to provide enough detail to get an impression of an image's dominant colours while keeping the subsequent processing speedy.

### Binning
In the next step, we split our colour-space into an even grid of bins and count the number of pixels appearing in each. Note that the value of `n_bins` seems to have a large effect on the 'goodness' of the results, and a higher granularity does not necessarily lead to better results. I've currently settled on 10 as a reasonable number, but I'm sure this could be more intelligently optimised.   
Binning an image's pixels provides us with a rough, grainy distribution of area that it occupies in colour-space. The counts are then flattened into an 1D array which we'll use to compare the image to its counterparts.

In [None]:
n_bins = 16
bin_counts = pd.DataFrame()

for image_id, image in tqdm(pixel_dict.items()):
    binned_pixels = (image / n_bins).astype(np.uint8).tolist()
    bin_strings = list(map(str, binned_pixels))
    unique, counts = np.unique(bin_strings, return_counts=True)
    bin_counts[image_id] = pd.Series(dict(zip(unique, counts)))

bin_counts = bin_counts.fillna(0)

### Dimensionality reduction
It's always nice to visualise the separation of vectors within a newly defined feature space - let's do that here with UMAP.

In [None]:
embedding = UMAP().fit_transform(bin_counts.T.values)

plt.scatter(x=embedding[:, 0], 
            y=embedding[:, 1]);

### Similarity
Now that we have a colour vector for each image, we'll compare them all to one another and store the numeric similarities in a great big dataframe.

In [None]:
similarity = pd.DataFrame(data=pairwise_distances(bin_counts.T, 
                                                  metric='cosine'),
                          index=bin_counts.columns,
                          columns=bin_counts.columns)

In [None]:
sns.heatmap(similarity);

### Search 
We'll only really know the goodness of the results by running a search with a few randomly chosen query images.

In [None]:
query_id = np.random.choice(bin_counts.columns)
image_dict[query_id]

In [None]:
resolution = 200
n_similar = 25
size = int(n_similar ** 0.5)
height = int(resolution * size)
width = int(resolution * size)

big_image = np.empty((height, width, 3)).astype(np.uint8)
grid = np.array(list(itertools.product(range(size), range(size))))

most_similar_ids = similarity[query_id].sort_values().index.values[1 : n_similar + 1]
similar_images = [image_dict[id].resize((resolution, resolution),
                                        resample=Image.BILINEAR) 
                  for id in most_similar_ids]

for pos, image in zip(grid, similar_images):
    block_t, block_l = pos * resolution
    block_b, block_r = (pos + 1) * resolution
    
    big_image[block_t : block_b, block_l : block_r] = np.array(image)

Image.fromarray(big_image)

The results here are definitely better than those produced by previous approaches, but the hard boundary between bins is a bit ugly and frustrating so I'd like to carry on in search of the perfect technique.