# 3D colour space 
As we've seen in previous notebooks, we can explore an image by the position of its pixels in an n-dimensional colour space. The 3D RGB space can be binned into sections as follows:

![RGB space](https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/RGB_Cube_Show_lowgamma_cutout_a.png/1280px-RGB_Cube_Show_lowgamma_cutout_a.png)

The underlying pixel positions remain continuous (or approximately continuous), but the view we obtain from a binned version of the space is more computationally manageable (a $(16 \times 16 \times 16)$ binning of the space produces 4096 degrees of freedom, while the original $(256 \times 256 \times 256)$ gives us 16777216 to deal with). It also gives a more intuitive, blurred view of the similarity of neighbouring colours to one another.

Let's start by loading in some images

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
plt.rcParams['figure.figsize'] = (20, 20)

import os
import itertools
import numpy as np
import pandas as pd
from PIL import Image

from umap import UMAP
from sklearn.metrics.pairwise import pairwise_distances

from tqdm import tqdm_notebook as tqdm

In [None]:
n_images = 5000
path_to_images = '../data/small_images/'

random_ids = np.random.choice(os.listdir(path_to_images), 
                              n_images, 
                              replace=False)

image_dict = {}
for image_id in tqdm(random_ids):
    try: 
        image = Image.open(path_to_images + image_id)
        if len(np.array(image).shape) != 3:
            image = Image.fromarray(np.stack((image,)*3, -1))
        image_dict[image_id] = image
    except: 
        pass
    
image_ids = list(image_dict.keys())
images = list(image_dict.values())

In [None]:
len(images)

In [None]:
images[9]

each pixel in the image is treated as a point in 3d space. we evenly bin that 3d space and produce counts of the pixels appearing in each

In [None]:
small_size = 75

pixel_lists = [(np.array(image.resize((small_size, small_size)))
                .reshape(-1, 3)) for image in images]

pixel_dict = dict(zip(image_ids, pixel_lists))

talk about the importance of resizing here - if we don't, this next process is _painfully_ slow. do some optimisation of the value of `small_size`

In [None]:
step_size = 16
r = range(step_size)

bins = [str(list(bin)) for bin in list(itertools.product(r, r, r))]
bin_counts = pd.DataFrame(index=bins)

for image_id, image in tqdm(pixel_dict.items()):
    binned_pixels = (image / step_size).astype(np.uint8).tolist()
    bin_strings = list(map(str, binned_pixels))
    unique, counts = np.unique(bin_strings, return_counts=True)
    bin_counts[image_id] = pd.Series(dict(zip(unique, counts)))

bin_counts = bin_counts.fillna(0)

In [None]:
embedding = UMAP().fit_transform(bin_counts.T.values)

In [None]:
plt.scatter(x=embedding[:, 0], 
            y=embedding[:, 1]);

In [None]:
similarity = pd.DataFrame(data=pairwise_distances(bin_counts.T, 
                                                  metric='cosine'),
                          index=bin_counts.columns,
                          columns=bin_counts.columns)

In [None]:
sns.heatmap(similarity);

In [None]:
id = np.random.choice(bin_counts.columns)
image_dict[id]

In [None]:
resolution = 200
n_similar = 10

most_similar_ids = similarity[id].sort_values().index.values[1 : n_similar + 1]
similar_images = [image_dict[id].resize((resolution, resolution)) for id in most_similar_ids]
Image.fromarray(np.hstack([np.array(image) 
                           for image in similar_images])
                .reshape(resolution, n_similar * resolution, 3))

the results here are definitely better than those produced by previous approaches, but the hard boundary between bins is a bit ugly and frustrating. 

# quick transformation from RGB to LAB