# recreating the paper with tiny imagenet
First we're going to take a stab at the most basic version of DeViSE: learning a mapping between image feature vectors and their corresponding labels' word vectors for imagenet classes. Doing this with the entirety of imagenet feels like overkill, so we'll start with tiny imagenet.

## tiny imagenet
Tiny imagenet is a subset of imagenet which has been preprocessed for the stanford computer vision course CS231N. It's freely available to download and ideal for putting together quick and easy tests and proof-of-concept work in computer vision. From [their website](https://tiny-imagenet.herokuapp.com/):
> Tiny Imagenet has 200 classes. Each class has 500 training images, 50 validation images, and 50 test images.

Images are also resized to 64x64px, making the whole dataset small and fast to load. 

We'll use it to demo the DeViSE idea here. Lets load in a few of the packages we'll use in the project - plotting libraries, numpy, pandas etc, and pytorch, which we'll use to construct our deep learning models.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
plt.rcParams['figure.figsize'] = (20, 20)

import os
import io
import numpy as np
import pandas as pd
from PIL import Image
from scipy.spatial.distance import cdist

import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from torchvision import models, transforms

from tqdm._tqdm_notebook import tqdm_notebook as tqdm
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
base_path = '/mnt/efs/images/tiny-imagenet-200/'

# wordvectors
We're going to use the [fasttext](https://fasttext.cc/docs/en/english-vectors.html) word vectors trained on [common crawl](http://commoncrawl.org) as the target word vectors throughout this work. Let's load them into memory

In [None]:
wv_path = '/mnt/efs/nlp/word_vectors/fasttext/crawl-300d-2M.vec'
wv_file = io.open(wv_path, 'r', encoding='utf-8', newline='\n', errors='ignore')

fasttext = {line.split()[0]: np.array(line.split()[1:]).astype(np.float)
            for line in tqdm(list(wv_file))}

In [None]:
vocabulary = set(fasttext.keys())

# wordnet
We're also going to need to load the wordnet classes and ids from tiny-imagenet

In [None]:
clean = lambda x: x.lower().strip().replace(' ', '-').split(',-')

In [None]:
with open(base_path + 'wnids.txt') as f:
    wnids = np.array([id.strip() for id in f.readlines()])

wordnet = {}
with open(base_path + 'words.txt') as f:
    for line in f.readlines():
        wnid, raw_words = line.split('\t')
        words = [word for word in clean(raw_words)
                 if word in vocabulary]
        
        if wnid in wnids and len(words) > 0:
            wordnet[wnid] = words

In [None]:
wnid_to_wordvector = {wnid: (np.array([fasttext[word] for word in words])
                             .mean(axis=0))
                      for wnid, words in wordnet.items()}

wnids = list(wnid_to_wordvector.keys())

# example data
here's an example of what we've got inside tiny-imagenet: one tiny image and its corresponding class

In [None]:
wnid = np.random.choice(wnids)
image_path = base_path + 'train/' + wnid + '/images/' + wnid + '_{}.JPEG'
print(' '.join(wordnet[wnid]))
Image.open(image_path.format(np.random.choice(500)))

# datasets and dataloaders
Pytorch allows you to explicitly write out how batches of data are assembled and fed to a network. Especially when dealing with images, I've found it's best to use a pandas dataframe of simple paths and pointers as the base structure for assembling data. Instead of loading all of the images and corresponding word vectors into memory at once, we can just store the paths to the images with their wordnet ids. Using pandas also gives us the opportunity to do all sorts of work to the structure of the data without having to use much memory.  
Here's how that dataframe is put together:

In [None]:
df = {}

for wnid in wnids:
    wnid_path = base_path + 'train/' + wnid + '/images/'
    image_paths = [wnid_path + file_name for file_name in os.listdir(wnid_path)]
    for path in image_paths:
        df[path] = wnid

df = pd.Series(df).to_frame().reset_index()
df.columns = ['path', 'wnid']

Pandas is great for working with this kind of structured data - we can quickly shuffle the dataframe:

In [None]:
df = df.sample(frac=1).reset_index(drop=True) 

and split it into 80:20 train:test portions. 

In [None]:
split_ratio = 0.8
train_size = int(split_ratio * len(df))

train_df = df.loc[:train_size]
test_df  = df.loc[train_size:]

n.b. tiny-imagenet already has `train/`, `test/`, and `val/` directories set up which we could have used here instead. However, we're just illustrating the principle in this notebook so the data itself isn't important, and we'll use this kind of split later on when incorporating non-toy data.

Now we can define how our `Dataset` object will transform the initial, simple data when it's called on to produce a batch. Images are generated by giving a path to `PIL`, and word vectors are looked up in our `wnid_to_wordvector` dictionary. Both objects are then transformed into pytorch tensors and handed over to the network.

In [None]:
class ImageDataset(Dataset):
    def __init__(self, dataframe, wnid_to_wordvector,
                 transform=transforms.ToTensor()):
        self.image_paths = dataframe['path'].values
        self.wnids = dataframe['wnid'].values
        self.wnid_to_wordvector = wnid_to_wordvector
        self.transform = transform

    def __getitem__(self, index):
        image = Image.open(self.image_paths[index]).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)

        target = torch.Tensor(wnid_to_wordvector[self.wnids[index]])
        return image, target

    def __len__(self):
        return len(self.wnids)

We can also apply transformations to the images as they move through the pipeline (see the `if` statement above in `__getitem__()`). The torchvision package provides lots of fast, intuitive utilities for this kind of thing which can be strung together as follows. Note that we're not applying any flips or grayscale to the test dataset - the test data should generally be left as raw as possible, with distortions applied at train time to increase the generality of the network's knowledge.

In [None]:
train_transform = transforms.Compose([transforms.Resize(224),
                                      transforms.RandomHorizontalFlip(),
                                      transforms.RandomRotation(15),
                                      transforms.RandomGrayscale(0.25),
                                      transforms.ToTensor()])

test_transform = transforms.Compose([transforms.Resize(224),
                                     transforms.ToTensor()])

Now all we need to do is pass our dataframe, dictionary of word vectors, and the desired image transforms to the `ImageDataset` object to define our data pipeline for training and testing.

In [None]:
train_dataset = ImageDataset(train_df, wnid_to_wordvector, train_transform)
test_dataset = ImageDataset(test_df, wnid_to_wordvector, test_transform)

Pytorch then requires that you pass the `Dataset` through a `DataLoader` to handle the batching etc. The `DataLoader` manages the pace and order of the work, while the `Dataset` does the work itself. The structure of these things is very predictable, and we don't have to write anything custom at this point.

In [None]:
batch_size = 128

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=batch_size,
                          num_workers=5,
                          shuffle=True)

test_loader = DataLoader(dataset=test_dataset,
                         batch_size=batch_size,
                         num_workers=5)

# building the model
Our model uses a pre-trained backbone to extract feature vectors from the images. This biases our network to perform well on imagenet-style images and worse on others, but hey, we're searching on imagenet in this example! Later on, when working in some less imagenet-y images, we'll make some attempts to compensate for the backbone's biases.

In [None]:
backbone = models.vgg16_bn(pretrained=True).features

We don't want this backbone to be trainable, so we switch off the gradients for its weight and bias tensors.

In [None]:
for param in backbone.parameters():
    param.requires_grad = False

Now we can put together the DeViSE network itself, which embeds image features into word vector space. The output of our backbone network is a $[512 \times 7 \times 7]$ tensor, which we then flatten into a 25088 dimensional vector. That vector is then fed through a few fully connected layers and ReLUs, while compressing the dimensionality down to our target size (300, to match the fasttext word vectors).

In [None]:
class DeViSE(nn.Module):
    def __init__(self, backbone, target_size=300):
        super(DeViSE, self).__init__()
        self.backbone = backbone
        self.head = nn.Sequential(
            nn.Linear(in_features=(25088), out_features=target_size*2),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(in_features=target_size*2, out_features=target_size),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(in_features=target_size, out_features=target_size),
        )

    def forward(self, x):
        x = self.backbone(x)
        x = x.view(x.size(0), -1)
        x = self.head(x)
        x = x / x.max()
        return x

In [None]:
devise_model = DeViSE(backbone, target_size=300).to(device)

# train loop
Pytorch requires that we write our own training loops - this is rough skeleton structure that I've got used to. For each batch, the inputs and target tensors are first passed to the GPU. The inputs are then passed through the network to generate a set of predictions, which are compared to the target using some appropriate loss function. Those losses are used to inform the backpropagation of tweaks to the network's weights and biases, before repeating the whole process with a new batch. We also display the network's current loss through in the progress bar which tracks the speed and progress of the training. We can also specify the number of epochs in the parameters for the train function. 

In [None]:
losses = []
flags = torch.ones(batch_size).cuda()

def train(model, train_loader, loss_function, optimiser, n_epochs):
    for epoch in range(n_epochs):
        model.train()
        loop = tqdm(train_loader)
        for images, targets in loop:
            images = images.cuda(non_blocking=True)
            targets = targets.cuda(non_blocking=True)

            optimiser.zero_grad()
            predictions = model(images)

            loss = loss_function(predictions, targets, flags)
            loss.backward()
            optimiser.step()

            loop.set_description('Epoch {}/{}'.format(epoch + 1, n_epochs))
            loop.set_postfix(loss=loss.item())
            losses.append(loss.item())

Here we define the optimiser, loss function and learning rate which we'll use.

In [None]:
trainable_parameters = filter(lambda p: p.requires_grad, devise_model.parameters())

loss_function = nn.CosineEmbeddingLoss()
optimiser = optim.Adam(trainable_parameters, lr=0.001)

Let's do some training!

In [None]:
train(model=devise_model,
      n_epochs=3,
      train_loader=train_loader,
      loss_function=loss_function,
      optimiser=optimiser)

When that's done, we can take a look at how the losses are doing.

In [None]:
loss_data = pd.Series(losses).rolling(window=15).mean()
ax = loss_data.plot();

ax.set_xlim(0,);
ax.set_ylim(0, 1);

# evaluate on test set
The loop below is very similar to the training one above, but evaluates the network's loss against the test set and stores the predictions. Obviously we're only going to loop over the dataset once here as we're not training anything. The network only has to see an image once to process it.

In [None]:
preds = []
test_loss = []
flags = torch.ones(batch_size).cuda()

devise_model.eval()
with torch.no_grad():
    test_loop = tqdm(test_loader)
    for images, targets in test_loop:
        images = images.cuda(non_blocking=True)
        targets = targets.cuda(non_blocking=True)

        predictions = devise_model(images)
        loss = loss_function(predictions, targets, flags)

        preds.append(predictions.cpu().data.numpy())
        test_loss.append(loss.item())

        test_loop.set_description('Test set')
        test_loop.set_postfix(loss=np.mean(test_loss[-5:]))

In [None]:
preds = np.concatenate(preds).reshape(-1, 300)
np.mean(test_loss)

# run a search on the predictions
Now we're ready to use our network to perform image searches! Each of the test set's images has been assigned a position in word vector space which the network believes is a reasonable numeric description of its features. We can use the complete fasttext dictionary to find the position of new, unseen words, and then return the nearest images to our query.

In [None]:
def search(query, n=5):
    image_paths = test_df['path'].values
    distances = cdist(fasttext[query].reshape(1, -1), preds)
    closest_n_paths = image_paths[np.argsort(distances)].squeeze()[:n]
    close_images = [np.array(Image.open(image_path).convert('RGB')) 
                    for image_path in closest_n_paths]
    return Image.fromarray(np.concatenate(close_images, axis=1))

In [None]:
search('bridge')

It works! The network has never seen the word 'bridge', has never been told what a bridge might look like, and has never seen any of the test set's images, but thanks to the combined subtlety of the word vector space which we're embedding our images in and the dexterity with which a neural network can manipulate manifolds like these, the machine has enough knowledge to make a very good guess at what a bridge might be. This has been trained on a tiny, terribly grainy set of data but it's enough to get startlingly good results.