# Final Project

## Overview
This project is oriented toward building an effective system for generating color descriptions that are pragmatic in the sense that they would help a reader/listener figure out which color was being referred to in a shared context consisting of a target color (whose identity is known only to the describer/speaker) and a set of distractors.

code courtsey for set up and initial analysis goes to notebook [hw_colors.ipynb](hw_colors.ipynb) and [colors_overview.ipynb](colors_overview.ipynb). Upto original system section has been resued from the homework assignment hw_colors.ipynb](hw_colors.ipynb)

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [All two-word examples as a dev corpus](#All-two-word-examples-as-a-dev-corpus)
1. [Dev dataset](#Dev-dataset)
1. [Random train–test split for development](#Random-train–test-split-for-development)
1. [Improve the tokenizer](Improve-the-tokenizer)
1. [Use the tokenizer](#Use-the-tokenizer)
1. [Improve the color representations](mprove-the-color-representations)
1. [Use the color representer](#Use-the-color-representer)
1. [Initial model](#Initial-model)
1. [GloVe embeddings ](GloVe-embeddings)
1. [Try the GloVe representations](#Try-the-GloVe-representations)
1. [Color context ](#Color-context)
1. [First original system ](#First-original-system)
1. 

## Set-up

See [colors_overview.ipynb](colors_overview.ipynb) for set-up in instructions and other background details.

In [1]:
from colors import ColorsCorpusReader
from nltk.translate.bleu_score import corpus_bleu
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from torch_color_describer import ContextualColorDescriber
from torch_color_describer import create_example_dataset

import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL

In [2]:
utils.fix_random_seeds()

In [3]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")

## Read dev corpus

Working only with the two-word-only subset of the corpus.

In [4]:
dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=2,
    normalize_colors=True)

In [5]:
dev_examples = list(dev_corpus.read())

This subset has about one-third the examples of the full corpus:

In [6]:
len(dev_examples)

13890

We __should__ worry that it's not a fully representative sample. Most of the descriptions in the full corpus are shorter, and a large proportion are longer. So this dataset is mainly for debugging, development, and general hill-climbing. All findings should be validated on the full dataset at some point.

## Dev dataset

The first step is to extract the raw color and raw texts from the corpus:

In [7]:
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])

The raw color representations are suitable inputs to a model, but the texts are just strings, so they can't really be processed as-is. Question 1 asks you to do some tokenizing!

## Random train–test split for development

For the sake of development runs, we create a random train–test split:

In [8]:
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)

## Improve the tokenizer

The function `tokenize_example` simply splits its string on whitespace, remove punctuations and adds the required start and end symbols:

In [9]:
def tokenize_example(s):

    # Improve me!
    # Re-Used concepts from https://github.com/futurulus/colors-in-context/blob/master/tokenizers.py for implementation of  Monroe et al. 2017
    import re
    
    WORD_RE_STR = r"""
        (?:[a-z][a-z'\-_]+[a-z])       # Words with apostrophes or dashes.
        |
        (?:[+\-]?\d+[,/.:-]\d+[+\-]?)  # Numbers, including fractions, decimals.
        |
        (?:[\w_]+)                     # Words without apostrophes or dashes.
        |
        (?:\.(?:\s*\.){1,})            # Ellipsis dots.
        |
        (?:\*{1,})                     # Asterisk runs.
        |
        (?:\!{1,})                     # Exclamation runs.
        |
        (?:\S)                         # Everything else that isn't whitespace. 
        """

    WORD_RE = re.compile(r"(%s)" % WORD_RE_STR, re.VERBOSE | re.I | re.UNICODE)


    def basic_unigram_tokenizer(s, lower=True):
        words = WORD_RE.findall(s)
        if lower:
            words = [w.lower() for w in words]
        return words
    
    ENDINGS = ['er', 'est', 'ish']

    def heuristic_segmenter(word):
        for ending in ENDINGS:
            if word.endswith(ending):
                return [word[:-len(ending)], ending]
        return [word]

    def heuristic_ending_tokenizer(s, lower=True):
        words = basic_unigram_tokenizer(s, lower=lower)
        return [seg for w in words for seg in heuristic_segmenter(w)]

    
    return [START_SYMBOL] + heuristic_ending_tokenizer(s) + [END_SYMBOL]

In [10]:
tokenize_example(dev_texts_train[376])

['<s>', 'aqua', ',', 'teal', '</s>']

In [11]:
tokenize_example("Darker bluish!, darkest.")

['<s>', 'dark', 'er', 'blu', 'ish', '!', ',', 'dark', 'est', '.', '</s>']

## Use the tokenizer

Tokenize inputs:

In [12]:
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]

dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]

We use only the train set to derive a vocabulary for the model:

In [13]:
dev_vocab = sorted({w for toks in dev_seqs_train for w in toks})

dev_vocab += [UNK_SYMBOL]

It's important that the `UNK_SYMBOL` is included somewhere in this list. In test examples, words not seen in training will be mapped to `UNK_SYMBOL`. 

Conceptual note: If model's vocab is the same as the train vocab, then `UNK_SYMBOL` will never be encountered during training, so it will be a random vector at test time.

In [14]:
len(dev_vocab)

959

## Improving the color representations

The following functions do nothing at all to the raw input colors we get from the corpus. 

In [15]:
def represent_color_context(colors):
    return [represent_color(color) for color in colors]

from itertools import product

def represent_color(color):
    
    # translate HLS format into HSV format
    H = color[0]
    L = color[1]
    S = color[2]
    
    h = H
    v = L + 1.0 * S * min(L, 1-L)
    s = 0
    if v != 0:
        s = 2.0 * (1 - 1.0 * L/v)
        
    # fourier transform
    # h,s,v are already all normalized to be in [0,1] range 
    # due to colors.py
    f_hat_real = []
    f_hat_imag = []
    for j, k, l in product((0, 1, 2), repeat=3):
        f_jkl = np.exp(-2j * np.pi * (j*h + k*s + l*v))
        f_hat_real.append(f_jkl.real)
        f_hat_imag.append(f_jkl.imag)
        
    return f_hat_real + f_hat_imag

__Your task__: Modify `represent_color_context` and/or `represent_color` to represent colors in a new way.
    
__Notes__:

* You are not required to keep `represent_color`. This might be unnatural if you want to perform an operation on each color trio all at once.
* For that matter, if you want to process all of the color contexts in the entire data set all at once, that is fine too, as long as you can also perform the operation at test time with an unknown number of examples being tested.

* The Fourier-transform method of [Monroe et al. 2016](https://www.aclweb.org/anthology/D16-1243/) and [Monroe et al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142) is a proven choice for our task. __It is not required that you implement this.__ However, if you decide to, you might find that the overly terse presentation in the paper is an obstacle. They key thing to see is that the notation $\hat{f}_{jkl}$ is meant to specify a full coordinate system. Thus, you might do something like

  ```
from itertools import product
for j, k, l in product((0, 1, 2), repeat=3):    
    f_jkl = ...
```

  and collect these `f_jkl` values in a list of 27 values. Additionally, in Python, [`2j` produces a value with `real` and `imag` attributes](https://docs.python.org/3.7/library/cmath.html). Each element `f_jkl` should have these components. If you concatenate the `real` and `imag` parts of all the `f_jkl`, you will have a 54-dimensional representation, as in the paper. Remember to start with an HSV representation, and with $h$ in $[0, 360]$, $s$ in $[0, 200]$, and $v$ in $[0, 200]$ (or else do the scaling differently). Note that the values in our corpus are in HLS format, [which are easily converted to HSV](https://en.wikipedia.org/wiki/HSL_and_HSV#HSL_to_HSV).
  
* It's natural to ask why this Fourier transform is useful in the current context. This is a challenging question, and I don't have a complete answer, but here is an intuitive observation: if you consider the raw color representations to be embeddings, then you can see very quickly that our standard geometric notions are totally out of line with our intuitions about the colors themselves. For example, here is a plot where we simply vary the hue dimension while keeping the other dimensions constant:

  <img src="fig/colors-hue-hls.png" alt="A series of very different colors with cosine distances from orange ranging from 0 to 0.19" />

  I've printed the cosine distances from the leftmost color above each patch. They all look pretty similar. Now, you might say, well at least the distances are sort of proportional to how different the colors are from the first. However, that argument seems to crumble when we do the same experiment but now varying the saturation dimension:

  <img src="fig/colors-saturation-hls.png" alt="A series of very similar purple-ish colors with cosine distances from gray-purple ranging from 0 to 0.19" />

  These colors are all quite simular intuitively. Notice, though, that the cosine distances are identical to my first plot. Of course! Cosine distances doesn't care about the nature of these dimensions! The underlying color space is a cylinder, not a regular Euclidean 3d space!
  
  The Fourier transformation that we apply is remapping the colors into approximately the cylindrical space that we want. It is at least capturing some the circular/radial relationships that are inherent in the space. Thus, here are plots corresponding to the above, but now where the colors have been transformed for the cosine comparisons. 
  
  First, the hue variation:
  
  <img src="fig/colors-hue-fourier.png" alt="A series of very different colors with cosine distances from orange that are generally large (near 1.0)" />

  And then saturation:
  
  <img src="fig/colors-saturation-fourier.png" alt="A series of very similar purple-ish colors with cosine distances from gray purple that seem aligned with visual color similarity" />
  
  These distances seem much better aligned with intuitions to me, and I think that's quite general. Thus, even if our networks can in principle learn this remapping, it's very helpful to at least start them closer to where we want them to be.
  
  If you want to go one layer deeper, then the [Zhang and Lu 2002](https://www.sciencedirect.com/science/article/pii/S092359650200084X) paper that Monroe et al. 2016 cite is pretty intuitive. It's for the 2d case, but that actually makes the ideas somewhere more accessible, since they can easily plot the original and remapped feature spaces.

## Use the color representer

The following cell just runs `represent_color_context` on the train and test sets:

In [16]:
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]

dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]

At this point, our preprocessing steps are complete.

In [17]:
dev_mod = ContextualColorDescriber(
    dev_vocab,
    early_stopping=True)
dev_mod.fit(dev_cols_train, dev_seqs_train)

  color_seqs = torch.FloatTensor(color_seqs)
Stopping after epoch 121. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 33.336477279663086

ContextualColorDescriber(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=<class 'torch.optim.adam.Adam'>,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=True,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_dim=50,
	embed_dim=50,
	embedding=None,
	freeze_embedding=False)

In [18]:
evaluation = dev_mod.evaluate(dev_cols_test, dev_seqs_test)
print('listener_accuracy:', evaluation['listener_accuracy'])
print('bleu:', evaluation['corpus_bleu'])

listener_accuracy: 0.7912467607255975
bleu: 0.6586667273916903


## GloVe embeddings

The above model uses a random initial embedding, as configured by the decoder used by `ContextualColorDescriber`.  

Below, we create a GloVe embedding based on model vocabulary. 

In [19]:
GLOVE_HOME = os.path.join('data', 'glove.6B')

In [20]:
def create_glove_embedding(vocab, glove_base_filename='glove.6B.50d.txt'): 
    glove_dict = utils.glove2dict(
        os.path.join(GLOVE_HOME, glove_base_filename))
    return utils.create_pretrained_embedding(glove_dict, vocab)


## Try the GloVe representations

Let's see if GloVe can help for our development data:

In [21]:
dev_glove_embedding, dev_glove_vocab = create_glove_embedding(dev_vocab)

In [22]:
dev_mod_glove = ContextualColorDescriber(
    dev_glove_vocab,
    embedding=dev_glove_embedding,
    early_stopping=True)

In [23]:
_ = dev_mod_glove.fit(dev_cols_train, dev_seqs_train)

Stopping after epoch 111. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 37.68464803695679

In [24]:
evaluation = dev_mod_glove.evaluate(dev_cols_test, dev_seqs_test)
print('listener_accuracy:', evaluation['listener_accuracy'])
print('bleu:', evaluation['corpus_bleu'])

listener_accuracy: 0.7952778577598618
bleu: 0.6600234778700497


We saw a small boost, possibly because tokeization scheme leads to good overlap with the GloVe vocabulary.

## Color context

In next model, we modify various components in `torch_color_describer.py`.

Redesigned the model so that the target color (the final one in the context) is appended to each input token that gets processed by the decoder. We subclass the `Decoder` and `EncoderDecoder` from `torch_color_describer.py` so that we can build models that do this.

__Step 1__: Modify the `Decoder` so that the input vector to the model at each timestep is not just a token representation `x` but the concatenation of `x` with the representation of the target color.

__Notes__:

* We might notice at this point that the original `Decoder.forward` method has an optional keyword argument `target_colors` that is passed to `Decoder.get_embeddings`. Because this is already in place, all you have to do is modify the `get_embeddings` method to use this argument.

* The change affects the configuration of `self.rnn`, so you need to subclass the `__init__` method as well, so that its `input_size` argument accomodates the embedding as well as the color representations.

* We can do the relevant operations efficiently in pure PyTorch using `repeat_interleave` and `cat`, but the important thing is to get a working implementation – you can always optimize the code later if the ideas prove useful to you. 

In [169]:
from torch_color_describer import Decoder
import torch
import torch.nn as nn


class ColorContextDecoder(Decoder):
    def __init__(self, color_dim, *args, **kwargs):
        self.color_dim = color_dim
        super().__init__(*args, **kwargs)

        self.rnn = nn.GRU(
            input_size=self.embed_dim + self.color_dim,
            hidden_size=self.hidden_dim,
            batch_first=True)  


    def get_embeddings(self, word_seqs, target_colors=None):
        """
        You can assume that `target_colors` is a tensor of shape
        (m, n), where m is the length of the batch (same as
        `word_seqs.shape[0]`) and n is the dimensionality of the
        color representations the model is using. The goal is
        to attached each color vector i to each of the tokens in
        the ith sequence of (the embedded version of) `word_seqs`.

        """
        num_repeats = word_seqs.shape[1]
        colors_repeated = torch.repeat_interleave(target_colors, num_repeats, dim=0)
        colors_repeated_shaped = torch.reshape(colors_repeated, (target_colors.shape[0], num_repeats, target_colors.shape[1]))
        return torch.cat((self.embedding(word_seqs), colors_repeated_shaped),2)
        

In [172]:
from torch_color_describer import EncoderDecoder

class ColorizedEncoderDecoder(EncoderDecoder):

    def forward(self,
            color_seqs,
            word_seqs,
            seq_lengths=None,
            hidden=None,
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)

        num_contexts = color_seqs.shape[1]
        target_colors = color_seqs[:,num_contexts-1,:]
        output, hidden = self.decoder(word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=target_colors)


        # Your decoder will return `output, hidden` pairs; the
        # following will handle the two return situations that
        # the code needs to consider -- training and prediction.
        if self.training:
            return output
        else:
            return output, hidden

__Step 3__: Finally, as in the examples in [Modifying the core model](colors_overview.ipynb#Modifying-the-core-model), you need to modify the `build_graph` method of `ContextualColorDescriber` so that it uses your new `ColorContextDecoder` and `ColorizedEncoderDecoder`. Here's starter code:

In [173]:
from torch_color_describer import Encoder

class ColorizedInputDescriber(ContextualColorDescriber):

    def build_graph(self):

        encoder = Encoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim)

        # Use your `ColorContextDecoder`, making sure
        # to pass in all the keyword arguments coming
        # from `ColorizedInputDescriber`:

        decoder = ColorContextDecoder(
            color_dim=self.color_dim,
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            hidden_dim=self.hidden_dim,
            embedding=self.embedding)


        # Return a `ColorizedEncoderDecoder` that uses
        # your encoder and decoder:

        ##### YOUR CODE HERE
        return ColorizedEncoderDecoder(encoder, decoder)
    

That's it! Since these modifications are pretty intricate, you might want to use [a toy dataset](colors_overview.ipynb#Toy-problems-for-development-work) to debug it:

If that worked, then you can now try this model on SCC problems!

## First system

There are many options for your original system, which consists of the full pipeline – all preprocessing and modeling steps. You are free to use any model you like, as long as you subclass `ContextualColorDescriber` in a way that allows its `evaluate` method to behave in the expected way.

So that we can evaluate models in a uniform way for the bake-off, we ask that you modify the function `evaluate_original_system` below so that it accepts a trained instance of your model and does any preprocessing steps required by your model.

If we seek to reproduce your results, we will rerun this entire notebook. Thus, it is fine if your `evaluate_original_system` makes use of functions you wrote or modified above this cell.

In [176]:
def evaluate_original_system(trained_model, color_seqs_test, texts_test):
    """
    Feel free to modify this code to accommodate the needs of
    your system. Just keep in mind that it will get raw corpus
    examples as inputs for the bake-off.

    """
    # `word_seqs_test` is a list of strings, so tokenize each of
    # its elements:
    tok_seqs = [tokenize_example(s) for s in texts_test]

    col_seqs = [represent_color_context(colors)
                for colors in color_seqs_test]


    # Optionally include other preprocessing steps here. Note:
    # DO NOT RETRAIN YOUR MODEL AS PART OF THIS EVALUATION!
    # It's a tempting step, but it's a mistake and will get
    # you disqualified!

    # The following core score calculations are required:
    evaluation = trained_model.evaluate(col_seqs, tok_seqs)

    return evaluation

If `evaluate_original_system` works on test sets you create from the corpus distribution, then it will work for the bake-off, so consider checking that. For example, this would check that `dev_mod` above passes muster:

In [177]:
my_evaluation = evaluate_original_system(dev_mod, dev_rawcols_test, dev_texts_test)

In [178]:
print('listener_accuracy:', my_evaluation['listener_accuracy'])
print('bleu:', my_evaluation['corpus_bleu'])

listener_accuracy: 0.7912467607255975
bleu: 0.6586667273916903


In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies. We also ask that you report the best **listener_accuracy** score your system got during development, just to help us understand how systems performed overall.

<font color='red'>Please review the descriptions in the following comment and follow the instructions.</font>

## Baseline Model for GRU with multi layer and multi dimension

In [100]:

# This system has a few main components.
# 1. Preprocessing: 

# 2. Encoder: 
# - GRU cell
# - multi-layer [2, 3, 4, 5]
# - hidden dim: [50, 100, 200, 300]

# 3. Decoder: 
# - GRU cell
# - multi-layer
# - attention mechanisms
# - hidden dim : [50, 100, 200, 300]
# - word sequences are passed as GloVe embeddings and have color contexts appended

import torch.nn as nn
from torch_color_describer import Encoder, Decoder

class DeepGRUEncoder(Encoder):
    def __init__(self, *args, num_layers=2, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_layers = num_layers
        self.rnn = nn.GRU(
            input_size=self.color_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True)


class DeepGRUDecoder(Decoder):
    def __init__(self, *args, num_layers=2, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_layers = num_layers
        self.rnn = nn.GRU(
            input_size=self.embed_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True)
from torch_color_describer import EncoderDecoder

class DeepGRUContextualColorDescriber(ContextualColorDescriber):
    def __init__(self, *args, num_layers=2, **kwargs):
        self.num_layers = num_layers
        super().__init__(*args, **kwargs)

    def build_graph(self):
        encoder = DeepGRUEncoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim,
            num_layers=self.num_layers)  

        decoder = DeepGRUDecoder(
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            embedding=self.embedding,
            hidden_dim=self.hidden_dim,
            num_layers=self.num_layers)  

        return EncoderDecoder(encoder, decoder)

mod_GRU_deep = DeepGRUContextualColorDescriber(
    dev_glove_vocab,
    embedding=dev_glove_embedding,
    num_layers=4,
    hidden_dim=300,
    early_stopping=True)
_ = mod_GRU_deep.fit(dev_cols_train, dev_seqs_train)

evaluation = mod_GRU_deep.evaluate(dev_cols_test, dev_seqs_test)
print('listener_accuracy:', evaluation['listener_accuracy'])
print('bleu:', evaluation['corpus_bleu'])

  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
  perp = [np.prod(s)**(-1/len(s)) for s in scores]
Stopping after epoch 67. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 42.684147357940674

0.7903829542182551
0.6538640145474843


## GRU based bidirectional Encoder and Decoder

In [295]:
#Try bidirectional for both encoder and decoder

# This system has a few main components.
# 1. Preprocessing: 

# 2. Encoder: 
# - GRU cell
# - multi-layer [2, 3, 4, 5]
# - hidden dim: [50, 100, 200, 300]
# - bidirectional

# 3. Decoder: 
# - GRU cell
# - multi-layer [2, 3, 4, 5, 8]
# - attention mechanisms
# - hidden dim : [50, 100, 200, 300]
# - bidirectional
# - word sequences are passed as GloVe embeddings and have color contexts appended

import torch.nn as nn
from torch_color_describer import Encoder, Decoder

class DeepGRUEncoder(Encoder):
    def __init__(self, *args, num_layers=2, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_layers = num_layers
        self.rnn = nn.GRU(
            input_size=self.color_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True)


class DeepGRUDecoder(Decoder):
    def __init__(self, *args, num_layers=2, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_layers = num_layers
        self.rnn = nn.GRU(
            input_size=self.embed_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True)       
        self.output_layer = nn.Linear(2*self.hidden_dim, self.vocab_size)

        
from torch_color_describer import EncoderDecoder

class DeepGRUContextualColorDescriber(ContextualColorDescriber):
    def __init__(self, *args, num_layers=2, **kwargs):
        self.num_layers = num_layers
        super().__init__(*args, **kwargs)

    def build_graph(self):
        encoder = DeepGRUEncoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim,
            num_layers=self.num_layers)  

        decoder = DeepGRUDecoder(
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            embedding=self.embedding,
            hidden_dim=self.hidden_dim,
            num_layers=self.num_layers)  

        return EncoderDecoder(encoder, decoder)

mod_GRU_deep = DeepGRUContextualColorDescriber(
    dev_glove_vocab,
    embedding=dev_glove_embedding,
    num_layers=3,
    hidden_dim=300,
    early_stopping=True)
_ = mod_GRU_deep.fit(dev_cols_train, dev_seqs_train)

evaluation = mod_GRU_deep.evaluate(dev_cols_test, dev_seqs_test)
print('listener_accuracy:', evaluation['listener_accuracy'])
print('bleu:', evaluation['corpus_bleu'])

Stopping after epoch 31. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 44.33427286148071

listener_accuracy: 0.7474805643535848
bleu: 0.3631851893157865


## LSTM based Encoder decoder baseline -  Bidirectional Encoder and multi layer


In [180]:

# Our system has a few main components.
# 1. Preprocessing: 

# 2. Encoder: 
# - LSTM cell
# - multi-layer [1, 2, 3]
# - uni/bidirectional LSTM
# - hidden dim [50, 100, 200, 300]

# 3. Decoder: 
# - LSTM cell
# - multi-layer [1, 4, 6]
# - attention mechanisms
# - hidden dim [50, 100, 200, 300]
# - word sequences are passed as GloVe embeddings and have color contexts appended


    class DeepLSTMEncoder(Encoder):
        def __init__(self, *args, num_layers=2, **kwargs):
            super().__init__(*args, **kwargs)
            self.num_layers = num_layers
            self.rnn = nn.LSTM(
                input_size=self.color_dim,
                hidden_size=self.hidden_dim,
                num_layers=self.num_layers,
                batch_first=True,
                bidirectional=True)


        def forward(self, color_seqs):
            """
            Parameters
            ----------
            color_seqs : torch.FloatTensor
                The shape is `(m, n, p)` where `m` is the batch_size,
                 `n` is the number of colors in each context, and `p` is
                 the color dimensionality.
            Returns
            -------
            hidden : torch.FloatTensor
                These are the final hidden state of the RNN for this batch,
                shape `(m, p) where `m` is the batch_size and `p` is
                 the color dimensionality.
            """
            output, hidden = self.rnn(color_seqs)

            newhidden = torch.mean(hidden[0], dim=0, keepdim=True)
            newcell = torch.mean(hidden[1], dim=0, keepdim=True)

            return (newhidden, newcell)

    class DeepLSTMDecoder(Decoder):
        def __init__(self, color_dim, *args, num_layers=1, **kwargs):
            self.color_dim = color_dim
            self.num_layers = num_layers
            super().__init__(*args, **kwargs)

            # Alter input_size of self.rnn
            self.rnn = nn.LSTM(
                input_size=self.embed_dim + self.color_dim,
                hidden_size=self.hidden_dim,
                num_layers=self.num_layers,
                batch_first=True)#,
                #bidirectional=True)  


        def get_embeddings(self, word_seqs, target_colors=None):
            """
            You can assume that `target_colors` is a tensor of shape
            (m, n), where m is the length of the batch (same as
            `word_seqs.shape[0]`) and n is the dimensionality of the
            color representations the model is using. The goal is
            to attached each color vector i to each of the tokens in
            the ith sequence of (the embedded version of) `word_seqs`.
            """
            num_repeats = word_seqs.shape[1]
            colors_repeated = torch.repeat_interleave(target_colors, num_repeats, dim=0)
            colors_repeated_shaped = torch.reshape(colors_repeated, (target_colors.shape[0], num_repeats, target_colors.shape[1]))
            return torch.cat((self.embedding(word_seqs), colors_repeated_shaped),2)

    # build an EncoderDecoder with custom encoder and decoders
    class DeepLSTMContextualColorDescriber(ContextualColorDescriber):

        def build_graph(self):

            # use CustomEncoder here
            encoder = DeepLSTMEncoder(
                color_dim=self.color_dim,
                num_layers = 3,
                hidden_dim=300)

            decoder = DeepLSTMDecoder(
                color_dim=self.color_dim,
                vocab_size=self.vocab_size,
                embed_dim=self.embed_dim,
                #num_layers=3,
                hidden_dim=300)

            # Return a `ColorizedEncoderDecoder` using Encoder and Decoder
            return ColorizedEncoderDecoder(encoder, decoder)

    mod_LSTM_deep = DeepLSTMContextualColorDescriber(
        dev_glove_vocab,
        embedding=dev_glove_embedding,
        #dev_vocab,
        #embed_dim=50,
        early_stopping=True)

    _ = mod_LSTM_deep.fit(dev_cols_train, dev_seqs_train)
    evaluation = mod_LSTM_deep.evaluate(dev_cols_test, dev_seqs_test)
    print('listener_accuracy:', evaluation['listener_accuracy'])
    print('bleu:', evaluation['corpus_bleu'])
        

Stopping after epoch 66. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 41.66188979148865

listener_accuracy: 0.8139936654189461
bleu: 0.6664774948006528


## LSTM based Encoder decoder - Bidirectional and multi layer

In [255]:

# This system has a few main components.
# 1. Preprocessing: 

# 2. Encoder: 
# - LSTM cell
# - multi-layer [2, 3, 4, 5, 6]
# - bidirectional LSTM
# - hidden dim: [50, 100, 200, 300]

# 3. Decoder: 
# - LSTM cell
# - multi-layer [2, 3, 4, 5, 6]
# - bidirectional LSTM
# - attention mechanisms
# - hidden dim : [50, 100, 200, 300]
# - word sequences are passed as GloVe embeddings and have color contexts appended
class DeepLSTMEncoder(Encoder):
    def __init__(self, *args, num_layers=2, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_layers = num_layers
        self.rnn = nn.LSTM(
            input_size=self.color_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True)


    def forward(self, color_seqs):
        output, hidden = self.rnn(color_seqs)
        newhidden = hidden[0]
        newcell = hidden[1]

        return (newhidden, newcell)

class DeepLSTMDecoder(Decoder):
    def __init__(self, color_dim, *args, num_layers=1, **kwargs):
        self.color_dim = color_dim
        self.num_layers = num_layers
        super().__init__(*args, **kwargs)

        # Alter input_size of self.rnn
        self.rnn = nn.LSTM(
            input_size=self.embed_dim + self.color_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True)  
        #Redefine output_layer for bidirectional only. For unidirectional decoder, remove factor of 2
        self.output_layer = nn.Linear(2*self.hidden_dim, self.vocab_size)


    def get_embeddings(self, word_seqs, target_colors=None):
        num_repeats = word_seqs.shape[1]
        colors_repeated = torch.repeat_interleave(target_colors, num_repeats, dim=0)
        colors_repeated_shaped = torch.reshape(colors_repeated, (target_colors.shape[0], num_repeats, target_colors.shape[1]))
        return torch.cat((self.embedding(word_seqs), colors_repeated_shaped),2)
    
    def forward(self, word_seqs, seq_lengths=None, hidden=None, target_colors=None):
        
        embs = self.get_embeddings(word_seqs, target_colors=target_colors)
        #print("decoder enter", len(embs))
        if self.training:
            # Packed sequence for performance:
            embs = torch.nn.utils.rnn.pack_padded_sequence(
                embs,
                batch_first=True,
                lengths=seq_lengths.cpu(),
                enforce_sorted=False)
            # RNN forward:
            output, hidden = self.rnn(embs, hidden)
            # Unpack:
            output, seq_lengths = torch.nn.utils.rnn.pad_packed_sequence(
                output, batch_first=True)

            # Output dense layer to get logits:
            output = self.output_layer(output)
            # Drop the final element:
            output = output[:, : -1, :]
            # Reshape for the sake of the loss function:
            output = output.transpose(1, 2)
            return output, hidden
        else:
            output, hidden = self.rnn(embs, hidden)
            output = self.output_layer(output)
            return output, hidden


class ColorizedLSTMEncoderDecoder(EncoderDecoder):

    def forward(self,
            color_seqs,
            word_seqs,
            seq_lengths=None,
            hidden=None,
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)

        num_contexts = color_seqs.shape[1]
        target_colors = color_seqs[:,num_contexts-1,:]
        output, hidden = self.decoder(word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=target_colors)


        # Your decoder will return `output, hidden` pairs; the
        # following will handle the two return situations that
        # the code needs to consider -- training and prediction.
        if self.training:
            return output
        else:
            return output, hidden

# build an EncoderDecoder with custom encoder and decoders
class DeepLSTMContextualColorDescriber(ContextualColorDescriber):

    def build_graph(self):

        # use CustomEncoder here
        encoder = DeepLSTMEncoder(
            color_dim=self.color_dim,
            num_layers=6,
            hidden_dim=300)

        decoder = DeepLSTMDecoder(
            color_dim=self.color_dim,
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            num_layers=6,
            hidden_dim=300)

        # Return a `ColorizedEncoderDecoder` using Encoder and Decoder
        return ColorizedLSTMEncoderDecoder(encoder, decoder)

mod_LSTM_deep_bidirectional = DeepLSTMContextualColorDescriber(
    dev_glove_vocab,
    embedding=dev_glove_embedding,
    early_stopping=True)

_ = mod_LSTM_deep_bidirectional.fit(dev_cols_train, dev_seqs_train)
evaluation = mod_LSTM_deep_bidirectional.evaluate(dev_cols_test, dev_seqs_test)
print('listener_accuracy:', evaluation['listener_accuracy'])
print('bleu:', evaluation['corpus_bleu'])

Stopping after epoch 50. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 43.846012592315674

listener_accuracy: 0.6472790095018716
bleu: 0.38923600929408014


## Bert Based Encoder Decoder Seq2Seq Model

In [25]:
#Try Encoder decoder Transformer
import torch
from transformers import EncoderDecoderModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "bert-base-uncased", "bert-base-uncased"
)
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size
input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids
labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=input_ids)
loss, logits = outputs.loss, outputs.logits
model.save_pretrained("bert2bert")
model = EncoderDecoderModel.from_pretrained("bert2bert")
generated = model.generate(input_ids)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.bias', 'cls.seq_relations



In [26]:
!pip install git-python==1.0.3
!pip install rouge_score
!pip install sacrebleu



In [27]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
batch_size=1

In [28]:
import datasets
rouge = datasets.load_metric("rouge")
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

In [78]:
from colors import ColorsCorpusReader
import os
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")
dev_corpus_t = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=2,
    normalize_colors=False)
dev_examples_t = list(dev_corpus_t.read())
dev_rawcols_t, dev_texts_t = zip(*[[ex.colors, ex.contents] for ex in dev_examples_t])

In [79]:
# Prepare dataset for processing to bert as input

def map_text_color_to_label():
    text_t, color_seq_t, label_t, color_h_t, color_l_t, color_s_t = [], [], [], [], [], []
    for index in range(len(dev_texts_t)):
        for seq in range(3):
            text_t.append(dev_texts_t[index])
            color_seq_t.append(dev_rawcols_t[index][seq])
            color_h_t.append(int(dev_rawcols_t[index][seq][0]))
            color_l_t.append(int(dev_rawcols_t[index][seq][1]))
            color_s_t.append(int(dev_rawcols_t[index][seq][2]))
            if seq == 2:
                label_t.append(1)
            else:
                label_t.append(0)

    return text_t, color_seq_t, label_t, color_h_t, color_l_t, color_s_t

data_text_t, data_color_t, data_label_t, color_h_t, color_l_t, color_s_t = map_text_color_to_label()

In [80]:
import pandas as pd
df = pd.DataFrame(list(zip(data_text_t, data_color_t, color_h_t, color_l_t, color_s_t, data_label_t)), columns=['text', 'color_seq', 'color_h', 'color_l', 'color_s', 'label'])
print(df)

                   text            color_seq  color_h  color_l  color_s  label
0           Medium pink  [302.0, 50.0, 86.0]      302       50       86      0
1           Medium pink  [291.0, 50.0, 59.0]      291       50       59      0
2           Medium pink  [301.0, 50.0, 57.0]      301       50       57      1
3           Mint green.   [57.0, 50.0, 64.0]       57       50       64      0
4           Mint green.  [239.0, 50.0, 87.0]      239       50       87      0
...                 ...                  ...      ...      ...      ...    ...
41665      bright green   [284.0, 50.0, 7.0]      284       50        7      0
41666      bright green  [126.0, 50.0, 47.0]      126       50       47      1
41667  green..no yellow   [74.0, 50.0, 90.0]       74       50       90      0
41668  green..no yellow   [159.0, 50.0, 1.0]      159       50        1      0
41669  green..no yellow   [80.0, 50.0, 78.0]       80       50       78      1

[41670 rows x 6 columns]


In [62]:
from pathlib import Path  
filepath = Path('./out.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
df.to_csv(filepath)  

In [76]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenized_text_t = []
for index in range(len(data_text_t)):
    tokenized_text_t.append(tokenizer(data_text_t[index]))


loading file https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt from cache at /home/vivek/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json from cache at /home/vivek/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4
loading file https://huggingface.co/bert-base-uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-uncased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json from cache at /home/vivek/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc

In [91]:
df2 = (df['text'].copy())#.to_frame()

In [127]:
train_dataset = df2[:train_split]
test_dataset = df2[train_split:]
filepath = Path('./out_train.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
train_dataset.to_csv(filepath)
filepath = Path('./out_test.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
test_dataset.to_csv(filepath)

In [29]:
train_scc_data = datasets.load_dataset('csv', data_files='out_train.csv')
test_scc_data = datasets.load_dataset('csv', data_files='out_test.csv')

Using custom data configuration default-11beba966d7d7eb2
Reusing dataset csv (/home/vivek/.cache/huggingface/datasets/csv/default-11beba966d7d7eb2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-e3d04b6a4c8a71b6
Reusing dataset csv (/home/vivek/.cache/huggingface/datasets/csv/default-e3d04b6a4c8a71b6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

In [31]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")


In [32]:
encoder_max_length=512
decoder_max_length=128

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["text"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["text"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch
#train_data = train_scc_data[:32]
# batch_size = 16
batch_size=1

train_data = train_scc_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    #remove_columns=["color_seq", "color_h", "color_l", "color_s", "label"]
    remove_columns=["index"],
)
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

val_data = test_scc_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size,
    remove_columns=["index"],
    #remove_columns=["color_seq", "color_h", "color_l", "color_s", "label"]
)
val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

Loading cached processed dataset at /home/vivek/.cache/huggingface/datasets/csv/default-11beba966d7d7eb2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-d324060cde868422.arrow


  0%|          | 0/4167 [00:00<?, ?ba/s]

In [35]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    output_dir="./",
    logging_steps=5,
    save_steps=10000,
    eval_steps=10000,
    num_train_epochs=2
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [36]:
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data['train'],
    eval_dataset=val_data['train'],
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: text. If text are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 37503
  Num Epochs = 2
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 75006


Step,Training Loss,Validation Loss


KeyboardInterrupt: 