## let's build some vocab ([link](https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/03-advanced/image_captioning/build_vocab.py#L4))

In [1]:
import nltk
import pickle
import argparse
from collections import Counter
from pycocotools.coco import COCO

### main idea behind `class Vocabulary` (use case)

- `self.idx` initialized at 0.
- loop over words. call `add_word` for each `word` ...
    - assign `self.idx2word[self.idx] = word`
    - and viceuhversa for `self.word2idx` 
    - `self.idx += 1`
- oh, then `__call__` is used **LATER** as a lookup table
- `__len__` is the obvious

In [2]:
class Vocabulary(object):
    """Simple vocabulary wrapper."""
    def __init__(self):
        self.word2idx = {}
        self.idx2word = {}
        self.idx = 0

    def add_word(self, word):
        if not word in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1

    def __call__(self, word):
        if not word in self.word2idx:
            return self.word2idx['<unk>']
        return self.word2idx[word]

    def __len__(self):
        return len(self.word2idx)

### understanding the code that does the stuff

In [3]:
json = 'annotations/captions_train2014.json'
coco = COCO(json)

loading annotations into memory...
Done (t=0.53s)
creating index...
index created!


`coco.anns` is a dict of dicts.

```
coco.anns = {..., 829717: {'image_id': 133071, 'id': 829717, 'caption': 'A dinner plate has a lemon wedge garnishment.'}, 829719: {'image_id': 238117, 'id': 829719, 'caption': 'A blue camouflage airplane is on a runway.'}}
```

thus, 

```
coco.anns.keys() = dict_keys([..., 829717, 829719])
```

In [4]:
coco.anns[829717]

{'caption': 'A dinner plate has a lemon wedge garnishment.',
 'id': 829717,
 'image_id': 133071}

In [5]:
coco.anns[829717]['caption']

'A dinner plate has a lemon wedge garnishment.'

In [6]:
tokens = nltk.tokenize.word_tokenize(coco.anns[829717]['caption'].lower())
print(tokens)

['a', 'dinner', 'plate', 'has', 'a', 'lemon', 'wedge', 'garnishment', '.']


### NOTE TO SELF: the period ... do we want it?

In [7]:
counter_test = Counter()
counter_test.update(tokens)
print(counter_test)

Counter({'a': 2, 'plate': 1, '.': 1, 'wedge': 1, 'lemon': 1, 'garnishment': 1, 'dinner': 1, 'has': 1})


### ok we run now ...

this code populates a `Counter` object where 
- `Counter[word]` returns the number of times that `word` appeared in the set of captions
- `counter.items()` converts to a list of `(word, count)` pairs

In [10]:
counter = Counter()
ids = coco.anns.keys()
for i, id in enumerate(ids):
    caption = str(coco.anns[id]['caption'])
    tokens = nltk.tokenize.word_tokenize(caption.lower())
    counter.update(tokens)

    if i % 100000 == 0:
        print("[%d/%d] Tokenized the captions." %(i, len(ids)))

[0/414113] Tokenized the captions.
[100000/414113] Tokenized the captions.
[200000/414113] Tokenized the captions.
[300000/414113] Tokenized the captions.
[400000/414113] Tokenized the captions.


### NOTE TO SELF: _interesting_ that not every caption ends with a period ._.

In [13]:
print(counter['.'])
print(len(ids))

310919
414113


In [15]:
# you know what they say about curiosity
print(counter.most_common(10))

[('a', 684576), ('.', 310919), ('on', 150675), ('of', 142759), ('the', 137981), ('in', 128909), ('with', 107703), ('and', 98754), ('is', 68686), ('man', 51530)]


`words` contains the list of "words" that we'll add to the vocab (if `threshold` > 1), then this is less than the total number of words in the training set.

In [22]:
# minimum word count threshold
threshold = 4
# If the word frequency is less than 'threshold', then the word is discarded.
words = [word for word, cnt in counter.items() if cnt >= threshold]

get started with making that vocabulary ... by putting the special stuff first

In [30]:
vocab = Vocabulary()
vocab.add_word('<pad>')
vocab.add_word('<start>')
vocab.add_word('<end>')
vocab.add_word('<unk>')
print(vocab.idx2word)
print(len(vocab))

{0: '<pad>', 1: '<start>', 2: '<end>', 3: '<unk>'}
4


In [31]:
# Adds the words to the vocabulary.
for i, word in enumerate(words):
    vocab.add_word(word)

In [34]:
print(len(words))
print(len(vocab))

9952
9956


nice, behaves as expected ... now to define place to save ...

In [33]:
vocab_path = 'data/vocab.pkl'

In [35]:
with open(vocab_path, 'wb') as f:
    pickle.dump(vocab, f)
print("Total vocabulary size: %d" %len(vocab))
print("Saved the vocabulary wrapper to '%s'" %vocab_path)

Total vocabulary size: 9956
Saved the vocabulary wrapper to 'data/vocab.pkl'


# sauce, and we're done here