### 03. Dataset Information
Go to `pycocoImageEmbedding` and find out how many corrupt images are present, and remove their captions from the above pickle before moving on.

Here, we're using the `corruption_free_coco_descriptions.pkl` to generate stats about our datasets.

We find these variables:
1. `VOCAB_SIZE`
2. `UNIQUE WORDS` (this is input in place of `VOCAB_SIZE` in `final_model.ipynb`)
3. `Total Captions`
4. `MAX_LENGTH` (Maximum caption length)

We also load glove vectors and generate **`embedding_matrix.pkl`**

In [None]:
import time
import pickle

from nltk import FreqDist
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
import numpy as np

In [4]:
def load_descriptions(file_path):
    with open(file_path, "rb") as f:
        return pickle.load(f)

In [5]:
new_desc = load_descriptions("./corruption_free_coco_descriptions.pkl")

In [6]:
print(type(new_desc))
print(len(new_desc))

<class 'dict'>
82782


In [7]:
for k, v in new_desc.items():
    print(v)
    break

['startseq a restaurant has modern wooden tables and chairs  endseq', 'startseq a long restaurant table with rattan rounded back chairs  endseq', 'startseq a long table with a plant on top of it surrounded with wooden chairs  endseq', 'startseq a long table with a flower arrangement in the middle for meetings endseq', 'startseq a table is adorned with wooden chairs with blue accents  endseq']


In [8]:
corpus = ""
start = time.time()
for ec in new_desc.values():
    for el in ec:
        corpus += " "+el
print("Generated Corpus in {:.2f}s".format(time.time() - start))

total_words = corpus.split()
vocabulary = set(total_words)
print("The size of vocabulary is {}".format(len(vocabulary)))

Generated Corpus in 0.30s
The size of vocabulary is 23124


In [9]:
# creating frequency distribution of words
freq_dist = FreqDist(total_words)
freq_dist.most_common(5)

[('a', 684603),
 ('startseq', 414108),
 ('endseq', 414108),
 ('on', 150689),
 ('of', 142762)]

In [10]:
#removing least common words from vocabulary
for ew in list(vocabulary):
    if(freq_dist[ew]<10):
        vocabulary.remove(ew)

In [11]:
VOCAB_SIZE = len(vocabulary)+1
print("Total unique words after removing less frequent word from our corpus = {}".format(VOCAB_SIZE))

Total unique words after removing less frequent word from our corpus = 6321


In [12]:
caption_list = []
for el in new_desc.values():
    for ec in el:
        caption_list.append(ec)
print("The total caption present = {}".format(len(caption_list)))

The total caption present = 414108


In [13]:
token = Tokenizer(num_words=VOCAB_SIZE)
token.fit_on_texts(caption_list)

In [14]:
# index to words are assigned according to frequency. i.e the most frequent word has index of 1
ix_to_word = token.index_word

In [15]:
for k in list(ix_to_word):
    if k>=6321:
        ix_to_word.pop(k, None)

In [16]:
word_to_ix = dict()
for k,v in ix_to_word.items():
    word_to_ix[v] = k

In [17]:
print(len(word_to_ix))
print(len(ix_to_word))

6320
6320


In [18]:
# finding the max_length caption
MAX_LENGTH = 0
temp = 0
for ec in caption_list:
    temp = len(ec.split())
    if(MAX_LENGTH<=temp):
        MAX_LENGTH = temp

print("Maximum caption has length of {}".format(MAX_LENGTH))

Maximum caption has length of 52


### Generating Glove Vectors file
using 300 dimensions glove file since the article we're following uses a embedding size of 300

Download pre-trained glove_vectors from [this link](https://nlp.stanford.edu/projects/glove/)

We'll load the text file (`glove.6B.300d.txt`) and save it as a pickle (`glove_vectors.pkl`) to save us time.

In [19]:
def save_glove_pickle(file_path):
    embeddings_index = {}
    start = time.time()
    with open(file_path, encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    print("Created embeddings_index in {:.2f}s".format(time.time() - start))
    
    print("Saving embeddings_index as glove_vectors.pkl")
    start = time.time()
    with open("glove_vectors.pkl", "wb") as f:
        pickle.dump(embeddings_index, f)
    print("Saved glove_vectors in {:.2f}s".format(time.time() - start))

In [20]:
save_glove_pickle("glove/glove.6B.300d.txt")

Created embeddings_index in 41.93s
Saving embeddings_index as glove_vectors.pkl
Saved glove_vectors in 45.51s


In [21]:
def load_glove_vectors(file_path):
    start = time.time()
    with open(file_path, "rb") as f:
        glove = pickle.load(f)
        glove_words = set(glove.keys())
    print("Loaded {} in {:.2f}s".format(file_path, time.time() - start))
    return glove, glove_words

In [22]:
glove, glove_words = load_glove_vectors("glove_vectors.pkl")

Loaded glove_vectors.pkl in 2.45s


In [23]:
# len(glove_words) # 400k
# type(glove_words) # set

In [24]:
EMBEDDING_SIZE = 300

# Get 300-dim dense vector for each of the words in vocabulary
embedding_matrix = np.zeros((VOCAB_SIZE,EMBEDDING_SIZE))
embedding_matrix.shape

start = time.time()
for word, i in word_to_ix.items():
    embedding_vector = np.zeros(300)
    if word in glove_words:
        embedding_vector = glove[word]
        embedding_matrix[i] = embedding_vector
    else:
        # Words not found in the embedding index will be all zeros
        embedding_matrix[i] = embedding_vector
print("Generated embedding_matrix in {:.2f}".format(time.time() - start))

Generated embedding_matrix in 0.03


In [25]:
# save the embedding matrix to file
with open("embedding_matrix.pkl","wb") as f:
    pickle.dump(embedding_matrix,f)

In [27]:
# save the tokenizer to file
with open("token.pkl","wb") as f:
    pickle.dump(token,f)