# Goal

In this notebook, we create word embeddings from raw text extracted from IMDB movie reviews.

## Creating In-Memory Text and Label Data

In [1]:
import os


imdb_dir = './imdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
    
        if label_type == 'neg':
            labels.append(0)
        else:
            labels.append(1)

In [2]:
texts[0]



In [3]:
labels[0]

0

In [4]:
len(texts)

25000

## Tokenizing the Text Data

In [12]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import numpy as np


maxlen = 100 # maximum number of words in a review to use
training_samples = 200 # maximum number of reviews to use
validation_samples = 10000
max_words = 10000 # use a vocabulary of 10000 most widely used words


tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')

data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)

data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)


## GLOVE Word Embeddings

The Glove word embeddings file can be downloaded from https://nlp.stanford.edu/projects/glove. The embeddings have been computed from the 2014 English Wikipedia containing 100-dimensional embedding vectors for 400,000 words (or nonword tokens). 

### Parsing GLOVE Word Embeddings

In [14]:
glove_dir = './glove6B'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))

for line in f:
    values = line.split()
    word = values[0]
    vect = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = vect

f.close()
print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


### Creating an Embedding Matrix

This matrix must be of shape (vocabulary size, embedding vector size), where entry $i$ contains
the vector for the word $i$ in the reference word index (built during tokenization). Note that index $0$ isn’t supposed to stand for any word or token—it’s a placeholder.