# Word Embedding
Word embeddings provide a dense representation of words and their relative meanings.

They are an improvement over sparse representations used in simpler bag of word model representations.
These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

The position of a word in the learned vector space is referred to as its embedding.


## Keras Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.
It requires that the input data be integer encoded, so that each word is represented by a unique integer. 

In this project, we'll show how a word embedding is learnt while fitting a neural network.

In [30]:
import numpy as np
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

# Define Train and Test Files

In [31]:
# Define documents
from nltk.corpus import reuters
from nltk.corpus import stopwords

train_docs = []
test_docs = []
train_id = []
test_id = []
 
for id in reuters.fileids():
    if id.startswith("train"):
        train_docs.append(reuters.raw(id))
        train_id.append(id)
    else:
        test_docs.append(reuters.raw(id))
        test_id.append(id)

# Document Cleaning and Representation

Using Tokenizer from Keras:

* Removing punctutations, putting words in lower case, tokenizing.
* The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 7769. (This number is chosen not because we have 7769 documents but because the longest document in the corpus has around 7000 words)

In [158]:
NUM_WORDS=25000
tokenizer = Tokenizer(num_words=NUM_WORDS,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n\'',
                      lower=True)
tokenizer.fit_on_texts(train_docs)
encoded_train_docs = tokenizer.texts_to_sequences(train_docs)
encoded_test_docs=tokenizer.texts_to_sequences(test_docs)
#word_index = tokenizer.word_index

# pad documents to a max length of 7769 words
max_length = 7769
padded_train_docs = pad_sequences(encoded_train_docs, maxlen=max_length, padding='post')
padded_test_docs = pad_sequences(encoded_test_docs, maxlen=max_length, padding='post')
print(padded_train_docs.shape)
print(padded_train_docs[0])

(7769, 7769)
[5963  526  904 ...    0    0    0]


In [157]:
# Transform multilabel labels
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform([reuters.categories(id)
                                  for id in train_id])
test_labels = mlb.transform([reuters.categories(id)
                             for id in test_id])

print(train_labels.shape)
print(reuters.categories())
print(reuters.categories()[21])
print(test_labels[60])

(7769, 90)
[u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', u'cotton', u'cotton-oil', u'cpi', u'cpu', u'crude', u'dfl', u'dlr', u'dmk', u'earn', u'fuel', u'gas', u'gnp', u'gold', u'grain', u'groundnut', u'groundnut-oil', u'heat', u'hog', u'housing', u'income', u'instal-debt', u'interest', u'ipi', u'iron-steel', u'jet', u'jobs', u'l-cattle', u'lead', u'lei', u'lin-oil', u'livestock', u'lumber', u'meal-feed', u'money-fx', u'money-supply', u'naphtha', u'nat-gas', u'nickel', u'nkr', u'nzdlr', u'oat', u'oilseed', u'orange', u'palladium', u'palm-oil', u'palmkernel', u'pet-chem', u'platinum', u'potato', u'propane', u'rand', u'rape-oil', u'rapeseed', u'reserves', u'retail', u'rice', u'rubber', u'rye', u'ship', u'silver', u'sorghum', u'soy-meal', u'soy-oil', u'soybean', u'strategic-metal', u'sugar', u'sun-meal', u'sun-oil', u'sunseed', u'tea', u'tin', u'trade', u'veg-oil', u'wheat', u'wpi', u'yen'

Specifying 3 arguments:

__input_dim:__ This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

__output_dim:__ This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. 

__input_length:__ This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

The output from the Embedding layer will be 7769 vectors of 30 dimensions each. We flatten this to a one (7769*30)-dimensions element vector to pass on to the Dense output layer.

(The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document). In order to connect a Dense layer directly to an Embedding layer, we first need to flatten the 2D output matrix to a 1D vector using the Flatten layer.)

In [77]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 30, input_length=max_length))
model.add(Flatten())
model.add(Dense(90, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_train_docs, train_labels, epochs=8)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 7769, 30)          750000    
_________________________________________________________________
flatten_17 (Flatten)         (None, 233070)            0         
_________________________________________________________________
dense_17 (Dense)             (None, 90)                20976390  
Total params: 21,726,390
Trainable params: 21,726,390
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x162124c10>

In [109]:
# evaluate the model
loss, accuracy = model.evaluate(padded_train_docs, train_labels, verbose=0)
#loss, accuracy = model.evaluate(padded_test_docs, test_labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 99.982980


In [155]:
preds = model.predict(padded_test_docs)

preds[preds>=0.5] = 1
preds[preds<0.5] = 0

pred_new = preds[60]
print(pred_new)
idx = np.where(pred_new>=1)
for val in idx[0]:
    print(reuters.categories()[int(val)])


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
earn
