The whole point of training a **skip gram** model is not to predict the context word given the target word but is to learn the weights of the **embedding matrix**.

This is true for all other models that are used to compute the **word embeddings**.

**Gensim**

Gensim is an open source library for natural language processing.

Using minimal lines of code we can generate word vectors for our own corpus.

**Word2Vec Parameters**

size: (default 100) The number of dimensions of the embedding, e.g., the length of the dense vector to represent each token (word).

window: (default 5) The maximum distance between a target word and words around the target word.

min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.

workers: (default 3) The number of threads to use while training.

sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

In [4]:
from gensim.models import Word2Vec

# define training data
sentences = [['gensim', 'is', 'billed','as', 'a', 'natural', 'language', 'processing', 'package'],
            ['but', 'it', 'is', 'practically', 'much', 'more', 'than' ,'that'],
            ['It', 'is', 'a', 'leading', 'and', 'a', 'state', 'of', 'the', 'art', 'package', 'for', 'processing', 
             'texts', 'working' 'with' 'word' 'vector' 'models']]

# train model
model = Word2Vec(sentences, min_count=1, size = 10)

# summarize the loaded model
print(model)

# summarize vocabulary
words = list(model.wv.vocab)
print(words)

# access vector for one word
print(model['gensim'])

Word2Vec(vocab=26, size=10, alpha=0.025)
['gensim', 'is', 'billed', 'as', 'a', 'natural', 'language', 'processing', 'package', 'but', 'it', 'practically', 'much', 'more', 'than', 'that', 'It', 'leading', 'and', 'state', 'of', 'the', 'art', 'for', 'texts', 'workingwithwordvectormodels']
[-0.03448543 -0.02987814 -0.03394822 -0.0402063   0.01721926  0.00844841
 -0.01756344  0.01706514 -0.03758939 -0.02017645]




### Sentiment Classification

#### Data Preprocessing

When it comes to text data, we first remove all kinds of stop words, if necessary, and then transform each character or words into one hot encoding.
Keras framework has a built-in class called Tokenizer which performs implicit tokenization and indexing of each words in the document.

It also eliminates special characters in the document.

In [8]:
### collection of text (or corpus)
docs = ["not good",  "climax was awesome !",  "really liked the movie",  "too lengthy", "awesome!"]

from keras.preprocessing.text import Tokenizer

t = Tokenizer()

### Perform transformation
t.fit_on_texts(docs)

###Output the number of documents in the corpus
print(t.document_count)

###Output the number of occurrence of each word across the document
print(t.word_counts)

###Output the dictionary having word as key and their unique index as values
print(t.word_index)

###Output the dictionary having word as key and number of documents it has appeared as values
print(t.word_docs)

5
OrderedDict([('not', 1), ('good', 1), ('climax', 1), ('was', 1), ('awesome', 2), ('really', 1), ('liked', 1), ('the', 1), ('movie', 1), ('too', 1), ('lengthy', 1)])
{'awesome': 1, 'not': 2, 'good': 3, 'climax': 4, 'was': 5, 'really': 6, 'liked': 7, 'the': 8, 'movie': 9, 'too': 10, 'lengthy': 11}
defaultdict(<class 'int'>, {'not': 1, 'good': 1, 'awesome': 2, 'climax': 1, 'was': 1, 'movie': 1, 'the': 1, 'really': 1, 'liked': 1, 'too': 1, 'lengthy': 1})


#### Lookup Table

Now each word in text data is replaced by their respective index.

To train the LSTM model, we will not directly input the word index to the LSTM network.

We first initialize the lookup table of shape (vocab_size, vector_length).

We do this in KerasEmbedding class as follows.

In [None]:
from keras.layers import Embedding  

embedding_layer = Embedding(vocab_size,  vector_length)

**Transform Data**

Once you have the unique index for each word in the corpus, the corpus has to be represented as an array of an index in place of words as shown below.

In [12]:
word_to_id = { "the": 0, "awesome": 1, "movie": 2, "good":3, "was":4 }
 
data = [["the movie was awesome"]]
transformed_data = [[0, 2, 4, 1]]

#### Sequence Padding

    - The length of the movie review is not always determined, it can be too short or too long.
    - The model may take a very long time to train if the text data is too long.
    - For all the reviews we may consider only first few words say 500.
    - If the text is less than 500words we zeros, in the beginning, to make up the length to 500 words.

In [14]:
from keras.preprocessing import sequence 
max_review_length = 100
sequence.pad_sequences(transformed_data,  maxlen=max_review_length)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 1]], dtype=int32)