<a href="https://colab.research.google.com/github/ramapriyakp/Portfolio/blob/master/NLP/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word Embedding 

Word embeddings are N-dimensional vectors that try to capture word-meaning and context in their values.  Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

There’s a few key characteristics to a set of useful word embeddings:

*  Every word has a unique word embedding (or “vector”), which is just a list of numbers for each word.
*  The word embeddings are multidimensional; typically for a good model, embeddings are between 50 and 500 in length.
*  For each word, the embedding captures the “meaning” of the word.
*  Similar words end up with similar embedding values.


### Word2Vec 
consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer. Word2Vec utilizes two architectures :

### CBOW (Continuous Bag of Words) : 
CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.
![alt text](https://media.geeksforgeeks.org/wp-content/uploads/cbow-1.png)
### Skip Gram 
 Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.
 ![alt text](https://media.geeksforgeeks.org/wp-content/uploads/skip_gram.png)

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


##Word Embeddings
*  Using embedding layer in Keras
*  Another way of learning word embeddings is via pre-training word vectors in another network (e.g., word2vec, GloVe, fasttext, etc.)

In [0]:
cd /content/drive/My Drive/NLP

/content/drive/My Drive/NLP


##Word vectors
*  Long story short, word embedding is process of converting each word to a fixed dimensional "(word) vector"
*  Dimensionality of embedding space (i.e., vector space) is a hyperparameter; one can set dimensionality as any positive integer.

Unfortunately, this approach to word representation does not addres polysemy, or the co-existence of many possible meanings for a given word or phrase.

In [0]:
from keras.models import Sequential
from keras.layers import *
from keras.datasets import reuters
from keras.preprocessing import sequence
from keras.utils import to_categorical

Using TensorFlow backend.


##Embedding layer
As a result, embedding layer bears 3-D tensors
*  Output shape = (batch_size, input_length, output_dim)
*  input_dim: dimensionality of input space (number of unique tokens of interest)
*  output_dim: dimensionality of embedding space
*  input_length: length of input sequence (if None, can vary

In [0]:
# when input length is constant
model = Sequential()
model.add(Embedding(input_dim = 10, output_dim = 5, input_length = 3))






In [0]:
model.output_shape

(None, 3, 5)

##Using embedding layer in network
Usually, embedding layer are used as first layer in network to model text format data

In [0]:
# parameters to import dataset
num_words = 3000
maxlen = 50

In [0]:
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words = num_words, maxlen = maxlen)

Downloading data from https://s3.amazonaws.com/text-datasets/reuters.npz


In [0]:
X_train = sequence.pad_sequences(X_train, maxlen = maxlen, padding = 'post')
X_test = sequence.pad_sequences(X_test, maxlen = maxlen, padding = 'post')

In [0]:
y_train = to_categorical(y_train, num_classes = 46)
y_test = to_categorical(y_test, num_classes = 46)

In [0]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1595, 50)
(399, 50)
(1595, 46)
(399, 46)


In [0]:
input_dim = num_words
output_dim = 100     # we set dimensionality of embedding space as 100
input_length = maxlen

In [0]:
def reuters_model():
    model = Sequential()
    model.add(Embedding(input_dim = input_dim, output_dim = output_dim, input_length = input_length))
    model.add(CuDNNGRU(50, return_sequences = False))
    model.add(Dense(100))
    model.add(Activation('relu'))
    model.add(Dense(46, activation = 'softmax'))
    
    model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
    return model

In [0]:
model = reuters_model()





In [0]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 100)           300000    
_________________________________________________________________
cu_dnngru_1 (CuDNNGRU)       (None, 50)                22800     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               5100      
_________________________________________________________________
activation_1 (Activation)    (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 46)                4646      
Total params: 332,546
Trainable params: 332,546
Non-trainable params: 0
_________________________________________________________________


In [0]:
model.fit(X_train, y_train, epochs = 100, batch_size = 100, verbose = 0)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



<keras.callbacks.History at 0x7f6df2b0a470>

In [0]:

result = model.evaluate(X_test, y_test)



In [0]:
print('Test Accuracy: ', result[1])

Test Accuracy:  0.8571428575910124
