<br>
<h1 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">Understanding Embedding with Classification</h1>
<br>
    
<center><img src="https://mlwhiz.com/images/word2vec.png"></center>

### <h3 style="color:#fe346e">Word2Vec</h3>
What are word embeddings exactly? Loosely speaking, they are vector representations of a particular word. Having said this, what follows is how do we generate them? More importantly, how do they capture the context?
Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.

### <h3 style="color:#fe346e">Why do we need them?</h3>
Consider the following similar sentences: Have a good day and Have a great day. They hardly have different meaning. If we construct an exhaustive vocabulary (let’s call it V), it would have V = {Have, a, good, great, day}.

Now, let us create a one-hot encoded vector for each of these words in V. Length of our one-hot encoded vector would be equal to the size of V (=5). We would have a vector of zeros except for the element at the index representing the corresponding word in the vocabulary. That particular element would be one. The encodings below would explain this better.
Have = `[1,0,0,0,0]`; a=`[0,1,0,0,0]` ; good=`[0,0,1,0,0]` ; great=`[0,0,0,1,0]` ; day=`[0,0,0,0,1]` (represents transpose)

If we try to visualize these encodings, we can think of a 5 dimensional space, where each word occupies one of the dimensions and has nothing to do with the rest (no projection along the other dimensions). This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.
Our objective is to have words with similar context occupy close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.

<center><img src="https://miro.medium.com/max/1394/0*XMW5mf81LSHodnTi.png"></center>

### <h3 style="color:#fe346e">How does Word2Vec work?</h3>
Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)

### CBOW Model: 

This method takes the context of each word as the input and tries to predict the word corresponding to the context. Consider our example: Have a great day.
Let the input to the Neural Network be the word, great. Notice that here we are trying to predict a target word (day) using a single context input word great. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word (day). In the process of predicting the target word, we learn the vector representation of the target word.

The architecture is below in Figure 1:
<img src="https://miro.medium.com/max/1400/0*3DFDpaXoglalyB4c.png">

The input or the context word is a one hot encoded vector of size V. The hidden layer contains N neurons and the output is again a V length vector with the elements being the softmax values.

### Skip-Gram Model:

<img src="https://miro.medium.com/max/1400/0*Ta3qx5CQsrJloyCA.png">

This looks like multiple-context CBOW model just got flipped. To some extent that is true.

We input the target word into the network. The model outputs C probability distributions. What does this mean?
For each context position, we get C probability distributions of V probabilities, one for each word.

### <h3 style="color:#fe346e">Stanford’s competing Approach — GloVe (2014)</h3>

GloVe is a "count-based" model.Count-based models learn their vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix. They first construct a large matrix of (words x context) co-occurrence information, i.e. for each "word" (the rows), you count how frequently we see this word in some "context" (the columns) in a large corpus.  The number of "contexts" is of course large, since it is essentially combinatorial in size. So then they factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a "reconstruction loss" which tries to find the lowerdimensional representations which can explain most of the variance in the high-dimensional data. In the specific case of GloVe, the counts matrix is preprocessed by normalizing the counts and log-smoothing them. This turns out to be A Good Thing in terms of the quality of the learned representations.

However, as pointed out, when we control for all the training hyper-parameters, the embeddings generated using the both Word2Vec and GLoVe methods tend to perform very similarly in downstream NLP tasks. The additional benefits of GloVe over word2vec is that it is easier to parallelize the implementation which means it's easier to train over more data, which, with these models, is always A Good Thing.

### <h3 style="color:#fe346e">fasttext</h3>

fastText as a library for efficient learning of word representations and sentence classification. It is written in C++ and supports multiprocessing during training. FastText allows you to train supervised and unsupervised representations of words and sentences. These representations (embeddings) can be used for numerous applications from data compression, as features into additional models, for candidate selection, or as initializers for transfer learning.
FastText supports training continuous bag of words (CBOW) or Skip-gram models using negative sampling, softmax or hierarchical softmax loss functions. I have primarily used fastText for training semantic embeddings for a corpus of size in the order of tens millions, and am happy with how it has performed and scaled for this task. I had a hard time finding documentation beyond the documentation for getting started, so in this post I am going to walk you through the internals of fastText and how it works. An understanding of how the word2vec models work is expected.

FastText is able to achieve really good performance for word representations and sentence classification, specially in the case of rare words by making use of character level information.
Each word is represented as a bag of character n-grams in addition to the word itself, so for example, for the word matter, with n = 3, the fastText representations for the character n-grams is <ma, mat, att, tte, ter, er>. < and > are added as boundary symbols to distinguish the ngram of a word from a word itself, so for example, if the word mat is part of the vocabulary, it is represented as <mat>. This helps preserve the meaning of shorter words that may show up as ngrams of other words. Inherently, this also allows you to capture meaning for suffixes/prefixes.

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Import Libraries&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as ex
import plotly.figure_factory as ff

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Read the data&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
test = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')
train = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
print(train.shape)
print(test.shape)

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Do some cleaning on the data&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

In [None]:
train['comment_text']=train['comment_text'].apply(lambda x:strip_links(x))
test['comment_text']=test['comment_text'].apply(lambda x:strip_links(x))

In [None]:
### replace :\n 
train['comment_text']=train['comment_text'].str.replace("\n",' ')

In [None]:
### replace :\n 
test['comment_text']=test['comment_text'].str.replace("\n",' ')

In [None]:
# Define the function to remove the punctuation
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
# Apply to the DF series
train['comment_text'] = train['comment_text'].apply(remove_punctuations) 

In [None]:
# Apply to the DF series
test['comment_text'] = test['comment_text'].apply(remove_punctuations) 

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Train and Val split&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
Y = train[list_classes].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.comment_text.values, Y,  
                                                  random_state=42, 
                                                  test_size=0.2)

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Define Vocab size and input string size&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
## Check lenght of text in the data
train['comment_text'].apply(lambda x:len(str(x).split())).max()

In [None]:
max_features = 5000
maxlen = 500

In [None]:
token=tf.keras.preprocessing.text.Tokenizer(num_words=max_features)
token.fit_on_texts(train.comment_text)

In [None]:
X_train_seq=token.texts_to_sequences(X_train)
X_test_seq=token.texts_to_sequences(X_test)

In [None]:
#zero pad the sequences
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen=maxlen)
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen=maxlen)

In [None]:
word_index = token.word_index

In [None]:
len(token.word_index)##251102

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Word2Vec embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gzip -d GoogleNews-vectors-negative300.bin.gz
!ls -l

In [None]:
from gensim.models import Word2Vec, KeyedVectors
# Load pretrained Glove model (in word2vec form)
word2vec_model = KeyedVectors.load_word2vec_format("./GoogleNews-vectors-negative300.bin", binary=True)

In [None]:
#Embedding length based on selected model - we are using 50d here.
embedding_vector_length = 300

In [None]:
#Initialize embedding matrix
embedding_matrix = np.zeros((max_features + 1, embedding_vector_length))
print(embedding_matrix.shape)

In [None]:
for word, i in sorted(token.word_index.items(),key=lambda x:x[1]):
    if i > (max_features+1):
        break
    try:
        embedding_vector = word2vec_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
    except:
        pass

In [None]:
embedding_matrix

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Model Building Using Word2vec&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
#Initialize model
import tensorflow as tf
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [None]:
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(tf.keras.layers.Embedding(max_features + 1, #Vocablury size
                                    embedding_vector_length, #Embedding size
                                    weights=[embedding_matrix], #Embeddings taken from pre-trained model
                                    trainable=False, #As embeddings are already available, we will not train this layer. It will act as lookup layer.
                                    input_length=maxlen) #Number of words in each review
         )
model.add(SimpleRNN(100))
model.add(Dense(6, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train_pad,
                    y_train,
                    epochs=3,
                    batch_size=32,          
                    validation_data=(X_test_pad, y_test))

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Glove Embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
#unzip the file, we get multiple embedding files. We can use either one of them
!unzip glove.6B.zip
!ls -l

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec

#Glove file - we are using model with 50 embedding size
glove_input_file = 'glove.6B.50d.txt'

#Name for word2vec file
word2vec_output_file = 'glove.6B.50d.txt.word2vec'

#Convert Glove embeddings to Word2Vec embeddings
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
### We will extract word embedding for which we are interested in; the pre trained has 400k words each with 50 embedding vector size.
from gensim.models import Word2Vec, KeyedVectors

# Load pretrained Glove model (in word2vec form)
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

#Embedding length based on selected model - we are using 50d here.
embedding_vector_length = 50

In [None]:
#Initialize embedding matrix
embedding_matrix = np.zeros((max_features + 1, embedding_vector_length))
print(embedding_matrix.shape)

In [None]:
for word, i in sorted(token.word_index.items(),key=lambda x:x[1]):
    if i > (max_features+1):
        break
    try:
        embedding_vector = glove_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
    except:
        pass

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Model Building Using GloVe&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
#Initialize model
import tensorflow as tf
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [None]:
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(tf.keras.layers.Embedding(max_features + 1, #Vocablury size
                                    embedding_vector_length, #Embedding size
                                    weights=[embedding_matrix], #Embeddings taken from pre-trained model
                                    trainable=False, #As embeddings are already available, we will not train this layer. It will act as lookup layer.
                                    input_length=maxlen) #Number of words in each review
         )
model.add(SimpleRNN(100))
model.add(Dense(6, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

In [None]:
history_glove=model.fit(X_train_pad,
                        y_train,
                        epochs=3,
                        batch_size=32,          
                        validation_data=(X_test_pad, y_test))

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Fasttext Embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
import fasttext.util
### uncomment this when you nned to download pretrained fasttext model
# fasttext.util.download_model('en', if_exists='ignore')  # English

In [None]:
### remove unnecessary files
# !rm -rf  ./cc.en.300.bin.gz
# !rm -rf ./GoogleNews-vectors-negative300.bin.gz
# !rm -rf ./glove.6B.300d.txt
# !rm -rf ./glove.6B.200d.txt
# !rm -rf ./glove.6B.100d.txt
# !rm -rf ./glove.6B.zip

In [None]:
ft = fasttext.load_model('cc.en.300.bin')

In [None]:
### reduct the vector dimension to 50
fasttext.util.reduce_model(ft, 50)

In [None]:
#Initialize embedding matrix
embedding_matrix_fasttext = np.zeros((max_features + 1, embedding_vector_length))
print(embedding_matrix_fasttext.shape)

In [None]:
for word, i in sorted(token.word_index.items(),key=lambda x:x[1]):
    if i > (max_features+1):
        break
    try:
        embedding_vector = ft[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix_fasttext[i] = embedding_vector
    except:
        pass

In [None]:
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(tf.keras.layers.Embedding(max_features + 1, #Vocablury size
                                    embedding_vector_length, #Embedding size
                                    weights=[embedding_matrix_fasttext], #Embeddings taken from pre-trained model
                                    trainable=False, #As embeddings are already available, we will not train this layer. It will act as lookup layer.
                                    input_length=maxlen) #Number of words in each review
         )
model.add(SimpleRNN(100))
model.add(Dense(6, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

In [None]:
history_fasttext=model.fit(X_train_pad,y_train,
                           epochs=3,
                           batch_size=32,          
                           validation_data=(X_test_pad, y_test))

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Loss curve for 3 embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [None]:
history.history

In [None]:
history_glove.history

In [None]:
history_fasttext.history

In [None]:
loss_list=[history.history,history_glove.history,history_fasttext.history]

In [None]:
loss_list

In [None]:
loss_dict={'w2v_loss':loss_list[0]['loss'],'w2v_val_loss':loss_list[0]['val_loss'],
           'glove_loss':loss_list[1]['loss'],'glove_val_loss':loss_list[1]['val_loss'],
           'fasttext_loss':loss_list[2]['loss'],'fasttext_val_loss':loss_list[2]['val_loss']}
acc_dict={'w2v_acc':loss_list[0]['accuracy'],'w2v_val_acc':loss_list[0]['val_accuracy'],
           'glove_acc':loss_list[1]['accuracy'],'glove_val_acc':loss_list[1]['val_accuracy'],
           'fasttext_acc':loss_list[2]['accuracy'],'fasttext_val_acc':loss_list[2]['val_accuracy']}

In [None]:
loss_dict['w2v_loss']

In [None]:
np.arange(1,3,1)

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Training Loss curve for 3 embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
epochRange = np.arange(1,4,1)
plt.plot(epochRange,loss_dict['w2v_loss'])
plt.plot(epochRange,loss_dict['glove_loss'])
plt.plot(epochRange,loss_dict['fasttext_loss'])
plt.title('Training loss for different embeddings')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['Word2Vec', 'GLOVE','FastText'], loc='upper left')
plt.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Validation Loss curve for 3 embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
epochRange = np.arange(1,4,1)
plt.plot(epochRange,loss_dict['w2v_val_loss'])
plt.plot(epochRange,loss_dict['glove_val_loss'])
plt.plot(epochRange,loss_dict['fasttext_val_loss'])
plt.title('Validation loss for different embeddings')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['Word2Vec', 'GLOVE','FastText'], loc='upper left')
plt.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Validation Accuracy curve for 3 embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
epochRange = np.arange(1,4,1)
plt.plot(epochRange,acc_dict['w2v_val_acc'])
plt.plot(epochRange,acc_dict['glove_val_acc'])
plt.plot(epochRange,acc_dict['fasttext_val_acc'])
plt.title('Validation accuracy for different embeddings')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['Word2Vec', 'GLOVE','FastText'], loc='upper left')
plt.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Training Accuracy curve for 3 embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
epochRange = np.arange(1,4,1)
plt.plot(epochRange,acc_dict['w2v_acc'])
plt.plot(epochRange,acc_dict['glove_acc'])
plt.plot(epochRange,acc_dict['fasttext_acc'])
plt.title('Training accuracy for different embeddings')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['Word2Vec', 'GLOVE','FastText'], loc='upper left')
plt.show()