<div style="text-align: center; display:block">
    <div style="display: inline-block">
        <h1  style="text-align: center">Word Embedding Module</h1>
        <div style="width:80%; text-align: center"><i>Author:</i> <strong>Soham Mullick</strong> </div>
    </div>
</div>

The purpose of this module is to generate custom word embedding vector from the cleaned and processed text data to be used for Final Prediction Model code

<b>Input</b> - Cleaned Data output from Text_Cleaner module

<b>Output</b> - Trained word2vec embedding object

### Introduction to Word2Vec

The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram. 

A word embedding is a way of representing text where each word in the vocabulary is represented by a real valued vector in a high-dimensional space. The vectors are learned in such a way that words that have similar meanings will have similar representation in the vector space (close in the vector space). This is a more expressive representation for text than more classical methods like bag-of-words, where relationships between words or tokens are ignored, or forced in bigram and trigram approaches.
  
In this implementation, we'll be using the skip-gram architecture because it performs better than CBOW. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

We also use gensim python library to create and store embeddings. The python code for word2vec implementation is part of Word2VecClass() instantiated inside Models.py


### Importing important modules

In [14]:
#Core modules
import pandas as pd
import configparser
import logging
import time

#Gensim modules
from gensim.models import word2vec
from gensim.models.word2vec import Word2Vec

#Visualisation modules
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool
import matplotlib as mpl
import matplotlib.cm as cm

### Word Embedding Control Box

The following parameters can be used to tweak the training for word embedding vectors

In [15]:
embedding_size=300      #Dimensionality of the feature vectors
window_size=10           #Maximum distance between the current and predicted word within a sentence
learning_rate=0.025     #Initial learning rate
use_sg=1                #Using Skip-Gram Algorithm (Set 0 for using CBOW)
negative_sampling=5    #Negative sampling will be used, the int for negative specifies how many “noise words” should be drawn
iter_no=150             #Number of iterations (epochs) over the corpus 
min_count=20             #Ignore all words with total frequency lower than this
workers=4               #How many worker threads to train the model(Only for multi-core machines)

### Defining the functions

The following functions are used by the module

In [16]:
# To Get Data

def getFile(fileName):
    try :
        raw_data=pd.read_csv(fileName,encoding='latin-1') #Change the Filename to use different Dataset
    except FileNotFoundError:
        print('\n File name not Correct. Please try again')
        raw_data=getFile()
    return raw_data


# Using iterable class to be used by word2vec
class Sentence(object):
    def __init__(self, doc_list):
        self.doc_list = list(doc_list)
    def __iter__(self):
        for doc in self.doc_list:
            yield str(doc).split()


### Read Config and Create Logger

In [24]:
# Loading config file
config = configparser.ConfigParser()
config.read('./config.ini')

# Read config file
textData = str(config['Word_Embedding']['Data_col'])
input_file=str(config['Word_Embedding']['Input_file'])
output_file=str(config['Word_Embedding']['Output_file'])

# Create logger file
logging.basicConfig(filename="Word_Embedding_{}.log".format(time.strftime('%b-%d-%Y_%H%M',time.localtime())),level=logging.DEBUG)

### Create Word Embedding

#### Get File

In [18]:
CleanRawData=getFile(input_file)

#### Using Whole corpus to train the word2vec

In [19]:
# Decide on Corpus to be used for training
data=CleanRawData[textData]

In [20]:
#Create an instance of the iterable class
sentences=Sentence(data)

#### Create Word2Vec Object

Word2Vec Model is being trained on the individual sentence generated out of the sentence generator which in turn takes the whole column as mentioned in the config file and creates individual sentence objects. 

The vocabulary is also made on the go. There are different variants to train the wordEmbedding on a pre-defined vocabulary.

The associated parameters to be tweaked for varying the training pattern.

In [21]:
w2v_model=Word2Vec(sentences,size=embedding_size, window=window_size,alpha=learning_rate,sg=use_sg,negative=negative_sampling,iter=iter_no,workers=4,min_count=min_count)

#### Check Vocabulary

To check the Final Vocabulary length

In [22]:
logging.debug("The total number of words in the current Word2Vec Model: "+str(len(w2v_model.wv.vocab)))

### Check for most similar word

In [23]:
 w2v_model.most_similar(positive=["switch"], topn=10)

  """Entry point for launching an IPython kernel.


[('stack', 0.584490180015564),
 ('power_supply', 0.41783031821250916),
 ('tried', 0.3626179099082947),
 ('one', 0.3552532196044922),
 ('two', 0.35395902395248413),
 ('stacked', 0.3537466526031494),
 ('showing', 0.34573429822921753),
 ('powered', 0.3440735936164856),
 ('cable', 0.3430076539516449),
 ('issue', 0.3364446759223938)]

### Visualising Word Embeddings

We use T-SNE for dimension reduction and try to visualise the whole Word Embedding Vector space in 2D

In [None]:
%matplotlib inline
output_notebook()

In [None]:
# TSNE Model
'''Can Take a lot of time to Run. Make Sure to Comment it out once TSNE Model is Created'''
def tsne_cal(model):
    "Creates and TSNE model"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    values = tsne_model.fit_transform(tokens)
    return values

def tsne_plot_bokeh(tsne_values,model):
    x = []
    y = []
    for value in tsne_values:
        x.append(value[0])
        y.append(value[1])
    source_train = ColumnDataSource(
        data=dict(
            x = x,
            y = y,
            desc = list(model.wv.vocab)
    )
    )

    hover_tsne = HoverTool(names=['train'],tooltips=[("Word", "@desc")])
    tools_tsne = [hover_tsne, 'pan', 'wheel_zoom', 'reset']
    plot_tsne = figure(plot_width=600, plot_height=600, tools=tools_tsne, title='Embeddings')
    plot_tsne.circle('x', 'y', size=10, #fill_color='colors', 
                     alpha=0.5, line_width=0, source=source_train, name="train")

    show(plot_tsne)

#### Projecting Word Vectors in 2D

In [None]:
tsne_values=tsne_cal(w2v_model)

#### Plotting Word Embeddings

In [None]:
tsne_plot_bokeh(tsne_values,w2v_model)

### Saving The Word Embedding Model

In [25]:
#Saving the word2vec model to be used later
w2v_model.save(output_file)