## Name: Vaibhav Bichave

## Implement the Continuous Bag of Words (CBOW) Model for the given (textual document) using the below steps:
    a. Data preparation
    b. Generate training data
    c. Train model
    d. Output

In [1]:
data ="""But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain 
was born and I will give you a complete account of the system, and expound the actual teachings 
of the great explorer of the truth, the master-builder of human happiness. No one rejects, 
dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know 
how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again 
is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, 
but because occasionally circumstances occur in which toil and pain can procure him some great 
pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, 
except to obtain some advantage from it? But who has any right to find fault with a man who 
chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that 
produces no resultant pleasure?"""

data = data.split()

In [2]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)

word2id = tokenizer.word_index
word2id['PAD'] = 0

id2word = {v:k for k,v in word2id.items()}
wids = tokenizer.texts_to_sequences(data)

emb_size = 100
window_size = 2
vocab_size = len(word2id)

In [3]:
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

In [4]:
def cbow_model(corpus,vocab_size, window_size):
    context_length = window_size*2
    for words in corpus:
        sequences_size = len(words)
        for index,word in enumerate(words):
            context_word = []
            label_word = []
            start = index - window_size
            end = index + window_size + 1
            context_word.append([words[i]
                               for i in range(start,end)
                               if 0<=i <sequences_size
                               and i!=index])
            label_word.append(word)
            
            x = pad_sequences(context_word,context_length)
            y = to_categorical(label_word,vocab_size)
            yield(x,y)
            
 

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Embedding,Lambda
import keras.backend as K

In [6]:
cbow = Sequential([
    Embedding(vocab_size,emb_size,input_length = window_size*2),
    Lambda(lambda x:K.mean(x,axis=1)),
    Dense(vocab_size,activation = 'softmax')
])

cbow.compile(loss='categorical_crossentropy', optimizer='adam')
cbow.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 100)            10200     
                                                                 
 lambda (Lambda)             (None, 100)               0         
                                                                 
 dense (Dense)               (None, 102)               10302     
                                                                 
Total params: 20,502
Trainable params: 20,502
Non-trainable params: 0
_________________________________________________________________


In [7]:
for epochs in range(5):
    loss  = 0
    for x,y in cbow_model(corpus=wids,vocab_size = vocab_size,window_size=window_size):
        loss += cbow.train_on_batch(x,y)
    print("Epochs {} - Loss -> {}".format(epochs,loss))

Epochs 0 - Loss -> 775.3729386329651
Epochs 1 - Loss -> 762.764808177948
Epochs 2 - Loss -> 752.3622903823853
Epochs 3 - Loss -> 743.9978125095367
Epochs 4 - Loss -> 739.9645841121674


In [8]:
import pandas as pd
weights = cbow.get_weights()[0][:]
# pd.DataFrame(weights,index=word2id.keys())

In [9]:
### from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(weights)
data = pd.DataFrame(distance_matrix,index=word2id.keys())
data.columns = word2id.keys()

data

Unnamed: 0,to,of,pleasure,pain,a,the,who,but,and,or,...,find,fault,with,man,chooses,enjoy,annoying,produces,resultant,PAD
to,0.000000,0.787391,0.810214,0.827404,0.797251,0.814888,0.786212,0.853085,0.804344,0.820375,...,0.778528,0.810712,0.812098,0.789316,0.770183,0.826989,0.780721,0.832445,0.785573,0.789727
of,0.787391,0.000000,0.384531,0.416828,0.325038,0.406923,0.388624,0.438588,0.389750,0.426883,...,0.425372,0.401584,0.444521,0.409172,0.373554,0.398741,0.408482,0.411565,0.440255,0.405608
pleasure,0.810214,0.384531,0.000000,0.409522,0.375739,0.403045,0.407210,0.402263,0.393062,0.427232,...,0.429522,0.404031,0.399282,0.409312,0.405998,0.394687,0.416837,0.403753,0.441222,0.411379
pain,0.827404,0.416828,0.409522,0.000000,0.391088,0.427218,0.427060,0.412869,0.367085,0.401413,...,0.407849,0.381329,0.419598,0.357400,0.424888,0.378397,0.351677,0.332776,0.430496,0.420382
a,0.797251,0.325038,0.375739,0.391088,0.000000,0.377490,0.405434,0.418239,0.381655,0.385599,...,0.419062,0.340630,0.443746,0.377340,0.369278,0.441894,0.406800,0.377152,0.433487,0.366327
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
enjoy,0.826989,0.398741,0.394687,0.378397,0.441894,0.410658,0.391083,0.406363,0.407552,0.428812,...,0.429945,0.439026,0.420391,0.400164,0.428240,0.000000,0.424172,0.416142,0.399358,0.440132
annoying,0.780721,0.408482,0.416837,0.351677,0.406800,0.438933,0.426726,0.446179,0.385946,0.389236,...,0.415694,0.452983,0.443795,0.378387,0.435288,0.424172,0.000000,0.379482,0.453137,0.388455
produces,0.832445,0.411565,0.403753,0.332776,0.377152,0.422260,0.434561,0.405213,0.405003,0.406478,...,0.415976,0.410710,0.432731,0.432116,0.419045,0.416142,0.379482,0.000000,0.425729,0.378775
resultant,0.785573,0.440255,0.441222,0.430496,0.433487,0.439111,0.429622,0.434378,0.363705,0.383129,...,0.381278,0.406512,0.433371,0.401329,0.417583,0.399358,0.453137,0.425729,0.000000,0.450683


In [10]:
def SearchWord(WordList):
    similar_words ={}
    for search_term in WordList:
        if(search_term in word2id.keys()):
            similar_words[search_term]=[id2word[idx] for idx in 
                                        distance_matrix[word2id[search_term]-1].argsort()[0:5]+1] 
    return similar_words



In [11]:
SearchWord(['enjoy'])

{'enjoy': ['enjoy', 'desires', 'one', 'again', 'know']}