# ConEc (Context Encoders), an extension of word2vec

The code implements the following procedure -
- Use CBOW word2vec model with negative sampling objective
- Train the model on "text8" , "OneBilCorpus" or "conll2003" datasets
- Multiply the trained word2vec embeddings with the word's average context vectors (CVs)
- A word has global CV and local CV
- Choice of alpha in the equation mentioned in the paper determines the emphasis on the word's local context

Download the conec folder from this repository and import the conec class into your script

Follow instructions given in README.md for more details about the Pre-requisites to run this notebook

This notebook trains the context2vec model on limited data due to space constraints, specifically 162000 words and 162 sentences from the "text8" dataset. To train it on all of the "text8" dataset, please change the following line 

`while count <= 118:` 

to 

`while True:` 

in `conec.py`

It gives a very low accuracy for the Google Analogy Task as its trained on a very small chunk of vocabulary from "text8".

Follow the results present in the results/ folder to find out the actual accuracies attrained by `conec` for the Google Analogy Task and CoNLL 2003 NER Task.

In [1]:
from conec import conec

Define a conec class object, context2vec
----------------------------------------

In [2]:
context2vec = conec()

Conec object created ... 
Use this object to get text embeddings or text similarity...


---
Loading Dataset
---
Ensure text8/OneBillionCorpus/CoNLL 2003 data is downloaded and present in /data

In [3]:
context2vec.read_Dataset('text8', 'data/text8')

Reading dataset ....
Dataset loaded into conec object ....


---
# Training 
Train the conec object on 5 iterations of word2vec model with negative modeling, 
followed by introducing the context vectors to this pre-trained model
Modify the iterations, modelType ("cbow" or "sg") and embedding dimesion (embed_dim) as required
Here, we are only training the model for 1 iteration to save time

In [4]:
context2vec.train(iterations=1, saveInterm=False, modelType="cbow", embed_dim=200 )

2019-05-01 03:50:36,078 : INFO : collecting all words and their counts
2019-05-01 03:50:36,085 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 unique words
2019-05-01 03:50:36,254 : INFO : collected 16438 unique words from a corpus of 162000 words and 162 sentences
2019-05-01 03:50:36,263 : INFO : total of 3878 unique words after removing those with count < 5
2019-05-01 03:50:36,265 : INFO : constructing a table with noise distribution from 3878 words


conec object being trained using text8 dataset present in the directory ../../../../examples/data/text8 ....
Creating word2vec model ....
<conec.Text8Corpus object at 0x7f67d72c1cf8>


2019-05-01 03:51:42,827 : INFO : training model on 3878 vocabulary and 200 features
2019-05-01 03:52:03,033 : INFO : PROGRESS: at 47.22% words, alpha 0.00500, 3310 words/s
2019-05-01 03:52:23,063 : INFO : PROGRESS: at 97.58% words, alpha 0.00500, 3435 words/s
2019-05-01 03:52:24,031 : INFO : training on 141635 words took 41.2s, 3437 words/s


Training the word2vec model on multiple iterations ....
PROGRESS: at sentence #0, processed 0 words and 0 unique words
collected 16438 unique words from a corpus of 162000 words and 162 sentences
PROGRESS: at sentence #0
PROGRESS: through with all the sentences
Saving context2vec model trained on text8 dataset in the following file - data/text8_cbow_200_hs0_neg13_seed3_it1.model
Model for ../../../../examples/data/text8 dataset saved as data/text8_cbow_200_hs0_neg13_seed3_it1.model in the data/ folder


Predicting the embedding of a word
---
Use the model trained above to get the embeddings for any word

In [5]:
context2vec.predict_embedding("student")

array([ 0.16531803, -0.10039496,  0.0074003 , -0.09941176,  0.07194029,
       -0.05663242,  0.04278717,  0.0274874 , -0.05531411, -0.10412723,
       -0.03567853,  0.12549633,  0.01547575, -0.01862101,  0.01271429,
        0.02227433, -0.03902028,  0.09761431, -0.02327678, -0.01857079,
        0.06608818,  0.03408823,  0.00421464,  0.06498237,  0.00841683,
       -0.00333358,  0.00111326,  0.06494229, -0.03110866, -0.12265436,
        0.01439275, -0.005578  , -0.02623828, -0.00900803,  0.1049694 ,
        0.10527257,  0.08490574,  0.05287833,  0.08795646,  0.06210937,
       -0.0152501 ,  0.23809971, -0.02945294,  0.03858227,  0.03797628,
        0.00498829,  0.04977058, -0.07361304, -0.12650932,  0.03004827,
        0.05253399, -0.1063354 , -0.02225152,  0.04144626, -0.0713333 ,
       -0.04629956,  0.06722289,  0.08018234, -0.06080223,  0.01653811,
       -0.02749066, -0.0229704 , -0.13580119,  0.04224365, -0.09832591,
       -0.02508853, -0.06136944,  0.06106324,  0.06571213, -0.07

Predicting the average embedding of a sentence
---
Use the model trained above to predict the embedding of a sentence

In [6]:
context2vec.predict_sent_embedding("This is a great class")

array([ 0.15852345, -0.12578668, -0.01444203, -0.10126745,  0.1023228 ,
       -0.03415585,  0.03266247,  0.00875758, -0.05765506, -0.0905843 ,
       -0.0126076 ,  0.14165781, -0.00206613, -0.02467626,  0.01630537,
        0.01781283, -0.05047987,  0.11270444, -0.00788195, -0.03575283,
        0.07065628,  0.03803026,  0.00316418,  0.0641703 ,  0.01651335,
        0.01152376, -0.03238369,  0.06772912, -0.03998572, -0.11205994,
        0.02399351,  0.0089479 , -0.02309657, -0.02279564,  0.090577  ,
        0.09138416,  0.08767302,  0.07333575,  0.11559842,  0.0603927 ,
       -0.01774923,  0.21608242, -0.0326739 ,  0.03193534,  0.04408792,
        0.01029744,  0.05784287, -0.08015395, -0.1304327 ,  0.02889983,
        0.04627755, -0.09095402, -0.02769028,  0.02879271, -0.05120329,
       -0.06675243,  0.06470329,  0.03837215, -0.04942747,  0.02869307,
       -0.05720734, -0.01412808, -0.15331699,  0.03162723, -0.10662205,
       -0.03688817, -0.05528343,  0.03763501,  0.06446529, -0.06

Similarity between words using their embeddings
---
Obtain the similarity between two words

In [7]:
context2vec.predict_similarity("student", "students")

0.9441206497249871

Similarity between sentences using their average embeddings
---
Obtain the similarity between two sentences

In [8]:
context2vec.predict_sent_similarity("I am a girl", "I am a woman")

0.9941471168338919

Evaluation on Google Analogy Dataset
---
Evaluate the trained model on Google Analogy Dataset

Ensure questions-words.txt is present in the data/ folder

In [9]:
context2vec.evaluate_analogy(input_path="data/questions-words.txt")

capital-common-countries: 0.0% (0/20)
capital-world: 0.0% (0/15)
currency: 0.0% (0/4)
city-in-state: 4.0% (1/25)
family: 4.2% (3/72)
gram1-adjective-to-adverb: 0.0% (0/42)
gram2-opposite: 0.0% (0/2)
gram3-comparative: 0.0% (0/110)
gram4-superlative: 0.0% (0/56)
gram5-present-participle: 0.0% (0/132)
gram6-nationality-adjective: 0.0% (0/177)
gram7-past-tense: 0.0% (0/132)
gram8-plural: 0.0% (0/90)
gram9-plural-verbs: 0.0% (0/132)
total: 0.4% (4/1009)


[{'section': 'capital-common-countries', 'correct': 0, 'incorrect': 20},
 {'section': 'capital-world', 'correct': 0, 'incorrect': 15},
 {'section': 'currency', 'correct': 0, 'incorrect': 4},
 {'section': 'city-in-state', 'correct': 1, 'incorrect': 24},
 {'section': 'family', 'correct': 3, 'incorrect': 69},
 {'section': 'gram1-adjective-to-adverb', 'correct': 0, 'incorrect': 42},
 {'section': 'gram2-opposite', 'correct': 0, 'incorrect': 2},
 {'section': 'gram3-comparative', 'correct': 0, 'incorrect': 110},
 {'section': 'gram4-superlative', 'correct': 0, 'incorrect': 56},
 {'section': 'gram5-present-participle', 'correct': 0, 'incorrect': 132},
 {'section': 'gram6-nationality-adjective', 'correct': 0, 'incorrect': 177},
 {'section': 'gram7-past-tense', 'correct': 0, 'incorrect': 132},
 {'section': 'gram8-plural', 'correct': 0, 'incorrect': 90},
 {'section': 'gram9-plural-verbs', 'correct': 0, 'incorrect': 132},
 {'section': 'total', 'correct': 4, 'incorrect': 1005}]