# Using Word Embeddings - Word2Vec, Doc2Vec

The intuition behind __Word2Vec__ is that the meaning of a word can be inferred from its neighbors. Thus if we train a (shallow) neural network with word pairs which are close together, we will end up with a network that can predict if word pairs "belong" with each other or not. However, we are not going to use the neural network after training! Instead, the goal is to learn the weights of the hidden layer, which are essentially the word vectors (aka word embeddings) that we’re trying to learn. One way to think of these embeddings is as features that describe the target word.

We will use the preprocessed text that we built for [16-topic-modeling notebook](http://localhost:8888/notebooks/16-topic-modeling.ipynb) to train a Gensim word2vec model and then use it to create document vectors similar to what we did there. We then use these document vectors in the same we we used the doc-topic vectors in the topic modeling notebook.

Code in this notebook is adapted from the blog post [Word2Vec tutorial](https://rare-technologies.com/word2vec-tutorial/) by Radim Řehůřek, creator of gensim, and by the post [Gensim word2vec tutorial - full working example](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/) by Kavita Ganesan.

Yet another option instead of creating vectors manually is to build a __Doc2Vec__ model, which allows us to find similar documents directly. More details in the blog post [Doc2Vec Tutorial](https://rare-technologies.com/doc2vec-tutorial/) also by gensim creator Radim Řehůřek.

In [1]:
import pickle
import gensim
import logging
import numpy as np
import os

In [2]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [3]:
DATA_DIR = "../data"
MODEL_DIR = "../models"

PREPROC_TEXTS_FILE = os.path.join(DATA_DIR, "textfiles_preproc.txt")

WORD2VEC_MODEL_FILE = os.path.join(MODEL_DIR, "word2vec_model.gensim")
DOC2VEC_MODEL_FILE = os.path.join(MODEL_DIR, "doc2vec_model.gensim")

DOC_SIMS = os.path.join(DATA_DIR, "w2v_sims.npy")
DOC_LOOKUP = os.path.join(DATA_DIR, "w2v_docid2corpus.pkl")

PAPERS_METADATA = os.path.join(DATA_DIR, "papers_metadata.tsv")

In [4]:
class MyDocIterator(object):
    def __init__(self, pp_filename):
        self.fpp = open(pp_filename, "r")

    def __iter__(self):
        for line in self.fpp:
            try:
                filename, text = line.strip().split("\t")
                yield text.split(" ")
            except ValueError:
                continue

                
class MyRestartableDocIterator(object):
    def __init__(self, pp_filename):
        self.pp_filename = pp_filename

    def __iter__(self):
        return iter(MyDocIterator(self.pp_filename))


if os.path.exists(WORD2VEC_MODEL_FILE):
    print("word2vec model already generated, loading")
    w2v_model = gensim.models.Word2Vec.load(WORD2VEC_MODEL_FILE)
else:
    docs = MyRestartableDocIterator(PREPROC_TEXTS_FILE)
    w2v_model = gensim.models.Word2Vec(docs, size=150, window=5, min_count=2, 
                                   workers=4, iter=10)
    w2v_model.save(WORD2VEC_MODEL_FILE)

2018-08-25 12:03:49,163 : INFO : loading Word2Vec object from ../models/word2vec_model.gensim


word2vec model already generated, loading


2018-08-25 12:03:49,442 : INFO : loading wv recursively from ../models/word2vec_model.gensim.wv.* with mmap=None
2018-08-25 12:03:49,443 : INFO : loading vectors from ../models/word2vec_model.gensim.wv.vectors.npy with mmap=None
2018-08-25 12:03:49,486 : INFO : setting ignored attribute vectors_norm to None
2018-08-25 12:03:49,487 : INFO : loading vocabulary recursively from ../models/word2vec_model.gensim.vocabulary.* with mmap=None
2018-08-25 12:03:49,488 : INFO : loading trainables recursively from ../models/word2vec_model.gensim.trainables.* with mmap=None
2018-08-25 12:03:49,489 : INFO : loading syn1neg from ../models/word2vec_model.gensim.trainables.syn1neg.npy with mmap=None
2018-08-25 12:03:49,534 : INFO : setting ignored attribute cum_table to None
2018-08-25 12:03:49,535 : INFO : loaded ../models/word2vec_model.gensim


## Query Expansion

We can use the `most_similar` method of the learned model to find queries that are close to our query terms to find possible synonyms for query expansion. Note that we also treated our keywords as single multi-word tokens, so we can use them as-is as shown in the second cell below.

In [5]:
w2v_model.wv.most_similar("convolution")

2018-08-25 12:03:49,712 : INFO : precomputing L2-norms of word weight vectors


[('convolve', 0.6773777008056641),
 ('upsampling', 0.6635823249816895),
 ('convolutional_layer', 0.6612037420272827),
 ('dilate', 0.6306341886520386),
 ('stride', 0.6231109499931335),
 ('convolutional', 0.6216310858726501),
 ('feature_map', 0.6172328591346741),
 ('deeper', 0.612465500831604),
 ('pool', 0.5934486985206604),
 ('max_pool', 0.586410641670227)]

In [6]:
w2v_model.wv.most_similar("neural_network")

[('neural_net', 0.7221011519432068),
 ('artificial_neural_network', 0.6340944766998291),
 ('deep_neural_network', 0.626783013343811),
 ('feedforward_network', 0.6249794363975525),
 ('recurrent_network', 0.5822201371192932),
 ('network_architecture', 0.5816317200660706),
 ('layer_perceptron', 0.5749163627624512),
 ('recurrent_neural_network', 0.5740536451339722),
 ('neuralnetworks_us', 0.5685520172119141),
 ('layer_network', 0.5418899059295654)]

## Word Embeddings

The weights of the learned model is the word embeddings we are interested in, can be accessed as shown below.

In [7]:
E = w2v_model.wv.vectors
print(E.shape)

(99617, 150)


## Document Embeddings (BoW)

We can extrapolate from these vectors to create document vectors using the bag-of-words approach, that is, each document vector is just the average of all its word vectors.

In [8]:
def compute_bow_docvec(text, w2v_model):
    doc_vec = np.zeros(w2v_model.wv.vectors.shape[1])
    num_words = 0
    for word in text.split(" "):
        try:
            doc_vec += w2v_model.wv[word]
            num_words += 1
        except KeyError:
            continue
    doc_vec /= num_words
    return doc_vec


test_text = "organization associative database application hisashi suzuki suguru arimoto"
print(compute_bow_docvec(test_text, w2v_model))

[ 0.19770882  0.18564166  1.0081561  -0.18493582 -0.71190245 -0.23386441
  0.44864107  0.42183316  0.32042122  0.03524575  0.39742947  0.2186656
 -0.33583622 -0.92449786 -0.27372528 -0.09199027 -0.45183092  1.26830638
 -0.23831061  0.45956633 -0.28053251 -0.09720943  1.03525012 -0.75988733
  0.31549739  0.07462404 -0.20503044 -1.0408117  -0.53094931 -0.63574317
  0.54676002  0.85588561 -0.18846756 -0.25591124  0.03638115  0.23548628
 -0.13348883 -0.80841208  0.34865554  0.08029212 -0.37310332 -1.12760907
 -0.18952324  0.09589486  0.3396261   0.35413956 -0.52332797 -0.10075037
 -0.06295226 -0.37133758  0.45845741 -0.43592981 -0.17267235 -0.75191418
 -0.02310579  0.22621982  0.57415657 -0.00700087  0.21481432  0.57625283
  0.05762654 -0.8824854  -0.04926448 -0.04621982  0.26995589  0.07962886
 -0.44957382  0.32853732 -0.34244629  0.06719177 -0.63850286 -0.44873272
 -0.08850067  0.04439371 -0.43870203 -0.4641712   0.27890945 -0.06352812
  0.47025011 -0.13218641  0.15521489 -0.55059321 -0.

In [9]:
doc_vectors = []
docid2corpus = {}
row_id = 0
fppt = open(PREPROC_TEXTS_FILE, "r")
for line in fppt:
    if row_id % 1000 == 0:
        print("{:d} preprocessed docs read".format(row_id))
    try:
        filename, text = line.strip().split("\t")
    except ValueError:
        pass
    doc_vectors.append(compute_bow_docvec(text, w2v_model))
    doc_id = int(filename.split(".")[0])
    docid2corpus[doc_id] = row_id
    row_id += 1
fppt.close()
print("{:d} preprocessed docs read, COMPLETE".format(row_id))

0 preprocessed docs read
1000 preprocessed docs read
2000 preprocessed docs read
3000 preprocessed docs read
4000 preprocessed docs read
5000 preprocessed docs read
6000 preprocessed docs read
7000 preprocessed docs read
7238 preprocessed docs read, COMPLETE


In [10]:
D = np.array(doc_vectors)
print(D.shape)

(7238, 150)


## Document Similarity (BoW)

Finally we compute the cosine similarity between all pairs of documents by multiplying the document vector with a transpose of itself.

In [11]:
sim = np.matmul(D, np.transpose(D)) / np.linalg.norm(D)
print(sim.shape)

(7238, 7238)


In [12]:
np.save(DOC_SIMS, sim)
pickle.dump(docid2corpus, open(DOC_LOOKUP, "wb"))

In [13]:
def similar_docs(filename, sim, topn, docid2corpus, corpusid2doc):
    doc_id = int(filename.split(".")[0])
    corpus_id = docid2corpus[doc_id]
    row = sim[corpus_id, :]
    target_docs = np.argsort(-row)[0:topn].tolist()
    scores = row[target_docs].tolist()
    target_filenames = ["{:d}.txt".format(corpusid2doc[x]) for x in target_docs]
    return target_filenames, scores
    

filename2title = {}
with open(PAPERS_METADATA, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        cols = line.strip().split("\t")
        filename2title["{:s}.txt".format(cols[0])] = cols[2]

source_filename = "1032.txt"
top_n = 10
corpusid2doc = {v:k for k, v in docid2corpus.items()}
target_filenames, scores = similar_docs(source_filename, sim, top_n, 
                                        docid2corpus, corpusid2doc)
print("Source: {:s}".format(filename2title[source_filename]))
print("--- top {:d} similar docs ---".format(top_n))
for target_filename, score in zip(target_filenames, scores):
    print("({:.5f}) {:s}".format(score, filename2title[target_filename]))

Source: Forward-backward retraining of recurrent neural networks
--- top 10 similar docs ---
(0.06981) Forward-backward retraining of recurrent neural networks
(0.06857) Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks
(0.06831) A Recurrent Neural Network for Word Identification from Continuous Phoneme Strings
(0.06652) Learning Sequential Structure in Simple Recurrent Networks
(0.06582) Speech Recognition Using Demi-Syllable Neural Prediction Model
(0.06512) Searching for Character Models
(0.06498) Speech Production Using A Neural Network with a Cooperative Learning Mechanism
(0.06488) An Integrated Architecture of Adaptive Neural Network Control for Dynamic Systems
(0.06454) HMM Speech Recognition with Neural Net Discrimination
(0.06425) A Segment-Based Automatic Language Identification System


## Document Similarity with Doc2Vec

We could dispense with having to create our own similarity matrix and just use a model such as Doc2Vec.

In [14]:
class TaggedDocIterator(object):
    def __init__(self, pp_filename):
        self.fpp = open(pp_filename, "r")

    def __iter__(self):
        for line in self.fpp:
            try:
                filename, text = line.strip().split("\t")
                doc_id = int(filename.split(".")[0])
                yield gensim.models.doc2vec.TaggedDocument(
                    words=text.split(" "), tags=[doc_id])
            except ValueError:
                pass


class RestartableTaggedDocIterator(object):
    def __init__(self, pp_filename):
        self.pp_filename = pp_filename
        
    def __iter__(self):
        return iter(TaggedDocIterator(self.pp_filename))


if os.path.exists(DOC2VEC_MODEL_FILE):
    print("doc2vec model file exists, loading")
    d2v_model = gensim.models.doc2vec.Doc2Vec.load(DOC2VEC_MODEL_FILE)
else:
    docs = RestartableTaggedDocIterator(PREPROC_TEXTS_FILE)            
    d2v_model = gensim.models.doc2vec.Doc2Vec(docs, vector_size=150, window=5, 
                                              min_count=1, workers=4, iter=10)
    d2v_model.save(DOC2VEC_MODEL_FILE)


2018-08-25 12:04:27,149 : INFO : loading Doc2Vec object from ../models/doc2vec_model.gensim


doc2vec model file exists, loading


2018-08-25 12:04:28,002 : INFO : loading vocabulary recursively from ../models/doc2vec_model.gensim.vocabulary.* with mmap=None
2018-08-25 12:04:28,003 : INFO : loading trainables recursively from ../models/doc2vec_model.gensim.trainables.* with mmap=None
2018-08-25 12:04:28,004 : INFO : loading syn1neg from ../models/doc2vec_model.gensim.trainables.syn1neg.npy with mmap=None
2018-08-25 12:04:28,129 : INFO : loading wv recursively from ../models/doc2vec_model.gensim.wv.* with mmap=None
2018-08-25 12:04:28,130 : INFO : loading vectors from ../models/doc2vec_model.gensim.wv.vectors.npy with mmap=None
2018-08-25 12:04:28,261 : INFO : loading docvecs recursively from ../models/doc2vec_model.gensim.docvecs.* with mmap=None
2018-08-25 12:04:28,262 : INFO : loaded ../models/doc2vec_model.gensim


In [15]:
d2v_model.docvecs.most_similar(positive=[1], negative=[], topn=10)

2018-08-25 12:04:28,781 : INFO : precomputing L2-norms of doc weight vectors


[(427, 0.6177938580513),
 (1347, 0.59282386302948),
 (38, 0.5886051654815674),
 (269, 0.5721673965454102),
 (88, 0.5622702240943909),
 (125, 0.5453654527664185),
 (1569, 0.5206254720687866),
 (839, 0.5154223442077637),
 (5872, 0.5119458436965942),
 (211, 0.5094888210296631)]

In [16]:
source_filename = "1032.txt"
top_n = 10

print("Source: {:s}".format(filename2title[source_filename]))
print("--- top {:d} similar docs ---".format(top_n))

source_docid = int(source_filename.split(".")[0])
targetid_scores = d2v_model.docvecs.most_similar(positive=[source_docid], 
                                                 negative=[], topn=top_n)

for target_id, score in targetid_scores:
    target_filename = "{:d}.txt".format(target_id)
    print("({:.5f}) {:s}".format(score, filename2title[target_filename]))

Source: Forward-backward retraining of recurrent neural networks
--- top 10 similar docs ---
(0.67334) Phonetic Classification and Recognition Using the Multi-Layer Perceptron
(0.65860) A Continuous Speech Recognition System Embedding MLP into HMM
(0.64828) Connectionist Approaches to the Use of Markov Models for Speech Recognition
(0.63736) REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities - Application to Transition-Based Connectionist Speech Recognition
(0.63566) Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System
(0.62960) Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
(0.61769) Modeling Consistency in a Speaker Independent Continuous Speech Recognition System
(0.61722) Improved Hidden Markov Model Speech Recognition Using Radial Basis Function Networks
(0.61304) Handwritten Word Recognition using Contextual Hybrid Radial Basis Function Network/Hidden Markov Models
(0.61246) English Alphabet Re