## Building paragraph vectors using Doc2Vec


### Import common text corpus, Doc2Vec algorithm and Tagged Document functionality from Gensim


In [1]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

## Corpus on which training will happen


In [2]:
common_texts


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

## Building Tagged Documents from the corpus as that's an expectation from the Doc2Vec model


In [3]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]


In [4]:
documents


[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]),
 TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]),
 TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]),
 TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]),
 TaggedDocument(words=['user', 'response', 'time'], tags=[4]),
 TaggedDocument(words=['trees'], tags=[5]),
 TaggedDocument(words=['graph', 'trees'], tags=[6]),
 TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]),
 TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

## Building a basic Doc2Vec model


In [5]:
model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4, epochs = 40)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

## What's the vector size (should be 5 as we specified it on top)


In [6]:
model.vector_size


5

## How many document vectors did we train?


In [7]:
len(model.dv)


9

## Let's check out the vocabulary information for the model we built


In [8]:
len(model.wv)


12

In [9]:
model.wv.key_to_index

{'system': 0,
 'graph': 1,
 'trees': 2,
 'user': 3,
 'minors': 4,
 'eps': 5,
 'time': 6,
 'response': 7,
 'survey': 8,
 'computer': 9,
 'interface': 10,
 'human': 11}

## Let's infer a vector based on the trained Doc2Vec model


In [10]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-0.07132846 -0.07293598  0.08171386 -0.01781811  0.08268694]


## Building a new model changing vector size and minimum count eligibility


In [11]:
model = Doc2Vec(documents, vector_size=50, min_count=3, epochs=40)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [12]:
len(model.wv.key_to_index)

4

In [13]:
model.wv.key_to_index

{'system': 0, 'graph': 1, 'trees': 2, 'user': 3}

In [14]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-6.9008539e-03 -7.0250016e-03  8.7858057e-03 -1.2729048e-03
  8.1649534e-03  8.9995340e-03  4.6824482e-03  9.7706914e-03
  6.0688327e-03 -4.4537876e-05  9.1232909e-03 -5.0985296e-03
  6.6826018e-03 -9.4934461e-05 -3.7608044e-03  5.0448021e-04
  1.6480876e-03  1.8674725e-03 -7.7308193e-03  2.3817269e-03
  7.8316070e-03  2.1804986e-03 -7.0076920e-03 -4.8691514e-03
  3.1949650e-03  6.4207390e-03 -5.5920205e-04  7.0257029e-03
  8.7348716e-03  4.3393727e-03  3.6249030e-04  4.8182695e-03
  2.6362271e-03 -7.7965776e-03  5.8811666e-03  5.6095268e-03
 -2.7122928e-03 -1.9781210e-03  5.1145451e-03 -1.1415991e-03
 -8.7102605e-03 -5.3913710e-03 -9.2915287e-03 -7.6096877e-03
  2.2767698e-03  7.8267222e-03  9.1777360e-03  9.2491861e-03
  6.1940067e-03  8.1247650e-03]


## Doc2Vec built next would be based on the distributed memory model (dm=1)


In [15]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, dm=1)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [16]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-7.5380853e-03 -7.1503585e-03  8.3977208e-03 -1.2937490e-03
  7.9445299e-03  9.1809472e-03  4.4372561e-03  1.0230061e-02
  5.4555195e-03  2.6790897e-04  9.4502801e-03 -4.5529213e-03
  6.9142780e-03 -4.2968336e-04 -3.4684092e-03 -2.4290899e-05
  1.6607591e-03  2.1289815e-03 -8.3805127e-03  2.2666617e-03
  7.4522351e-03  2.2275292e-03 -7.2887558e-03 -4.6946849e-03
  3.0986574e-03  6.4599975e-03 -9.3866157e-04  6.3878996e-03
  8.8160178e-03  4.0722988e-03  9.2795381e-04  5.2868393e-03
  1.9848894e-03 -7.9225171e-03  5.8491481e-03  6.0383268e-03
 -2.8408356e-03 -2.5716196e-03  5.3137038e-03 -1.2146889e-03
 -9.1439392e-03 -4.9629640e-03 -9.4759008e-03 -8.2118576e-03
  1.9457844e-03  8.1711570e-03  9.4537018e-03  9.2346901e-03
  6.6101081e-03  7.9657305e-03]


## Doc2Vec built next would be based on the distributed bag of words approach (dm=0)


In [17]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, dm=0)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [18]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-0.00885652 -0.00737746  0.00779405 -0.00163616  0.00748496  0.00959941
  0.00393247  0.0112919   0.00479682  0.00106699  0.00985868 -0.00359571
  0.00739341 -0.0008992  -0.00273496 -0.00077877  0.00098837  0.00230627
 -0.0090192   0.00248175  0.00682581  0.00227657 -0.00804985 -0.00464084
  0.00282642  0.00668496 -0.00139884  0.00522772  0.00881779  0.00385717
  0.00198561  0.00627777  0.0007406  -0.00755929  0.00562679  0.00711951
 -0.00333192 -0.00325486  0.00564324 -0.00143247 -0.00990125 -0.00412399
 -0.00981668 -0.00903562  0.0008811   0.00901486  0.01003559  0.00939979
  0.00729895  0.00765798]


## Adding the window size which controls the maximum distance between current and predicted word

In [19]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [20]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-0.32081848 -0.07034425 -0.19386359 -0.0073753  -0.12657149  0.12652932
 -0.16216983  0.19491796 -0.2673311   0.11826707  0.24494976  0.26303568
  0.12428475 -0.24816737  0.16992673 -0.3462673  -0.07880609  0.1366772
 -0.34724015 -0.0381703  -0.06368154  0.07221834 -0.17794438  0.16144235
 -0.04233294  0.00254776 -0.21198331 -0.2794619   0.06668678 -0.18528673
  0.32997644  0.22898008 -0.34520456 -0.08478537 -0.08134042  0.18606313
 -0.02795762 -0.18381299 -0.01800981 -0.01356451 -0.13271166  0.13744706
 -0.00462981 -0.3213552  -0.06995157  0.1747461   0.19557606 -0.00258067
  0.24483562 -0.16757362]


## Adding initial learning rate and to what value should the learning rate drop to linearly over training (alpha and min_alpha)

In [21]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [22]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-0.32081848 -0.07034425 -0.19386359 -0.0073753  -0.12657149  0.12652932
 -0.16216983  0.19491796 -0.2673311   0.11826707  0.24494976  0.26303568
  0.12428475 -0.24816737  0.16992673 -0.3462673  -0.07880609  0.1366772
 -0.34724015 -0.0381703  -0.06368154  0.07221834 -0.17794438  0.16144235
 -0.04233294  0.00254776 -0.21198331 -0.2794619   0.06668678 -0.18528673
  0.32997644  0.22898008 -0.34520456 -0.08478537 -0.08134042  0.18606313
 -0.02795762 -0.18381299 -0.01800981 -0.01356451 -0.13271166  0.13744706
 -0.00462981 -0.3213552  -0.06995157  0.1747461   0.19557606 -0.00258067
  0.24483562 -0.16757362]


## Adding the dm_concat parameter to use concatenation of the word vectors


In [23]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05, dm_concat=1)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [24]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[ 0.03773112 -0.20826522 -0.10378091  0.00898944  0.1619937   0.07913642
  0.03239574 -0.25054485  0.01461297  0.16573314 -0.09829282 -0.01521427
 -0.17119169 -0.16681662 -0.01669092 -0.15423249  0.04396598  0.08041818
 -0.05130988 -0.19174676  0.01460071  0.02210508 -0.09894468  0.02346781
 -0.23472956 -0.07796176 -0.27106303 -0.00614524  0.08469789 -0.16458744
  0.17992994 -0.02304962  0.06590129 -0.21521696  0.10241684 -0.14612527
  0.05690332 -0.11589447 -0.00139536  0.03630562  0.19709547 -0.02264826
  0.21506989 -0.22678553  0.05019264 -0.06729796  0.05885966  0.00277757
  0.01487127 -0.14159141]


## Adding the dm_mean parameter to use sum of the context word vectors (dm_mean=1)


In [25]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, dm_concat=0, dm_mean=1, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [26]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-0.32081848 -0.07034425 -0.19386359 -0.0073753  -0.12657149  0.12652932
 -0.16216983  0.19491796 -0.2673311   0.11826707  0.24494976  0.26303568
  0.12428475 -0.24816737  0.16992673 -0.3462673  -0.07880609  0.1366772
 -0.34724015 -0.0381703  -0.06368154  0.07221834 -0.17794438  0.16144235
 -0.04233294  0.00254776 -0.21198331 -0.2794619   0.06668678 -0.18528673
  0.32997644  0.22898008 -0.34520456 -0.08478537 -0.08134042  0.18606313
 -0.02795762 -0.18381299 -0.01800981 -0.01356451 -0.13271166  0.13744706
 -0.00462981 -0.3213552  -0.06995157  0.1747461   0.19557606 -0.00258067
  0.24483562 -0.16757362]


## Adding the dm_mean parameter to use mean of the context word vectors (dm_mean=0)


In [27]:
model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, dm_concat=0, dm_mean=0, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [28]:
vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

[-0.32081848 -0.07034425 -0.19386359 -0.0073753  -0.12657149  0.12652932
 -0.16216983  0.19491796 -0.2673311   0.11826707  0.24494976  0.26303568
  0.12428475 -0.24816737  0.16992673 -0.3462673  -0.07880609  0.1366772
 -0.34724015 -0.0381703  -0.06368154  0.07221834 -0.17794438  0.16144235
 -0.04233294  0.00254776 -0.21198331 -0.2794619   0.06668678 -0.18528673
  0.32997644  0.22898008 -0.34520456 -0.08478537 -0.08134042  0.18606313
 -0.02795762 -0.18381299 -0.01800981 -0.01356451 -0.13271166  0.13744706
 -0.00462981 -0.3213552  -0.06995157  0.1747461   0.19557606 -0.00258067
  0.24483562 -0.16757362]
