In [3]:
%matplotlib inline


# Doc2Vec Model


In [4]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


## Prepare the Training and Test Data

- 该语料库包含选自澳大利亚广播公司新闻邮件服务的 314 个文档，该服务提供头条新闻的文本电子邮件并涵盖许多广泛的主题。
- And we’ll test our model by eye using the much shorter Lee Corpus which contains 50 documents.

In [5]:
import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

## Define a Function to Read and Preprocess Text

- To train the model, we'll need to associate a tag/number with each document
  of the training corpus. 
- In our case, the tag is simply the zero-based line
  number.




In [10]:
import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            # 主要用于将文本行进行分词，并进行一些基本的文本清理操作
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                # 用于表示已标记文档的数据结构，通常用于训练 Doc2Vec 模型
                # 返回结果： 创建的 TaggedDocument 对象包含了词语列表和标签，用于在训练 Doc2Vec 模型时表示一个已标记的文档
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])     # [i]表示该文档的标签，通常是一个唯一的标识符

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

In [57]:
print(len(train_corpus))
print(train_corpus[:2])

300
[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which

In [56]:
print(len(test_corpus))
print(test_corpus[:2])

50
[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', '

## Training the Model

- 实例化一个 Doc2Vec 模型，其向量大小为 50 个维度，并在训练语料库上迭代 40 次。我们将最小字数设置为 2，以便丢弃出现次数很少的单词。
- 在实际使用中（一般有数十万到数百万个文档），典型迭代计数为 10-20。
- 但这是一个非常非常小的数据集（300 个文档），文档较短（几百个单词）。添加训练迭代次数有时可以帮助处理如此小的数据集。

In [15]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

2024-01-10 10:18:55,200 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>', 'datetime': '2024-01-10T10:18:55.200962', 'gensim': '4.3.2', 'python': '3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:07:22) [Clang 14.0.6 ]', 'platform': 'macOS-14.1.2-x86_64-i386-64bit', 'event': 'created'}


In [18]:
print(model)
model

Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>


<gensim.models.doc2vec.Doc2Vec at 0x1bdc81d50>

Build a vocabulary



In [19]:
model.build_vocab(train_corpus)

2024-01-10 10:19:35,442 : INFO : collecting all words and their counts
2024-01-10 10:19:35,443 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2024-01-10 10:19:35,455 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
2024-01-10 10:19:35,456 : INFO : Creating a fresh vocabulary
2024-01-10 10:19:35,472 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 retains 3955 unique words (56.65% of original 6981, drops 3026)', 'datetime': '2024-01-10T10:19:35.472465', 'gensim': '4.3.2', 'python': '3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:07:22) [Clang 14.0.6 ]', 'platform': 'macOS-14.1.2-x86_64-i386-64bit', 'event': 'prepare_vocab'}
2024-01-10 10:19:35,473 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 55126 word corpus (94.80% of original 58152, drops 3026)', 'datetime': '2024-01-10T10:19:35.473319', 'gensim': '4.3.2', 'python': '3.10.11 | packaged by con

- 可通过 ``model.wv.index_to_key`` 访问所有唯一单词的列表
- 使用 ``model.wv.get_vecattr()`` 方法可以获取每个单词的附加属性(见如下示例)

In [23]:
print(model.corpus_count)
print(model.epochs)
# 查看 penalty 在训练语料库中出现的次数
model.wv.get_vecattr('penalty', 'count')

300
40


4

- 在通常情况下，需要安装一个用于优化批量向量操作的 BLAS 库，对这个 300 个小文档、约 60k 字语料库的训练应该只需要几秒钟。 
- 如不使用 BLAS 库，则需要花费 60 倍到 120 倍的时间
- 注：BLAS 库安装使用参见相关文档

In [24]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

2024-01-10 10:20:29,736 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2024-01-10T10:20:29.736749', 'gensim': '4.3.2', 'python': '3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:07:22) [Clang 14.0.6 ]', 'platform': 'macOS-14.1.2-x86_64-i386-64bit', 'event': 'train'}
2024-01-10 10:20:29,789 : INFO : EPOCH 0: training on 58152 raw words (42747 effective words) took 0.0s, 861707 effective words/s
2024-01-10 10:20:29,838 : INFO : EPOCH 1: training on 58152 raw words (42667 effective words) took 0.0s, 930217 effective words/s
2024-01-10 10:20:29,885 : INFO : EPOCH 2: training on 58152 raw words (42682 effective words) took 0.0s, 939360 effective words/s
2024-01-10 10:20:29,935 : INFO : EPOCH 3: training on 58152 raw words (42637 effective words) took 0.0s, 897420 effective words/s
2024-01-10 10:20:29,998 : INFO : EPOCH 4: training on 581

In [25]:
# 使用模型来推断文本片段的向量
# 之后可以通过余弦相似度将该向量与其他向量进行比较
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)
# 注意：由于底层训练 / 推理算法是一个利用内部随机化的迭代近似问题，因此同一文本的重复推理将返回略有不同的向量

[-0.19389434 -0.37575087 -0.11150538  0.17023939 -0.19585437 -0.08501209
 -0.02834594  0.06310242 -0.25008348 -0.17818134  0.16944358 -0.08258659
  0.00545225  0.02568397 -0.0194872  -0.05995872  0.15472971  0.16811125
  0.06364274 -0.10676058  0.02367179  0.05606707  0.14834112 -0.03242365
 -0.04256508  0.08342563 -0.23089176  0.07234342 -0.12182193 -0.07851215
  0.45746467  0.14566699  0.21811247  0.13564608  0.12847733  0.20201543
 -0.01480675 -0.2522898  -0.12386173  0.02103999  0.0154345  -0.02884163
 -0.12639856 -0.12402748  0.02785637  0.0171064  -0.03167505 -0.10304504
  0.1724661   0.03696358]


## Assessing the Model


In [61]:
# 测试项
for doc_id in range(3):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    print(sims)
    rank = [docid for docid, sim in sims].index(doc_id)
    print(rank)
    print("========")

[(0, 0.9700483679771423), (48, 0.8897426724433899), (255, 0.8708517551422119), (40, 0.8607314825057983), (33, 0.8395884037017822), (8, 0.836052656173706), (272, 0.8317200541496277), (264, 0.7745107412338257), (105, 0.7607932686805725), (198, 0.7241007685661316), (19, 0.7184740304946899), (9, 0.7178652286529541), (113, 0.6899672150611877), (25, 0.6892133355140686), (4, 0.6846137046813965), (10, 0.6803380846977234), (84, 0.6456793546676636), (62, 0.6398455500602722), (189, 0.6387141942977905), (46, 0.6315411329269409), (42, 0.5984219908714294), (144, 0.5978928804397583), (188, 0.5962110161781311), (109, 0.5857176184654236), (172, 0.5791205167770386), (2, 0.5784093141555786), (219, 0.5635756850242615), (77, 0.5616114139556885), (43, 0.5585625767707825), (52, 0.5576309561729431), (180, 0.5564634203910828), (21, 0.5552763342857361), (78, 0.5451232194900513), (89, 0.5450719594955444), (126, 0.543415904045105), (256, 0.5416069626808167), (5, 0.5386147499084473), (15, 0.5374650359153748), (222

In [64]:
# 首先为训练语料库的每个文档推断新向量，将推断的向量与训练语料库进行比较，然后根据自相似性返回文档的排名
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)     # 返回第一次出现该文档标识符的位置
    ranks.append(rank)

    # 跟踪第二个排名，以比较不太相似的文档
    second_ranks.append(sims[1])

In [58]:
print(len(model.dv))
print(len(train_corpus))
train_corpus[:2]

300
300


[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

In [35]:
second_ranks[:5]

[(48, 0.8941465020179749),
 (143, 0.7156313061714172),
 (21, 0.8416993618011475),
 (57, 0.71437668800354),
 (33, 0.7626050114631653)]

In [37]:
# 正常应该本文档是自己的相似度最高，但也不完全是
# 例:有多少个相似度最高的不是本文档的
[rank for rank in ranks if rank!=0]

[1, 1, 1, 1, 1, 1, 1]

In [38]:
import collections

# 每个文档在训练语料库中的排名
counter = collections.Counter(ranks)
print(counter)

Counter({0: 293, 1: 7})


- 基本上，超过 95% 的推断文档被发现与自身最相似，大约 5% 的情况下它被错误地与另一个文档最相似。
- 根据训练向量检查推断向量是一种 “健全性检查”，以确定模型是否以有效一致的方式运行，尽管不是真正的 “准确性” 值。

In [47]:
# 示例1
# doc_id: 299
# sims: 最后一个的most_similar列表
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)

for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

- Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. 
- However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) 
- and the reasoning becomes obvious when we examine the text itself.

In [55]:
# 示例2
# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (76): «the death toll in argentina food riots has risen to local media reports say four more people died this morning in clashes between police and protesters near the presidential palace in the capital buenos aires president fernando de la rua has called on the opposition to take part in government of national unity and apparently will resign if it does not looting and rioting has generally given way to more peaceful demonstrations against the faltering government blamed for month recession heavily armed police using powers under day state of siege decree are attempting to prevent large public gatherings but union leaders say workers and the unemployed will not stop until the government is removed and living standards restored with argentina discredited economy minister now gone the government hopes to approve new budget acceptable to the international monetary fund imf to avoid default on the billion foreign debt the presidents of neighbouring brazil and chile say they

## Testing the Model

Using the same approach above, we'll infer the vector for a randomly chosen
test document, and compare the document to our model by eye.




In [49]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (39): «the real level of world inequality and environmental degradation may be far worse than official estimates according to leaked document prepared for the world richest countries and seen by the guardian it includes new estimates that the world lost almost of its forests in the past years that carbon dioxide emissions leading to global warming are expected to rise by in rich countries and in the rest of the world in the next years and that more than more fresh water will be needed by»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>:

MOST (171, 0.6015363931655884): «drug education campaigns appear to be paying dividends with new figures showing per cent drop in drug related deaths last year according to the australian bureau of statistics people died from drug related causes in the year that figure is substantial drop from when australians died of drug related causes across the states and territories new south wales recorded the biggest decrea