Exploring Gensim Doc2Vec<br/>
Doc2Vec https://radimrehurek.com/gensim/models/doc2vec.html

In [1]:
# Cell 1

# Mounting Google drive
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [5]:
# Cell 2

# set project_folder to the path where the documents are located.
from pathlib import Path
project_folder = Path("/content/drive")/"MyDrive"/"562 Project"
articles = project_folder/"Raw Articles"/"no_linebreaks"

In [6]:
# Cell 3

# Make a list of document strings (strings that each contain a whole document)
text = []
for file in articles.glob("*"):
  with open(file,'r') as f:
    text.append(f.readline().replace("- ",""));
    #text.append(list(gensim.utils.tokenize(f.readline(), lowercase=True, deacc=True)))

In [8]:
# Cell 4

# the shape of our data

print(f"There are {len(text)} documents.")
print(text[0])

There are 85 documents.
Investigating User Risk Attitudes in Navigation Systems to Support People with Mobility Impairments Sadia Azmin Anisha School of Information Technology, Monash University Malaysia saani2@student.monash.edu ABSTRACT This paper investigates the impact of visualizing the risk of encountering potential accessibility barriers on the route planning behaviour of pedestrians with mobility impairments. Using a prototype system, we explored the relationship between the risk of facing possible accessibility barriers and the navigation planning behaviour of the mobility impaired users. We found that mobility impaired users had a very strong inclination towards longer but accessible barrier-free routes instead of shorter potentially inaccessible routes (being willing to travel over 900 metres to avoid barriers), suggesting a degree of risk aversion that goes beyond the literature. However, we have also observed users' varying risk attitudes towards obstacles based on the typ

Trying to train a Doc2Vec model from run_doc2vec_lee.ipynb.


In [10]:
# Cell 5

%matplotlib inline

In [11]:
# Cell 6

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Preparing the training and test data:

In [12]:
# Cell 7

import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

Define a function to read and preprocess text

In [13]:
# Cell 8

import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

In [14]:
# Cell 9

print(train_corpus[:2])

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

In [15]:
# Cell 10

print(test_corpus[:2])


[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'

Training the model

In [16]:
# Cell 11

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_corpus)

# can use model.wv.index_to_key to see the list of unique words.
print(len(model.wv.index_to_key))
print(model.wv.index_to_key)

3955


This is a way to get a word count.

In [17]:
# Cell 12

print(f"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times in the training corpus.")

Word 'penalty' appeared 4 times in the training corpus.


train the model on the corpus

In [18]:
# Cell 13

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Now infer a vector with the trained model

In [19]:
# Cell 14

vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

[-0.14963728 -0.3321976  -0.15831946  0.10868703  0.05946086 -0.08006481
  0.00757076  0.01149584 -0.25280604 -0.2082983   0.19445609 -0.00560647
 -0.04450019 -0.04812901 -0.19132477 -0.12211814  0.08125138  0.28469113
  0.12103054 -0.00515204 -0.09060451 -0.01991914 -0.01806591 -0.03998234
  0.00629603 -0.04313989 -0.29113021  0.00386415 -0.07145585 -0.03804312
  0.35477683 -0.05390288  0.06938466  0.15971725  0.16949688  0.18493302
 -0.00109425 -0.3536889  -0.19943851  0.05203699  0.03347838  0.05299618
 -0.02132636 -0.01931833  0.16076025  0.10556723 -0.06722038 -0.07335476
  0.06606483 -0.02849324]


#HERE is the secret sauce

* feed the model a list of tokenized strings for each document

In [20]:
# Cell 15

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

running the model on a training example and looking for most similar, least similar, etc.  Should find itself as most similar.

In [21]:
# Cell 16

print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

Selecting documents from the training set to compare, printing the 2nd most similar documents

In [22]:
# Cell 17

# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))


Similar Document (134, 0.6026414036750793): «israel has reacted with caution to promise from palestinian leader yasser arafat to hunt down suicide bombers and end armed attacks against israeli targets mr arafat made the commitments during speech broadcast on palestinian television the palestinian leader said israel was using suicide attacks as pretext for waging war on palestinians and that such operations were therefore against palestinian national interests israel will be looking to see whether mr arafat is offering anything more than words he has promised to round up suicide bombers before but very few in fact have been arrested mr arafat said peace was the only way of resolving the conflict and that the changed world situation since the attacks in the united states on september had to be taken into account the united states government says it is keenly watching to see whether mr arafat actions match his words the white house says it will continue to engage in the peace process des

# Test the model

pick a document from the test set, which one is closest
* modified to compare one of Christines documents to the training set


In [23]:
# Cell 18

# Pick a random document from the test corpus and infer a vector from the model
#doc_id = random.randint(0, len(test_corpus) - 1)
#doc_id = random.randint(0,len(text)-1) # pick one of Christines documents
doc_id = 32
#inferred_vector = model.infer_vector(test_corpus[doc_id])
inferred_vector = model.infer_vector(gensim.utils.tokenize(text[doc_id], lowercase=True, deacc=True)) # Use Christines document
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
#print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(text[doc_id]) # Print Christines document
print(list(gensim.utils.tokenize(text[doc_id], lowercase=True, deacc=True))) # print tokenized version of Christine's Document.
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))



[]
SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>:

MOST (87, 0.23104780912399292): «the australian transport safety bureau has called for pilots to be better trained on the risks of air turbulence it is response to helicopter crash last august which claimed the life of media personality shirley strachan mr strachan was on solo navigation training flight on august when he crashed into mt archer on queensland sunshine coast witnesses told of seeing mr strachan apparently struggling to control his aircraft just prior to the crash safety bureau director alan stray says the helicopter was struck by severe air turbulence phenomena known as mountain wave it caused one of the helicopter rotors to flap and strike the tail boom while reluctant to attribute blame mr stray says mountain waves are not uncommon and mr strachan could have been better advised of local weather conditions prior to the flight he says the accident is wake up call to flight trainers to ensure stu