# Vector Similarity

Use `doc2vec` to find the most similar Wikipedia articles for a given ETD chapter.

In [1]:
from gensim.models.doc2vec import Doc2Vec
from nltk import sent_tokenize, pprint, RegexpTokenizer

from experiments.sentence_tokenization import extract_ch_json

[nltk_data] Downloading package punkt to /Users/waingram/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Load an ETD that has been processed by Grobid into TEI XML, and extact the chapters.

In [2]:
docPath = '/Users/waingram/Desktop/gorbid_fulltext/theses/17274/Granstedt_JL_T_2017.tei.xml'
ch_json = extract_ch_json(docPath, 'thesis', 17292)

Load the saved doc2vec model trained on Wikipedia dump. If you don't already have a pretrained model, use `doc2vec_train_wiki_model.py` to create it. NOTE: this model took several days to train. 

In [4]:
model = Doc2Vec.load('experiments/wikipedia-vectors.d2v')

For each chapter in the ETD, use the model to infer a vector. Then use that vector to find similar wikipedia articles. 

In [5]:
tokenizer = RegexpTokenizer(r'\w+')

for chapter in ch_json['chapters']:
    print()
    print(chapter['title'])
    sentences = []
    for paragraph in chapter['paragraphs']:
        sentences += sent_tokenize(paragraph)

    doc_words = tokenizer.tokenize(' '.join(sentences).lower())

    vector = model.infer_vector(doc_words, steps=200)

    sims = model.docvecs.most_similar([vector])
    pprint(sims)


Data Augmentation with Seq2Seq Models


  if np.issubdtype(vec.dtype, np.int):


[('Corpora in Translation Studies', 0.578441858291626),
 ('Automatic acquisition of sense-tagged corpora', 0.5154329538345337),
 ('Arbitrary-precision arithmetic', 0.5109789967536926),
 ('Moses for Mere Mortals', 0.5108679533004761),
 ('Artificial imagination', 0.5086458921432495),
 ('Dictionary-based machine translation', 0.5022629499435425),
 ('Paraphrasing (computational linguistics)', 0.5017775297164917),
 ('Time delay neural network', 0.5000360012054443),
 ('Statistical machine translation', 0.49955862760543823),
 ('Natural language understanding', 0.49809378385543823)]

Introduction


[('Paraphrasing (computational linguistics)', 0.5342533588409424),
 ('Statistical machine translation', 0.5028438568115234),
 ('Quantum machine learning', 0.49393802881240845),
 ('Bayesian Program Synthesis', 0.4881432056427002),
 ('Message passing in computer clusters', 0.4806269705295563),
 ('Multi-label classification', 0.47103315591812134),
 ('Domain adaptation', 0.45933797955513),
 ('Conditional random field', 0.45746147632598877),
 ('Europarl Corpus', 0.4558519721031189),
 ('Autoencoder', 0.45458176732063293)]

Chapter 2 Literature Review


[('Paraphrasing (computational linguistics)', 0.5603368282318115),
 ('Statistical machine translation', 0.5517240166664124),
 ('Comparison of different machine translation approaches', 0.5505878329277039),
 ('Automatic acquisition of sense-tagged corpora', 0.5398809313774109),
 ('Automatic summarization', 0.5201674699783325),
 ('Dictionary-based machine translation', 0.5041261315345764),
 ('Message passing in computer clusters', 0.5030425190925598),
 ('Moses (machine translation)', 0.5030009746551514),
 ('Neural machine translation', 0.4965655505657196),
 ('Pivot language', 0.4962734282016754)]

Chapter 3 Approach and Discussion


[('Paraphrasing (computational linguistics)', 0.5562536716461182),
 ('BLEU', 0.5131765604019165),
 ('Plagiarism detection', 0.5048375129699707),
 ('Decompiler', 0.5029244422912598),
 ('Object categorization from image search', 0.5022076368331909),
 ('Weighted Micro Function Points', 0.5009472370147705),
 ('Classic monolingual word-sense disambiguation', 0.4828462302684784),
 ('Automatic acquisition of sense-tagged corpora', 0.48140478134155273),
 ('Latent semantic analysis', 0.47652843594551086),
 ('Variance-based sensitivity analysis', 0.47284334897994995)]

Chapter 4 Conclusions


[('Paraphrasing (computational linguistics)', 0.5084825754165649),
 ('Lazy learning', 0.47415319085121155),
 ('Bias–variance tradeoff', 0.47086840867996216),
 ('PAQ', 0.46527379751205444),
 ('Nati Linial', 0.4645746052265167),
 ('Word2vec', 0.4550856053829193),
 ('BrownBoost', 0.4487757980823517),
 ('Plagiarism detection', 0.44783473014831543),
 ('Reference implementation', 0.44597771763801575),
 ('Kernel perceptron', 0.4430201053619385)]
