### Project Kojak

** The problem  **

We will attempt to identify amboguously defined words - words that are homographs (spelled the same, but with multiple meanings) and determine the exact meaning of the word from a context window.

Here we attempt to do this in a few stages
1. train a word embedding on some training corpus using skip-gram (Here we use 1000 sholarly research papers) 
2. identify common homographs and extract the various context windows
3. interpret the context windows as vectors in the embedding space and appy a clustering algorith (DBSCAN). Each cluster is interpreted as a distinct definition of the homograph. Each cluster then is representative vector.
4. apply to a test corpus - match context of given homograph to most similar group.


### This notebook

This notebook takes a set of training data, trains a word embedding model, and saves it for further use.

In [1]:
import gensim
import json
import os
import re
import time
from nltk.corpus import stopwords
from nltk import tokenize
from nltk import pos_tag
from pprint import pprint



Using Theano backend.


In [2]:
# Declare stopwords, preprocess the data from source file
# Define class MyPapers, which takes a json file, loads the appropriate text data,
# and preprocesses the text to create a list of tokenized "sentences"

stop = stopwords.words('english')
stop += ['?','!',':',';','[',']','[]','“', '&' ]
stop += ['.', ',', '(', ')', "'", '"',"''",'""',"``",'”', '“', '?', '!', '’', 'et', 'al', 'al.']
stop = set(stop)

class MyPapers(object):
    # a memory-friendly way to load a large corpora
     def __init__(self, dirname):
            self.dirname = dirname
 
     def __iter__(self):
        with open(self.dirname) as data_file:    
            data = json.load(data_file)
        # iterate through all file names in our directory
        for paper in data:
            sentences = tokenize.sent_tokenize(paper['full_text'])
            for sentence in sentences:
                try:
                    line = re.sub(r'[?\.,!:;\(\)“\[\]]',' ',sentence)
                    line = [word for word in line.lower().split() if word not in stop]
                    yield line
                except:
                    print("Empty line found")
                    continue
                

## Word embeddings

In [3]:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from functools import reduce

In [4]:
#Instantiate iterable on the data

#papers is an iterable of scholarly papers, tokenized for prcessing
papers = MyPapers('data/train_data.json') 



In [5]:
#Initialize the model

model = gensim.models.word2vec.Word2Vec(sentences = papers, size=200, window=6, min_count=1, workers=4,sg=1)

In [6]:
model.corpus_count

182533

## identifying n-grams

In [7]:
phrases = gensim.models.phrases.Phrases(sentences = papers, min_count = 5, threshold = 150)

In [8]:
# The following loop outputs phrase pairs and their score
# Is used to determine an appropriate threshold for defining bigrams
for phrase, score in phrases.export_phrases(papers):
    if score <200:
        print(u'{0}   {1}'.format(phrase, score))

b'written consent'   188.30541263634493
b'written consent'   188.30541263634493
b'ministry education'   175.08865556367698
b'answer questions'   170.4226168901776
b'take place'   180.11889362873262
b'take place'   180.11889362873262
b'academic excellence'   188.5479367369018
b'resonance frequency'   153.04258064516128
b'numerical simulations'   176.30489326099084
b'resonance frequency'   153.04258064516128
b'dating back'   163.62231759656652
b'developing countries'   156.92637696121488
b'point view'   167.9818430638322
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'point view'   167.9818430638322
b'point view'   167.9818430638322
b'allows us'   152.11897949074066
b'cell membrane'   160.14281963770017
b'stochastic differential'   158.7716587210258
b'stochastic differential'   158.7716587210258
b'molecular biology'   171.41462930820728
b'limit cycle'   162.60236220472441
b'frequency bands'   160.52210526315787
b'limit cycle'   162.60236220472441
b'limit cycle' 

b'mutually beneficial'   183.28846153846152
b'epistemic communities'   165.87672226250908
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'section concludes'   196.8254696687222
b'early 1990s'   185.7215929850201
b'north american'   170.2280930232558
b'early 1990s'   185.7215929850201
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'human capital'   174.07865898206023
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'human capital'   174.07865898206023
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'developing countries'   156.92637696121488
b'financial crises'   164.12125560538115
b'human capita

b'unclear whether'   150.95759533916512
b'machine learning'   150.03541912632824
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'changed dramatically'   181.2192513368984
b'take advantage'   159.8067555122736
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'going forward'   163.76288659793815
b'tell us'   198.1039260969977
b'written english'   151.97321215020332
b'written english'   151.97321215020332
b'tell us'   198.1039260969977
b'jihadist language'   183.45383458646614
b'jihadist language'   183.45383458646614
b'explanatory power'   175.90093338990243
b'allows us'   152.11897949074066
b'jihadist language'   183.45383458646614
b'allows us'   152.11897949074066
b'point view'   167.9818430638322
b'jihadist language'   183.45383458646614
b'jihadist language'   183.45383458646614
b'forum posts'   190.52077043206663
b'jihadist language'   183.45383458646614
b'jihadist language'   183.45383458646614
b'allows us'   152.11897949074066
b

b'class ii'   191.53678283031803
b'orthodontic treatment'   191.7541556768693
b'class ii'   191.53678283031803
b'class ii'   191.53678283031803
b'class ii'   191.53678283031803
b'treated surgically'   176.5
b'treated surgically'   176.5
b'put forth'   163.242818911686
b'class ii'   191.53678283031803
b'class ii'   191.53678283031803
b'class ii'   191.53678283031803
b'class ii'   191.53678283031803
b'treated surgically'   176.5
b'class ii'   191.53678283031803
b'missing teeth'   169.90747028862478
b'orthodontic treatment'   191.7541556768693
b'orthodontic treatment'   191.7541556768693
b'missing teeth'   169.90747028862478
b'missing teeth'   169.90747028862478
b'orthodontic treatment'   191.7541556768693
b'orthodontic treatment'   191.7541556768693
b'orthodontic treatment'   191.7541556768693
b'orthodontic treatment'   191.7541556768693
b'orthodontic treatment'   191.7541556768693
b'orthodontic treatment'   191.7541556768693
b'class ii'   191.53678283031803
b'orthodontic treatment'   19

b'active ingredients'   177.16642462968343
b'per year'   170.26470350468037
b'charged particles'   179.6889238020424
b'easily accessible'   190.24875070175563
b'electric field'   182.72720010651045
b'transmission electron'   193.68084913362966
b'0 02'   190.1989866181629
b'0 03'   153.34793296089384
b'electric field'   182.72720010651045
b'spectra recorded'   156.23256211047553
b'time consuming'   187.2188002473717
b'grain size'   170.8400691152906
b'transmission electron'   193.68084913362966
b'< <'   198.38016528925618
b'intermediate products'   156.71969245213597
b'drug delivery'   192.64723643303202
b'blood circulation'   159.2370344587539
b'drug delivery'   192.64723643303202
b'drug delivery'   192.64723643303202
b'250 \xc2\xb0c'   164.17033492822966
b'figure shows'   183.8472613458529
b'250 \xc2\xb0c'   164.17033492822966
b'north island'   186.18697674418604
b'north island'   186.18697674418604
b'north island'   186.18697674418604
b'north island'   186.18697674418604
b'north isla

b'tell us'   198.1039260969977
b'partial metric'   169.69663998021701
b'past decades'   183.83512224742378
b'necessarily imply'   188.2989538672612
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'70 %'   155.423135722779
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'attitudes toward'   179.99661994500704
b'cil score'   197.42712266695435
b'cil score'   197.42712266695435
b'cil score'   197.42712266695435
b'cil score'   197.42712266695435
b'cil score'   197.42712266695435
b'cil score'   197.42712266695435
b'social media'   188.84761866092816
b'social media'   188.84761866092816
b'point view'   167.9818430638322
b'takes account'   196.82107379000263
b'past decades'   183.83512224742378
b'person clustering'   171.13207975685515
b'person clustering'   171.13207975685515
b'person clustering'   171.13207975685515
b'person clustering'   171.13207975685515
b'perso

b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'challenges facing'   192.72242647058823
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitarian workers'   150.90795696450147
b'humanitaria

b'anti-social behavior'   158.51975051975052
b'take place'   180.11889362873262
b'anti-social behavior'   158.51975051975052
b'youth unemployment'   159.2120969566201
b'unemployment rate'   185.93339844303253
b'developing countries'   156.92637696121488
b'anti-social behavior'   158.51975051975052
b'anti-social behavior'   158.51975051975052
b'anti-social behavior'   158.51975051975052
b'allows us'   152.11897949074066
b'allows us'   152.11897949074066
b'developing countries'   156.92637696121488
b'positively negatively'   192.02590393955748
b'section concludes'   196.8254696687222
b'remittances sent'   165.03896103896105
b'allows us'   152.11897949074066
b'labor markets'   154.962486239309
b'wage elasticity'   198.14968814968813
b'labor supply'   154.7255522282896
b'wage elasticity'   198.14968814968813
b'wage elasticity'   198.14968814968813
b'labor supply'   154.7255522282896
b'labor supply'   154.7255522282896
b'wage elasticity'   198.14968814968813
b'wage elasticity'   198.1496881

b'employment websites'   180.96081941957777
b'unemployment rate'   185.93339844303253
b'labor markets'   154.962486239309
b'labor markets'   154.962486239309
b'labor markets'   154.962486239309
b'human capital'   174.07865898206023
b'human capital'   174.07865898206023
b'human capital'   174.07865898206023
b'ministry education'   175.08865556367698
b'educational attainments'   184.47096774193548
b'high-income countries'   185.58221609232703
b'high-income countries'   185.58221609232703
b'unemployed workers'   166.33507853403142
b'decide whether'   158.83763835429477
b'human capital'   174.07865898206023
b'high-income countries'   185.58221609232703
b'point view'   167.9818430638322
b'high-income countries'   185.58221609232703
b'section concludes'   196.8254696687222
b'notoriously difficult'   182.03043867502237
b'educational attainments'   184.47096774193548
b'questions asked'   165.6882095088232
b'educational attainments'   184.47096774193548
b'human capital'   174.07865898206023
b'h

b'regularization term'   162.0487753041815
b'disparity maps'   195.34768530050172
b'decide whether'   158.83763835429477
b'time consuming'   187.2188002473717
b'machine learning'   150.03541912632824
b'decide whether'   158.83763835429477
b'classifier trained'   198.86459465333624
b'without sacrificing'   150.69084920226456
b'pip joint'   176.7557229788467
b'pip joint'   176.7557229788467
b'pip joint'   176.7557229788467
b'pip joint'   176.7557229788467
b'pip joint'   176.7557229788467
b'pip joint'   176.7557229788467
b'section concludes'   196.8254696687222
b'without sacrificing'   150.69084920226456
b'half 20th'   164.10653753026634
b'per year'   170.26470350468037
b'closer inspection'   158.54721885288512
b'questions asked'   165.6882095088232
b'moving away'   192.1713835652402
b'point clouds'   196.57392166584037
b'point clouds'   196.57392166584037
b'point clouds'   196.57392166584037
b'time consuming'   187.2188002473717
b'point clouds'   196.57392166584037
b'heat transfer'   188

b'fog computing'   151.88844621513942
b'fog computing'   151.88844621513942
b'fog computing'   151.88844621513942
b'internal external'   171.9420964574089
b'internal external'   171.9420964574089
b'fog computing'   151.88844621513942
b'fog computing'   151.88844621513942
b'802 15'   169.86466165413532
b'802 15'   169.86466165413532
b'virtual machine'   169.06430155210643
b'relational database'   198.26991418913065
b'answer questions'   170.4226168901776
b'attitude towards'   168.44000216579133
b'mobile devices'   191.35352183369585
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'centers disease'   153.15541916585298
b'comorbid substance' 

b'fish farms'   163.68085867620752
b'temperate regions'   177.88976377952756
b'agency epa'   159.05710560625815
b'cost savings'   182.1249594329703
b'past decades'   183.83512224742378
b'developing countries'   156.92637696121488
b'challenges facing'   192.72242647058823
b'play pivotal'   153.48941761959347
b'90 %'   158.48350689031614
b'take advantage'   159.8067555122736
b'90 %'   158.48350689031614
b'fuel consumption'   171.68979991433642
b'cost savings'   182.1249594329703
b'per hour'   167.31244666585394
b'per year'   170.26470350468037
b'developing countries'   156.92637696121488
b'water scarcity'   157.6331518227965
b'developing countries'   156.92637696121488
b'per day'   153.90155403647375
b'fuel consumption'   171.68979991433642
b'fuel consumption'   171.68979991433642
b'fuel consumption'   171.68979991433642
b'fuel consumption'   171.68979991433642
b'fuel consumption'   171.68979991433642
b'fuel consumption'   171.68979991433642
b'fuel consumption'   171.68979991433642
b'per

b'bitcoin price'   155.34397283531408
b'bitcoin price'   155.34397283531408
b'bitcoin price'   155.34397283531408
b'point view'   167.9818430638322
b'explanatory power'   175.90093338990243
b'ranked top'   153.3939940065802
b'machine learning'   150.03541912632824
b'machine learning'   150.03541912632824
b'machine learning'   150.03541912632824
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'fixed point'   191.92128446664887
b'financial crises'   164.12125560538115
b'financial crises'   164.12125560538115
b'stock markets'   153.569565014686
b'stock markets'   153.569565014686
b'financial crises'   164.12125560538115
b'financial crises'   164.12125560538115
b'financial crises'   164.12125560538115
b'financial crises'   164.12125560538115
b'stock mar

b'time consuming'   187.2188002473717
b'voice stream'   188.73267326732673
b'section concludes'   196.8254696687222
b'voice stream'   188.73267326732673
b'voice stream'   188.73267326732673
b'voice stream'   188.73267326732673
b'point view'   167.9818430638322
b'voice stream'   188.73267326732673
b'voice stream'   188.73267326732673
b'voice stream'   188.73267326732673
b'voice stream'   188.73267326732673
b'voice stream'   188.73267326732673
b'best suited'   185.3702564102564
b'artificial landmarks'   182.41148325358853
b'artificial landmarks'   182.41148325358853
b'computational complexity'   155.39951729686243
b'high-performance computing'   190.60510897586127
b'computational complexity'   155.39951729686243
b'artificial landmarks'   182.41148325358853
b'artificial landmarks'   182.41148325358853
b'artificial landmarks'   182.41148325358853
b'artificial landmarks'   182.41148325358853
b'outdoor environments'   158.73976405274115
b'remain unchanged'   156.16589861751152
b'microsoft of

b'lipid profile'   193.49361046494062
b'lipid profile'   193.49361046494062
b'0 06'   181.37927554514326
b'unclear whether'   150.95759533916512
b'glucose concentration'   158.92640692640694
b'relative abundance'   157.60626370184778
b'machine learning'   150.03541912632824
b'chest wall'   162.20821699242123
b'healthy subjects'   150.98115573743442
b'healthy subjects'   150.98115573743442
b'machine learning'   150.03541912632824
b'time consuming'   187.2188002473717
b'machine learning'   150.03541912632824
b'machine learning'   150.03541912632824
b'machine learning'   150.03541912632824
b'society iri'   160.28108108108108
b'society iri'   160.28108108108108
b'rapidly evolving'   178.20157756354075
b'humanitarian operations'   160.074528456618
b'mitigation adaptation'   196.45217391304348
b'best suited'   185.3702564102564
b'software engineering'   150.65951249425112
b'society iri'   160.28108108108108
b'guiding principles'   189.20099255583128
b'guiding principles'   189.20099255583128

b'career choices'   151.883403605458
b'career choices'   151.883403605458
b'total phenolic'   178.52319399053704
b'5 mg/ml'   152.11571072319202
b'total phenolic'   178.52319399053704
b'flavonoid content'   192.79564530289727
b'flavonoid content'   192.79564530289727
b'body weight'   164.2060628188983
b'body weight'   164.2060628188983
b'iii iv'   157.4423126559408
b'body weight'   164.2060628188983
b'40 mg'   155.70349193383703
b'flavonoid content'   192.79564530289727
b'acute toxicity'   158.97880227035793
b'body weight'   164.2060628188983
b'acute toxicity'   158.97880227035793
b'body weight'   164.2060628188983
b'oxygen species'   155.37035150280184
b'antioxidant activity'   178.8184480985384
b'antioxidant activity'   178.8184480985384
b'body weight'   164.2060628188983
b'acute toxicity'   158.97880227035793
b'body weight'   164.2060628188983
b'acute toxicity'   158.97880227035793
b'body weight'   164.2060628188983
b'middle school'   177.19338239362585
b'foreign language'   172.869

b'foreign language'   172.86995951417003
b'foreign language'   172.86995951417003
b'take place'   180.11889362873262
b'play pivotal'   153.48941761959347
b'play pivotal'   153.48941761959347
b'foreign language'   172.86995951417003
b'inside outside'   180.40192236598892
b'internal external'   171.9420964574089
b'foreign language'   172.86995951417003
b'foreign language'   172.86995951417003
b'foreign language'   172.86995951417003
b'language proficiency'   189.56896240601503
b'foreign language'   172.86995951417003
b'agreed participate'   157.18986399427345
b'foreign language'   172.86995951417003
b'> 0'   158.55418994413407
b'> 0'   158.55418994413407
b'mechanical properties'   183.76418242491658
b'molecular weight'   166.32814179164328
b'contact angle'   180.29281108714926
b'outdoor environments'   158.73976405274115
b'70 %'   155.423135722779
b'outdoor environments'   158.73976405274115
b'modulus elasticity'   187.34152334152336
b'250 \xc2\xb0c'   164.17033492822966
b'mechanical pro

b'allows us'   152.11897949074066
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'takes account'   196.82107379000263
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'lniso method'   157.86335403726707
b'class ii'   191.53678283031803
b'molecular weight'   166.32814179164328
b'liquid dosage'   168.54266635965922
b'liquid dosage'   168.54266635965922
b'american academy'   179.4070588235294
b'unclear whether'   150.95759533916512
b'liquid dosage'   168.54266635965922
b'poorly soluble'   197.02325581395348
b'drug delivery'   192.64723643303202
b'drug delivery'   192.64723643303202
b'liquid dosage'   168.5426

In [9]:
bigram = gensim.models.phrases.Phraser(phrases)

In [10]:
model.iter

5

In [11]:
model.build_vocab(sentences = bigram[papers], update = True)
model.train(sentences = bigram[papers], total_examples = model.corpus_count,epochs=model.iter)

2763724

In [12]:
phrases2 = gensim.models.phrases.Phrases(sentences = bigram[papers], min_count = 5, threshold = 300)

In [13]:
trigram = gensim.models.phrases.Phraser(phrases2)

In [14]:
# The following loop outputs phrase pairs and their score
# Is used to determine an appropriate threshold for defining trigrams

for phrase, score in phrases2.export_phrases(bigram[papers]):
    if score >= 200:
        print(u'{0}   {1}'.format(phrase, score))

b'squamous_cell carcinoma'   4444.444976076555
b'partial_differential equations'   1914.517331834364
b'rio_de janeiro'   60579.717391304344
b'carbon emissions'   409.82749009986225
b'gross_domestic product'   1298.7562169180067
b'rio_de janeiro'   60579.717391304344
b'carbon emissions'   409.82749009986225
b'remainder_paper organized_follows'   11449.774764679516
b'stochastic_differential equations'   1044.282181000562
b'transition consciousness_unconsciousness'   569.1721813725491
b'transition consciousness_unconsciousness'   569.1721813725491
b'transition consciousness_unconsciousness'   569.1721813725491
b'transition consciousness_unconsciousness'   569.1721813725491
b'transition consciousness_unconsciousness'   569.1721813725491
b'inhibitory_neurons receive_inhibitory'   16888.89090909091
b'inhibitory_neurons receive_inhibitory'   16888.89090909091
b'receive_inhibitory inputs'   1930.158961038961
b'inhibitory_neurons receive_inhibitory'   16888.89090909091
b'receive_inhibitory inpu

b'probable_pathological gamblers'   2025.9302071973825
b'american_psychiatric association'   1761.7619724988144
b'american_psychiatric association'   1761.7619724988144
b'american_psychiatric association'   1761.7619724988144
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'gambling-related family'   312.494196804037
b'problem gambling_behaviours'   354.8387521222411
b'mother\xe2\x80\x99s_alcohol consumption'   414.4975457385096
b'problem gambling_behaviours'   354.8387521222411
b'maternal_alcohol consumption'   1695.6717780211754
b'problem gambling_behaviours'   354.8387521222411
b'problem gambling_behaviours'   354.8387521222411
b'problem gambling_behaviours'   354.838

b'reactive_oxygen species_ros'   39028.94957983193
b'hybrid rice_breeding'   1555.5557416267943
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'quantitative_trait loci'   8077.295652173913
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'cm_seeding depth'   605.9288975864318
b'cm_seeding depth'   605.9288975864318
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'cm_seeding depth'   605.9288975864318
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'cm_seeding depth'   605.9288975864318
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'cm_seeding depth'   605.9288975864318
b'long arm_chromosome'   313.390350877193
b'mesocotyl_coleoptile elongation'   4739.2295918367345
b'cm_seeding depth'   605.9288975864318
b'quantitative_trait loci'   8077.295652173913
b'plays_crucial role'   608.8223782771535
b'quantitative_trait loci'   8077.295652173913
b'biotic_abiotic stresses'   1258.6571815718157

b'self-normalized_central limit'   973.9334207077326
b'self-normalized_central limit'   973.9334207077326
b'self-normalized_central limit'   973.9334207077326
b'self-normalized_central limit'   973.9334207077326
b'paper_organized follows'   2005.5733266899185
b'latin_america caribbean'   904.4683544303798
b'gross_domestic product'   1298.7562169180067
b'hull white'   303.9558246073299
b'hull white'   303.9558246073299
b'hull white'   303.9558246073299
b'hull white'   303.9558246073299
b'hull white'   303.9558246073299
b'paper_organized follows'   2005.5733266899185
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136087
b'mean_reversion momentum'   3200.3066322136

b'latin_american countries'   306.5570488892935
b'liner_shipping connectivity'   769.7975138121546
b'liner_shipping connectivity'   769.7975138121546
b'liner_shipping connectivity'   769.7975138121546
b'rest paper_organized'   2077.716252110419
b'liner_shipping connectivity'   769.7975138121546
b'liner_shipping connectivity'   769.7975138121546
b'liner_shipping connectivity'   769.7975138121546
b'remainder_paper organised_follows'   617.8177585633522
b'value decomposition_svd'   368.1684502576298
b'discrete_wavelet transform_dwt'   75061.73737373739
b'artificial_intelligence ai'   6984.127819548872
b'paper_organized follows'   2005.5733266899185
b'cargo_transport demand'   792.1618625277162
b'cargo_transport demand'   792.1618625277162
b'gross_profit per_day'   4751.350383631714
b'gross_profit per_day'   4751.350383631714
b'cargo_transport demand'   792.1618625277162
b'gross_profit per_day'   4751.350383631714
b'gross_profit per_day'   4751.350383631714
b'\xe2\x80\x98optimisation_ship 

b'place-based_job creation'   1527.7779605263156
b'place-based_job creation'   1527.7779605263156
b'place-based_job creation'   1527.7779605263156
b'remainder_paper organized_follows'   11449.774764679516
b'place-based_job creation'   1527.7779605263156
b'place-based_job creation'   1527.7779605263156
b'place-based_job creation'   1527.7779605263156
b'place-based_job creation'   1527.7779605263156
b'eu_member states'   1353.2025858144405
b'eu_member states'   1353.2025858144405
b'labor_force participation'   598.7349565434357
b'labor_force participation'   598.7349565434357
b'average duration_unemployment'   403.9525983909545
b'average duration_unemployment'   403.9525983909545
b'average duration_unemployment'   403.9525983909545
b'labor_force participation'   598.7349565434357
b'average duration_unemployment'   403.9525983909545
b'average duration_unemployment'   403.9525983909545
b'labor_force participation'   598.7349565434357
b'labor_force participation'   598.7349565434357
b'avera

b'east west'   448.04292850990527
b'emergency_department visits'   1673.6738738738738
b'new_york city'   338.8044984802431
b'emergency_department visits'   1673.6738738738738
b'emergency_department visits'   1673.6738738738738
b'emergency_department visits'   1673.6738738738738
b'per 100_000'   998.4945828972305
b'south asia'   517.2665421907757
b'acute pesticide_poisoning'   615.1582781456954
b'acute pesticide_poisoning'   615.1582781456954
b'acute pesticide_poisoning'   615.1582781456954
b'acute pesticide_poisoning'   615.1582781456954
b'acute pesticide_poisoning'   615.1582781456954
b'acute pesticide_poisoning'   615.1582781456954
b'south asian'   322.1729328523863
b'south asia'   517.2665421907757
b'acute pesticide_poisoning'   615.1582781456954
b'emergency_department ed'   783.8725738396624
b'intensive_care unit_icu'   20641.977777777774
b'95%_confidence interval'   1304.6193820224719
b'chronic_obstructive pulmonary'   7705.07004608295
b'ho_chi minh_city'   72854.03921568628
b'rel

b'intensive_care units'   579.5207664884136
b'critically_ill patients'   395.9153135873779
b'acute lung_injury'   473.19867549668874
b'reactive_oxygen species_ros'   39028.94957983193
b'microvascular_endothelial cells'   526.1634756995583
b'vascular smooth_muscle'   1659.7182846932697
b'microvascular_endothelial cells'   526.1634756995583
b'microvascular_endothelial cells'   526.1634756995583
b'microvascular_endothelial cells'   526.1634756995583
b'critically_ill patients'   395.9153135873779
b'microvascular_endothelial cells'   526.1634756995583
b'microvascular_endothelial cells'   526.1634756995583
b'microvascular_endothelial cells'   526.1634756995583
b'vascular smooth_muscle'   1659.7182846932697
b'smooth_muscle cells'   356.87609656143945
b'vascular smooth_muscle'   1659.7182846932697
b'microvascular_endothelial cells'   526.1634756995583
b'vascular smooth_muscle'   1659.7182846932697
b'microvascular_endothelial cells'   526.1634756995583
b'critically_ill patients'   395.915313587

b'enhanced_haptic feedback'   1949.9844054580894
b'enhanced_haptic feedback'   1949.9844054580894
b'laparoscopic_psychomotor skills'   618.1603815439219
b'intra-corporeal_knot tying'   40980.39705882353
b'laparoscopic_psychomotor skills'   618.1603815439219
b'laparoscopic_psychomotor skills'   618.1603815439219
b'intra-corporeal_knot tying'   40980.39705882353
b'intra-corporeal_knot tying'   40980.39705882353
b'laparoscopic_psychomotor skills'   618.1603815439219
b'laparoscopic_psychomotor skills'   618.1603815439219
b'intra-corporeal_knot tying'   40980.39705882353
b'intra-corporeal_knot tying'   40980.39705882353
b'intra-corporeal_knot tying'   40980.39705882353
b'intra-corporeal_knot tying'   40980.39705882353
b'laparoscopic_psychomotor skills'   618.1603815439219
b'oil gas'   316.017739544604
b'nw_sabalan geothermal'   2278.0846106683016
b'nw_sabalan geothermal'   2278.0846106683016
b'nw_sabalan geothermal'   2278.0846106683016
b'nw_sabalan geothermal'   2278.0846106683016
b'nw_sab

b'vocational_education training'   686.3486420807199
b'vocational_education training'   686.3486420807199
b'adolescents young_adults'   1243.0766142522582
b'vocational_education training'   686.3486420807199
b'upper_secondary school'   333.27108420226483
b'upper_secondary school'   333.27108420226483
b'upper_secondary school'   333.27108420226483
b'upper_secondary school'   333.27108420226483
b'upper_secondary school'   333.27108420226483
b'upper_secondary school'   333.27108420226483
b'vocational_education training'   686.3486420807199
b'vocational_education training'   686.3486420807199
b'vocational_education training'   686.3486420807199
b'inclusion_exclusion criteria'   1132.1011578305909
b'vocational_education training'   686.3486420807199
b'vocational_education training'   686.3486420807199
b'vocational_education training'   686.3486420807199
b'raises_question whether'   340.0655317591067
b'vocational_education training'   686.3486420807199
b'vocational_education training'   686.

b'partial_differential equations'   1914.517331834364
b'adaptive mesh_refinement'   1190.8833333333334
b'discontinuous_galerkin predictor'   537.862767805443
b'discontinuous_galerkin predictor'   537.862767805443
b'discontinuous_galerkin predictor'   537.862767805443
b'discontinuous_galerkin predictor'   537.862767805443
b'adaptive mesh_refinement'   1190.8833333333334
b'adaptive mesh_refinement'   1190.8833333333334
b'discontinuous_galerkin predictor'   537.862767805443
b'discontinuous_galerkin predictor'   537.862767805443
b'remainder_paper organized_follows'   11449.774764679516
b'remainder_paper organized_follows'   11449.774764679516
b'partial_differential equations'   1914.517331834364
b'eu_member states'   1353.2025858144405
b'remainder_paper organized_follows'   11449.774764679516
b'rest paper_organized'   2077.716252110419
b'partial_differential equations'   1914.517331834364
b'remainder_paper organized_follows'   11449.774764679516
b'paper_organized follows'   2005.5733266899

b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'l2_listening proficiency'   2679.3264827834864
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99_personal styles'   10703.247736625515
b'teachers\xe2\x80\x99

In [15]:
model.build_vocab(sentences = trigram[bigram[papers]], update = True)
model.train(sentences = trigram[bigram[papers]], total_examples = model.corpus_count,epochs=model.iter)

2753751

In [16]:
model.save("data/journal.txt")

In [22]:
vectors = model.wv
len(vectors.vocab)

127777

In [23]:
vectors['new_york']

array([  2.55582184e-01,  -1.73009504e-02,   1.06402971e-01,
        -2.66092151e-01,  -3.89313735e-02,  -1.58489108e-01,
        -2.47355536e-01,   3.59850712e-02,  -2.71874428e-01,
        -9.52778831e-02,   6.57308549e-02,  -6.77294610e-03,
         7.59963840e-02,   5.57670854e-02,   2.68546976e-02,
        -2.24529151e-02,  -2.34565362e-02,   2.37930462e-01,
        -8.54160339e-02,  -6.16243631e-02,   2.17030589e-02,
        -7.39934370e-02,   1.01153865e-01,  -3.17370109e-02,
        -1.66211769e-01,  -8.52872059e-02,   9.57673788e-02,
         1.57554727e-02,   1.75273240e-01,  -1.24914669e-01,
        -9.29379463e-03,  -3.38235438e-01,   1.88794211e-01,
         1.40671790e-01,   1.19395629e-01,   2.57506818e-02,
        -1.58740982e-01,  -3.20343375e-01,  -9.87890661e-02,
        -8.08268338e-02,  -3.72908004e-02,   5.11135794e-02,
        -1.00134656e-01,   1.62097186e-01,  -1.41944751e-01,
         2.79612392e-01,  -1.74228311e-01,   3.05699348e-01,
        -1.98937412e-02,