<h2>Word2Vec on Research papers</h2>
<p>In this notebook, we will see how word2vec perform in research papers.</p>
<p>Word2Vec tries to create a vector for each word. And we want two vector to be close to each other if they have similar semantics.</p>

In [1]:
# -*- coding: utf-8 -*-
#Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import timeit
import codecs
import re
import os
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
from wordcloud import WordCloud
%matplotlib inline

In [2]:
cleaned_paper_df = pd.read_csv('../dataset/cleaned_papers_pdf.csv',encoding='utf-8')

In [3]:
print(cleaned_paper_df.shape)
cleaned_paper_df.head(10)

(784, 5)


Unnamed: 0,name,content,directory,isValid,faculty
0,Banlue Srisuchinwong,"Thammasat Int. J. Sc. Tech., Vol.6, No.l, Janu...",../papers/ICT_professor/Banlue Srisuchinwong/1...,True,ICT_professor
1,Banlue Srisuchinwong,___________________________________________0-7...,../papers/ICT_professor/Banlue Srisuchinwong/1...,True,ICT_professor
2,Banlue Srisuchinwong,This paper is a postprint of a paper submitted...,../papers/ICT_professor/Banlue Srisuchinwong/2...,True,ICT_professor
3,Banlue Srisuchinwong,Electronic version of an article published as ...,../papers/ICT_professor/Banlue Srisuchinwong/2...,True,ICT_professor
4,Banlue Srisuchinwong,PhysicsLettersA373(2009)4038–4043 Contentslist...,../papers/ICT_professor/Banlue Srisuchinwong/2...,True,ICT_professor
5,Banlue Srisuchinwong,41UTCC Engineering Research Papers 2008A Low-P...,../papers/ICT_professor/Banlue Srisuchinwong/4...,True,ICT_professor
6,Banlue Srisuchinwong,1Improved Implementation of Sprott™s Chaotic O...,../papers/ICT_professor/Banlue Srisuchinwong/4...,True,ICT_professor
7,Banlue Srisuchinwong,Compound Structures of Six New Chaotic Attract...,../papers/ICT_professor/Banlue Srisuchinwong/5...,True,ICT_professor
8,Banlue Srisuchinwong,Int.J.Electron.Commun.(AEÜ)61(2007)307–313 www...,../papers/ICT_professor/Banlue Srisuchinwong/8...,True,ICT_professor
9,Banlue Srisuchinwong,International Journal of Engineering Research ...,../papers/ICT_professor/Banlue Srisuchinwong/a...,True,ICT_professor


In [4]:
# Import various modules for string cleaning
from nltk.corpus import stopwords

def review_to_wordlist( paper_content ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", paper_content) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))      
    stops.add('cid')
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops] 
    
    # 5. Cut word with only 1 character except a,i,u
  
    meaningful_words = [w for w in meaningful_words if len(w)!=1 or w in['a','i','u']]
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

In [5]:
# Download the punkt tokenizer for sentence splitting  
import nltk
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

from nltk.corpus import stopwords # Import the stop word list
# Define a function to split a review into parsed sentences
def paper_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

In [6]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for paper in cleaned_paper_df["content"]:
    sentences += paper_to_sentences(paper, tokenizer)
print("Finished parsing")

Parsing sentences from training set
Finished parsing


In [7]:
split_sentences = [sentence.split(" ") for sentence in sentences if len(sentence)>1]
split_sentences

[['thammasat', 'int'],
 ['sc'],
 ['tech',
  'vol',
  'january',
  'april',
  'cmos',
  'capacitorlesscurrent',
  'tunable',
  'pass',
  'filter',
  'usingcurrent',
  'mirrorsbanlue',
  'srisuchinwong',
  'adisorn',
  'leelasantithamelectrical',
  'engineering',
  'program',
  'sirindhorn',
  'international',
  'institute',
  'technology',
  'thammasat',
  'university',
  'box',
  'thammasat',
  'rangsit',
  'post',
  'office',
  'patumthani',
  'thailand',
  'tel',
  'fax',
  'banlue',
  'siit',
  'tu',
  'ac',
  'thabstracta',
  'cmos',
  'capacitorless',
  'current',
  'tunable',
  'pass',
  'filter',
  'using',
  'current',
  'mirrors',
  'presentedthrough',
  'use',
  'mos',
  'internal',
  'capacitances'],
 ['frequency',
  'fo',
  'magnitude',
  'phaseshift',
  'transfer',
  'function',
  'approximately',
  'db',
  'respectively',
  'tunable',
  'thebias',
  'current'],
 ['maximum',
  'useful',
  'fo',
  'excess',
  'mhz',
  'depending',
  'internal',
  'parameters'],
 ['introduct

In [8]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 500    # Word vector dimensionality                      
min_word_count = 0  # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(split_sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

2017-05-22 18:18:19,900 : INFO : 'pattern' package not found; tag filters are not available for English
2017-05-22 18:18:19,961 : INFO : collecting all words and their counts
2017-05-22 18:18:19,962 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-05-22 18:18:20,005 : INFO : PROGRESS: at sentence #10000, processed 130042 words, keeping 40812 word types
2017-05-22 18:18:20,048 : INFO : PROGRESS: at sentence #20000, processed 251062 words, keeping 75681 word types
2017-05-22 18:18:20,104 : INFO : PROGRESS: at sentence #30000, processed 374115 words, keeping 116445 word types
2017-05-22 18:18:20,150 : INFO : PROGRESS: at sentence #40000, processed 481567 words, keeping 149626 word types


Training model...


2017-05-22 18:18:20,196 : INFO : PROGRESS: at sentence #50000, processed 607326 words, keeping 174323 word types
2017-05-22 18:18:20,240 : INFO : PROGRESS: at sentence #60000, processed 698940 words, keeping 192433 word types
2017-05-22 18:18:20,279 : INFO : PROGRESS: at sentence #70000, processed 799841 words, keeping 218492 word types
2017-05-22 18:18:20,326 : INFO : PROGRESS: at sentence #80000, processed 937921 words, keeping 255256 word types
2017-05-22 18:18:20,374 : INFO : PROGRESS: at sentence #90000, processed 1076299 words, keeping 286888 word types
2017-05-22 18:18:20,417 : INFO : PROGRESS: at sentence #100000, processed 1196341 words, keeping 309466 word types
2017-05-22 18:18:20,457 : INFO : PROGRESS: at sentence #110000, processed 1322981 words, keeping 317892 word types
2017-05-22 18:18:20,502 : INFO : PROGRESS: at sentence #120000, processed 1456107 words, keeping 347669 word types
2017-05-22 18:18:20,571 : INFO : PROGRESS: at sentence #130000, processed 1574363 words, 

In [9]:
len(model.wv.vocab)

440471

In [10]:
model.most_similar('resistance')

2017-05-22 18:19:26,709 : INFO : precomputing L2-norms of word weight vectors


[('hydrogel', 0.9876585006713867),
 ('triggered', 0.9848117232322693),
 ('strengths', 0.9845622181892395),
 ('molecular', 0.9845357537269592),
 ('variations', 0.9835673570632935),
 ('phenomenon', 0.9830471277236938),
 ('sensitive', 0.982763946056366),
 ('strength', 0.9811860918998718),
 ('gains', 0.9810322523117065),
 ('causes', 0.9809278845787048)]

In [11]:
model.most_similar('database')

[('directory', 0.9709730744361877),
 ('server', 0.9668992161750793),
 ('wikipedia', 0.9631386995315552),
 ('files', 0.9602453112602234),
 ('unl', 0.9600975513458252),
 ('orchid', 0.9591746926307678),
 ('statistics', 0.9544699192047119),
 ('gateway', 0.9543536901473999),
 ('wordnet', 0.9538884162902832),
 ('client', 0.9532955288887024)]

In [12]:
model.most_similar('corpus')

[('statistics', 0.9816146492958069),
 ('grammars', 0.9611673951148987),
 ('texts', 0.9579599499702454),
 ('annotating', 0.9571319222450256),
 ('extracting', 0.9557010531425476),
 ('transcription', 0.9544445276260376),
 ('designexperimental', 0.9539710879325867),
 ('text', 0.9529205560684204),
 ('dictionary', 0.9528775215148926),
 ('lexical', 0.9521612524986267)]

In [13]:
model.most_similar('create')

[('allows', 0.9830431938171387),
 ('put', 0.9830320477485657),
 ('build', 0.9803174734115601),
 ('acquire', 0.9784332513809204),
 ('created', 0.9778603315353394),
 ('generate', 0.9776138067245483),
 ('allow', 0.9774624705314636),
 ('execute', 0.9767727851867676),
 ('core', 0.9726788401603699),
 ('appropriate', 0.9724059104919434)]

In [15]:
model.most_similar('electric')

[('muscle', 0.9701358079910278),
 ('field', 0.9695734977722168),
 ('hpci', 0.9680004119873047),
 ('loads', 0.9668121337890625),
 ('reducing', 0.9661319255828857),
 ('failure', 0.9651815295219421),
 ('cooling', 0.9617865085601807),
 ('calories', 0.9594910144805908),
 ('change', 0.9591827988624573),
 ('losses', 0.9588176608085632)]

In [17]:
model.most_similar('semantic')

[('named', 0.9711902141571045),
 ('keywords', 0.9691453576087952),
 ('extracting', 0.967287003993988),
 ('linguistic', 0.9652711749076843),
 ('srqvru', 0.9630720615386963),
 ('lexical', 0.9619459509849548),
 ('text', 0.9588766694068909),
 ('thesaurus', 0.9582123756408691),
 ('infor', 0.9573590159416199),
 ('sentiment', 0.954674482345581)]

In [19]:
model.most_similar('key')

[('includes', 0.9771648049354553),
 ('captures', 0.9771592020988464),
 ('another', 0.9768005609512329),
 ('accommodate', 0.9767913818359375),
 ('ask', 0.9767223596572876),
 ('already', 0.976711630821228),
 ('handled', 0.9750445485115051),
 ('particular', 0.9750428795814514),
 ('serve', 0.9750289916992188),
 ('involved', 0.9748645424842834)]

In [23]:
model.most_similar('test')

[('analyzed', 0.9652294516563416),
 ('tests', 0.9629669189453125),
 ('performed', 0.9562002420425415),
 ('comparison', 0.9543420672416687),
 ('comparisons', 0.9533629417419434),
 ('validation', 0.9505937099456787),
 ('obtained', 0.9471129775047302),
 ('experiments', 0.9451783299446106),
 ('scores', 0.9451711177825928),
 ('rotary', 0.9447579979896545)]