<h2>Word2Vec on Research papers</h2>
<p>In this notebook, we will see how word2vec perform in research papers.</p>
<p>Word2Vec tries to create a vector for each word. And we want two vector to be close to each other if they have similar semantics.</p>

In [1]:
# -*- coding: utf-8 -*-
#Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import timeit
import codecs
import re
import os
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
from wordcloud import WordCloud
%matplotlib inline

In [2]:
cleaned_paper_df = pd.read_csv('../dataset/cleaned_papers_pdf.csv',encoding='utf-8')

In [3]:
print(cleaned_paper_df.shape)
cleaned_paper_df.head()

(1042, 5)


Unnamed: 0,name,content,directory,isValid,faculty
0,Alice Sharp,Discarded appendicularian houses as sources of...,../papers/BIO/Alice Sharp/0014.pdf,True,BIO
1,Alice Sharp,AbstractThailandhassufferedfromseveredefor- es...,../papers/BIO/Alice Sharp/00463529693fc4921900...,True,BIO
2,Alice Sharp,"Cell, Vol. 11, 263-271, June 1977. Copyright 0...",../papers/BIO/Alice Sharp/0deec5230a91912b3800...,True,BIO
3,Alice Sharp,ImprovingthesolidwastemanagementinPhnomPenhcit...,../papers/BIO/Alice Sharp/1-s2.0-S0956053X0400...,True,BIO
4,Alice Sharp,arXiv:1001.2574v2 [astro-ph.CO] 26 Jan 2010Pro...,../papers/BIO/Alice Sharp/1001.2574,True,BIO


In [4]:
# Import various modules for string cleaning
from nltk.corpus import stopwords

def review_to_wordlist( paper_content ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", paper_content) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))      
    stops.add('cid')
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops] 
    
    # 5. Cut word with only 1 character except a,i,u
  
    meaningful_words = [w for w in meaningful_words if len(w)!=1 or w in['a','i','u']]
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

In [5]:
# Download the punkt tokenizer for sentence splitting  
import nltk
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

from nltk.corpus import stopwords # Import the stop word list
# Define a function to split a review into parsed sentences
def paper_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

In [6]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for paper in cleaned_paper_df["content"]:
    sentences += paper_to_sentences(paper, tokenizer)
print("Finished parsing")

Parsing sentences from training set
Finished parsing


In [7]:
split_sentences = [sentence.split(" ") for sentence in sentences if len(sentence)>1]

In [8]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 500    # Word vector dimensionality                      
min_word_count = 0  # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(split_sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

2017-06-24 15:12:03,360 : INFO : 'pattern' package not found; tag filters are not available for English
2017-06-24 15:12:03,463 : INFO : collecting all words and their counts
2017-06-24 15:12:03,465 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-24 15:12:03,518 : INFO : PROGRESS: at sentence #10000, processed 115196 words, keeping 36812 word types
2017-06-24 15:12:03,571 : INFO : PROGRESS: at sentence #20000, processed 232029 words, keeping 44138 word types
2017-06-24 15:12:03,621 : INFO : PROGRESS: at sentence #30000, processed 351003 words, keeping 61914 word types


Training model...


2017-06-24 15:12:03,691 : INFO : PROGRESS: at sentence #40000, processed 480732 words, keeping 95649 word types
2017-06-24 15:12:03,755 : INFO : PROGRESS: at sentence #50000, processed 600294 words, keeping 128127 word types
2017-06-24 15:12:03,817 : INFO : PROGRESS: at sentence #60000, processed 725490 words, keeping 169099 word types
2017-06-24 15:12:03,879 : INFO : PROGRESS: at sentence #70000, processed 832531 words, keeping 199099 word types
2017-06-24 15:12:03,921 : INFO : PROGRESS: at sentence #80000, processed 957250 words, keeping 222723 word types
2017-06-24 15:12:03,972 : INFO : PROGRESS: at sentence #90000, processed 1025047 words, keeping 246354 word types
2017-06-24 15:12:04,056 : INFO : PROGRESS: at sentence #100000, processed 1147692 words, keeping 273990 word types
2017-06-24 15:12:04,150 : INFO : PROGRESS: at sentence #110000, processed 1284256 words, keeping 311518 word types
2017-06-24 15:12:04,242 : INFO : PROGRESS: at sentence #120000, processed 1416723 words, kee

In [9]:
len(model.wv.vocab)

539862

In [10]:
model.most_similar('resistance')

2017-06-24 15:13:46,788 : INFO : precomputing L2-norms of word weight vectors


[('metabolic', 0.9642347097396851),
 ('strength', 0.9590809941291809),
 ('variation', 0.9545866847038269),
 ('swelling', 0.952375590801239),
 ('degradation', 0.9511069059371948),
 ('absorption', 0.9497958421707153),
 ('contrast', 0.9449074268341064),
 ('effect', 0.9436883926391602),
 ('induced', 0.9416981339454651),
 ('muscle', 0.9411588907241821)]

In [11]:
model.most_similar('database')

[('svg', 0.9527514576911926),
 ('server', 0.9471654891967773),
 ('documents', 0.9339300394058228),
 ('files', 0.9294983148574829),
 ('xml', 0.9168329238891602),
 ('experiencedphysical', 0.9134659171104431),
 ('sql', 0.9124547243118286),
 ('textual', 0.9110561609268188),
 ('wikipedia', 0.9107706546783447),
 ('code', 0.9089975953102112)]

In [12]:
model.most_similar('corpus')

[('transcription', 0.9303451180458069),
 ('texts', 0.9281502366065979),
 ('reference', 0.9207703471183777),
 ('vocabularies', 0.9181904196739197),
 ('annotated', 0.9172554612159729),
 ('emsim', 0.9154324531555176),
 ('terminology', 0.9154263734817505),
 ('analyzed', 0.9152184724807739),
 ('unicode', 0.9140470027923584),
 ('manually', 0.9102402925491333)]

In [13]:
model.most_similar('create')

[('build', 0.9631944298744202),
 ('allow', 0.9628151059150696),
 ('able', 0.9508962035179138),
 ('fully', 0.9496178030967712),
 ('realize', 0.9479504227638245),
 ('acquire', 0.9470875263214111),
 ('communicate', 0.9468263387680054),
 ('creating', 0.9456331729888916),
 ('allowing', 0.943172812461853),
 ('offers', 0.9429171681404114)]

In [14]:
model.most_similar('electric')

[('solar', 0.9141300320625305),
 ('distortion', 0.8976895213127136),
 ('fluid', 0.8933480381965637),
 ('amplifier', 0.8919662833213806),
 ('generator', 0.8908932209014893),
 ('shock', 0.8896439671516418),
 ('solid', 0.8887043595314026),
 ('field', 0.8871155977249146),
 ('conversion', 0.8853572010993958),
 ('load', 0.8853031992912292)]

In [15]:
model.most_similar('semantic')

[('ontology', 0.9500008225440979),
 ('lexical', 0.9480582475662231),
 ('linguistic', 0.9404067397117615),
 ('identification', 0.9396563172340393),
 ('specification', 0.9362311363220215),
 ('representation', 0.9292150139808655),
 ('metadata', 0.9251999258995056),
 ('generic', 0.9235600829124451),
 ('concepts', 0.9214292168617249),
 ('spatial', 0.920572817325592)]

In [19]:
model.most_similar('test')

[('tests', 0.9566447138786316),
 ('measurements', 0.9369489550590515),
 ('experiments', 0.9170331954956055),
 ('withclinical', 0.9133439064025879),
 ('performed', 0.9131519198417664),
 ('tested', 0.9116257429122925),
 ('experimental', 0.9044852256774902),
 ('experiment', 0.9030770063400269),
 ('suitability', 0.900475800037384),
 ('indcol', 0.8983445167541504)]

In [21]:
model.most_similar('innovation')

[('technological', 0.9409782886505127),
 ('upgrading', 0.9409516453742981),
 ('entrepreneurial', 0.9324290752410889),
 ('innovations', 0.9310087561607361),
 ('capability', 0.9307618737220764),
 ('nnovation', 0.9198551177978516),
 ('stakeholder', 0.9146895408630371),
 ('achievement', 0.9092878699302673),
 ('cultural', 0.9010015726089478),
 ('mpwt', 0.9005976319313049)]