**Idea**
Dataset contains 1401 research papers. Each paper has its own structure and some are written in English and French. The structure of most scientific research papers follows: abstract, introduction, ..., related work, results, conclusion.

Use df or .csv to create a word2vec model with all section names. Thus, we can use word similarity for further tests.


In [None]:
# !pip install gensim
# !pip install python-Levenshtein

In [1]:
# Currently not necessary
#from google.colab import drive
#drive.mount('/content/drive')

In [2]:
import gensim
import pandas as pd



### Reading and Exploring the Dataset
The train dataset we are using here are 1401 research papers. The data is stored as a LaTeX file and should be read using pandas dataframe. See here for data [preparation](Data Preparation Code.ipynb)

Link to the Dataset: https://github.com/jd-coderepos/sota/tree/master/dataset/train

In [3]:
import os

# Current working dir
print(os.getcwd())

pathToDatasetFiles = "/Users/christophzweifel/Downloads/Word2Vec/section_titles.csv"
df = pd.read_csv(pathToDatasetFiles)
df

/Users/christophzweifel/Downloads/Word2Vec


Unnamed: 0,file,section_title
0,1905.00526v2.tex,
1,1905.00526v2.tex,Introduction
2,1905.00526v2.tex,Related Work
3,1905.00526v2.tex,Radar Region Proposal Network
4,1905.00526v2.tex,Perspective Transformation
...,...,...
205630,1209.0359.tex,Communicating Processes
205631,1209.0359.tex,Recursive Communicating Processes
205632,1209.0359.tex,Topologies with Decidable State Reachability
205633,1209.0359.tex,Eager \qcp and the Mutex Restriction


The 1401 research paper contain X number of section names:

In [10]:
df.shape

(205635, 2)

Number of unique section names in all research papers:

In [11]:
# Get the count of unique values in section_title
unique_count = df['file'].nunique()

print(unique_count)

12056


### Simple Preprocessing & Tokenization


1.   We apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. *TODO* Add reference to lab session in data science or nlp

2.   Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'. *TODO* Add reference to lecture and lab session about tokenization

3.   (LaTex formatting can be stripped by regex expression, latex2text or Pandoc. For our use case, a simple regex expression seemed most effective to extract section names)



In [20]:
# First, ensure all section titles are treated as strings (this also converts NaNs to the string 'nan')
df['section_title'] = df['section_title'].astype(str)

# Apply gensim's simple_preprocess to each section title
text = df['section_title'].apply(gensim.utils.simple_preprocess)

# review_text = df.section_title.apply(gensim.utils.simple_preprocess)

In [21]:
text.loc[2]

['related', 'work']

In [22]:
df.section_title.loc[4]

'Perspective Transformation'

In [25]:
# Check how often "représentation" appears in the dataset
word_occurrences = df['section_title'].apply(lambda x: 'Représentation' in x).sum()
print(f"Occurrences of 'représentation': {word_occurrences}")

# Check how often "Fazit" appears in the dataset
word_occurrences = df['section_title'].apply(lambda x: 'work' in x).sum()
print(f"Occurrences of 'work': {word_occurrences}")

Occurrences of 'représentation': 1
Occurrences of 'work': 6181


#### Initialize the model

In [26]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=1,
    workers=1,
)


#### Build Vocabulary

In [27]:
model.build_vocab(text, progress_per=1000)

#### Train the Word2Vec Model

In [28]:
model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

(2270426, 2854570)

### Save the Model

Save the model so that it can be reused in other applications

In [29]:
model.save("/Users/christophzweifel/Downloads/Word2Vec/word2vec-similarSectionNames.model")

### Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

In [30]:
model.wv.most_similar("abstract")

[('operational', 0.9686930775642395),
 ('triangle', 0.9658939242362976),
 ('rounding', 0.9642549157142639),
 ('witness', 0.9615827202796936),
 ('closure', 0.9609056115150452),
 ('observational', 0.9600898623466492),
 ('prefix', 0.9591124057769775),
 ('rijndael', 0.9586595892906189),
 ('minimal', 0.9584744572639465),
 ('multiway', 0.9582595229148865)]

In [31]:
model.wv.most_similar("experiments")

[('cifar', 0.9203596711158752),
 ('imagenet', 0.9065940976142883),
 ('benchmarks', 0.8723424673080444),
 ('lt', 0.8722990155220032),
 ('experiment', 0.8678033351898193),
 ('svhn', 0.8663187623023987),
 ('hotpotqa', 0.857439398765564),
 ('cityscapes', 0.854036271572113),
 ('evaluations', 0.8525301814079285),
 ('cub', 0.8517221212387085)]

In [32]:
model.wv.most_similar("results")

[('evaluations', 0.834568440914154),
 ('cifar', 0.8315017223358154),
 ('examples', 0.830646812915802),
 ('experiments', 0.8299868106842041),
 ('voc', 0.826363205909729),
 ('imagenet', 0.8127244114875793),
 ('wikiann', 0.8075611591339111),
 ('benchmarks', 0.8020544648170471),
 ('hotpotqa', 0.7987909913063049),
 ('crowdhuman', 0.7971668839454651)]

In [35]:
text.loc[7]

['experiments', 'and', 'results']

In [38]:
model.wv.most_similar("représentation")

[('toolkit', 0.9210864305496216),
 ('experimentieren', 0.9210857152938843),
 ('deepmind', 0.9186710119247437),
 ('logiciels', 0.9178512692451477),
 ('battery', 0.9174641966819763),
 ('psc', 0.9170849919319153),
 ('workload', 0.9167493581771851),
 ('imagined', 0.9166383147239685),
 ('sectioning', 0.9164373874664307),
 ('dd', 0.9161875247955322)]

In [39]:
model.wv.similarity(w1="abstract", w2="introduction")

0.78969204

In [40]:
model.wv.similarity(w1="abstract", w2="mémoire")

0.84020174

In [41]:
model.wv.similarity(w1="conclusion", w2="summary")

0.8085111