# reference:

original paper: https://arxiv.org/abs/1301.3781

gensim package manual: https://radimrehurek.com/gensim/


## Step 1: Load Data 

choose data from sklearn.datasets

In [1]:
import logging
from sklearn.datasets import fetch_20newsgroups
logging.basicConfig() # to show logging message of what is happening behind the scene.



In [2]:
categories = ['sci.med', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True, categories = categories)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True, categories = categories)

print(list(newsgroups_train.target_names))

print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)
print(newsgroups_test.filenames.shape, newsgroups_test.target.shape)



['sci.med', 'sci.space']
(1187,) (1187,)
(790,) (790,)


## Step 2: Data Pre-processing  

Tokenize the document (simple_preprocess). 

E.g. sentence: "This is a sentence for illustration of the data pre-processing."

    After tokenize: ['this', 'is', 'sentence', 'for', 'illustration', 'of', 'the', 'data', 'pre', 'processing']


In [3]:
'''
Loading Gensim and nltk libraries
'''
from gensim.utils import simple_preprocess

In [4]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
# Tokenize
def preprocess(text):
    result=[]
    for token in simple_preprocess(text) :
        result.append(token)
            
    return result


In [5]:
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))
    

## Step 3: Build a Word2Vec Model 

Build a Word2Vec model, which produce word embeddings in multiple dimensions (Word2Vec).

In [6]:
from gensim.models import Word2Vec

In [7]:
model = Word2Vec(processed_docs, 
                 size=200, window=5, min_count=5, 
                 alpha=0.02, workers=4)

## Step 4: Find Similar Words 

Select the similar words (most_similar). Based on Word2Vec model, each word can be compared with other words in terms of closeness in hundreds of dimensions.

In [8]:
interest_word = "physics"
try:
    similar_words = model.wv.most_similar(interest_word, topn = 25)

    for (word, similarity) in similar_words:
        print(word, similarity)
except:
    print("The word: {} is not in the dictionary.".format(interest_word))

isbn 0.9995366334915161
increases 0.9995232820510864
south 0.9995055198669434
databases 0.9994800090789795
promote 0.9994581937789917
ad 0.9994315505027771
chemistry 0.9994108080863953
pm 0.9994007349014282
array 0.9993683099746704
standards 0.999338686466217
astronomical 0.99931401014328
aiaa 0.9992887377738953
west 0.9992662668228149
interstellar 0.9992375373840332
review 0.999211311340332
optical 0.9992091059684753
pp 0.9992085099220276
ref 0.9991738200187683
aerobee 0.9991406202316284
cdc 0.9990933537483215
outbreak 0.9990758895874023
staff 0.9990637302398682
newsgroups 0.9990483522415161
aerospace 0.9990434646606445
cell 0.9990332126617432
