In [1]:
import datetime
start = datetime.datetime.now()

In [2]:
import numpy as np

## Collect Data

In [3]:
import pandas as pd
df = pd.read_csv("Data/spinger_collected_data.csv", sep="|")
title = df['title'].copy()
abstract = df["abstract"].copy()
ids = df.index.values.copy()

In [4]:
print(df.iloc[0])

title       Algal blooms in a shallow lake with a special ...
abstract    Algal blooms are frequently observed in eutrop...
writer                                              M. Mansor
journal                                            GeoJournal
keywords    Water Temperature,Environmental Management,Hig...
url         http://link.springer.com/article/10.1007/BF002...
Name: 0, dtype: object


In [5]:
# random order
shuffled_idx = np.random.permutation(len(abstract))
%time abstract = np.array(abstract)[shuffled_idx]
%time title = np.array(title)[shuffled_idx]    
%time ids = np.array(ids)[shuffled_idx]

CPU times: user 381 µs, sys: 39 µs, total: 420 µs
Wall time: 354 µs
CPU times: user 295 µs, sys: 30 µs, total: 325 µs
Wall time: 258 µs
CPU times: user 72 µs, sys: 7 µs, total: 79 µs
Wall time: 75.1 µs


In [6]:
print( abstract[np.where(ids == 0)[0][0]] )

Algal blooms are frequently observed in eutrophic lakes; a typical example of which is Lake Lindores, where two species of algal blooms were observed during the studied period. The first bloom of Asterionella formosa occured in spring, with a second occurence of the bloom observed in late winter and autumn when the water temperature was fairly low. At a higher water temperature, of more than 15°C, an unwanted blue-green algal bloom of Anabaena flosaquae occured. The blue-green algal bloom normally occured in summer and early autumn.


# Prep text

In [7]:
import text_prep

In [8]:
%time abstract_prep = text_prep.TextPrep(abstract, memoryopt=True)
%time abstract_prep.merge(title)
%time abstract_prep.run()

CPU times: user 322 µs, sys: 35 µs, total: 357 µs
Wall time: 359 µs
CPU times: user 3.24 ms, sys: 7.56 ms, total: 10.8 ms
Wall time: 10.6 ms
CPU times: user 1min 10s, sys: 743 ms, total: 1min 11s
Wall time: 1min 10s


In [9]:
print( abstract_prep.text[np.where(ids == 0)[0][0]] )

algal bloom shallow lake special reference lake loch lindores Scotland algal bloom frequent observed eutrophic lake typical example lake lindores species algal bloom observed studied period first bloom asterionella formosa occured spring second occurence bloom observed late winter autumn water temperature fair low high water temperature wanted blue-green algal bloom anabaena flosaquae occured blue-green algal bloom normal occured summer early autumn


# Gensim model

https://rare-technologies.com/word2vec-tutorial/

In [10]:
%time train_corpus = list(abstract_prep.read_corpus(ids))

CPU times: user 997 ms, sys: 4.01 ms, total: 1 s
Wall time: 1 s


In [11]:
# model
import gensim
# alpha -- tanulási ráta
# min_alpha -- a minimum tanulási ráta, lényegében a kezdeti alpha lineállisan csökken 
# vector_size -- a rejtet réteg neuronjainak száma
# negative  -- mennyi darab negativ mintavétel 
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=400, alpha=0.025, min_alpha=0.001, negative=10 )
%time model.build_vocab(train_corpus)
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 490 ms, sys: 16 ms, total: 506 ms
Wall time: 504 ms
CPU times: user 16min 12s, sys: 10.4 s, total: 16min 23s
Wall time: 5min 47s


## Make some test

In [12]:


model.wv.most_similar([model.docvecs[1219]])



[('efgp', 0.5769450068473816),
 ('pars', 0.5085806846618652),
 ('allometric', 0.49856048822402954),
 ('lr', 0.49125656485557556),
 ('tianshan', 0.481795996427536),
 ('efgc', 0.47992241382598877),
 ('stn', 0.47163310647010803),
 ('mabs', 0.46663936972618103),
 ('uas', 0.46639305353164673),
 ('ecadi', 0.46302029490470886)]

### Grassland article

In [13]:
import re
import random
doc_id = random.randint(0, len(train_corpus) - 1) # 
for x in range(len(title)):
    if re.search("SOC storage and potential of grasslands", title[x]):
        doc_id = x
        break

shuf_id = ids[doc_id]
print('Eredeti cikk ({}): \n\t{}\n\t{}\n'.format(doc_id, title[doc_id], abstract[doc_id]))

# get the similar docs
sim_ids = model.docvecs.most_similar([shuf_id], topn=10)
for label, distance in sim_ids:
    matchid = np.where(ids == label)[0][0]
    print('Hasonló cikk ({}, {}): \n\t{}\n'.format(matchid, 
                                                 distance, 
                                                 title[matchid]))

Eredeti cikk (2368): 
	SOC storage and potential of grasslands from 2000 to 2012 in central and eastern Inner Mongolia, China
	Grassland ecosystem is an important component of the terrestrial carbon cycle system. Clear comprehension of soil organic carbon (SOC) storage and potential of grasslands is very important for the effective management of grassland ecosystems. Grasslands in Inner Mongolia have undergone evident impacts from human activities and natural factors in recent decades. To explore the changes of carbon sequestration capacity of grasslands from 2000 to 2012, we carried out studies on the estimation of SOC storage and potential of grasslands in central and eastern Inner Mongolia, China based on field investigations and MODIS image data. First, we calculated vegetation cover using the dimidiate pixel model based on MODIS-EVI images. Following field investigations of aboveground biomass and plant height, we used a grassland quality evaluation model to get the grassland eval

### Alga bloom article

In [14]:
shuf_id = 0
doc_id = np.where(ids == shuf_id)[0][0]
print('Train Document ({}): «{}»\n'.format(shuf_id, abstract[doc_id]))
sim_id = model.docvecs.most_similar([shuf_id], topn=20)[0]

# get the similar docs
sim_ids = model.docvecs.most_similar([shuf_id], topn=10)
for label, distance in sim_ids:
    matchid = np.where(ids == label)[0][0]
    print('Similar Document ({}, {}): \n\t{}\n'.format(matchid, 
                                                 distance, 
                                                 title[matchid]))

Train Document (0): «Algal blooms are frequently observed in eutrophic lakes; a typical example of which is Lake Lindores, where two species of algal blooms were observed during the studied period. The first bloom of Asterionella formosa occured in spring, with a second occurence of the bloom observed in late winter and autumn when the water temperature was fairly low. At a higher water temperature, of more than 15°C, an unwanted blue-green algal bloom of Anabaena flosaquae occured. The blue-green algal bloom normally occured in summer and early autumn.»

Similar Document (1758, 0.5879272818565369): 
	Chemical composition of waters and the phytoplankton of the lakes within the delta of the Selenga river

Similar Document (5482, 0.5406219959259033): 
	Hydrochemical and microbiological characteristics of bog ecosystems on the isthmus of Svyatoi Nos Peninsula (Lake Baikal)

Similar Document (3075, 0.5372505784034729): 
	The extension of Ebenezer Howard's ideas on urbanization outside the 

## Save model

In [15]:
model.save('gensim_doc2vec.pkl')


In [16]:
end = datetime.datetime.now()
print("Duration: ", end-start)

Duration:  0:07:00.324153
