# RAMAZAN ARSLAN - CMPE 492 PROJECT

### Word2Vec with NeuroBoun Corpus


[Neuroboun](https://www.neuroboun.com), a section of the brain through [PUBMED](https://www.ncbi.nlm.nih.gov/pubmed/) about the amygdala about 37298 articles were taken.
In Neuroboun, you can enter the keyword you want and see the rate of articles using the keyword.





***

### Import

First, we start with imports gensim (for word2vec), logging, django framework


In [37]:
import gensim 
import logging
import os,django
os.environ.setdefault("DJANGO_SETTINGS_MODULE","neuroboun.settings")
django.setup()

***
### Adding Logging Rule

In [38]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


---
### Article Model
We used Article model
- abstract = models.TextField(null=True)
- year = models.IntegerField()
- pubmed_id = models.IntegerField(primary_key=True)
- doi = models.TextField(null=True)
- title = models.TextField()
- keywords = models.TextField()

importing the model and getting all object :


In [39]:
from neuroextractor.models import Article
articles=Article.objects.all()

---
### Read articles into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. 

Using `gensim.utils.simple_preprocess (line.title)`. 

This does some basic pre-processing such as **tokenization,** **lowercasing,** etc and returns back a list of tokens (words).



- read the tokenized reviews into a list
- each review item becomes a serries of words
- so this becomes a list of lists


In [40]:
def read_input(articles):
       
    logging.info("reading NeuroBoun ...this may take a while")    
    
    for i, line in enumerate (articles): 
        if (i%1000==0):
            logging.info ("read {0} reviews".format (i))

        # do some pre-processing and return a list of words for each review text        
        yield gensim.utils.simple_preprocess (line.title)
        
documents = list (read_input (articles))
logging.info ("Done reading data file")    


2018-11-02 14:26:41,811 : INFO : reading NeuroBoun ...this may take a while
2018-11-02 14:26:42,356 : INFO : read 0 reviews
2018-11-02 14:26:42,383 : INFO : read 1000 reviews
2018-11-02 14:26:42,411 : INFO : read 2000 reviews
2018-11-02 14:26:42,440 : INFO : read 3000 reviews
2018-11-02 14:26:42,466 : INFO : read 4000 reviews
2018-11-02 14:26:42,491 : INFO : read 5000 reviews
2018-11-02 14:26:42,523 : INFO : read 6000 reviews
2018-11-02 14:26:42,548 : INFO : read 7000 reviews
2018-11-02 14:26:42,575 : INFO : read 8000 reviews
2018-11-02 14:26:42,604 : INFO : read 9000 reviews
2018-11-02 14:26:42,633 : INFO : read 10000 reviews
2018-11-02 14:26:42,662 : INFO : read 11000 reviews
2018-11-02 14:26:42,690 : INFO : read 12000 reviews
2018-11-02 14:26:42,719 : INFO : read 13000 reviews
2018-11-02 14:26:42,755 : INFO : read 14000 reviews
2018-11-02 14:26:42,856 : INFO : read 15000 reviews
2018-11-02 14:26:42,883 : INFO : read 16000 reviews
2018-11-02 14:26:42,912 : INFO : read 17000 reviews
2

---
## Training the Word2Vec model


## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [41]:
model = gensim.models.Word2Vec (documents, size=150, window=5, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

w1 = "amygdala"
model.wv.most_similar (positive=w1)

2018-11-02 14:26:50,351 : INFO : collecting all words and their counts
2018-11-02 14:26:50,352 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-02 14:26:50,402 : INFO : PROGRESS: at sentence #10000, processed 139655 words, keeping 8950 word types
2018-11-02 14:26:50,453 : INFO : PROGRESS: at sentence #20000, processed 279895 words, keeping 12488 word types
2018-11-02 14:26:50,487 : INFO : PROGRESS: at sentence #30000, processed 419232 words, keeping 14864 word types
2018-11-02 14:26:50,509 : INFO : collected 16148 word types from a corpus of 520241 raw words and 37298 sentences
2018-11-02 14:26:50,510 : INFO : Loading a fresh vocabulary
2018-11-02 14:26:50,537 : INFO : effective_min_count=2 retains 9572 unique words (59% of original 16148, drops 6576)
2018-11-02 14:26:50,538 : INFO : effective_min_count=2 leaves 513665 word corpus (98% of original 520241, drops 6576)
2018-11-02 14:26:50,567 : INFO : deleting the raw counts dictionary of 16148 items
201

2018-11-02 14:26:53,473 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-11-02 14:26:53,487 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-11-02 14:26:53,523 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-11-02 14:26:53,525 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-11-02 14:26:53,532 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-11-02 14:26:53,536 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-11-02 14:26:53,539 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-11-02 14:26:53,543 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-11-02 14:26:53,551 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-11-02 14:26:53,558 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-11-02 14:26:53,559 : INFO : EPOCH - 2 : training on 520241 raw words (383093 effectiv

2018-11-02 14:26:56,174 : INFO : EPOCH - 9 : training on 520241 raw words (382976 effective words) took 0.4s, 1025205 effective words/s
2018-11-02 14:26:56,503 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-11-02 14:26:56,505 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-11-02 14:26:56,506 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-11-02 14:26:56,522 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-11-02 14:26:56,523 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-11-02 14:26:56,531 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-11-02 14:26:56,539 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-11-02 14:26:56,543 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-11-02 14:26:56,546 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-11-02 14:26:56,550 : INFO : worker threa

[('amygdalar', 0.6904447078704834),
 ('amygdaloid', 0.6506998538970947),
 ('dorsal', 0.5091315507888794),
 ('septum', 0.4866825044155121),
 ('ventral', 0.45731252431869507),
 ('bla', 0.45720943808555603),
 ('geniculate', 0.4503151476383209),
 ('habenular', 0.44185322523117065),
 ('accumbens', 0.4305647313594818),
 ('amygdale', 0.4303969144821167)]

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `amygdala`. All we need to do here is to call the `most_similar` function and provide the word `amygdala` as the positive example. This returns the top 10 similar words. 

In [30]:

w1 = "amygdala"
model.wv.most_similar (positive=w1)


[('amygdalar', 0.6844335794448853),
 ('amygdaloid', 0.6229093074798584),
 ('ventral', 0.5053435564041138),
 ('septum', 0.4974673390388489),
 ('bla', 0.4948784112930298),
 ('geniculate', 0.48149192333221436),
 ('dorsal', 0.480157732963562),
 ('amygdale', 0.4364165663719177),
 ('prelimbic', 0.4177556037902832),
 ('habenula', 0.4051814079284668)]

In [42]:
# similarity between two different words
model.wv.similarity(w1="amygdala",w2="amygdaloid")

0.65069985

In [36]:
# similarity between two unrelated words
model.wv.similarity(w1="amygdala",w2="brain")

0.10858928

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [43]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["male","female","animal"])

'animal'

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?
