# <center>Other NLP Packages: spaCy and Gensim</center>

References: 
- https://nlpforhackers.io/complete-guide-to-spacy/
- https://radimrehurek.com/gensim/models/phrases.html

## 1. spaCy
- spaCy is a relatively new framework in the Python Natural Language Processing, but is getting popular
- Provides models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
- Supports 8 languages out of the box
- Provides easy and beautiful visualizations
- PProvides pretrained word vectors
- installation:
  1. pip install spacy
  2. python -m spacy download en

In [None]:
# Exercise 1.1. Load package and language library

import spacy
nlp = spacy.load('en')


In [None]:
# Exercise 1.2. Get POS, lemmatization, and other NLP tasks all in one task

doc = nlp("Next week I'll be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(
        token.text,         # original text
        token.lemma_,       # lemma
        token.is_punct,     # is it a punctuation ?
        token.is_space,     # is it a space
        token.pos_,         # The simple part-of-speech tag.
        token.tag_          # The detailed part-of-speech tag
    ))

In [None]:
# Exercise 1.3. Segment by sentences

doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

In [None]:
# Exercise 1.4. Entity Recognition

doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
# Exercise 1.5. Visulaize named entities

from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)


In [None]:
# Exercise 1.6. Visualized dependency graph

from spacy import displacy
 
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
 

## 2. gensim
- Gensim is an open source Python library for NLP, with a focus on topic modeling.
- It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling, including 
  - Word2Vec word embedding 
  - Topic modeling
  - Text preprocessing like **phrase extraction**
  
- Gensim Phrase Model: 
    - **gensim.models.phrases.Phrases(sentences, min_count, threshold, max_vocab_size, delimiter, scoring, ...)**
        - *sentences*: list of sentences or iterables, each of which can be a document
        - *min_count*: Ignore all words and bigrams with total collected count lower than this value.
        - *threshold*: Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words $a$ followed by $b$ is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function.
        - *max_vocab_size*: Maximum size (number of tokens) of the vocabulary. 
        - *delimiter*: Glue character used to join collocation tokens, should be a byte string (e.g. b’\_’).
        - *scoring: Specify how potential phrases are scored. 
           - **default** - original_scorer(), by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf)
           - **npmi** - npmi_scorer().

In [None]:
# Exercise 2.1. Find bigrams using gensim

import nltk
from nltk.collocations import *

from gensim.models.phrases import Phrases, Phraser

# load a built-in NLTK corpus as a list of words
words=nltk.corpus.gutenberg.words('austen-sense.txt')

# Train phrase model to find phrases using original_scorer
phrases = Phrases([words], min_count=5, threshold=10)

for phrase, score in phrases.export_phrases([words]):
    print(phrase, score)


In [None]:
# Exercise 2.2. Find bigrams by NPMI

# find phrases using NPMI

phrases = Phrases([words], min_count=5, threshold=0.4, scoring='npmi')

for phrase, score in phrases.export_phrases([words]):
    print(phrase, score)



In [None]:
# Exercise 2.3. Tokenize by unigrams and bigrams

# Initialize phrase tokenizer
bigram = Phraser(phrases)

sent="As dinner was not to be ready in less than two hours from their arrival,"
print(bigram[nltk.word_tokenize(sent.lower())])