# <center>Other NLP Packages: spaCy, Gensim, and Stanza (Stanford NLP)</center>

References: 
- https://nlpforhackers.io/complete-guide-to-spacy/
- https://radimrehurek.com/gensim/models/phrases.html
- https://stanfordnlp.github.io/stanza/

## 1. spaCy
- spaCy is a relatively new framework in the Python Natural Language Processing, but is getting popular
- Provides models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
<img src='https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg' width = "70%">
- Supports 8 languages out of the box
- Provides easy and beautiful visualizations
- Provides pretrained word vectors
- installation:
  1. `pip install spacy`
  2. `python -m spacy download en` or `python -m spacy download en_core_web_sm`

In [4]:
# Installation

#! pip install spacy
#! python -m spacy download en_core_web_sm

In [1]:
# Exercise 1.1. Load package and language library

import spacy
nlp = spacy.load('en_core_web_sm')

# if you downloaded en_core_web_sm use the following:
#import en_core_web_sm 
#nlp = en_core_web_sm.load()

In [2]:
# Exercise 1.2. Get POS, lemmatization, and other NLP tasks all in one task

doc = nlp("Next week I'll be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(
        token.text,         # original text
        token.lemma_,       # lemma
        token.is_punct,     # is it a punctuation ?
        token.is_space,     # is it a space
        token.pos_,         # The simple part-of-speech tag.
        token.tag_          # The detailed part-of-speech tag
    ))

Next	next	False	False	ADJ	JJ
week	week	False	False	NOUN	NN
I	I	False	False	PRON	PRP
'll	will	False	False	AUX	MD
be	be	False	False	AUX	VB
in	in	False	False	ADP	IN
Madrid	Madrid	False	False	PROPN	NNP
.	.	True	False	PUNCT	.


In [3]:
# Exercise 1.3. Segment by sentences

doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


In [4]:
# Exercise 1.4. Entity Recognition

doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ'")
for ent in doc.ents:
    print(ent.text, "\t\t", ent.label_)

2 		 CARDINAL
9 a.m. 		 TIME
30% 		 PERCENT
just 2 days 		 DATE
WSJ 		 ORG


In [5]:
# Exercise 1.5. Visulaize named entities

from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)


In [6]:
# Exercise 1.6. Visualized dependency graph

from spacy import displacy
 
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
 

## 2. Stanza

Stanza is a Python natural language analysis package. 
- Full neural network pipeline for robust text analytics, including 
    - Tokenization, multi-word token (MWT) expansion, 
    - Lemmatization, 
    - Part-of-speech (POS)
    - Dependency Parsing
    - Named Entity Recognition
    - Sentiment Analysis
- Pretrained neural models supporting 66 (human) languages
- A stable, officially maintained Python interface to Stanford CoreNLP.

<img src='https://stanfordnlp.github.io/stanza/assets/images/pipeline.png' width="50%" >

In [15]:
# Installation
#! pip install stanza

# stanza.download('en')

In [7]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner,pos,lemma')
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
#doc = nlp("Next week I'll be in Madrid.")
for sentence in doc.sentences:  # segment into sentences
    for word in sentence.words: # tokenize into words
        print("{0}\t{1}\t{2}".format(
            word.text,       # original text
            word.lemma,      # lemma
            word.upos        # universal part-of-speech tag.
        ))
    
    print("\n")
    print("Entities:")
    for ent in sentence.ents: # Get entities
        print("{0}\t{1}".format(
            ent.text,        # original text
            ent.type         # entity type
        ))

2023-02-22 13:19:25 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| ner       | ontonotes |

2023-02-22 13:19:25 INFO: Use device: cpu
2023-02-22 13:19:25 INFO: Loading: tokenize
2023-02-22 13:19:26 INFO: Loading: pos
2023-02-22 13:19:26 INFO: Loading: lemma
2023-02-22 13:19:26 INFO: Loading: ner
2023-02-22 13:19:27 INFO: Done loading processors!


I	I	PRON
just	just	ADV
bought	buy	VERB
2	2	NUM
shares	share	NOUN
at	at	ADP
9	9	NUM
a.m.	a.m.	NOUN
because	because	SCONJ
the	the	DET
stock	stock	NOUN
went	go	VERB
up	up	ADP
30	30	NUM
%	%	SYM
in	in	ADP
just	just	ADV
2	2	NUM
days	day	NOUN
according	accord	VERB
to	to	ADP
the	the	DET
WSJ	WSJ	PROPN


Entities:
2	CARDINAL
9 a.m.	TIME
30%	PERCENT
just 2 days	DATE
WSJ	ORG


## 3. gensim
- Gensim is an open source Python library for NLP, with a focus on topic modeling.
- It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling, including 
  - Word2Vec word embedding 
  - Topic modeling
  - Text preprocessing like **phrase extraction**
  
- Gensim Phrase Model: 
    - `gensim.models.phrases.Phrases(sentences, min_count, threshold, max_vocab_size, delimiter, scoring, ...)`
        - `sentences`: list of sentences or iterables, each of which can be a document
        - `min_count`: Ignore all words and bigrams with total collected count lower than this value.
        - `threshold`: Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words $a$ followed by $b$ is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function.
        - `max_vocab_size`: Maximum size (number of tokens) of the vocabulary. 
        - `delimiter`: Glue character used to join collocation tokens, should be a byte string (e.g. '\_').
        - `scoring`: Specify how potential phrases are scored. 
           - `default` - original_scorer(), by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf)
           - `npmi` - npmi_scorer().

In [16]:
#!pip install gensim

In [8]:
f = open("file.txt", "r")
text=f.read()
print(text)

<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>a2032880z10-k.txt
<DESCRIPTION>FORM 10-K
<TEXT>

<PAGE>
                                 UNITED STATES
                       SECURITIES AND EXCHANGE COMMISSION
                             WASHINGTON, D.C. 20549

                            ------------------------

                                   FORM 10-K

(MARK ONE)

<TABLE>
<C>        <S>
   /X/     ANNUAL REPORT PURSUANT TO SECTION 13 OR 15( ) OF THE
           SECURITIES EXCHANGE ACT OF 1934
</TABLE>

                  FOR THE FISCAL YEAR ENDED SEPTEMBER 30, 2000

                                       OR

<TABLE>
<C>        <S>
   / /     TRANSITION REPORT PURSUANT TO SECTION 13 OR 15( ) OF THE
           SECURITIES EXCHANGE ACT OF 1934
</TABLE>

        FOR THE TRANSITION PERIOD FROM ______________ TO ______________

                         COMMISSION FILE NUMBER 0-10030

                            ------------------------

                              APPLE COMPUTER, INC.

   

In [31]:
# Exercise 2.1. Find bigrams using gensim
import gensim
import nltk
from nltk.collocations import *

from gensim.models.phrases import Phrases, Phraser


# Tokenize the text into tokens
pattern=r'\w[\w\',-]*\w'                        
words=nltk.regexp_tokenize(text.lower(), pattern)

# Train phrase model to find phrases using original_scorer
phrases = Phrases([words], min_count=5, threshold=50)

# get unique set of phrases and sorted by score in descending order
items = sorted(phrases.export_phrases().items(), key=lambda item: -item[1])


# print top 50 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

firmly_committed:	802.83
legal_proceedings:	714.61
fred_anderson:	677.00
lawrence_ellison:	635.21
property_plant:	625.28
gareth_chang:	625.28
probable_but:	588.50
united_states:	584.88
nasdaq_national:	577.18
jerome_york:	577.18
asia_pacific:	574.42
valuation_allowance:	559.69
6,134_5,941:	555.81
japanese_yen:	476.40
g4_cube:	470.80
set_forth:	461.75
millard_drexler:	416.85
arthur_levinson:	416.85
sufficient_quantities:	389.79
public_offering:	370.54
vice_president:	367.51
pro_forma:	357.30
senior_vice:	337.81
accounts_receivable:	336.29
in-process_research:	333.48
professionally_oriented:	333.48
jonathan_rubinstein:	333.48
part_ii:	332.79
mac_os:	330.34
entered_into:	329.37
steven_jobs:	322.73
adversely_affected:	317.60
gross_margin:	303.17
william_campbell:	303.17
agreement_dated:	277.90
hereby_incorporated:	268.44
obtain_sufficient:	267.98
restructuring_actions:	246.82
quarterly_report:	244.68
ii_item:	241.44
multiple_sources:	238.20
committed_transactions:	238.20
forward_contracts:

In [33]:
# Exercise 2.2. Find bigrams by NPMI

# find phrases using NPMI

phrases = Phrases([words], min_count=5, threshold=0.5, \
                  scoring='npmi')

# get unique set of phrases and sorted by score in descending order
items = sorted(phrases.export_phrases().items(), key=lambda item: -item[1])

# print top 20 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

6,134_5,941:	1.00
firmly_committed:	1.00
british_pound:	1.00
pound_sterling:	1.00
gilbert_amelio:	1.00
legal_proceedings:	0.98
united_states:	0.98
japanese_yen:	0.98
final_assembly:	0.98
matching_contributions:	0.98
lawrence_ellison:	0.97
millard_drexler:	0.97
arthur_levinson:	0.97
fred_anderson:	0.96
mac_os:	0.96
gareth_chang:	0.95
pro_forma:	0.95
vice_president:	0.95
property_plant:	0.94
nasdaq_national:	0.94
jerome_york:	0.94
professionally_oriented:	0.94
jonathan_rubinstein:	0.94
probable_but:	0.94
asia_pacific:	0.93
valuation_allowance:	0.93
william_campbell:	0.93
mitchell_mandich:	0.92
set_forth:	0.92
ronald_johnson:	0.91
g4_cube:	0.91
public_offering:	0.91
form_10-k:	0.90
multiple_sources:	0.90
accounts_receivable:	0.90
sufficient_quantities:	0.90
part_ii:	0.90
601_309:	0.89
sets_forth:	0.89
balance_sheets:	0.89
agreement_dated:	0.89
senior_vice:	0.89
fair_value:	0.88
in-process_research:	0.88
intellectual_property:	0.87
entered_into:	0.87
steven_jobs:	0.86
adversely_affected:	0

In [34]:
# Exercise 2.3. Tokenize by unigrams and bigrams

# Initialize phrase tokenizer
bigram = Phraser(phrases)

sent="Improved profitability was driven by the 30% increase in net sales, stable overall gross margins in 2000 as compared to 1999, and a relatively modest increase in operating expenses before special charges of 18%."
print(bigram[nltk.word_tokenize(sent.lower())])

['improved', 'profitability', 'was', 'driven', 'by', 'the', '30', '%', 'increase', 'in', 'net_sales', ',', 'stable', 'overall', 'gross_margins', 'in', '2000', 'as_compared', 'to', '1999', ',', 'and', 'a', 'relatively', 'modest', 'increase', 'in', 'operating_expenses', 'before', 'special_charges', 'of', '18', '%', '.']
