<a href="https://colab.research.google.com/github/chu-ise/411A-2022/blob/main/notebooks/05/01_text_parsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part-of-Speech Tagging, Dependency and Constituency Parsing

In [None]:
# INSTALLATION
%%capture
%pip install -U spacy stanza spacy-stanza
!python -m spacy download en_core_web_sm

In [1]:
import gdown
import os
id = "14t4uoStGbRLMTu0GaNDt44UsbYs3Y4Ku"

data_file = "fomc_speech.csv"
gdown.download(id=id, output=data_file, quiet=False, fuzzy=True)

In [2]:
import pandas as pd
df = pd.read_csv(data_file)
df.dropna(inplace=True)
df.head()

Unnamed: 0,date,speaker,title,text,content_type
0,1996-06-13,Chairman Alan Greenspan,Bank supervision in a world economy,Remarks by Chairman Alan Greenspan Bank superv...,fomc_speech
1,1996-06-18,"Governor Edward W. Kelley, Jr.",Developments in electronic money and banking,"Remarks by Governor Edward W. Kelley, Jr. Deve...",fomc_speech
2,1996-09-08,Governor Laurence H. Meyer,Monetary policy objectives and strategy,Monetary Policy Objectives and Strategy\n\nI w...,fomc_speech
3,1996-09-19,Chairman Alan Greenspan,Regulation and electronic payment systems,Remarks by Chairman Alan Greenspan Regulation ...,fomc_speech
4,1996-10-02,Governor Lawrence B. Lindsey,Small business is big business,Remarks by Governor Lawrence B. Lindsey At the...,fomc_speech


In [3]:
import re
# example text
text = df.text[30]
text = re.sub(r'\n+', '\n', text)
text

'Financial Reform and the Importance of a Decentralized Banking Structure\nAs always, it is a pleasure to address this convention of the Independent Bankers Association of America. This is the sixth year I have addressed this convention, and during that time four separate Congresses have debated how best to reform the financial system. I last spoke to you about financial reform in 1994, in Orlando, and it is clear that the real world occurrences of the past three years have not diminished the relevance of those words. Therefore, I shall reemphasize some of those thoughts today in the context of legislative proposals that are now before the current Congress.\nLet me begin by reiterating the essential thrust of the Federal Reserve\x92s position regarding financial reform. We believe that any changes, either in regulation or legislation, should be consistent with four basic objectives: (1) continuing the safety and soundness of the banking system; (2) limiting systemic risk; (3) contribut

### SpaCy

In [4]:
import spacy
nlp_spacy = spacy.load('en_core_web_sm')
 
sent='I shall reemphasize some of those thoughts today in the context of legislative proposals that are now before the current Congress.'

for token in nlp_spacy(sent):
  print(token.text, '=>',token.pos_,'=>',token.tag_)

I => PRON => PRP
shall => AUX => MD
reemphasize => VERB => VB
some => PRON => DT
of => ADP => IN
those => DET => DT
thoughts => NOUN => NNS
today => NOUN => NN
in => ADP => IN
the => DET => DT
context => NOUN => NN
of => ADP => IN
legislative => ADJ => JJ
proposals => NOUN => NNS
that => PRON => WDT
are => AUX => VBP
now => ADV => RB
before => ADP => IN
the => DET => DT
current => ADJ => JJ
Congress => PROPN => NNP
. => PUNCT => .


In [5]:
from spacy import displacy
# svg = displacy.render(nlp_spacy(sent), jupyter=False)
# fig_file = 'dep.svg'
# open(fig_file, "w", encoding="utf-8").write(svg)
displacy.render(nlp_spacy(sent), jupyter=True)

### Stanza

In [56]:
import stanza

# Download English language model and initialize the NLP pipeline.
stanza.download('en')
nlp_stanza = stanza.Pipeline('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-06 00:51:21 INFO: Downloading default packages for language: en (English)...
2022-04-06 00:51:22 INFO: File exists: /root/stanza_resources/en/default.zip.
2022-04-06 00:51:31 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-06 00:51:31 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-06 00:51:31 INFO: Use device: cpu
2022-04-06 00:51:31 INFO: Loading: tokenize
2022-04-06 00:51:32 INFO: Loading: pos
2022-04-06 00:51:32 INFO: Loading: lemma
2022-04-06 00:51:32 INFO: Loading: depparse
2022-04-06 00:51:34 INFO: Loading: sentiment
2022-04-06 00:51:34 INFO: Loading: constituency
2022-04-06 00:51:35 INFO: Loading: ner
2022-04-06 00:51:36 INFO: Done loading processors!


In [58]:
doc = nlp_stanza(sent)
print(
    *[
        f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}'
        for sent in doc.sentences
        for word in sent.words
    ],
    sep="\n",
)

id: 1	word: Therefore	head id: 5	head: reemphasize	deprel: advmod
id: 2	word: ,	head id: 5	head: reemphasize	deprel: punct
id: 3	word: I	head id: 5	head: reemphasize	deprel: nsubj
id: 4	word: shall	head id: 5	head: reemphasize	deprel: aux
id: 5	word: reemphasize	head id: 0	head: root	deprel: root
id: 6	word: some	head id: 5	head: reemphasize	deprel: obj
id: 7	word: of	head id: 9	head: thoughts	deprel: case
id: 8	word: those	head id: 9	head: thoughts	deprel: det
id: 9	word: thoughts	head id: 6	head: some	deprel: nmod
id: 10	word: today	head id: 5	head: reemphasize	deprel: obl:tmod
id: 11	word: in	head id: 13	head: context	deprel: case
id: 12	word: the	head id: 13	head: context	deprel: det
id: 13	word: context	head id: 5	head: reemphasize	deprel: obl
id: 14	word: of	head id: 16	head: proposals	deprel: case
id: 15	word: legislative	head id: 16	head: proposals	deprel: amod
id: 16	word: proposals	head id: 13	head: context	deprel: nmod
id: 17	word: that	head id: 23	head: Congress	deprel: nsu

In [59]:
doc = nlp_stanza(sent)
doc.sentences[0].constituency

(ROOT (S (ADVP (RB Therefore)) (, ,) (NP (PRP I)) (VP (MD shall) (VP (VB reemphasize) (NP (NP (DT some)) (PP (IN of) (NP (DT those) (NNS thoughts)))) (NP (NN today)) (PP (IN in) (NP (NP (DT the) (NN context)) (PP (IN of) (NP (NP (JJ legislative) (NNS proposals)) (SBAR (WHNP (WDT that)) (S (VP (VBP are) (ADVP (RB now)) (PP (IN before) (NP (DT the) (JJ current) (NNP Congress)))))))))))) (. .)))

### Tokenization

SpaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. 

##### Lexeme - entries in the vocabulary

In [60]:
# import a list of stop words from SpaCy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')

print('Example stop words: {}'.format(list(STOP_WORDS)[0:10]))

Example stop words: ['itself', 'whereas', 'therein', 'put', 'under', 'latter', 'name', 'take', 'whom', 'same']


In [61]:
nlp.vocab['have']

<spacy.lexeme.Lexeme at 0x7f069e6a9f00>

In [62]:
print(dir(nlp.vocab['have']))

['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'check_flag', 'cluster', 'flags', 'from_bytes', 'has_vector', 'is_alpha', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'norm', 'norm_', 'orth', 'orth_', 'prefix', 'prefix_', 'prob', 'rank', 'sentiment', 'set_attrs', 'set_flag', 'shape', 'shape_', 'similarity', 'suffix', 'suffix_', 'text', 'to_bytes', 'vector', 'vector_norm', 'vocab']


In [63]:
nlp.vocab['have'].is_stop

True

In [64]:
# search for word in the SpaCy vocabulary and
# change the is_stop attribute to True

for word in STOP_WORDS:
    nlp.vocab[word].is_stop = True

### Part-of-speech (POS) Tagging

In [80]:
# review document
doc = nlp(text)
doc[:50]

Financial Reform and the Importance of a Decentralized Banking Structure
As always, it is a pleasure to address this convention of the Independent Bankers Association of America. This is the sixth year I have addressed this convention, and during that time four separate Congresses have debated

In [67]:
# check if POS tags were added to the doc in the NLP pipeline
doc.is_tagged

True

In [81]:
# print column headers
print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'TEXT','LEMMA_','POS_','TAG_','DEP_','SHAPE_','IS_ALPHA','IS_STOP'))

# print various SpaCy POS attributes
for token in doc[:50]:
    print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

TEXT            | LEMMA_          | POS_     | TAG_     | DEP_        | SHAPE_   | IS_ALPHA | IS_STOP  | 
Financial       | Financial       | PROPN    | NNP      | compound    | Xxxxx    |        1 |        0 |
Reform          | Reform          | PROPN    | NNP      | ROOT        | Xxxxx    |        1 |        0 |
and             | and             | CCONJ    | CC       | cc          | xxx      |        1 |        1 |
the             | the             | DET      | DT       | det         | xxx      |        1 |        1 |
Importance      | importance      | NOUN     | NN       | conj        | Xxxxx    |        1 |        0 |
of              | of              | ADP      | IN       | prep        | xx       |        1 |        1 |
a               | a               | DET      | DT       | det         | x        |        1 |        1 |
Decentralized   | decentralized   | ADJ      | JJ       | amod        | Xxxxx    |        1 |        0 |
Banking         | Banking         | PROPN    | NNP    

##### create (adjective --> noun) phrases from parts of speech

In [85]:
previous_token = doc[0]  # set first token

for token in doc[1:]:    
    # identify adjective noun pairs
    if previous_token.pos_ == 'ADJ' and token.pos_ == 'NOUN':
        print(f'{previous_token.text}_{token.text}')
    
    previous_token = token

sixth_year
financial_system
financial_reform
real_world
legislative_proposals
essential_thrust
financial_reform
basic_objectives
systemic_risk
macroeconomic_stability
moral_hazard
financial_reform
such_reform
critical_role
large_numbers
small_banks
separate_banking
industrialized_nations
domestic_banking
diverse_banking
direct_result
first_edition
natural_liberty
economic_choice
late_1980s
highest_levels
half_century
ongoing_pace
total_number
large_number
smaller_community
average_size
nonfinancial_businesses
diverse_nature
strong_connection
Smaller_banks
small_businesses
new_businesses
new_technology
new_firms
old_firms
new_firms
perennial_gale
creative_destruction"--is
general_safety
many_times
optimal_degree
financial_system
other_hand
essential_risk
risky_credit
Optimal_risk
optimal_risk
economic_luck
such_failure
natural_process
competitive_system
financial_system
regulatory_roadblocks
smaller_banks
national_counterparts
large_bank
new_market
smaller_institution
small_businesses
l

##### word sense disambiguation via part of speech tags

In [86]:
for token in doc[0:50]:
    print(f'{token.text}_{token.pos_}')

Financial_PROPN
Reform_PROPN
and_CCONJ
the_DET
Importance_NOUN
of_ADP
a_DET
Decentralized_ADJ
Banking_PROPN
Structure_PROPN

_SPACE
As_SCONJ
always_ADV
,_PUNCT
it_PRON
is_AUX
a_DET
pleasure_NOUN
to_PART
address_VERB
this_DET
convention_NOUN
of_ADP
the_DET
Independent_PROPN
Bankers_PROPN
Association_PROPN
of_ADP
America_PROPN
._PUNCT
This_DET
is_AUX
the_DET
sixth_ADJ
year_NOUN
I_PRON
have_AUX
addressed_VERB
this_DET
convention_NOUN
,_PUNCT
and_CCONJ
during_ADP
that_DET
time_NOUN
four_NUM
separate_ADJ
Congresses_PROPN
have_AUX
debated_VERB


### Text Dependency Parsing

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc  object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

In [71]:
# check is document has been parsed (dependency parsing)
doc.is_parsed

True

In [87]:
print('{:15} | {:10} | {:15} | {:10} | {:25} | {:25}'.format(
    'TEXT','DEP','HEAD TEXT','HEAD POS','CHILDREN','LEFTS'))

for token in doc[:50]:
    print('{:15} | {:10} | {:15} | {:10} | {:25} | {:25}'.format(
        token.text, token.dep_, token.head.text, token.head.pos_,
        str([child for child in token.children]), str([t.text for t in token.lefts])))

TEXT            | DEP        | HEAD TEXT       | HEAD POS   | CHILDREN                  | LEFTS                    
Financial       | compound   | Reform          | PROPN      | []                        | []                       
Reform          | ROOT       | Reform          | PROPN      | [Financial, and, Importance] | ['Financial']            
and             | cc         | Reform          | PROPN      | []                        | []                       
the             | det        | Importance      | NOUN       | []                        | []                       
Importance      | conj       | Reform          | PROPN      | [the, of]                 | ['the']                  
of              | prep       | Importance      | NOUN       | [Structure]               | []                       
a               | det        | Structure       | PROPN      | []                        | []                       
Decentralized   | amod       | Structure       | PROPN      | []     

#### NOUN CHUNCKS:

| **TERM** | Definition |
|:---|:---:|
| **Text** | The original noun chunk text |
| **Root text** | The original text of the word connecting the noun chunk to the rest of the parse |
| **Root dependency** | Dependency relation connecting the root to its head |
| **Root head text** | The text of the root token's head |

In [88]:
print('{:15} | {:10} | {:15} | {:40}'.format('ROOT_TEXT','ROOT','DEPENDENCY','TEXT'))

for chunk in list(doc.noun_chunks):
    print('{:15} | {:10} | {:15} | {:40}'.format(
        chunk.root.text, chunk.root.dep_, chunk.root.head.text, chunk.text))

ROOT_TEXT       | ROOT       | DEPENDENCY      | TEXT                                    
Reform          | ROOT       | Reform          | Financial Reform                        
Importance      | conj       | Reform          | the Importance                          
Structure       | pobj       | of              | a Decentralized Banking Structure       
it              | nsubj      | is              | it                                      
pleasure        | attr       | is              | a pleasure                              
convention      | dobj       | address         | this convention                         
Association     | pobj       | of              | the Independent Bankers Association     
America         | pobj       | of              | America                                 
year            | attr       | is              | the sixth year                          
I               | nsubj      | addressed       | I                                       
convention

### Named Entity Recognition (NER)

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product, or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. 

In [90]:
ner_text = "When I told John that I wanted to move to Alaska, he warned me that I'd have trouble finding a Starbucks there."
ner_doc = nlp(ner_text)

In [91]:
print('{:10} | {:15}'.format('LABEL','ENTITY'))

for ent in ner_doc.ents[0:20]:
    print('{:10} | {:50}'.format(ent.label_, ent.text))

LABEL      | ENTITY         
PERSON     | John                                              
GPE        | Alaska                                            
ORG        | Starbucks                                         


In [92]:
# ent methods and attributes
print(dir(ent))

['_', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_fix_dep_copy', '_recalculate_indices', '_vector', '_vector_norm', 'as_doc', 'char_span', 'conjuncts', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'kb_id', 'kb_id_', 'label', 'label_', 'lefts', 'lemma_', 'lower_', 'merge', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'set_extension', 'similarity', 'start', 'start_char', 'string', 'subtree', 'tensor', 'text', 'text_with_ws', 'to_array', 'upper_', 'vector', 'vector_norm', 'vocab']


In [93]:
# entity visualization
displacy.render(docs=ner_doc, style='ent', jupyter=True)

# Pipeline

If you have a sequence of documents to process, you should use the Language.pipe()  method. The method takes an iterator of texts, and accumulates an internal buffer, which it works on in parallel. It then yields the documents in order, one-by-one.

- batch_size: number of docs to process per thread
- disable: Names of pipeline components to disable to speed up text processing.
                                    

In [94]:
from spacy.pipeline import Pipe

In [99]:
# create a dataframe with a subset of the data, mentioning the word reform
reform_df = df[df.text.str.contains('reform')].text

# print the count of matches
print('Lines with the term reform: {}\n'.format(len(reform_df)))

# view the first five section names
reform_df.head()

Lines with the term reform: 372



5     Remarks by Chairman Alan Greenspan Bank superv...
13    The Transformation of the U.S. Banking Industr...
16    The Challenge of Central Banking in a Democrat...
17    I am privileged to accept the Union League of ...
18    I discovered when I joined the Board of Govern...
Name: text, dtype: object

In [100]:
%%time

for doc in nlp.pipe(reform_df.head(10)):  # includes ['parser','tagger','ner']
    if 'reform' in doc.text:
        print(doc, '\n')

Remarks by Chairman Alan Greenspan Bank supervision, regulation, and risk At the Annual Convention of the American Bankers Association, Honolulu, Hawaii October 5, 1996

You may well wonder why a regulator is the first speaker at a conference in which a major theme is maximizing shareholder value. I hope that by the end of my remarks this morning it will be clear that we, the regulators, share with you ultimately the same objective of a strong and profitable banking system. Such a banking system knows how to take and manage risk for profit. The problem is what, if anything, regulators should do to constrain the amount of risk bankers take in trying to meet their corporate objectives. I have given considerable thought to this issue over the years, and today I would like to address this theme once again.

I. The Changing Nature of Bank Supervision and Regulation At the outset, it is critical to understand some key unintended implications of the safety net--our system of deposit insurance

### SpaCy - Tips for faster processing

You can substantially speed up the time it takes SpaCy to read a document by disabling components of the NLP that are not necessary for a given task.

- Disable options: **parser, tagger, ner**

In [101]:
%%time

# processing occurs ~75x faster by disabling pipeline components
for doc in nlp.pipe(reform_df.head(10), disable=['parser','tagger','ner']):
    if 'immune' in doc.text:
        print(doc, '\n')

The Challenge of Central Banking in a Democratic Society

Good evening ladies and gentlemen. I am especially pleased to accept AEI's Francis Boyer Award for 1996 and be listed with so many of my friends and former associates. In my lecture this evening I want to give some personal perspectives on central banking and, consequently, I shall be speaking only for myself.

William Jennings Bryan reportedly mesmerized the Democratic Convention of 1896 with his memorable ". . . you shall not crucify mankind upon a cross of gold." His utterances underscored the profoundly divisive role of money in his time--a divisiveness that remains apparent today. Bryan was arguing for monetizing silver at an above-market price in order to expand the money supply. The presumed consequences would have been an increase in product prices and an accompanying shift in the value of net claims on future wealth from the "monied interests" of the East to the indebted farmers of the West who would arguably be able to

##### Determine which NLP components can be disabled

In [102]:
def view_pos(doc, n_tokens=5):
    """ print SpaCy POS information about each token in a provided document """
    print('{:15} | {:10} | {:10} | {:30}'.format('TOKEN','POS','DEP_','LEFTS'))
    for token in doc[0:n_tokens]:
        print('{:15} | {:10} | {:10} | {:30}'.format(
            token.text, token.head.pos_,token.dep_, str([t.text for t in token.lefts])))

In [103]:
# observe results from the default pipeline
pos_doc = nlp(text)
view_pos(pos_doc)

TOKEN           | POS        | DEP_       | LEFTS                         
Financial       | PROPN      | compound   | []                            
Reform          | PROPN      | ROOT       | ['Financial']                 
and             | PROPN      | cc         | []                            
the             | NOUN       | det        | []                            
Importance      | PROPN      | conj       | ['the']                       


In [104]:
# observe which part of speech (pos) attributes are disabled by parser
pos_doc = nlp(text, disable=['ner','parser'])
view_pos(pos_doc)

TOKEN           | POS        | DEP_       | LEFTS                         
Financial       | PROPN      |            | []                            
Reform          | PROPN      |            | []                            
and             | CCONJ      |            | []                            
the             | DET        |            | []                            
Importance      | NOUN       |            | []                            


In [105]:
# observe which part of speech (pos) attributes are disabled by tagger
pos_doc = nlp(text, disable=['ner','tagger'])
view_pos(pos_doc, n_tokens=5)

TOKEN           | POS        | DEP_       | LEFTS                         
Financial       |            | compound   | []                            
Reform          |            | ROOT       | ['Financial']                 
and             |            | cc         | []                            
the             |            | det        | []                            
Importance      |            | conj       | ['the']                       


### Stanza Pipeline

In [5]:
import stanza
import pandas as pd
from spacy import displacy

# Download English language model and initialize the NLP pipeline.
stanza.download('en')
nlp = stanza.Pipeline('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-06 01:10:49 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-06 01:11:28 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-06 01:11:28 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-06 01:11:28 INFO: Use device: gpu
2022-04-06 01:11:28 INFO: Loading: tokenize
2022-04-06 01:13:01 INFO: Loading: pos
2022-04-06 01:13:01 INFO: Loading: lemma
2022-04-06 01:13:02 INFO: Loading: depparse
2022-04-06 01:13:03 INFO: Loading: sentiment
2022-04-06 01:13:04 INFO: Loading: constituency
2022-04-06 01:13:04 INFO: Loading: ner
2022-04-06 01:13:06 INFO: Done loading processors!


In [7]:
doc = nlp(text) # return a Document object

In [8]:
def print_doc_info(doc):
    print(f"Num sentences:\t{len(doc.sentences)}")
    print(f"Num tokens:\t{doc.num_tokens}")
    print(f"Num words:\t{doc.num_words}")
    print(f"Num entities:\t{len(doc.entities)}")

print_doc_info(doc)

Num sentences:	118
Num tokens:	3414
Num words:	3414
Num entities:	76


In [9]:
def print_sentence_info(sentence):
    print(f"Text: {sentence.text}")
    print(f"Num tokens:\t{len(sentence.tokens)}")
    print(f"Num words:\t{len(sentence.words)}")
    print(f"Num entities:\t{len(sentence.entities)}")

print_sentence_info(doc.sentences[0])

Text: Financial Reform and the Importance of a Decentralized Banking Structure
Num tokens:	10
Num words:	10
Num entities:	0


In [12]:
def print_token_info(token):
    print(f"Text:\t{token.text}")
    print(f"Start:\t{token.start_char}")
    print(f"End:\t{token.end_char}")

print_token_info(doc.sentences[1].tokens[10])

Text:	convention
Start:	117
End:	127


In [14]:
def print_token_info(token):
    print(f"Text:\t{token.text}")
    print(f"Start:\t{token.start_char}")
    print(f"End:\t{token.end_char}")

print_token_info(doc.sentences[1].tokens[10])

Text:	convention
Start:	117
End:	127


In [16]:
def print_word_info(word):
    print(f"Text:\t{word.text}")
    print(f"Lemma: \t{word.lemma}")
    print(f"UPOS: \t{word.upos}")
    print(f"XPOS: \t{word.xpos}")

print_word_info(doc.sentences[1].words[10])

Text:	convention
Lemma: 	convention
UPOS: 	NOUN
XPOS: 	NN


In [17]:
def word_info_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame object with one row for each token in
      doc, and columns for text, lemma, upos, and xpos.
    """
    rows = []
    for sentence in doc.sentences:
        for word in sentence.words:
            row = {
                "text": word.text,
                "lemma": word.lemma,
                "upos": word.upos,
                "xpos": word.xpos,
            }
            rows.append(row)
    return pd.DataFrame(rows)

word_info_df(doc)

Unnamed: 0,text,lemma,upos,xpos
0,Financial,Financial,ADJ,JJ
1,Reform,Reform,NOUN,NN
2,and,and,CCONJ,CC
3,the,the,DET,DT
4,Importance,importance,NOUN,NN
...,...,...,...,...
3409,24,24,NUM,CD
3410,",",",",PUNCT,","
3411,1997,1997,NUM,CD
3412,9:00,9:00,NUM,CD


In [18]:
def print_entity_info(entity):
    print(f"Text:\t{entity.text}")
    print(f"Type:\t{entity.type}")
    print(f"Start:\t{entity.start_char}")
    print(f"End:\t{entity.end_char}")

print_entity_info(doc.entities[0])

Text:	the Independent Bankers Association of America
Type:	ORG
Start:	131
End:	177


### spacy_stanza

In [6]:
import stanza
import spacy_stanza

stanza.download("en")
nlp = spacy_stanza.load_pipeline("en")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-06 01:35:15 INFO: Downloading default packages for language: en (English)...
2022-04-06 01:35:18 INFO: File exists: /root/stanza_resources/en/default.zip.
2022-04-06 01:35:27 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-06 01:35:27 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-06 01:35:27 INFO: Use device: gpu
2022-04-06 01:35:27 INFO: Loading: tokenize
2022-04-06 01:35:30 INFO: Loading: pos
2022-04-06 01:35:31 INFO: Loading: lemma
2022-04-06 01:35:31 INFO: Loading: depparse
2022-04-06 01:35:31 INFO: Loading: sentiment
2022-04-06 01:35:32 INFO: Loading: constituency
2022-04-06 01:35:32 INFO: Loading: ner
2022-04-06 01:35:33 INFO: Done loading processors!


In [7]:
doc = nlp(text)

In [22]:
from itertools import islice

n = 11
sent = next(islice(doc.sents, n, n+1))

displacy.render(sent, style='ent', jupyter=True, options={'distance': 100})

In [21]:
displacy.render(sent, style='dep', jupyter=True, options={'distance': 100})

In [19]:
print(sent.ents[1])
print(sent.ents[1].lemma_)

England
England


In [20]:
print([(ent.label_, ent.text) for ent in sent.ents])

[('CARDINAL', 'less than 500'), ('GPE', 'England'), ('GPE', 'Germany'), ('GPE', 'Canada')]
