### **References**
- [https://spacy.pythonhumanities.com/intro.html](https://spacy.pythonhumanities.com/intro.html)  
- [https://spacy.io/usage/spacy-101](https://spacy.io/usage/spacy-101)
- [https://www.youtube.com/watch?v=dIUTsFT2MeQ&t=14s](https://www.youtube.com/watch?v=dIUTsFT2MeQ&t=14s)

### **Linguistic Annotations**

[https://spacy.pythonhumanities.com/01_02_linguistic_annotations.html](https://spacy.pythonhumanities.com/01_02_linguistic_annotations.html)

In [364]:
import spacy

In [365]:
nlp = spacy.load('en_core_web_sm')

with open ('wiki_us.txt') as f:
    text = f.read()

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [366]:
doc = nlp(text)
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [367]:
print(len(text))
print(len(doc))

6559
1208


Those text print look similar, but why the length difference. Let see:

In [368]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


Count every single char

In [369]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


Count individuals token (word, punctuation)

In [370]:
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


The (U.S.A. look kinda wrong, the punctuation should be saparate then we can remove it

In [371]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [372]:
# sent1 = doc.sents[0] # can not use

sent1 = list(doc.sents)[0]
print(sent1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [373]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [374]:
token2 = sent1[2]
print(token2)

States


In [375]:
token2.text

'States'

[https://spacy.io/api/token](https://spacy.io/api/token)

In [376]:
token2.left_edge

The

In [377]:
token2.right_edge

America

[https://spacy.io/api/entityrecognizer](https://spacy.io/api/entityrecognizer)

In [378]:
token2.ent_type

384

In [379]:
token2.ent_type_

'GPE'

In [380]:
token2.ent_iob_

'I'

In [381]:
token2.lemma_

'States'

In [382]:
print(sent1[12])

known


In [383]:
sent1[12].lemma_

'know'

In [384]:
token2.morph

Number=Sing

In [385]:
sent1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

In [386]:
token2.pos_ # proper noun

'PROPN'

In [387]:
token2.dep_ # noun subject

'nsubj'

In [388]:
token2.lang_

'en'

In [389]:
text = 'Mike enjoys playing football.'
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football.


In [390]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


In [391]:
from spacy import displacy
displacy.render(doc2, style='dep')

### **Named Entity Recognition**

In [392]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775Ã¢â‚¬â€œ1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The SpanishÃ¢â‚¬â€œAmerican War and World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean

In [393]:
displacy.render(doc, style='ent')

In [394]:
spacy.explain('GPE')

'Countries, cities, states'

### **Word Vectors**

[https://spacy.pythonhumanities.com/01_03_word_vectors.html](https://spacy.pythonhumanities.com/01_03_word_vectors.html)

In [395]:
# !python -m spacy download en_core_web_md

In [396]:
import numpy as np

In [397]:
nlp1 = spacy.load('en_core_web_md')

In [398]:
print(sent1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


[https://spacy.io/api/vocab](https://spacy.io/api/vocab)  
[https://numpy.org/doc/stable/reference/generated/numpy.asarray.html](https://numpy.org/doc/stable/reference/generated/numpy.asarray.html)

In [399]:
your_word = 'country'

ms = nlp1.vocab.vectors.most_similar(
    np.asarray([nlp1.vocab.vectors[nlp1.vocab.strings[your_word]]]), n=10)
words = [nlp1.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]

print(nlp1.vocab.strings[your_word])
print(ms)
print(words)

12290671265767728302
(array([[12389239844680878404,  1435501296278296988,  3205366385982613224,
        10101261077591962824, 10067128433980916117, 13467190378500458811,
         7523086094447079607,  4411440909759659592,  3830018849180425586,
          769100778973147158]], dtype=uint64), array([[  351,  1831,   919,  8453,  4341, 10117,  1955, 14035,   984,
        17926]], dtype=int32), array([[1.    , 0.8009, 0.7833, 0.773 , 0.7341, 0.712 , 0.6996, 0.6951,
        0.6934, 0.6924]], dtype=float32))
['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


[https://spacy.io/usage/linguistic-features](https://spacy.io/usage/linguistic-features)

In [400]:
doc1 = nlp1('I like salty fries and hamburgers.')
doc2 = nlp1('Fast food tastes very good.')

In [401]:
print(doc1, '<->', doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.6916492984766077


In [402]:
doc3 = nlp1('The Empire State Building is in New York.')

In [403]:
print (doc1, '<->', doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


To detect document overlap sentences

In [404]:
doc4 = nlp1('I enjoy oranges.')
doc5 = nlp1('I enjoy apples.')

In [405]:
print(doc4, '<->', doc5, doc4.similarity(doc5))

I enjoy oranges. <-> I enjoy apples. 0.9775700747747101


In [406]:
doc6 = nlp1('I enjoy burgers.')

In [407]:
print(doc4, '<->', doc6, doc4.similarity(doc6))

I enjoy oranges. <-> I enjoy burgers. 0.9628306772893752


In [408]:
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, '<->', burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489079475403


### **Pipelines**

[https://spacy.pythonhumanities.com/01_04_pipelines.html](https://spacy.pythonhumanities.com/01_04_pipelines.html)  
[https://spacy.io/usage/processing-pipelines](https://spacy.io/usage/processing-pipelines)  
[https://spacy.io/models](https://spacy.io/models)

[https://spacy.io/api/top-level](https://spacy.io/api/top-level)

In [409]:
nlp2 = spacy.blank('en')

In [410]:
nlp2.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x1d743743b00>

[https://pypi.org/project/requests/](https://pypi.org/project/requests/)  
[https://pypi.org/project/beautifulsoup4/](https://pypi.org/project/beautifulsoup4/)

In [411]:
# !py -m pip install bs4

In [412]:
import requests
from bs4 import BeautifulSoup
s = requests.get('https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt')
soup = BeautifulSoup(s.content).text.replace('-\n', '').replace('\n', ' ')
nlp2.max_length = 5278439

In [413]:
# %%time
# doc_test = nlp2(soup)
# print(len(list(doc.sents)))

# Wall time: 19.9s

In [414]:
# %%time
# nlp.max_length = 5278439
# doc_test = nlp(soup)
# print(len(list(doc.sents)))

# Wall time: 3min 50s

In [415]:
nlp2.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [416]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

### **EntityRuler**

[https://spacy.pythonhumanities.com/02_01_entityruler.html](https://spacy.pythonhumanities.com/02_01_entityruler.html)  
[https://spacy.io/api/entityruler](https://spacy.io/api/entityruler)

In [417]:
text = 'West Chestertenfieldville was referenced in Mr. Deeds.'

In [418]:
doc_e = nlp(text)

In [419]:
for ent in doc_e.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [420]:
spacy.explain('GPE')

'Countries, cities, states'

The West Chestertenfieldville is GPE look correct. Now we want the Mr. Deeds is a film

In [421]:
ruler = nlp.add_pipe('entity_ruler')

In [422]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

Example to change entity type of specific entity

In [423]:
patterns = [
    { 'label': 'PERSON', 'pattern': 'West Chestertenfieldville' }
]

In [424]:
ruler.add_patterns(patterns)

for ent in doc_e.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


It is not affected, cuz the **ner** step is before the **entity_ruler**, when the **ner** step to identify the "West Chestertenfieldville", the **entity_ruler** will not identify it again. To apply our rule, we need to move the **entity_ruler** before **ner** step

In [425]:
nlp_e = spacy.load('en_core_web_sm')

In [426]:
ruler = nlp_e.add_pipe('entity_ruler', before='ner')

In [427]:
ruler.add_patterns(patterns)

In [428]:
doc_e = nlp_e(text)

for ent in doc_e.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville PERSON
Deeds PERSON


In [429]:
nlp_e.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [430]:
nlp_e1 = spacy.load('en_core_web_sm')

In [431]:
ruler = nlp_e1.add_pipe('entity_ruler', before='ner')

In [432]:
patterns = [
    { 'label': 'PERSON', 'pattern': 'West Chestertenfieldville' },
    { 'label': 'FILM', 'pattern': 'Mr. Deeds' }
]

In [433]:
ruler.add_patterns(patterns)

In [434]:
doc_e = nlp_e1(text)

for ent in doc_e.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville PERSON
Mr. Deeds FILM


Mr. Deeds is now detect as a film, but it could be a person. It called **toponym resolution** (common problem of nlp, thing with multiple label)

### **Matcher**

[https://spacy.pythonhumanities.com/02_02_matcher.html](https://spacy.pythonhumanities.com/02_02_matcher.html)  
[https://spacy.io/api/matcher](https://spacy.io/api/matcher)

In [435]:
from spacy.matcher import Matcher

In [436]:
matcher = Matcher(nlp.vocab)
pattern = [
    { 'LIKE_EMAIL': True }
]
matcher.add('EMAIL_ADDRESS', [pattern])

In [437]:
doc_m = nlp('This is an email address: nghoangtan96@gmail.com')
matchers = matcher(doc_m)
print(matchers)

[(16571425990740197027, 6, 7)]


In [438]:
print(nlp.vocab[matchers[0][0]].text)

EMAIL_ADDRESS


In [439]:
with open ('wiki_mlk.txt') as f:
    text = f.read()

print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 Ã¢â‚¬â€œ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his 

In [440]:
nlp_m = spacy.load('en_core_web_sm')

In [441]:
# POS Part-of-Speech
print(spacy.explain('PROPN'))

proper noun


In [442]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN' }]
matcher.add('PROPER_NOUN', [pattern])
doc_m = nlp(text)
matches = matcher(doc_m)
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

102
(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist


Look not correct cuz the Martin Luther King Jr. should be a proper noun

In [443]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }]
matcher.add('PROPER_NOUN', [pattern])
doc_m = nlp(text)
matches = matcher(doc_m)
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

175
(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


In [444]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

61
(451313080118390996, 83, 88) Martin Luther King Sr.
(451313080118390996, 468, 473) Martin Luther King Jr. Day
(451313080118390996, 535, 540) Martin Luther King Jr. Memorial
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 128, 132) Southern Christian Leadership Conference
(451313080118390996, 246, 250) Director J. Edgar Hoover
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 324, 327) Nobel Peace Prize
(451313080118390996, 421, 424) James Earl Ray
(451313080118390996, 462, 465) Congressional Gold Medal


Look correct, but the order was messup, let check the bottom of the list

In [445]:
for match in matches[-10:]:
    print(match, doc_m[match[1]:match[2]])

(451313080118390996, 401, 402) April
(451313080118390996, 404, 405) Memphis
(451313080118390996, 406, 407) Tennessee
(451313080118390996, 416, 417) U.S.
(451313080118390996, 430, 431) King
(451313080118390996, 449, 450) King
(451313080118390996, 457, 458) Freedom
(451313080118390996, 513, 514) U.S.
(451313080118390996, 545, 546) Washington
(451313080118390996, 547, 548) D.C.


In [446]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

61
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist
(451313080118390996, 49, 50) King
(451313080118390996, 69, 71) Mahatma Gandhi
(451313080118390996, 83, 88) Martin Luther King Sr.
(451313080118390996, 89, 90) King
(451313080118390996, 113, 114) King


Now we to know anytime proper noun follow by a verb

In [447]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }, { 'POS': 'VERB' }]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

7
(451313080118390996, 49, 51) King advanced
(451313080118390996, 89, 91) King participated
(451313080118390996, 113, 115) King led
(451313080118390996, 167, 169) King helped
(451313080118390996, 246, 251) Director J. Edgar Hoover considered
(451313080118390996, 321, 323) King won
(451313080118390996, 484, 487) United States beginning


In [448]:
import json
with open('alice.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
# with open('alice.txt') as f:
#     data = f.read()

# print(data)

In [449]:
data = data[0][2]
print(data)

["Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'", 'So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.', "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT- POCKET, and looked at it

In [450]:
text = data[0].replace('`', "'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


We want to know any speak inside the content

In [451]:
matcher = Matcher(nlp.vocab)
pattern = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

2
(451313080118390996, 47, 58) 'and what is the use of a book,'
(451313080118390996, 60, 67) 'without pictures or conversation?'


Now we can extract the speak, but we also want to know who is the speaker

In [452]:
speak_lemmas = ['think', 'say']
matcher = Matcher(nlp.vocab)
pattern = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" },

    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' },

    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

1
(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [453]:
for text in data:
    text = text.replace('`', "'")
    doc_m = nlp(text)
    matches = matcher(doc_m)
    print(len(matches))
    matches.sort(key = lambda x: x[1])
    for match in matches[:10]:
        print (match, doc_m[match[1]:match[2]])

1
(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


Our pattern can not capture any match of other content cuz now we specific capture <```string in quote```><```verb + proper noun```><```string in quote```> and only 1 matched

In [454]:
matcher1 = Matcher(nlp_m.vocab)
pattern1 = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" },

    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' },

    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
pattern2 = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" },

    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' }
]
pattern3 = [
    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' },

    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
matcher.add('PROPER_NOUNS', [pattern1, pattern2, pattern3], greedy='LONGEST')

for text in data:
    text = text.replace('`', "'")
    doc_m = nlp(text)
    matches = matcher(doc_m)
    print(len(matches))
    matches.sort(key = lambda x: x[1])
    for match in matches[:10]:
        print (match, doc_m[match[1]:match[2]])

2
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


### **Custom Components**

[https://spacy.pythonhumanities.com/02_04_custom_component.html](https://spacy.pythonhumanities.com/02_04_custom_component.html)  
[https://spacy.io/usage/processing-pipelines#custom-components](https://spacy.io/usage/processing-pipelines#custom-components)

In [455]:
nlp_cc = spacy.load('en_core_web_sm')
doc_cc = nlp_cc('Britain is a place. Mary is a doctor')

In [456]:
for ent in doc_cc.ents:
    print(ent.text, ent.label_)

Britain GPE
Mary PERSON


In [457]:
from spacy.language import Language

In [458]:
@Language.component('remove_gpe')
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            original_ents.remove(ent)
    doc.ents = original_ents
    return (doc)

In [459]:
nlp_cc.add_pipe('remove_gpe')

<function __main__.remove_gpe(doc)>

In [460]:
nlp_cc.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [461]:
doc_cc = nlp_cc('Britain is a place. Mary is a doctor')
for ent in doc_cc.ents:
    print(ent.text, ent.label_)

Mary PERSON


In [462]:
nlp_cc.to_disk('new_en_core_web_sm')

### **RegEx**

[https://spacy.pythonhumanities.com/02_05_simple_regex.html](https://spacy.pythonhumanities.com/02_05_simple_regex.html)

In [463]:
#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [
                        {"TEXT": {"REGEX": r"((\d){3}-(\d){4})"}}
                    ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Can not use to match across tokens. The dash in the phone number throws off the EntityRuler

In [464]:
text = "This is a sample number 5555555."

patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [
                        {"TEXT": {"REGEX": r"((\d){5})"}}
                    ]
                }
            ]

#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER


### **RegEx (Multi-Word Tokens)**

In [465]:
import re

In [466]:
text = 'Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common.'

In [467]:
pattern = r'Paul [A-Z]\w+'

[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)

In [468]:
matches = re.finditer(pattern, text)

for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


In [469]:
from spacy.tokens import Span

In [470]:
nlp_rg = spacy.blank('en')
doc_rg = nlp_rg(text)
doc_rg.ents

()

In [471]:
original_ents = list(doc_rg.ents)
mwt_ents = []

for match in re.finditer(pattern, doc_rg.text):
    start, end = match.span()
    span = doc_rg.char_span(start, end)
    print(span)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

Paul Newman
Paul Hollywood


In [472]:
print(mwt_ents)

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]


In [473]:
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc_rg, start, end, label='PERSON')
    original_ents.append(per_ent)
doc_rg.ents = original_ents
print(doc_rg.ents)

(Paul Newman, Paul Hollywood)


In [474]:
from spacy.language import Language

In [475]:
@Language.component('paul_ner')
def paul_ner(doc):
    pattern = r'Paul [A-Z]\w+'
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        print(span)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='PERSON')
        original_ents.append(per_ent)
    doc.ents = original_ents
    return (doc)

In [476]:
nlp_rg1 = spacy.blank('en')
nlp_rg1.add_pipe('paul_ner')

<function __main__.paul_ner(doc)>

In [477]:
doc_rg1 = nlp_rg1(text)
print(doc_rg1.ents)

Paul Newman
Paul Hollywood
(Paul Newman, Paul Hollywood)


Let create a new Span to detect Holywood as CINEMA

In [478]:
@Language.component('cinema_ner')
def paul_ner(doc):
    pattern = r'Hollywood'
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        print(span)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='CINEMA')
        original_ents.append(per_ent)
    doc.ents = original_ents
    return (doc)

In [479]:
nlp_rg2 = spacy.load('en_core_web_sm')
nlp_rg2.add_pipe('cinema_ner')

<function __main__.paul_ner(doc)>

In [480]:
# doc_rg2 = nlp_rg2(text)
# [E1010] Unable to set entity information for token 9 which is 
# included in more than one span in entities, blocked, missing or outside.

# The error because the overlap span of cinema_ner and paul_ner

How to solve the problem

In [481]:
from spacy.util import filter_spans
@Language.component('cinema_ner1')
def paul_ner(doc):
    pattern = r'Hollywood'
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        print(span)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='CINEMA')
        original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)

The priority given to longer token

In [482]:
nlp_rg3 = spacy.load('en_core_web_sm')
nlp_rg3.add_pipe('cinema_ner1')
doc_rg3 = nlp_rg3(text)

for ent in doc_rg3.ents:
    print(ent.text, ent.label_)

Hollywood
Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
Paul PERSON


### **Applied SpaCy Financi**

[https://spacy.pythonhumanities.com/03_01_stock_analysis.html](https://spacy.pythonhumanities.com/03_01_stock_analysis.html)

In [483]:
import spacy
import pandas as pd

In [484]:
df = pd.read_csv('stocks.tsv', sep='\t')
df.head()

Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
0,A,Agilent Technologies,Life Sciences Tools & Services,53.65B
1,AA,Alcoa,Metals & Mining,9.25B
2,AAC,Ares Acquisition,Shell Companies,1.22B
3,AACG,ATA Creativity Global,Diversified Consumer Services,90.35M
4,AADI,Aadi Bioscience,Pharmaceuticals,104.85M


In [485]:
df.tail()

Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
5874,ZWRK,Z-Work Acquisition,Shell Companies,278.88M
5875,ZY,Zymergen,Chemicals,1.31B
5876,ZYME,Zymeworks,Biotechnology,1.50B
5877,ZYNE,Zynerba Pharmaceuticals,Pharmaceuticals,184.39M
5878,ZYXI,Zynex,Health Care Equipment & Supplies,438.33M


In [486]:
symbols = df.Symbol.tolist()
companies = df.CompanyName.tolist()
print(symbols[:10])

['A', 'AA', 'AAC', 'AACG', 'AADI', 'AAIC', 'AAL', 'AAMC', 'AAME', 'AAN']


In [487]:
stops = ['two']
nlp_asf = spacy.blank('en')
ruler = nlp_asf.add_pipe('entity_ruler')
patterns = []

for symbol in symbols:
    patterns.append({ 'label': 'STOCK', 'pattern': symbol })

for company in companies:
    if company not in stops:
        patterns.append({ 'label': 'COMPANY', 'pattern': company })

print(patterns[:10])
print(patterns[-10:])

[{'label': 'STOCK', 'pattern': 'A'}, {'label': 'STOCK', 'pattern': 'AA'}, {'label': 'STOCK', 'pattern': 'AAC'}, {'label': 'STOCK', 'pattern': 'AACG'}, {'label': 'STOCK', 'pattern': 'AADI'}, {'label': 'STOCK', 'pattern': 'AAIC'}, {'label': 'STOCK', 'pattern': 'AAL'}, {'label': 'STOCK', 'pattern': 'AAMC'}, {'label': 'STOCK', 'pattern': 'AAME'}, {'label': 'STOCK', 'pattern': 'AAN'}]
[{'label': 'COMPANY', 'pattern': 'Zoetis'}, {'label': 'COMPANY', 'pattern': 'Zumiez'}, {'label': 'COMPANY', 'pattern': 'Zuora'}, {'label': 'COMPANY', 'pattern': 'Zevia'}, {'label': 'COMPANY', 'pattern': 'Zovio'}, {'label': 'COMPANY', 'pattern': 'Z-Work Acquisition'}, {'label': 'COMPANY', 'pattern': 'Zymergen'}, {'label': 'COMPANY', 'pattern': 'Zymeworks'}, {'label': 'COMPANY', 'pattern': 'Zynerba Pharmaceuticals'}, {'label': 'COMPANY', 'pattern': 'Zynex'}]


In [488]:
text = '''
Sept 10 (Reuters) - Wall Street's main indexes were subdued on Friday as signs of higher inflation and a drop in Apple shares following an unfavorable court ruling offset expectations of an easing in U.S.-China tensions.

Data earlier in the day showed U.S. producer prices rose solidly in August, leading to the biggest annual gain in nearly 11 years and indicating that high inflation was likely to persist as the pandemic pressures supply chains. read more .

"Today's data on wholesale prices should be eye-opening for the Federal Reserve, as inflation pressures still don't appear to be easing and will likely continue to be felt by the consumer in the coming months," said Charlie Ripley, senior investment strategist for Allianz Investment Management.

Apple Inc (AAPL.O) fell 2.7% following a U.S. court ruling in "Fortnite" creator Epic Games' antitrust lawsuit that stroke down some of the iPhone maker's restrictions on how developers can collect payments in apps.


Sponsored by Advertising Partner
Sponsored Video
Watch to learn more
Report ad
Apple shares were set for their worst single-day fall since May this year, weighing on the Nasdaq (.IXIC) and the S&P 500 technology sub-index (.SPLRCT), which fell 0.1%.

Sentiment also took a hit from Cleveland Federal Reserve Bank President Loretta Mester's comments that she would still like the central bank to begin tapering asset purchases this year despite the weak August jobs report. read more

Investors have paid keen attention to the labor market and data hinting towards higher inflation recently for hints on a timeline for the Federal Reserve to begin tapering its massive bond-buying program.

The S&P 500 has risen around 19% so far this year on support from dovish central bank policies and re-opening optimism, but concerns over rising coronavirus infections and accelerating inflation have lately stalled its advance.


Report ad
The three main U.S. indexes got some support on Friday from news of a phone call between U.S. President Joe Biden and Chinese leader Xi Jinping that was taken as a positive sign which could bring a thaw in ties between the world's two most important trading partners.

At 1:01 p.m. ET, the Dow Jones Industrial Average (.DJI) was up 12.24 points, or 0.04%, at 34,891.62, the S&P 500 (.SPX) was up 2.83 points, or 0.06%, at 4,496.11, and the Nasdaq Composite (.IXIC) was up 12.85 points, or 0.08%, at 15,261.11.

Six of the eleven S&P 500 sub-indexes gained, with energy (.SPNY), materials (.SPLRCM) and consumer discretionary stocks (.SPLRCD) rising the most.

U.S.-listed Chinese e-commerce companies Alibaba and JD.com , music streaming company Tencent Music (TME.N) and electric car maker Nio Inc (NIO.N) all gained between 0.7% and 1.4%


Report ad
Grocer Kroger Co (KR.N) dropped 7.1% after it said global supply chain disruptions, freight costs, discounts and wastage would hit its profit margins.

Advancing issues outnumbered decliners by a 1.12-to-1 ratio on the NYSE and by a 1.02-to-1 ratio on the Nasdaq.

The S&P index recorded 14 new 52-week highs and three new lows, while the Nasdaq recorded 49 new highs and 38 new lows.
'''

In [489]:
ruler.add_patterns(patterns)

In [490]:
doc_asf = nlp_asf(text)
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
Apple COMPANY
Nasdaq COMPANY
ET STOCK
Nasdaq COMPANY
JD.com COMPANY
Kroger COMPANY
Nasdaq COMPANY
Nasdaq COMPANY


In [491]:
from spacy import displacy

In [492]:
displacy.render(doc_asf, style='ent')

We missing AAPL at second block

In [493]:
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for symbol in symbols:
    for l in letters:
        patterns.append({"label": "STOCK", "pattern": symbol+f".{l}"})

In [494]:
ruler.add_patterns(patterns)
doc_asf = nlp_asf(text)
displacy.render(doc_asf, style='ent')

In [495]:
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
Nasdaq COMPANY
ET STOCK
Nasdaq COMPANY
JD.com COMPANY
TME.N STOCK
NIO.N STOCK
Kroger COMPANY
KR.N STOCK
Nasdaq COMPANY
Nasdaq COMPANY


In [496]:
df2 = pd.read_csv('indexes.tsv', sep='\t')
df2.head()

Unnamed: 0,IndexName,IndexSymbol
0,Dow Jones Industrial Average,DJIA
1,Dow Jones Transportation Average,DJT
2,Dow Jones Utility Average Index,DJU
3,NASDAQ 100 Index (NASDAQ Calculation),NDX
4,NASDAQ Composite Index,COMP


In [497]:
indexes = df2.IndexName.tolist()
index_symbols = df2.IndexSymbol.tolist()

In [498]:
for index in indexes:
    patterns.append({ 'label': 'INDEX', 'pattern': index })
    words = index.split()
    patterns.append({ 'label': 'INDEX', 'pattern': ' '.join(words[:2]) })
for index in index_symbols:
    patterns.append({ 'label': 'INDEX', 'pattern': index })

In [499]:
ruler.add_patterns(patterns)

In [500]:
doc_asf = nlp_asf(text)
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
Nasdaq COMPANY
S&P 500 INDEX
S&P 500 INDEX
ET STOCK
Dow Jones Industrial Average INDEX
S&P 500 INDEX
Nasdaq COMPANY
S&P 500 INDEX
JD.com COMPANY
TME.N STOCK
NIO.N STOCK
Kroger COMPANY
KR.N STOCK
Nasdaq COMPANY
Nasdaq COMPANY


In [501]:
df3 = pd.read_csv('stock_exchanges.tsv', sep='\t')
df3.head()

Unnamed: 0,BloombergExchangeCode,BloombergCompositeCode,Country,Description,ISOMIC,Google Prefix,EODcode,NumStocks
0,AF,AR,Argentina,Bolsa de Comercio de Buenos Aires,XBUE,,BA,12
1,AO,AU,Australia,National Stock Exchange of Australia,XNEC,,,1
2,AT,AU,Australia,Asx - All Markets,XASX,ASX,AU,875
3,AV,,Austria,Wiener Boerse Ag,XWBO,VIE,VI,38
4,BI,,Bahrain,Bahrain Bourse,XBAH,,,4


In [502]:
exchanges = df3.ISOMIC.tolist() + df3['Google Prefix'].tolist() + df3.Description.tolist()
print(exchanges[:10])

['XBUE', 'XNEC', 'XASX', 'XWBO', 'XBAH', 'XDHA', 'XBRU', 'BVMF', 'XCNQ', 'XTSE']


In [503]:
for e in exchanges:
    patterns.append({ 'label': 'STOCK_EXCHANGE', 'pattern': e })

In [504]:
ruler.add_patterns(patterns)
doc_asf = nlp_asf(text)
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
Nasdaq COMPANY
S&P 500 INDEX
S&P 500 INDEX
ET STOCK
Dow Jones Industrial Average INDEX
S&P 500 INDEX
Nasdaq COMPANY
S&P 500 INDEX
JD.com COMPANY
TME.N STOCK
NIO.N STOCK
Kroger COMPANY
KR.N STOCK
NYSE STOCK_EXCHANGE
Nasdaq COMPANY
Nasdaq COMPANY
