### **References**
- [https://spacy.pythonhumanities.com/intro.html](https://spacy.pythonhumanities.com/intro.html)  
- [https://spacy.io/usage/spacy-101](https://spacy.io/usage/spacy-101)
- [https://www.youtube.com/watch?v=dIUTsFT2MeQ&t=14s](https://www.youtube.com/watch?v=dIUTsFT2MeQ&t=14s)

### **Linguistic Annotations**

[https://spacy.pythonhumanities.com/01_02_linguistic_annotations.html](https://spacy.pythonhumanities.com/01_02_linguistic_annotations.html)

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

with open ('wiki_us.txt') as f:
    text = f.read()

print(text)

In [None]:
doc = nlp(text)
print(doc)

In [None]:
print(len(text))
print(len(doc))

Those text print look similar, but why the length difference. Let see:

In [None]:
for token in text[0:10]:
    print(token)

Count every single char

In [None]:
for token in doc[:10]:
    print(token)

Count individuals token (word, punctuation)

In [None]:
for token in text.split()[:10]:
    print(token)

The (U.S.A. look kinda wrong, the punctuation should be saparate then we can remove it

In [None]:
for sent in doc.sents:
    print(sent)

In [None]:
# sent1 = doc.sents[0] # can not use

sent1 = list(doc.sents)[0]
print(sent1)

In [None]:
for token in doc[:10]:
    print(token)

In [None]:
token2 = sent1[2]
print(token2)

In [None]:
token2.text

[https://spacy.io/api/token](https://spacy.io/api/token)

In [None]:
token2.left_edge

In [None]:
token2.right_edge

[https://spacy.io/api/entityrecognizer](https://spacy.io/api/entityrecognizer)

In [None]:
token2.ent_type

In [None]:
token2.ent_type_

In [None]:
token2.ent_iob_

In [None]:
token2.lemma_

In [None]:
print(sent1[12])

In [None]:
sent1[12].lemma_

In [None]:
token2.morph

In [None]:
sent1[12].morph

In [None]:
token2.pos_ # proper noun

In [None]:
token2.dep_ # noun subject

In [None]:
token2.lang_

In [None]:
text = 'Mike enjoys playing football.'
doc2 = nlp(text)
print(doc2)

In [None]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

In [None]:
from spacy import displacy
displacy.render(doc2, style='dep')

### **Named Entity Recognition**

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
displacy.render(doc, style='ent')

In [None]:
spacy.explain('GPE')

### **Word Vectors**

[https://spacy.pythonhumanities.com/01_03_word_vectors.html](https://spacy.pythonhumanities.com/01_03_word_vectors.html)

In [None]:
# !python -m spacy download en_core_web_md

In [None]:
import numpy as np

In [None]:
nlp1 = spacy.load('en_core_web_md')

In [None]:
print(sent1)

[https://spacy.io/api/vocab](https://spacy.io/api/vocab)  
[https://numpy.org/doc/stable/reference/generated/numpy.asarray.html](https://numpy.org/doc/stable/reference/generated/numpy.asarray.html)

In [None]:
your_word = 'country'

ms = nlp1.vocab.vectors.most_similar(
    np.asarray([nlp1.vocab.vectors[nlp1.vocab.strings[your_word]]]), n=10)
words = [nlp1.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]

print(nlp1.vocab.strings[your_word])
print(ms)
print(words)

[https://spacy.io/usage/linguistic-features](https://spacy.io/usage/linguistic-features)

In [None]:
doc1 = nlp1('I like salty fries and hamburgers.')
doc2 = nlp1('Fast food tastes very good.')

In [None]:
print(doc1, '<->', doc2, doc1.similarity(doc2))

In [None]:
doc3 = nlp1('The Empire State Building is in New York.')

In [None]:
print (doc1, '<->', doc3, doc1.similarity(doc3))

To detect document overlap sentences

In [None]:
doc4 = nlp1('I enjoy oranges.')
doc5 = nlp1('I enjoy apples.')

In [None]:
print(doc4, '<->', doc5, doc4.similarity(doc5))

In [None]:
doc6 = nlp1('I enjoy burgers.')

In [None]:
print(doc4, '<->', doc6, doc4.similarity(doc6))

In [None]:
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, '<->', burgers, french_fries.similarity(burgers))

### **Pipelines**

[https://spacy.pythonhumanities.com/01_04_pipelines.html](https://spacy.pythonhumanities.com/01_04_pipelines.html)  
[https://spacy.io/usage/processing-pipelines](https://spacy.io/usage/processing-pipelines)  
[https://spacy.io/models](https://spacy.io/models)

[https://spacy.io/api/top-level](https://spacy.io/api/top-level)

In [None]:
nlp2 = spacy.blank('en')

In [None]:
nlp2.add_pipe('sentencizer')

[https://pypi.org/project/requests/](https://pypi.org/project/requests/)  
[https://pypi.org/project/beautifulsoup4/](https://pypi.org/project/beautifulsoup4/)

In [None]:
# !py -m pip install bs4

In [None]:
import requests
from bs4 import BeautifulSoup
s = requests.get('https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt')
soup = BeautifulSoup(s.content).text.replace('-\n', '').replace('\n', ' ')
nlp2.max_length = 5278439

In [None]:
# %%time
# doc_test = nlp2(soup)
# print(len(list(doc.sents)))

# Wall time: 19.9s

In [None]:
# %%time
# nlp.max_length = 5278439
# doc_test = nlp(soup)
# print(len(list(doc.sents)))

# Wall time: 3min 50s

In [None]:
nlp2.analyze_pipes()

In [None]:
nlp.analyze_pipes()

### **EntityRuler**

[https://spacy.pythonhumanities.com/02_01_entityruler.html](https://spacy.pythonhumanities.com/02_01_entityruler.html)  
[https://spacy.io/api/entityruler](https://spacy.io/api/entityruler)

In [None]:
text = 'West Chestertenfieldville was referenced in Mr. Deeds.'

In [None]:
doc_e = nlp(text)

In [None]:
for ent in doc_e.ents:
    print(ent.text, ent.label_)

In [None]:
spacy.explain('GPE')

The West Chestertenfieldville is GPE look correct. Now we want the Mr. Deeds is a film

In [None]:
ruler = nlp.add_pipe('entity_ruler')

In [None]:
nlp.analyze_pipes()

Example to change entity type of specific entity

In [None]:
patterns = [
    { 'label': 'PERSON', 'pattern': 'West Chestertenfieldville' }
]

In [None]:
ruler.add_patterns(patterns)

for ent in doc_e.ents:
    print(ent.text, ent.label_)

It is not affected, cuz the **ner** step is before the **entity_ruler**, when the **ner** step to identify the "West Chestertenfieldville", the **entity_ruler** will not identify it again. To apply our rule, we need to move the **entity_ruler** before **ner** step

In [None]:
nlp_e = spacy.load('en_core_web_sm')

In [None]:
ruler = nlp_e.add_pipe('entity_ruler', before='ner')

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc_e = nlp_e(text)

for ent in doc_e.ents:
    print(ent.text, ent.label_)

In [None]:
nlp_e.analyze_pipes()

In [None]:
nlp_e1 = spacy.load('en_core_web_sm')

In [None]:
ruler = nlp_e1.add_pipe('entity_ruler', before='ner')

In [None]:
patterns = [
    { 'label': 'PERSON', 'pattern': 'West Chestertenfieldville' },
    { 'label': 'FILM', 'pattern': 'Mr. Deeds' }
]

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc_e = nlp_e1(text)

for ent in doc_e.ents:
    print(ent.text, ent.label_)

Mr. Deeds is now detect as a film, but it could be a person. It called **toponym resolution** (common problem of nlp, thing with multiple label)

### **Matcher**

[https://spacy.pythonhumanities.com/02_02_matcher.html](https://spacy.pythonhumanities.com/02_02_matcher.html)  
[https://spacy.io/api/matcher](https://spacy.io/api/matcher)

In [None]:
from spacy.matcher import Matcher

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [
    { 'LIKE_EMAIL': True }
]
matcher.add('EMAIL_ADDRESS', [pattern])

In [None]:
doc_m = nlp('This is an email address: nghoangtan96@gmail.com')
matchers = matcher(doc_m)
print(matchers)

In [None]:
print(nlp.vocab[matchers[0][0]].text)

In [None]:
with open ('wiki_mlk.txt') as f:
    text = f.read()

print(text)

In [None]:
nlp_m = spacy.load('en_core_web_sm')

In [None]:
# POS Part-of-Speech
print(spacy.explain('PROPN'))

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN' }]
matcher.add('PROPER_NOUN', [pattern])
doc_m = nlp(text)
matches = matcher(doc_m)
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

Look not correct cuz the Martin Luther King Jr. should be a proper noun

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }]
matcher.add('PROPER_NOUN', [pattern])
doc_m = nlp(text)
matches = matcher(doc_m)
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

Look correct, but the order was messup, let check the bottom of the list

In [None]:
for match in matches[-10:]:
    print(match, doc_m[match[1]:match[2]])

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

Now we to know anytime proper noun follow by a verb

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{ 'POS': 'PROPN', 'OP': '+' }, { 'POS': 'VERB' }]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

In [None]:
import json
with open('alice.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
# with open('alice.txt') as f:
#     data = f.read()

print(data)

In [None]:
data = data['data'][0]['contents']
print(data)

In [None]:
text = data[0].replace('`', "'")
print(text)

We want to know any speak inside the content

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

Now we can extract the speak, but we also want to know who is the speaker

In [None]:
speak_lemmas = ['think', 'say']
matcher = Matcher(nlp.vocab)
pattern = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" },

    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' },

    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
matcher.add('PROPER_NOUN', [pattern], greedy='LONGEST')
doc_m = nlp(text)
matches = matcher(doc_m)
# (id, start_index, end_index)
matches.sort(key = lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc_m[match[1]:match[2]])

In [None]:
for text in data:
    text = text.replace('`', "'")
    doc_m = nlp(text)
    matches = matcher(doc_m)
    print(len(matches))
    matches.sort(key = lambda x: x[1])
    for match in matches[:10]:
        print (match, doc_m[match[1]:match[2]])

Our pattern can not capture any match of other content cuz now we specific capture <```string in quote```><```verb + proper noun```><```string in quote```> and only 1 matched

In [None]:
matcher1 = Matcher(nlp_m.vocab)
pattern1 = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" },

    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' },

    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
pattern2 = [
    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" },

    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' }
]
pattern3 = [
    { 'POS': 'VERB', 'LEMMA': { 'IN': speak_lemmas }}, # select a verb follow which is match speak_lemmas (think > thought > thought)
    { 'POS': 'PROPN', 'OP': '+' },

    { 'ORTH': "'" },
    { 'IS_ALPHA': True, 'OP': '+' }, # to select all text inside quote
    { 'IS_PUNCT': True, 'OP': '*' }, # to select any or non punc inside quote
    { 'ORTH': "'" }
]
matcher.add('PROPER_NOUNS', [pattern1, pattern2, pattern3], greedy='LONGEST')

for text in data:
    text = text.replace('`', "'")
    doc_m = nlp(text)
    matches = matcher(doc_m)
    print(len(matches))
    matches.sort(key = lambda x: x[1])
    for match in matches[:10]:
        print (match, doc_m[match[1]:match[2]])

### **Custom Components**

[https://spacy.pythonhumanities.com/02_04_custom_component.html](https://spacy.pythonhumanities.com/02_04_custom_component.html)  
[https://spacy.io/usage/processing-pipelines#custom-components](https://spacy.io/usage/processing-pipelines#custom-components)

In [None]:
nlp_cc = spacy.load('en_core_web_sm')
doc_cc = nlp_cc('Britain is a place. Mary is a doctor')

In [None]:
for ent in doc_cc.ents:
    print(ent.text, ent.label_)

In [None]:
from spacy.language import Language

In [None]:
@Language.component('remove_gpe')
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            original_ents.remove(ent)
    doc.ents = original_ents
    return (doc)

In [None]:
nlp_cc.add_pipe('remove_gpe')

In [None]:
nlp_cc.analyze_pipes()

In [None]:
doc_cc = nlp_cc('Britain is a place. Mary is a doctor')
for ent in doc_cc.ents:
    print(ent.text, ent.label_)

In [None]:
nlp_cc.to_disk('new_en_core_web_sm')

### **RegEx**

[https://spacy.pythonhumanities.com/02_05_simple_regex.html](https://spacy.pythonhumanities.com/02_05_simple_regex.html)

In [None]:
#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [
                        {"TEXT": {"REGEX": r"((\d){3}-(\d){4})"}}
                    ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Can not use to match across tokens. The dash in the phone number throws off the EntityRuler

In [None]:
text = "This is a sample number 5555555."

patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [
                        {"TEXT": {"REGEX": r"((\d){5})"}}
                    ]
                }
            ]

#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

### **RegEx (Multi-Word Tokens)**

In [None]:
import re

In [None]:
text = 'Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common.'

In [None]:
pattern = r'Paul [A-Z]\w+'

[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)

In [None]:
matches = re.finditer(pattern, text)

for match in matches:
    print(match)

In [None]:
from spacy.tokens import Span

In [None]:
nlp_rg = spacy.blank('en')
doc_rg = nlp_rg(text)
doc_rg.ents

In [None]:
original_ents = list(doc_rg.ents)
mwt_ents = []

for match in re.finditer(pattern, doc_rg.text):
    start, end = match.span()
    span = doc_rg.char_span(start, end)
    print(span)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

In [None]:
print(mwt_ents)

In [None]:
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc_rg, start, end, label='PERSON')
    original_ents.append(per_ent)
doc_rg.ents = original_ents
print(doc_rg.ents)

In [None]:
from spacy.language import Language

In [None]:
@Language.component('paul_ner')
def paul_ner(doc):
    pattern = r'Paul [A-Z]\w+'
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        print(span)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='PERSON')
        original_ents.append(per_ent)
    doc.ents = original_ents
    return (doc)

In [None]:
nlp_rg1 = spacy.blank('en')
nlp_rg1.add_pipe('paul_ner')

In [None]:
doc_rg1 = nlp_rg1(text)
print(doc_rg1.ents)

Let create a new Span to detect Holywood as CINEMA

In [None]:
@Language.component('cinema_ner')
def paul_ner(doc):
    pattern = r'Hollywood'
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        print(span)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='CINEMA')
        original_ents.append(per_ent)
    doc.ents = original_ents
    return (doc)

In [None]:
nlp_rg2 = spacy.load('en_core_web_sm')
nlp_rg2.add_pipe('cinema_ner')

In [None]:
# doc_rg2 = nlp_rg2(text)
# [E1010] Unable to set entity information for token 9 which is 
# included in more than one span in entities, blocked, missing or outside.

# The error because the overlap span of cinema_ner and paul_ner

How to solve the problem

In [None]:
from spacy.util import filter_spans
@Language.component('cinema_ner1')
def paul_ner(doc):
    pattern = r'Hollywood'
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        print(span)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='CINEMA')
        original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)

The priority given to longer token

In [None]:
nlp_rg3 = spacy.load('en_core_web_sm')
nlp_rg3.add_pipe('cinema_ner1')
doc_rg3 = nlp_rg3(text)

for ent in doc_rg3.ents:
    print(ent.text, ent.label_)

### **Applied SpaCy Financi**

[https://spacy.pythonhumanities.com/03_01_stock_analysis.html](https://spacy.pythonhumanities.com/03_01_stock_analysis.html)

In [None]:
import spacy
import pandas as pd

In [None]:
df = pd.read_csv('stocks.tsv', sep='\t')
df.head()

In [None]:
df.tail()

In [None]:
symbols = df.Symbol.tolist()
companies = df.CompanyName.tolist()
print(symbols[:10])

In [64]:
stops = ['two']
nlp_asf = spacy.blank('en')
ruler = nlp_asf.add_pipe('entity_ruler')
patterns = []

for symbol in symbols:
    patterns.append({ 'label': 'STOCK', 'pattern': symbol })

for company in companies:
    if company not in stops:
        patterns.append({ 'label': 'COMPANY', 'pattern': company })

print(patterns[:10])
print(patterns[-10:])

[{'label': 'STOCK', 'pattern': 'A'}, {'label': 'STOCK', 'pattern': 'AA'}, {'label': 'STOCK', 'pattern': 'AAC'}, {'label': 'STOCK', 'pattern': 'AACG'}, {'label': 'STOCK', 'pattern': 'AADI'}, {'label': 'STOCK', 'pattern': 'AAIC'}, {'label': 'STOCK', 'pattern': 'AAL'}, {'label': 'STOCK', 'pattern': 'AAMC'}, {'label': 'STOCK', 'pattern': 'AAME'}, {'label': 'STOCK', 'pattern': 'AAN'}]
[{'label': 'COMPANY', 'pattern': 'Zoetis'}, {'label': 'COMPANY', 'pattern': 'Zumiez'}, {'label': 'COMPANY', 'pattern': 'Zuora'}, {'label': 'COMPANY', 'pattern': 'Zevia'}, {'label': 'COMPANY', 'pattern': 'Zovio'}, {'label': 'COMPANY', 'pattern': 'Z-Work Acquisition'}, {'label': 'COMPANY', 'pattern': 'Zymergen'}, {'label': 'COMPANY', 'pattern': 'Zymeworks'}, {'label': 'COMPANY', 'pattern': 'Zynerba Pharmaceuticals'}, {'label': 'COMPANY', 'pattern': 'Zynex'}]


In [66]:
text = '''
Sept 10 (Reuters) - Wall Street's main indexes were subdued on Friday as signs of higher inflation and a drop in Apple shares following an unfavorable court ruling offset expectations of an easing in U.S.-China tensions.

Data earlier in the day showed U.S. producer prices rose solidly in August, leading to the biggest annual gain in nearly 11 years and indicating that high inflation was likely to persist as the pandemic pressures supply chains. read more .

"Today's data on wholesale prices should be eye-opening for the Federal Reserve, as inflation pressures still don't appear to be easing and will likely continue to be felt by the consumer in the coming months," said Charlie Ripley, senior investment strategist for Allianz Investment Management.

Apple Inc (AAPL.O) fell 2.7% following a U.S. court ruling in "Fortnite" creator Epic Games' antitrust lawsuit that stroke down some of the iPhone maker's restrictions on how developers can collect payments in apps.


Sponsored by Advertising Partner
Sponsored Video
Watch to learn more
Report ad
Apple shares were set for their worst single-day fall since May this year, weighing on the Nasdaq (.IXIC) and the S&P 500 technology sub-index (.SPLRCT), which fell 0.1%.

Sentiment also took a hit from Cleveland Federal Reserve Bank President Loretta Mester's comments that she would still like the central bank to begin tapering asset purchases this year despite the weak August jobs report. read more

Investors have paid keen attention to the labor market and data hinting towards higher inflation recently for hints on a timeline for the Federal Reserve to begin tapering its massive bond-buying program.

The S&P 500 has risen around 19% so far this year on support from dovish central bank policies and re-opening optimism, but concerns over rising coronavirus infections and accelerating inflation have lately stalled its advance.


Report ad
The three main U.S. indexes got some support on Friday from news of a phone call between U.S. President Joe Biden and Chinese leader Xi Jinping that was taken as a positive sign which could bring a thaw in ties between the world's two most important trading partners.

At 1:01 p.m. ET, the Dow Jones Industrial Average (.DJI) was up 12.24 points, or 0.04%, at 34,891.62, the S&P 500 (.SPX) was up 2.83 points, or 0.06%, at 4,496.11, and the Nasdaq Composite (.IXIC) was up 12.85 points, or 0.08%, at 15,261.11.

Six of the eleven S&P 500 sub-indexes gained, with energy (.SPNY), materials (.SPLRCM) and consumer discretionary stocks (.SPLRCD) rising the most.

U.S.-listed Chinese e-commerce companies Alibaba and JD.com , music streaming company Tencent Music (TME.N) and electric car maker Nio Inc (NIO.N) all gained between 0.7% and 1.4%


Report ad
Grocer Kroger Co (KR.N) dropped 7.1% after it said global supply chain disruptions, freight costs, discounts and wastage would hit its profit margins.

Advancing issues outnumbered decliners by a 1.12-to-1 ratio on the NYSE and by a 1.02-to-1 ratio on the Nasdaq.

The S&P index recorded 14 new 52-week highs and three new lows, while the Nasdaq recorded 49 new highs and 38 new lows.
'''

In [67]:
ruler.add_patterns(patterns)

In [68]:
doc_asf = nlp_asf(text)
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
Apple COMPANY
Nasdaq COMPANY
ET STOCK
Nasdaq COMPANY
JD.com COMPANY
Kroger COMPANY
Nasdaq COMPANY
Nasdaq COMPANY


In [None]:
from spacy import displacy

In [None]:
displacy.render(doc_asf, style='ent')

We missing AAPL at second block

In [None]:
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for symbol in symbols:
    for l in letters:
        patterns.append({"label": "STOCK", "pattern": symbol+f".{l}"})

In [None]:
ruler.add_patterns(patterns)
doc_asf = nlp_asf(text)
displacy.render(doc_asf, style='ent')

In [63]:
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
Nasdaq COMPANY
two COMPANY
ET STOCK
Nasdaq COMPANY
JD.com COMPANY
TME.N STOCK
NIO.N STOCK
Kroger COMPANY
KR.N STOCK
Nasdaq COMPANY
Nasdaq COMPANY


In [71]:
df2 = pd.read_csv('indexes.tsv', sep='\t')
df2.head()

Unnamed: 0,IndexName,IndexSymbol
0,Dow Jones Industrial Average,DJIA
1,Dow Jones Transportation Average,DJT
2,Dow Jones Utility Average Index,DJU
3,NASDAQ 100 Index (NASDAQ Calculation),NDX
4,NASDAQ Composite Index,COMP


In [74]:
indexes = df2.IndexName.tolist()
index_symbols = df2.IndexSymbol.tolist()

In [78]:
for index in indexes:
    patterns.append({ 'label': 'INDEX', 'pattern': index })
    words = index.split()
    patterns.append({ 'label': 'INDEX', 'pattern': ' '.join(words[:2]) })
for index in index_symbols:
    patterns.append({ 'label': 'INDEX', 'pattern': index })

In [79]:
ruler.add_patterns(patterns)

In [80]:
doc_asf = nlp_asf(text)
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
Apple COMPANY
Nasdaq COMPANY
S&P 500 INDEX
S&P 500 INDEX
ET STOCK
Dow Jones Industrial Average INDEX
S&P 500 INDEX
Nasdaq COMPANY
S&P 500 INDEX
JD.com COMPANY
Kroger COMPANY
Nasdaq COMPANY
Nasdaq COMPANY


In [81]:
df3 = pd.read_csv('stock_exchanges.tsv', sep='\t')
df3.head()

Unnamed: 0,BloombergExchangeCode,BloombergCompositeCode,Country,Description,ISOMIC,Google Prefix,EODcode,NumStocks
0,AF,AR,Argentina,Bolsa de Comercio de Buenos Aires,XBUE,,BA,12
1,AO,AU,Australia,National Stock Exchange of Australia,XNEC,,,1
2,AT,AU,Australia,Asx - All Markets,XASX,ASX,AU,875
3,AV,,Austria,Wiener Boerse Ag,XWBO,VIE,VI,38
4,BI,,Bahrain,Bahrain Bourse,XBAH,,,4


In [84]:
exchanges = df3.ISOMIC.tolist() + df3['Google Prefix'].tolist() + df3.Description.tolist()
print(exchanges[:10])

['XBUE', 'XNEC', 'XASX', 'XWBO', 'XBAH', 'XDHA', 'XBRU', 'BVMF', 'XCNQ', 'XTSE']


In [87]:
for e in exchanges:
    patterns.append({ 'label': 'STOCK_EXCHANGE', 'pattern': e })

In [88]:
ruler.add_patterns(patterns)
doc_asf = nlp_asf(text)
for ent in doc_asf.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
Apple COMPANY
Nasdaq COMPANY
S&P 500 INDEX
S&P 500 INDEX
ET STOCK
Dow Jones Industrial Average INDEX
S&P 500 INDEX
Nasdaq COMPANY
S&P 500 INDEX
JD.com COMPANY
Kroger COMPANY
NYSE STOCK_EXCHANGE
Nasdaq COMPANY
Nasdaq COMPANY
