# Named Entity Recognition (NER) With SpaCy

We will be performing NER on threads from the **Investing** subreddit, but first let's test SpaCy for named entity recognition (NER) using an example from */r/investing*.

In [1]:
import spacy
from spacy import displacy

In [2]:
nlp = spacy.load('en_core_web_sm')

# en_core_web Model

en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities. 

Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

More details:
https://spacy.io/models/en

In [3]:
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [4]:
doc = nlp(txt)


In [5]:
displacy.render(doc, jupyter=True, style='ent')
# displacy.serve(doc, style='ent') if not running in a notebook

Immediately we're able to produce not perfect, but pretty good NER. We are using the [`en_core_web_sm`](https://spacy.io/models/en) model - `en` referring to English and `sm` small.

The model is accurately identifying ARK as an organization. It does also classify ETF (exchange traded fund) as an organization, which is not the case (an ETF is a grouping of securities on the markets), but it's easy to see why this is being classified as one. The other tag we can see is `WORK_OF_ART`, it isn't inherently clear what exactly this means, so we can get more information using `spacy.explain`:

In [6]:
spacy.explain('WORK_OF_ART')

'Titles of books, songs, etc.'

And we can see that this description fits well to the tagged item, which refers to an article (although not quite a book).

We have a visual output from our tagged text, but this won't be particularly useful programatically. What we need is a way to extract the relevant tags (the organizations) from our text. To do that we can use `doc.ents` which will return a list of all identified entities.

Each item in this entity list contains two attributes that we are interested in, `label_` and `text`:

In [7]:
ner=[(X.text, X.label_) for X in doc.ents] 
ner

[('ARK', 'ORG'),
 ('The Bear Cave](https://thebearcave.substack.com/p/special-edition', 'ORG'),
 ('ARK', 'ORG'),
 ('ARK', 'ORG')]

We're almost there. Now, we need to filter out any entities that are not `ORG` entities, and append those remaining `ORG`s to an organization list:

In [8]:
# initialize our list
org_list = []

for entity in doc.ents:
    # if label_ is ORG, we append text, otherwise ignore
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

org_list

['ARK',
 'The Bear Cave](https://thebearcave.substack.com/p/special-edition',
 'ARK',
 'ARK']

In [9]:
# we don't need to see 'ARK' three times, so we use set() to remove duplicates, and then convert back to list
org_list = list(set(org_list))

org_list

['ARK', 'The Bear Cave](https://thebearcave.substack.com/p/special-edition']

Another example :

Nintendo Co Ltd 7974.T said on Thursday third-quarter operating profit rose 6%,driven by Switch console sales in the year-end shopping season, but the earnings fell below market expectations.Profit for the October-December quarter was 168.7 billion yen ($1.54 billion) versus 158.6 billion yen a year earlier.That compared with an average forecast of 175 billion yen from 10 analyst estimates compiled by Refinitiv.

In [10]:
txt = ('Nintendo Co Ltd 7974.T said on Thursday third-quarter operating profit rose 6%,driven by Switch console sales in the year-end shopping season, but the earnings fell below market expectations.Profit for the October-December quarter was 168.7 billion yen ($1.54 billion) versus 158.6 billion yen a year earlier.That compared with an average forecast of 175 billion yen from 10 analyst estimates compiled by Refinitiv.')

In [11]:
doc = nlp(txt)
ner=[(X.text, X.label_) for X in doc.ents] 
ner


[('Nintendo Co Ltd 7974.T', 'ORG'),
 ('Thursday third-quarter', 'DATE'),
 ('6%,driven', 'CARDINAL'),
 ('Switch', 'NORP'),
 ('year-end shopping season', 'DATE'),
 ('October-December quarter', 'DATE'),
 ('168.7 billion yen', 'MONEY'),
 ('$1.54 billion', 'MONEY'),
 ('158.6 billion yen', 'MONEY'),
 ('a year earlier', 'DATE'),
 ('175 billion yen', 'MONEY'),
 ('10', 'CARDINAL')]

Attributes: Named entity type.

IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

In [None]:
tag=[(X, X.ent_iob_, X.ent_type_) for X in doc]
tag

[(Nintendo, 'B', 'ORG'),
 (Co, 'I', 'ORG'),
 (Ltd, 'I', 'ORG'),
 (7974.T, 'I', 'ORG'),
 (said, 'O', ''),
 (on, 'O', ''),
 (Thursday, 'B', 'DATE'),
 (third, 'I', 'DATE'),
 (-, 'I', 'DATE'),
 (quarter, 'I', 'DATE'),
 (operating, 'O', ''),
 (profit, 'O', ''),
 (rose, 'O', ''),
 (6%,driven, 'B', 'CARDINAL'),
 (by, 'O', ''),
 (Switch, 'B', 'NORP'),
 (console, 'O', ''),
 (sales, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (year, 'B', 'DATE'),
 (-, 'I', 'DATE'),
 (end, 'I', 'DATE'),
 (shopping, 'I', 'DATE'),
 (season, 'I', 'DATE'),
 (,, 'O', ''),
 (but, 'O', ''),
 (the, 'O', ''),
 (earnings, 'O', ''),
 (fell, 'O', ''),
 (below, 'O', ''),
 (market, 'O', ''),
 (expectations, 'O', ''),
 (., 'O', ''),
 (Profit, 'O', ''),
 (for, 'O', ''),
 (the, 'O', ''),
 (October, 'B', 'DATE'),
 (-, 'I', 'DATE'),
 (December, 'I', 'DATE'),
 (quarter, 'I', 'DATE'),
 (was, 'O', ''),
 (168.7, 'B', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (yen, 'I', 'MONEY'),
 ((, 'O', ''),
 ($, 'B', 'MONEY'),
 (1.54, 'I', 'MONEY'),
 (bi