<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/11-named-entity-recognition/02_named_entity_practical_applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Practical Applications of NER

Let’s suppose ourselves of the scenario: it is widely known that certain events influence the trends of stock price movements: specifically, you can extract relevant facts from the news and then use these facts to predict company stock prices. 

Suppose you have access to a large collection of news; now your task is to extract the relevant events and facts
that can be linked to the stock market in the downstream (stock market price prediction)
application. 

How will you do that?

This means that you have access to a collection of news texts, and among other
preprocessing steps, you apply NER. Then you can focus only on the texts and sentences
that are relevant for your task: for instance, if you are interested in the recent events, in
which a particular company (e.g., “Apple”) participated, you can easily identify such texts,
sentences, and contexts.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/11-named-entity-recognition/images/ner1.png?raw=1' width='600'/>

##Setup

In [1]:
!pip -q install spacy

In [None]:
!python -m spacy download en_core_web_md 

After install, just restart the colab runtime.

In [2]:
import spacy
from spacy import displacy

import pandas as pd

Let's download dataset from Kaggle.

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [4]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle> URL: https://www.kaggle.com/datasets/snapcrack/all-the-news?select=articles1.csv
kaggle datasets download -d snapcrack/all-the-news
unzip -qq all-the-news.zip
rm -rf all-the-news.zip

kaggle.json
Downloading all-the-news.zip to /content
 99% 241M/244M [00:05<00:00, 52.4MB/s]
100% 244M/244M [00:05<00:00, 50.2MB/s]




##Data Loading and Exploration

We are going to use the data that has already been extracted from a range of news portals: the
dataset called “All the news” is hosted on the Kaggle website. 

The dataset consists of
143,000 articles scraped from 15 news websites, including The New York Times, CNN,
Business Insider, The Washington Post, etc.

In [5]:
news_df = pd.read_csv("articles1.csv")
news_df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [6]:
news_df.shape

(50000, 10)

Since the data from 15 news sources is split between several .csv files, let’s find out which news sources are covered.

In [7]:
source = news_df["publication"].unique()
print(source)

['New York Times' 'Breitbart' 'CNN' 'Business Insider' 'Atlantic']


Let's extract the content of articles from a specific source.

In [8]:
# Define a condition for the publication source to be “New York Times”
condition = news_df["publication"].isin(["New York Times"])
# Select the content from all articles that satisfy this condition and only extract the first 1000 of them
content_df = news_df.loc[condition, :]["content"][:1000]
content_df.shape

(1000,)

In [9]:
# check the contents of these articles
content_df.head()

0    WASHINGTON  —   Congressional Republicans have...
1    After the bullet shells get counted, the blood...
2    When Walt Disney’s “Bambi” opened in 1942, cri...
3    Death may be the great equalizer, but it isn’t...
4    SEOUL, South Korea  —   North Korea’s leader, ...
Name: content, dtype: object

##Named Entity Types Exploration

Let’s start by iterating through the news articles, collecting all named entities identified in
texts, and storing the number of occurrences in a Python dictionary.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/11-named-entity-recognition/images/ner2.png?raw=1' width='600'/>

Let's populate a dictionary with NEs extracted from news articles.

In [10]:
nlp = spacy.load("en_core_web_md")

In [11]:
def collect_entites(data_frame):
  named_entities = {}
  processed_docs = []

  for item in data_frame:
    # Process each news article with spaCy’s NLP pipeline
    doc = nlp(item)
    processed_docs.append(doc)

    for ent in doc.ents:
      # For each entity, extract the text
      entity_text = ent.text
      # Identify the type of the entity with ent.label_
      entity_type = str(ent.label_)
      # For each entity type, extract the list of currently stored entities with their counts
      current_ents = {}
      if entity_type in named_entities.keys():
        current_ents = named_entities.get(entity_type)
      current_ents[entity_text] = current_ents.get(entity_text, 0) + 1
      named_entities[entity_type] = current_ents
  return named_entities, processed_docs

In [12]:
named_entities, processed_docs = collect_entites(content_df)

Now, let's print out the named entities dictionary.

In [13]:
def print_out(named_entities):
  for key in named_entities.keys():
    print(key)
    # Extract all entities of a particular type from the dictionary
    entities = named_entities.get(key)
    sorted_keys = sorted(entities, key=entities.get, reverse=True)
    # Sort the entries by their frequency in descending order and print out the most frequent n ones
    for item in sorted_keys[:10]:
      # It would be most informative to only look into entities that occur more than once
      if entities.get(item) > 1:
        print(f"   {item}: {str(entities.get(item))}")

In [14]:
print_out(named_entities)

GPE
   the United States: 1141
   Russia: 526
   China: 514
   Washington: 503
   New York: 385
   America: 356
   Iran: 294
   Mexico: 266
   Britain: 237
   California: 206
NORP
   American: 980
   Republicans: 523
   Republican: 473
   Democrats: 398
   Russian: 337
   Chinese: 288
   Americans: 267
   British: 180
   Democrat: 166
   Muslim: 164
PERSON
   Trump: 3634
   Obama: 839
   Clinton: 186
   Spicer: 134
   Donald J. Trump: 128
   Hillary Clinton: 123
   Sessions: 123
   Gorsuch: 116
   Barack Obama: 115
   Kushner: 110
ORG
   Trump: 768
   Senate: 373
   Congress: 344
   Twitter: 310
   White House: 235
   The New York Times: 230
   the White House: 223
   Times: 211
   House: 207
   Google: 134
MONEY
   1: 66
   2: 23
   10: 19
   millions of dollars: 19
   100: 18
   3: 18
   billions of dollars: 17
   5: 16
   4: 15
   $1 billion: 14
CARDINAL
   one: 1382
   two: 910
   000: 591
   three: 349
   One: 338
   four: 172
   seven: 170
   1: 155
   five: 131
   2: 118
DATE
  

Another way in which you can explore the statistics on various NE types is to aggregate
the counts on the types and print out the number of unique entries.

To do that, you extract and aggregate the statistics for each NE type, and in the end, you print out the results in a tabulated format, with each row
storing the statistics on a separate NE type.

Let's aggregate the counts on all named entity types.

In [15]:
rows = []
rows.append(["Type:", "Entries:", "Total:"])

for ent_type in named_entities.keys():
  rows.append([ent_type, str(len(named_entities.get(ent_type))), str(sum(named_entities.get(ent_type).values()))])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]

for row in rows:
  print("".join(" {:{width}} ".format(row[i], width=column_widths[i]) for i in range(0, len(row))))

 Type:        Entries:  Total: 
 GPE          1760      15100  
 NORP         541       7525   
 PERSON       10000     30268  
 ORG          4893      15215  
 MONEY        681       1239   
 CARDINAL     1216      9097   
 DATE         3107      15117  
 LAW          129       412    
 LOC          455       1462   
 ORDINAL      69        1736   
 TIME         587       1614   
 FAC          548       1060   
 QUANTITY     308       358    
 PERCENT      268       658    
 EVENT        230       562    
 PRODUCT      294       537    
 WORK_OF_ART  1322      1951   
 LANGUAGE     17        94     


As this table shows, the most frequently used named entities in the news articles are entities
of the following types: PERSON, GPE, ORG, and DATE. This is, perhaps, not very surprising:
after all, most often news report on the events that are related to people (PERSON),
companies (ORG), countries (GPE), and usually news articles include references to specific
dates.

At the same time, the least frequently used entities are the ones of the type
LANGUAGE: there are only 17 unique languages mentioned in this news articles dataset, and
in total they are mentioned 85 times.

You may also note that ORDINAL type has only 68
unique entries: it is, naturally, a very compact list of items including entries like first, second,
third, and so on.

##Information Extraction

Consider the scenario again: your task is to
build an information extraction application focused on companies and the news that report
on these companies. The dataset at hand, contains information on
as many as 4,892 companies. Of course, not all of them might be of interest to you, so it
would make sense to select a few and extract information on them.

Recall that `spaCy`’s NLP pipeline processes sentences (or full documents) and returns a
data structure, which contains all sorts of information on the words in the sentence (text),
including the information about the word’s type (part-of-speech, e.g., verb, noun, etc.), its
named entity type, its role in the sentence (e.g., main verb or ROOT, main action’s participant
or nsubj, and so on).

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/11-named-entity-recognition/images/ner3.png?raw=1' width='600'/>

In addition, each word has a unique index that is linked to its position in the sentence. If
a named entity consists of multiple words, some of them may be marked with the `nsubj` or
`dobj` relations (i.e., relevant relations in your application), but your goal is to extract not
only the word marked as `nsubj` or `dobj` but the whole named entity, which plays this role. 

To
do that, the best way is to match the named entities to their roles in the sentence via the
indexes assigned to the named entities in the sentence.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/11-named-entity-recognition/images/ner4.png?raw=1' width='600'/>

Your goal is to identify whether The New York Times is one
of the participants of the main action (wrote) in this sentence – the subject (the entity that
performs the action) or an object (an entity to which the action applies). Indeed, The New
York Times as a whole is the subject – it is the entity that performed the action of writing.

However, since linguistic analysis applies to individual words rather than whole expressions,
technically only the word Times is directly dependent on the main verb wrote – this is shown
through the chain of relations.

How can you extract the whole expression The
New York Times?

To do that, you first identify the indexes of the words covered by this expression in the
sentence: for The New York Times these are `[0, 1, 2, 3]`.

Next, you check if a word with any of these indexes plays a role of the subject or an
object in the sentence. Indeed, the word that is the subject in the sentence has the index of 3.

Therefore, you can return the whole
named entity The New York Times as the subject of the main action in the sentence.

Let's extract the indexes of the words covered by the NE.


In [16]:
def extract_span(sent, entity):
  indexes = []
  for ent in sent.ents:
    if ent.text == entity:
      for i in range(int(ent.start), int(ent.end)):
        indexes.append(i)
  return indexes

Now, let's extract information about the main participants of the action.

In [17]:
def extract_information(sent, entity, indexes):
  actions = []
  action = ""
  participant1 = ""
  participant2 = ""

  for token in sent:
    # Identify the main verb expressing the main action in the sentence
    if token.pos_=="VERB" and token.dep_=="ROOT":  
        subj_ind = -1
        obj_ind = -1
        action = token.text
        children = [child for child in token.children]   
        for child1 in children:
            if child1.dep_=="nsubj":
                participant1 = child1.text
                # Find the subject via the nsubj relation and store it as participant1 and its index as subj_ind
                subj_ind = int(child1.i)
            if child1.dep_=="prep":
                participant2 = ""
                child1_children = [child for child in child1.children]
                for child2 in child1_children:
                    if child2.pos_ == "NOUN" or child2.pos_ == "PROPN":
                        participant2 = child2.text
                        # Search for the indirect object as the second participant and store it as participant2 and its index as obj_ind
                        obj_ind = int(child2.i)
                if not participant2=="":
                    if subj_ind in indexes:
                        actions.append(entity + " " + action + " " + child1.text + " " + participant2)
                    elif obj_ind in indexes:
                        actions.append(participant1 + " " + action + " " + child1.text + " " + entity)

            if child1.dep_=="dobj" and (child1.pos_ == "NOUN" or child1.pos_ == "PROPN"):
                participant2 = child1.text
                obj_ind = int(child1.i)
                if subj_ind in indexes:
                    actions.append(entity + " " + action + " " + participant2)
                elif obj_ind in indexes:
                    actions.append(participant1 + " " + action + " " + entity)
                
  if not len(actions)==0:
      print (f"\nSentence = {sent}")
      for item in actions:
          print(item)

Now let’s apply this code to your texts extracted from the news articles.

So, let's to extract information on the specific entity.

In [18]:
def entity_detector(processed_docs, entity, ent_type):
  output_sentences = []
  for doc in processed_docs:
    for sent in doc.sents:
      if entity in [ent.text for ent in sent.ents if ent.label_ == ent_type]:
        output_sentences.append(sent)
  return output_sentences

In [19]:
entity = "Apple"
ent_sentences = entity_detector(processed_docs, entity, "ORG")
print(len(ent_sentences))

for sent in ent_sentences:
  indexes = extract_span(sent, entity)
  extract_information(sent, entity, indexes)

61

Sentence = Apple, complying with what it said was a request from Chinese authorities, removed news apps created by The New York Times from its app store in China late last month.
Apple removed apps

Sentence = Apple removed both the   and   apps from the app store in China on Dec. 23.
Apple removed apps
Apple removed from store
Apple removed on Dec.

Sentence = Apple has previously removed other, less prominent media apps from its China store.
Apple removed apps

Sentence = It puts Apple and Google in a difficult position.
It puts Apple

Sentence = Russia required Apple and Google to remove the LinkedIn app from their local stores.
Russia required Apple

Sentence = On Friday, Apple, its longtime partner, sued Qualcomm over what it said was $1 billion in withheld rebates.
Apple sued Qualcomm

Sentence = Apple sued three days after the  Federal Trade Commission accused Qualcomm of using anticompetitive practices to guarantee its high royalty payments for advanced wireless technology.

The main content of the sentences is concisely summarized by the tuples consisting of
the main action and its two participants, so if you were interested in extracting only the
sentences that have such informative content and that directly answer questions “What did Apple do to X?” or “What did Y do to Apple?”

Now, let's extract information on named entities consisting of multiple words.

In [20]:
entity = "The New York Times"
sentences = ["The New York Times wrote about Apple"]

for sent in sentences:
  doc = nlp(sent)
  indexes = extract_span(doc, entity)
  print(indexes)
  extract_information(doc, entity, indexes)

[0, 1, 2, 3]

Sentence = The New York Times wrote about Apple
The New York Times wrote about Apple


##Named Entities Visualization

One of the most useful ways to explore named entities contained in text and to extract
relevant information is to visualize the results of NER.

To that end, let’s revisit extraction of sentences containing the entity in question
and explore visualization to highlight the use of the entity alongside other relevant entities.

We ses `spaCy`’s visualization tool, `displaCy`, which allows you to
highlight entities of different types in the selected set of sentences using distinct colors for each type.

In [29]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [30]:
# let's visualize named entities of various types in their contexts of use
def visualize(processed_docs, entity, ent_type):
  for doc in processed_docs:
    for sent in doc.sents:
      # Identify the sentences that contain the entity in question and visualize the context
      if entity in [ent.text for ent in sent.ents if ent.label_ == ent_type]:
        displacy.render(sent, style="ent", jupyter=True)

In [31]:
visualize(processed_docs, "Apple", "ORG")

Finally, you might be interested specifically in the contexts in which the company Apple is
mentioned alongside other companies. 

Let’s filter out all other information and only highlight
named entities of the same type as the entity in question – i.e., all ORG NEs in this case.

In [32]:
# Here, you define count_ents function that counts the number of entities of a certain type in a sentence
def count_ents(sent, ent_type):
  return len([ent.text for ent in sent.ents if ent.label_ == ent_type])

# here, we extract only the sentences that mention the input entity of a specified type as well as at least one other entity of the same type
def entity_detector_custom(processed_docs, entity, ent_type):
  output_sentences = []
  for doc in processed_docs:
    for sent in doc.sents:
      # Identify the sentences that contain the entity in question and visualize the context
      if entity in [ent.text for ent in sent.ents if ent.label_ == ent_type and count_ents(sent, ent_type) > 1]:
        output_sentences.append(sent)
  return output_sentences

In [33]:
output_sentences = entity_detector_custom(processed_docs, "Apple", "ORG")
print(output_sentences)

[American tech giants like Google, Apple and Facebook are on a collision course with European regulators over issues including privacy and taxes., Nearly a year ago, I argued that we were witnessing a new era in the tech business, one that is typified less by the storied   in a garage than by a posse I like to call the Frightful Five: Amazon, Apple, Facebook, Microsoft and Alphabet, Google’s parent company., The precise nature of the fights varies by company and region, including the tax and antitrust investigations of Apple and Google in Europe and Donald J. Trump’s broad and often incoherent criticism of the Five for various alleged misdeeds., Apple’s sales were flat last year, and after a monster 2016, Alphabet’s stock price hit a plateau., When Apple took on the Federal Bureau of Investigation last year over access to a terrorist’s iPhone, many in tech sided with the company, but a majority of Americans thought Apple should give in., Apple, complying with what it said was a request

Now visualize the results – named entities of specified type (you can change the colors by selecting the codes from https://htmlcolorcodes.com/color-chart/:

In [35]:
def visualize_type(sents, entity, ent_type):
  # customize the colors for visualization and to apply gradient
  colors = {"ORG": "linear-gradient(90deg, #64B5F6, #E0F7FA)"}
  options = {"ents": ["ORG"], "colors": colors}

  for sent in sents:
    displacy.render(sent, style="ent", options=options, jupyter=True)

In [36]:
visualize_type(output_sentences, "Apple", "ORG")

Congratulations – you can now extract from a collection
of news articles all relevant events and facts summarizing the actions undertaken by the
participants of interest, for example, specific companies. 

These events can be further used in
downstream tasks: for example, if you also harvest data on stock price movements, you can
link the events extracted from the news to the changes in the stock prices immediately
following such events, which will help you to predict how the stock price may change in view
of similar events in the future.