# Named Entity Recognition NER using spaCy - Extracting Subject Verb Action


## First What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.



BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.


It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.


------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------

![Imgur](https://imgur.com/NpeB9vb.png)

-------------------------

In [1]:
import pandas as pd


In [2]:
ROOT_DIR = '../../input/' # Local Machine

# ROOT_DIR = '../input/all-the-news/' # Kaggle


# print(os.listdir('../input/all-the-news/'))

In [3]:

# path = "all-the-news/"
df = pd.read_csv(ROOT_DIR + "articles1.csv")

In [4]:
df.shape

(50000, 10)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [6]:
sources = df["publication"].unique()
print(sources)

['New York Times' 'Breitbart' 'CNN' 'Business Insider' 'Atlantic']


In [7]:
condition = df["publication"].isin(["New York Times"])

content_df = df.loc[condition, :]["content"][:100]

content_df.shape

(100,)

In [8]:
content_df.head()

0    WASHINGTON  —   Congressional Republicans have...
1    After the bullet shells get counted, the blood...
2    When Walt Disney’s “Bambi” opened in 1942, cri...
3    Death may be the great equalizer, but it isn’t...
4    SEOUL, South Korea  —   North Korea’s leader, ...
Name: content, dtype: object

In [None]:
for article in content_df[:2]:
    print(article)

In [10]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = "He would not tell the police what he had learned"

doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop  )



He he PRON PRP nsubj Xx True True
would would AUX MD aux xxxx True True
not not PART RB neg xxx True True
tell tell VERB VB ROOT xxxx True False
the the DET DT det xxx True True
police police NOUN NNS dobj xxxx True False
what what PRON WP dobj xxxx True True
he he PRON PRP nsubj xx True True
had have AUX VBD aux xxx True True
learned learn VERB VBN ccomp xxxx True False


### In below the named_entities variable should show me a dict structure like below

```
{
'GPE': {'WASHINGTON': 1,
    'Obama': 1,
    'the District of Columbia Circuit': 1,
    'Manhattan': 13,
    'New York City': 3,
    'Bronx': 19 },

'NORP': {'Republicans': 15,
    'Americans': 1,
    'Republican': 2,
    'Hispanic': 1,}
```

In [11]:
def return_entities_and_processed_docs(data_frame):
    """
    Extracts named entities from a DataFrame and returns them along with the processed docs.

    Args:
        data_frame (DataFrame): The DataFrame from which to extract named entities.

    Returns:
        dict: A dictionary mapping entity types to a dictionary of entity names to counts.
        list: A list of processed docs.
    """
    # Initialize the dictionary to hold named entities
    named_entities = {}
    # Initialize the list to hold processed docs
    processed_docs = []
    
    # Process each item in the DataFrame
    for item in data_frame:
        # Process the item with a language model
        doc = nlp(item)
        # Add the processed doc to the list
        processed_docs.append(doc)
        
        # For each named entity in the doc...
        for ent in doc.ents:
            # Extract the entity text (e.g., 'WASHINGTON')
            entity_text = ent.text
            # Extract the entity type (e.g., 'GPE')
            entity_type = str(ent.label_)
            # Initialize a dictionary to hold the current entities
            current_ents = {}
            
            # If the entity type is already in the named entities dictionary...
            if entity_type in named_entities.keys():
                # Get the dictionary of entity names to counts
                current_ents = named_entities.get(entity_type)
                
            # Increment the count for the entity name
            # This will add 1 to the count in the inner dictionary
            current_ents[entity_text] = current_ents.get(entity_text, 0) + 1
            
            # Update the inner dictionary for the entity type
            named_entities[entity_type] = current_ents

    # Return the named entities and the processed docs
    return named_entities, processed_docs


named_entities, processed_docs = return_entities_and_processed_docs(content_df)


In [12]:
named_entities   

{'GPE': {'WASHINGTON': 18,
  'Obama': 2,
  'the District of Columbia Circuit': 1,
  'Manhattan': 30,
  'New York City': 13,
  'Bronx': 26,
  'Upper Manhattan': 2,
  'Lower Manhattan': 2,
  'the United States': 81,
  'St. Mary’s Park': 1,
  'New York': 38,
  'Mott Haven': 1,
  'Brooklyn': 18,
  'Queens': 10,
  'Staten Island': 4,
  'Washington Heights': 1,
  'Gramercy Park': 1,
  'the City of New York': 1,
  'America': 31,
  'Hollywood': 7,
  'California': 21,
  'San Francisco': 7,
  'San Francisco Bay': 1,
  'Angel Island': 2,
  'Guangdong Province': 1,
  'United States': 20,
  'Attuned': 1,
  'Sacramento': 2,
  'Los Angeles': 10,
  'Pasadena': 1,
  'Sunland': 2,
  'Calif.': 11,
  'Minneapolis': 3,
  'Eagles': 1,
  'Saskatchewan': 1,
  'Tennessee': 4,
  'New York Giant': 1,
  'Texas': 6,
  'Waco': 1,
  'Tex.': 4,
  'Rwanda': 1,
  'Bosnia': 1,
  'Israel': 16,
  'Poland': 1,
  'Vera Rubin': 1,
  'SEOUL': 1,
  'South Korea': 7,
  'North Korea’s': 2,
  'North Korea': 13,
  'Cheong': 1,
  '

In [13]:
len(processed_docs)

100

### Better way to print the Entity Types and Values

In this code, you print out the type of a named entity (e.g., ORG) and for each type, you extract all entities assigned with this type in the dictionary, sorted by their frequency in descending order. 

Print out the most frequent 10 entities per type

In [14]:
def print_top_10(named_entities):
    for key in named_entities.keys():        
        print(key)
        entities = named_entities.get(key)
                
        # Sort the entries by their frequency in descending 
        # order and print out the most frequent n ones
        sorted_keys = sorted(entities, key=entities.get, reverse=True )
        for item in sorted_keys[:10]:
            if (entities.get(item) > 1 ):
                print(" " + item + ": " + str(entities.get(item)))
                
print_top_10(named_entities)
        

GPE
 the United States: 81
 China: 61
 Russia: 50
 Washington: 48
 Afghanistan: 42
 Turkey: 40
 New York: 38
 America: 31
 Manhattan: 30
 Iraq: 28
NORP
 Republicans: 138
 American: 136
 Republican: 67
 Democrats: 60
 Russian: 59
 Chinese: 48
 Democratic: 31
 Americans: 26
 Russians: 24
 Democrat: 23
PERSON
 Trump: 255
 Obama: 114
 Roof: 41
 Kelly: 39
 Donald J. Trump: 37
 Wong: 31
 Clinton: 30
 Kerr: 27
 Netanyahu: 25
 Leahy: 24
ORG
 House: 84
 Senate: 69
 Congress: 68
 Hacking Team: 35
 the White House: 31
 Times: 27
 Apple: 25
 Hacking Team’s: 24
 Trump: 23
 NBC: 20
MONEY
 1: 9
 billions of dollars: 5
 2: 5
 3: 4
 more than $1 billion: 3
 the billions of dollars: 3
 100: 3
 millions of dollars: 3
 tens of millions: 3
 $5 million: 3
CARDINAL
 one: 122
 two: 91
 000: 53
 three: 40
 One: 34
 Five: 24
 2: 23
 four: 21
 1: 19
 10: 14
DATE
 Wednesday: 50
 Tuesday: 42
 2015: 36
 last year: 32
 2016: 32
 Sunday: 32
 Thursday: 32
 years: 30
 this year: 29
 Monday: 23
LAW
 the Affordable Care 

In [15]:
named_entities.keys()

dict_keys(['GPE', 'NORP', 'PERSON', 'ORG', 'MONEY', 'CARDINAL', 'DATE', 'LAW', 'LOC', 'ORDINAL', 'PRODUCT', 'TIME', 'FAC', 'QUANTITY', 'PERCENT', 'WORK_OF_ART', 'EVENT', 'LANGUAGE'])

#### Interpretation of the above

As this table shows, the most frequently used named entities in the news articles are entities of the following types: PERSON, GPE, ORG, and DATE. This is, perhaps, not very surprising: after all, most often news report on the events that are related to people (PERSON), companies (ORG), countries (GPE), and usually news articles include references to specific dates.

------------------------------

# Extracting the subject (the entity that performs the action) or an object (an entity to which the action applies)


----------

Suppose your named entity of interest is a multi-word expression **The New York Times** 

And the full sentence is **“The New York Times wrote about Apple”**. 

#### And from the above we know that indeed, The New York Times as a whole is the subject – it is the entity that performed the action of writing.

### So the problem is - How can you extract the whole expression "The New York Times?"

----------

### Overall Solution Approach

1. **So the first step of the solution is**, you first identify the indexes of the words covered by this expression in the sentence: for "The New York Times" these are [0, 1, 2, 3], as the left part. 


2. Next, i.e. the second half of the task, you check if a word with any of these indexes plays a role of the subject or an object in the sentence. 

3. And we see that, indeed, the word that is the subject in the sentence has the index of 3 . 

4. Therefore, you can return the whole named entity "The New York Times" as the subject of the main action in the sentence.



In [16]:
def calculate_entity_span(document, entity):
    """
    Calculate the span of a named entity in a document.

    Args:
        document (spacy.tokens.Doc): The document to search for the entity.
        entity (str): The entity to find in the document.

    Returns:
        list: A list of indices representing the span of the entity in the document.
    """
    # Initialize a list to hold the indices
    indexes = []
    
    # Iterate over the entities in the document
    for ent in document.ents:
        # If the entity text matches the entity we're looking for...
        if ent.text == entity:
            # Iterate over the range from the entity's start to its end
            for i in range(int(ent.start), int(ent.end)):
                # Append the index to the list
                indexes.append(i)
                
    # Return the list of indices
    return indexes


In [17]:
entity = "The New York Times"

sentences = "The New York Times wrote about Apple"

doc = nlp(sentences)

calculate_entity_span(doc, entity)

[0, 1, 2, 3]

In [18]:
sentences = ["The New York Times wrote about Apple"]

for sentence in sentences:
    doc = nlp(sentence)
    for token in doc:
        print(token.dep_)

det
compound
compound
nsubj
ROOT
prep
pobj


In [24]:
def calc_entity_subject_object(document, entity, indexes):
    """
    Analyze a document to find actions related to a specific entity.

    Args:
        document (spacy.tokens.Doc): The document to analyze.
        entity (str): The entity to find actions for.
        indexes (list): A list of indices in the document where the entity appears.

    Returns:
        None: Prints out the sentence and all actions involving the entity.
    """
    actions = []
    action = ''
    participant1 = ''
    participant2 = ''
    
    for token in document:
        # Next, you identify the main verb expressing the main action in the sentence
        # To extract the relation, we have to find the ROOT of the sentence (which is also the verb of the sentence)
        if token.pos_ == "VERB" and token.dep_ == 'ROOT':
            # Initialize the indexes for the subject and the object related to the main verb
            subj_ind = -1
            obj_ind = -1
            # Store the main verb itself (token.text) in the action variable
            action = token.text
            children = [child for child in token.children ]
            for child1 in children:
                # Find the subject via the nsubj relation and store it as participant1 
                # and its index as subj_ind
                if child1.dep_ == 'nsubj':
                    participant1 = child1.text
                    subj_ind = int(child1.i)
                # If there is a preposition attached to the verb (e.g., “write about”), then
                # you need to search for the indirect object as the second participant. 
                if child1.dep_ == 'prep':
                    participant2 = ''
                    child1_children = [child for child in child1.children]
                    for child2 in child1_children:
                        # If such an object is a noun or a proper noun, 
                        # you store it as participant2 and its index as obj_ind
                        if child2.pos_ == 'NOUN' or child2.pos_ == 'PROPN':
                            participant2 = child2.text
                            obj_ind = int(child2.i)
                    
                    # If at this point both participants of the main action have been identified and
                    # their indexes are included in the indexes of the words covered by the entity, 
                    # you add the action with two participants to the list of actions.
                    if not participant2=="":
                        if subj_ind in indexes:
                            actions.append(entity + " " + action + " " + child1.text + " " + participant2)
                        elif obj_ind in indexes:
                            actions.append(participant1 + " " + action + " " + child1.text + " " + entity)
                            
                # Otherwise, if there is no preposition attached to the verb,
                # participant2 is a direct object of the main verb, 
                # which can be identified via the dobj relation
                if child1.dep_ == 'dobj' and (child1.pos_ == 'NOUN' or child1.pos_ == 'PROPN' ):
                    participant2 = child1.text
                    obj_ind = int(child1.i)
                    # In this case, you apply the same strategy as above, 
                    # adding the action with two participants to the list of actions. 
                    if subj_ind in indexes:
                         actions.append(entity + " " + action + " " + participant2)
                    elif obj_ind in indexes:
                        actions.append(participant1 + " " + action + " " + entity)
    # Finally if the final list of actions is not empty, 
    # Print out the sentence and all actions together with the participants.
    if not len(actions) == 0:
        print(f"\nSentence = {document}")
        for item in actions:
            print(item)            

In [25]:
for sentence in sentences:
    doc = nlp(sentence)
    indexes = calculate_entity_span(doc, entity)
    calc_entity_subject_object(doc, entity, indexes)


Sentence = The New York Times wrote about Apple
Times wrote about Apple


In [26]:
def return_docs_of_given_ent_type(processed_docs, entity, ent_type):
    """
    Extracts sentences from a list of processed documents that contain a given named entity of a specific type.

    Args:
        processed_docs (list): A list of processed documents. Each document is a spacy.tokens.Doc object.
        entity (str): The named entity to search for.
        ent_type (str): The type of the named entity (e.g. 'ORG', 'PERSON', 'GPE')

    Returns:
        output_sentences (list): A list of sentences containing the named entity of the specified type.
    """
    output_sentences = []
    for doc in processed_docs:
        for sentence in doc.sents:
            # Only consider sentences that contain the input entity 
            # of the specified type among its named entities
            if entity in [ent.text for ent in sentence.ents if ent.label_ == ent_type ]:
                output_sentences.append(sentence)
    return output_sentences

entity = "Apple"

ent_sentences = return_docs_of_given_ent_type(processed_docs, entity, 'ORG' )
print(ent_sentences)        
        

[American tech giants like Google, Apple and Facebook are on a collision course with European regulators over issues including privacy and taxes., Nearly a year ago, I argued that we were witnessing a new era in the tech business, one that is typified less by the storied   in a garage than by a posse I like to call the Frightful Five: Amazon, Apple, Facebook, Microsoft and Alphabet, Google’s parent company., The precise nature of the fights varies by company and region, including the tax and antitrust investigations of Apple and Google in Europe and Donald J. Trump’s broad and often incoherent criticism of the Five for various alleged misdeeds., Apple’s sales were flat last year, and after a monster 2016, Alphabet’s stock price hit a plateau., When Apple took on the Federal Bureau of Investigation last year over access to a terrorist’s iPhone, many in tech sided with the company, but a majority of Americans thought Apple should give in., He promised to force Apple to make iPhones in Am

In [27]:
for sentence in ent_sentences:
    indexes = calculate_entity_span(sentence, entity)
    calc_entity_subject_object(sentence, entity, indexes)


Sentence = Apple, complying with what it said was a request from Chinese authorities, removed news apps created by The New York Times from its app store in China late last month.
Apple removed apps

Sentence = Apple removed both the   and   apps from the app store in China on Dec. 23.
Apple removed from store
Apple removed on Dec.

Sentence = Apple has previously removed other, less prominent media apps from its China store.
Apple removed apps
Apple removed from store


In [28]:
entity = "The New York Times"

ent_sentences = return_docs_of_given_ent_type(processed_docs, entity, "ORG")
print(len(ent_sentences))

for sentence in ent_sentences:
    indexes = calculate_entity_span(sentence, entity)
    calc_entity_subject_object(sentence, entity, indexes)

13

Sentence = During negotiations with Mr. Corzine last year, the commission also strengthened aspects of the deal after some of the agency’s commissioners questioned it, The New York Times reported at the time.
The New York Times reported at time


## Apply visualization with displaCy

In [29]:
from spacy import displacy

text = "Last week, Democratic lawmakers from both parties said they had the Senate votes needed to pass legislation that would prevent tech platforms, including Apple, GM and Facebook, from favoring their own businesses."

doc = nlp(text)

displacy.render(doc, style="ent")

## Visualize entity types in sentences containing the specified entity:

In [31]:
def visualize_given_ent_type(processed_docs, entity, ent_type):
    """
    Visualizes named entity annotations in sentences containing a given named entity of a specific type.

    Args:
        processed_docs (list): A list of processed documents. Each document is a spacy.tokens.Doc object.
        entity (str): The named entity to search for.
        ent_type (str): The type of the named entity (e.g. 'ORG', 'PERSON', 'GPE')

    Returns:
        None. Displays inline the named entity annotations for sentences containing the specified entity.
    """
    for doc in processed_docs:
        for sentence in doc.sents:
            if entity in [ent.text for ent in sentence.ents if ent.label_ == ent_type ]:
                displacy.render(sentence, style='ent' )

visualize_given_ent_type(processed_docs, 'Apple', 'ORG' )

## Now - Find sentences where the particular named entity is used alongside other entities of the same type - i.e. a specific entity type is used a particular number of times

In [32]:
def return_count_of_ent_type(sentence, ent_type):
    return len([ent.text for ent in sentence.ents if ent.label_ == ent_type ])

txt = "Last week, Democratic lawmakers from both parties said they had the Senate votes needed to pass legislation that would prevent tech platforms, including Apple, GM and Facebook, from favoring their own businesses."

doc = nlp(txt)

return_count_of_ent_type(doc, 'ORG')

4

In [35]:
def return_docs_of_given_ent_type_custom(processed_docs, entity, ent_type):
    """
    Returns sentences containing a given named entity of a specific type, 
    but only if there is more than one occurrence of the entity type in the sentence.

    Args:
        processed_docs (list): A list of processed documents. Each document is a spacy.tokens.Doc object.
        entity (str): The named entity to search for.
        ent_type (str): The type of the named entity (e.g. 'ORG', 'PERSON', 'GPE')

    Returns:
        output_sentences (list): A list of sentences that contain the specified entity 
                                 and have more than one occurrence of the specified entity type.
    """
    output_sentences = []
    for doc in processed_docs:
        for sentence in doc.sents:
            if entity in [ent.text for ent in sentence.ents if ent.label_ == ent_type and
                          return_count_of_ent_type(sentence, ent_type) > 1 ]:
                output_sentences.append(sentence)
    return output_sentences

output_sentences = return_docs_of_given_ent_type_custom(processed_docs, "Apple", "ORG")

print(len(output_sentences))

16


In [47]:
def visualize_conditional_sentences(sentences):
    """
    This function visualizes the named entities in the given sentences using the displaCy visualizer from SpaCy.
    It specifically highlights organizations (ORG) in the sentences.

    Parameters:
    sentences (list): A list of sentences. Each sentence is expected to be a SpaCy Doc object with named entities.

    Returns:
    None. The function directly renders the visualization in the Jupyter notebook or other environment.
    """
    colors = {"ORG": "linear-gradient(90deg, #64B5F6, #E0F7FA)"}
    options = {"ents": ["ORG"], "colors": colors}
    
    for sentence in sentences:
        displacy.render(sentence, style="ent", options=options)
        


visualize_conditional_sentences(output_sentences )
    