# Information Extraction using Spacy

With the huge amount of text data being generated every hour, extracting useful information from these documents becomes a daunting task. Information extraction in natural language processing aims to develop algorithms that can automate the process of extracting key information from large numbers of documents.

Information extraction remains a non-trivial task and is an active area of research as it requires a deep understanding of the meaning of the text. Example use cases of event extraction are discovering and extracting details about crises and incidents from social media (such as location) and extracting medication and associated side effects from medical reports.

In this notebook, we will demonstrate how to build a simple information extraction framework based on rule-based association and dependency parsing. This method is based on an approach used to extract information about armed conflicts [here](https://andrewhalterman.com/post/event-data-in-30-lines-of-python/).

Dependency parsing is the task of identifying the grammatical relationships between the words in a text. For example, in the sentence "Jane is making a salad.", "Jane" is the nominal subject of the verb "making" and "salad" is its direct object. By traversing these grammatical dependencies, we can trace the entities involved in a certain action. Therefore, if we can detect a trigger action/event, we can extract involved parties.

In this notebook, we will demonstrate this on the [JUSTICE](https://arxiv.org/abs/2112.03414) dataset which provides a collection of documents detailing the results of court proceedings. We will try to extract information regarding the legal action taken (conviction, appeal...) and the entities involved or related to the action.

For dependency parsing, we will rely on the parser in the spaCy processing pipeline. We begin by downloading the pretrained spaCy model "en_core_web_lg" and loading it into a processing pipeline.

In [None]:
%%capture
#!pip uninstall -q neuralcoref -y
#!pip install -q neuralcoref --no-binary neuralcoref

#!pip uninstall -q spacy -y
#!pip install -q -U spacy==2.1.0

import spacy.cli
spacy.cli.download('en_core_web_lg')

In [None]:
import spacy

import en_core_web_lg
nlp = en_core_web_lg.load()

#import neuralcoref
#neuralcoref.add_to_pipe(nlp)
#print(nlp.pipe_names)

We can now try spaCy's parser on the example sentenced we used above.

In [None]:
doc = nlp('Jane is making a Greek salad.')

for token in doc:
    print(f'{token.text}, {token.dep_}')

When calling the processing pipeline on a text, it will return a spaCy Doc object which contains the tokenized text along with other NLP attributes that were computed by the pipeline. Since we are interested in grammatical dependencies, we iterate over the tokens and print the dependency property for each word. We see some familiar grammatical tags such as nominal subject (nsubj), auxiliary (aux) and determinant (det).

However, this does not show the actual relationships between the tokens but only the grammatical functions. For instance, we see that "is" is an auxiliary but it is not explicitly stated that it is the auxiliary of "making". Luckily, spaCy has a useful visualization tool that allows us to see these dependencies.

In [None]:
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True)

In case a certain dependency is not familiar, we can query a clarification from spaCy using the following function.

In [None]:
spacy.explain('amod')

'adjectival modifier'

Another interesting property that can be visualized in the named entity tag for each token. We can similarly explain unclear entity tags.

In [None]:
displacy.render(doc, style='ent', jupyter=True)

In [None]:
spacy.explain('NORP')

## Integrating NeuralCoref

In [None]:
'''
def get_coref(doc):
    if doc._.has_coref:
        return nlp(doc._.coref_resolved)
    else:
        return doc

from spacy.tokens import Doc

Doc.set_extension('coref', getter=get_coref)
'''

## Dataset

Before we can understand how these dependencies can help us in our task, let's first load the dataset. We are only interested in the documents under the "facts" column. Also, we remove the \<p>\</p> tags indicating the beginning and end of a paragraph.

In [None]:
import pandas as pd

texts = pd.read_csv('https://raw.githubusercontent.com/smitp415/CSCI_544_Final_Project/main/clean_data.csv')
texts['facts'] = texts['facts'].apply(lambda x: x.replace('<p>', '').replace('</p>','').replace('\n',''))
texts = texts['facts'].values
print('\n**************\n'.join(texts[:5]))

Let us examine the dependencies of an example from the dataset (we only select the first sentence of the paragraph to reduce the visualization and only focus on what we're highlighting).

In [None]:
displacy.render(nlp(texts[9])[:13], style='dep', jupyter=True)

In the context of our application, this text tells that a certain Johnny Wilson got convicted by the court. If we follow the reasoning described in our approach, "convicted" would be the trigger token signaling the event of interest. If we follow the dependecies related to this token we can find "Wilson" as being the object and "court" as the subject. This is important information because it tells us who was involved in this action and who was affected.

Let us see how we can write a code to extract this information.

In [None]:
for token in nlp(texts[9])[:13]:
    #first we need to find our trigger word
    if token.text == 'convicted':
        #iterate over all tokens related by dependency (children)
        for child in token.children:
            if child.dep_ == 'nsubj':
                subj = child
            elif child.dep_ == 'dobj':
                obj = child
        print(f'Subj: {subj.text}, Trigger: {token.text}, Obj: {obj.text}')

As expected, the code above extracts the needed information. However, in this example, we knew exactly the trigger word we wanted. Luckily, all of our documents have a legal context. Therefore, we define a list of potential trigger words that are likely to appear in our dataset.

In [None]:
trigger_list = ['file', 'sue', 'appeal', 'reject', 'convict', 'refuse', 'deny',
             'rule', 'plead', 'affirm', 'refuse', 'charge', 'accuse', 'challenge',
             'sentence', 'prosecute']

When matching tokens against this list, we will be matching the word lemmas instead of the actual words. Word lemmas are the root form of words. This will ensure that we do not miss any triggers due to a difference in word form. Below is an example of word lemmatization which is part of spaCy's default processing pipeline.

In [None]:
print(nlp('pleaded')[0].lemma_)

Next, we extend the code shown above to look for multiple events in a document. Instead of extracting each detected component as a single word, we extract the full token subtree (meaning every token connected by dependency) to add more context to the extracted information. We then visualize a number of samples from the dataset. We also account for more grammatical dependecies such as passive subjects occuring in passive clauses.

In [None]:
def detect_event(doc):
    #match word lemmas against trigger words
    events = []
    for token in doc:
        if token.lemma_ in trigger_list:
            events.append(token)
    return events
                    
def actor_extractor(event):
    #follow the trigger's dependency to find obj and subj tokens
    nsubj = None
    dobj = None
    for child in event.children:
        if child.dep_ in ['nsubj', 'csubj']:
            nsubj = child
        elif child.dep_ in ['dobj', 'nsubjpass', 'csubjpass']:
            dobj = child
    return nsubj, dobj

def extract_info(doc, use_coref=False, display_results=True):
    if use_coref:
        doc = doc._.coref
    if display_results:
        displacy.render(doc, style='ent', jupyter=True)
        print()
    events = detect_event(doc)
    #check if all three components of the event were detected
    full_event = False
    if events:
        for event in events:
            nsubj, dobj = actor_extractor(event)
            #check if all three components of the event were detected
            if nsubj and dobj:
                full_event=True
            if display_results:
                if nsubj:
                    print(f"Subject: {''.join(w.text_with_ws for w in nsubj.subtree).strip()}")
                else:
                    print('Subject: not found')
                print(f"Event: {event.text}")
                if dobj:
                    print(f"Object: {''.join(w.text_with_ws for w in dobj.subtree).strip()}")
                else:
                    print('Object: not found')
                print('***************************')
    else:
        if display_results:
            print('No event detected')
    if display_results:
        print('***************************\n'*3)
    else:
        return full_event

for doc in nlp.pipe(texts[5:10]):
    extract_info(doc)

Inspecting the results shows that in many cases, our algorithm was able to extract meaningful information from the text.

For instance, if we look at sample at index #5, we find two complete events telling us that the case is about someone filing for a second Visa application after the court had refused his first application. This is already delivering information in a compacted form from a relatively long paragraph.

To get a more quantitative assesment of our approach, we added a flag "full_event" to the code above. This flag is raised whenever we detect at least one event with all three components for each document. We run the algorithm on a large number of samples. We tried multiple sample sizes (the nlp pipeline tended to become slow when running on significantly large sets) and found that we consistently scored above 60%.

In [None]:
count = 0
data = texts[:500]
for doc in nlp.pipe(data):
    full_event = extract_info(doc, use_coref=False, display_results=False)
    if full_event:
        count += 1
print(f'Detected {count}/{len(data)} ({count/len(data)*100}%) complete events.')

Detected 309/500 (61.8%) complete events.


## Detection rate with Coref

In [None]:
'''
count = 0
data = texts[:100]
for doc in nlp.pipe(data):
    full_event = extract_info(doc, use_coref=True, display_results=False)
    if full_event:
        count += 1
print(f'Detected {count}/{len(data)} ({count/len(data)*100}%) complete events.')
'''

## Weaknesses of this approach

The most glaring weakness of our algorithms is that it is relies on some pattern matching and largely dependent on manually defined trigger vocabulary and grammatical dependency matching. Naturally, performance can be improved by making the trigger list more comprehensive but this is not something that can be efficiently scaled.

Additionally, there are some cases where the dependency parser fails to detect some relationships that result in missing some event components. For instance, in the example below, "charged" is detected as an event but without "John Argersinger" as the person charged. This is due to the structure of the sentence where the first part "Jon Argersinger was an indigent" forms a complete meaningful phrase and "was" is not detected as the auxiliary to "charged" therefore ignoring any connection between "John" and "charged".

In the same example, "convicted" is detected with the passive subject "he" but "sentenced" which follows by conjunction is not attributed the same subject.

In [None]:
extract_info(nlp(texts[7]))

In [None]:
displacy.render(nlp(texts[7]), style='dep', jupyter=True)