# Information Extraction

## (Named Entity Recognition, Relation and Event Extraction)

We are gradually moving along the simple NLP pipeline: from raw text to tokenised sentences to POS-tagged sentences. Now we can proceed to extract information from an unstructured text. 

<img src="https://www.nltk.org/images/ie-architecture.png " width ="600">

Source: NLTK book (https://www.nltk.org/book/ch07.html)

## Named Entity Recognition (NER)

The goal of a **named entity recognition (NER)** system is to identify all textual mentions of the named entities. 
Sub-tasks: 
- identifying the boundaries of the NE. 
- identifying its type 

Why? 
- classifying content for news providers (automatically scan an article and reveal what and who is discussed in it to help categorise it for easy content discovery)
- search engines 
- content recommendation systems
- customer support (e.g. to categorise a social media complaint using the product model name and redirect the question to the relevant department)

  What constitutes a **named entity** type is task-specific:
  - Common named entities: people, places, organizations 
  - Task-specific: geneor  protein  names,  financial  asset  classes, commercial products, works of art

Some examples:

- **People (PER)** (people, characters): "**Emerson Brookings**, a resident fellow")
- **Organisaton (ORG)** (companies, sport teams): "a historical geographer at the **University of Cambridge**"
- **Location (LOC)** (regions, mountains, seas): "a storm front swept across the **Great Plains** of the United States"
- **Geo-Political Entity (GPE)** (countries, states, provinces): "much of the western **United States** is on the brink of a prolonged megadrought")
- **Facility (FAC) (bridges, buildings, airports)**: "Security agents at **Pittsburgh International Airport** caught a man with a gun"
- **Vehicles (VEH) (planes, trains, automobiles)**: "American Airlines plans customer tours of **Boeing 737 Max**"


Preparing spaCy:

In [None]:
!pip install spacy

import spacy
from spacy import displacy

!python -m spacy download en_core_web_sm #downloading the English model

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

doc_ner = nlp("In 2019, I went to Luxembourg and bought an apartment for 35 million euro near Red River, where the Obama family used to stay for summer in their first year as retirees")

 
for ent in doc_ner.ents:
    print(ent.text, ent.label_)


Named entity labels in SpaCY: https://spacy.io/api/annotation#named-entities

### Boundary detection in NER (NER as a Sequence labeling task)

Named entity recognition is considered a sequence labeling task. This is because Named Entities can be more than one token, for instance: Jean-Claude Juncker (PER), Grand-Duché de Luxembourg(GPE)

In sequence labeling tasks, spans of text are identfied as a unit.

IOB tagging is one of the methods to represent the tags of the sequence.

![Picture title](img/image-20201026-102814.png)


For instance: 

Jean-Claude(B-PER) Juncker(I_PER) was(O) the(O) former(O) prime(O) minister(O) of(O)  Grand-Duchy(B-GPE) of(I-GBE) Luxembourg(I-GPE).

The following code shows how to get IOB tagging of named entities using spaCy library.

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

doc_ner = nlp("In 2019, I went to Luxembourg and bought an apartment for 35 million euro near Red River.")

for token in doc_ner:
    if(token.ent_iob_!="O"):
        print(token.text,str(token.ent_iob_)+"-"+str(token.ent_type_[:3]))
    else:
        print(token.text,token.ent_iob_)

**1- Research and Reply** IOB is not the only sequence labeling scheme defined. Look for other sequence labeling schemes and descibe them using the above example.(at least two other schemes)

### Write your answer here

We can visualise NEs with DisplayCY:

In [None]:
displacy.render(doc_ner, style="ent",jupyter=True)


### Named entity extraction in Stanza:


Prepare stanza:

In [None]:
!pip install stanza #install stanza

import stanza
stanza.download('en') # download English model

In [None]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("In 2019, I went to Luxembourg and bought an apartment for 35 million euro near Red River, where the Obama family used to stay for summer in their first year as retirees.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')


### Type Ambiguity in NER

One of the complications in Named Entity Recognition task is caused by the ambiguity in type of some Entities.
For instance:
**JFK**

    1- Airport (Facility) 
    2- 25th President of the US (Person)


In [None]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

sentence_1 = "JFK served in both the U.S. House of Representatives and U.S. Senate before becoming the 35th president in 1961."
sentence_2 = "Flight KL-3949 from Abu-Dhabi arrived at JFK airport at 12:05 PM."
doc1=nlp(sentence_1)
doc2=nlp(sentence_2)
displacy.render(doc1, style="ent",jupyter=True)
displacy.render(doc2, style="ent",jupyter=True)

**2-CODE IT:** Think of a named entity that can serve as two or more different entity types in a language you know well, give examples for each of the types and using a library of your own choosing extract named entity for the examples and print them.


In [None]:
#insert your code here

### NER algorithm families: 

    1. Feature based sequence labeling  machine learning algorithms such as CRF
    2. Neural based sequence labeling machine learning algorithms such as bi-LSTM
    3. Rule based machine learning algorithms


**Feature based sequence labeling machine leaning algorithms**, transform data into features and decide on the label of each token according to a labeling scheme such as IOB based on these features.

In Figure below from your text book, you can see the features used (each token is represented as a feature defining it's POS tag(IN,NNP,...) and a feature identifying whether the token is capitalized(x,X) or not)
![Named entity recognition as a sequence labeling task](https://pic4.zhimg.com/80/v2-de055cf1dc659adf7b8177c8bc92dac3_720w.jpg)




These features are then fed into a classifer such as CRF, which is trained using labeled data to decide whether the token should be identified as the begining(for instance B-PER meaning begining of named entity of PERSON ) , or continuation (for instance I-PER meaning continuation of named entity of PERSON) or doen't belong to a named entity(O).




**If you Fancy** Watch [this](https://www.youtube.com/watch?v=wxyZTSc2tM0) video of NLP course of Stanford university in which Chris Manning introduces machine learning algorithms for sequence labeling.

---


**Rule based Algorithms** use a set of rules to decide whether a span of text is a named entity or not.

**3- OBSERVE AND REFLECT**: Consider a language you know well. Using examples define some rules for (two at the least of different entity types such as LOCATION and FACILITY) which can help extracting Named Entities.

For instance: If a capitalized word occurs after the word: "Mr." classify it as a PERSON.


### Write your anser here

The following code implements a virtual assisstant for a travel agency. 

In order to extract traveler's names, destination and date of travel it uses Spacy's NER functionality.

In [None]:

import en_core_web_sm
nlp = en_core_web_sm.load()
travelers =[]
destination =""
date = ""
        

quit =False
input_text = input("Hello I am your travel agency virtual assistant. Specify date of travel, destination and name of traveleres please: Press q to quit, r to reset)")
print(input_text)
missing = True
while(not quit):
    
    
    quit = (input_text=="q")
    reset = (input_text=="r")
    if(not quit):
        if(reset):
            travelers =[]
            destination =""
            date = ""
        
        elif(input_text =="q"):# checking if user already wants to exit program
                quit = True
                continue;
        elif(input_text =="Y"):# checking if user wants to register request
            if(missing):
                print("Your request was not registered due to missing information")

            else:
                print("Your travel request has been registered in the system:\n Destination:"+ destination +" \nDate:"+date+"\nTravelers names: "+ str(travelers))
            quit = True
                              
        else:
            missing = False
            doc_ner = nlp(input_text)
            for ent in doc_ner.ents:
                print(ent.label_,ent.text)
                if(ent.label_ == "PERSON"):
                    travelers.append(ent.text)
                elif(ent.label_ == "LOC" or ent.label_ =="GPE"):
                    destination= ent.text
                elif(ent.label_ == "DATE"):
                    date= ent.text
            
            if(not date):
                print("You didn't specify the date of your travel.")
                missing= True
            if(not destination):
                print("You didn't specify the destination for your travel.")
                missing = True
            if(not travelers):
                print("You didn't specify travelers's name.")
                missing =True
                
                
            if(missing == False):
                print("You are planning to travel to " +destination+" on "+ date +". Travelers names are : "+ str(travelers))                
                input_text = input("Press Y to confirm:(Press q to quit, r to reset)")
                print(input_text)
            else:
                input_text = input("Modify request: (Press q to quit, r to reset)")
                print(input_text)

            


**4- CODE IT** Modify the code so that the virtual assistant also registers:

1. the maximum amount of money the traveler is willing to spend 
2. the number of travelers

---

## Relation Extraction

We now have identified the named entities. Would not it be nice to also find relations between them?

One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types,
 and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. 
 
 


 These are often binary relations like child-of, em-ployment, part-whole, and geospatial relations.  Relation extraction has close links to populating a relational database.

Watch [this video](https://www.youtube.com/watch?v=gTFMULX7vU0) of Stanford Natural languge processing video series by Dan Jurafsky as an introduction on Relation Extraction.

One of the applications of relation extraction task is **Question Anwering**. Question answering systems retrive answers to queries from **knowledge bases**.

Example: 

**Question:** Which Luxembourgish athelete won a gold medal at the 1952 Summer Olympics?

Relations : (X ,is, Luxembougish) , (X ,won, gold medal) and (X ,participated-in, 1952 Summer Olympics)


**Answer** X is Josy Barthel

![Josy Barthel](https://2.bp.blogspot.com/-Co3IuoMvwlc/V_vz1YVlSSI/AAAAAAAAAzA/NlMLF6S9WJE2LLcQtK0KhkBgExD2XdBUwCLcB/s1600/448px-Josy-Victoire1.gif)


The following example searches for strings that contain the word *in*.
 
The special regular expression (?!\b.+ing\b) is a negative lookahead assertion that allows us to disregard strings such as **success** in supervising the transition of, where **in** is followed by a gerund.



In [None]:
!pip install nltk
import nltk

In [None]:
nltk.download('ieer')
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

### Relation extraction algorithms:

    1. handwritten patterns
    2. supervised machine learning
    3. semi-supervised(via bootstrapping and via distant supervision)
    4. unsupervised

**IF you Fancy** Watch [this lecture](https://www.youtube.com/watch?v=pO3Jsr31s_Q&list=PLoROMvodv4rObpMCir6rNNUlFAn56Js20&index=7) of Stanford University's Natural Language Understaing course by Bill MacCartney on Relation Extraction explaining different Relation Extraction algorithms.

### Representing Relations:
### RDF (Rsource Description FrameWork)

RDF is a meta-language for representing relations between entities using a tuple of (subject, predicate , object).


<img src="https://cdn1.marklogic.com/wp-content/uploads/2015/11/1-sop-1.png" width ="600">



Hyponymy and Hypernymy Relations in Linguistics:



<img src="https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-02934-0_20/MediaObjects/475331_1_En_20_Fig2_HTML.png" width ="600">

The following example is applying a handwritten pattern to identify hyponymy:

If a text span follows the pattern "X **such as** Y and/or Z" then Y and Z are Hyponyms of X.

For  example: I like tropical fruits **such as** Bannana and Mango.  Then the rule identfies "Bannana" and "Mango" as Hyponyms of "Fruit".



**5- OBSERVE and REFLECT**

Give an example for Hypernymy pattern detection in english other than **such as** patten.

### Write your answer here

**6- CODE IT** Write a piece of code using regex to identiy the pattern for your defined rule for Hypernymy in the exercise above and outputs an RDF tuple <Subject,Predicate,Object>

for instance :<Fruit,Hypernym,Banana>

In [None]:
# insert your code here

## Event Extraction

Event extraction is the task of identifying an event which is occuring at a particular point in time or a time interval.

In english events are usually(but not always) expressed with verbs.

For example:  World war II **started** in 1 September 1939


The start of the war is an event which happened at a specific time.



One of the application of event extraction is to detect temporal ordering of events. For instance if two events are happening simeltaneously or one after another

---

**7- Homework Exercise**  Acronym expansion is the task of associating a phrase with an acronym such as **JFK** for **John F Kennedy**.

Search for some common **Three Letter Acronyms(TLA)** and their proper phrase in a language you know. 

Then write a piece of code that extracts from text the Three Letter Acronyms which serve as Named Entities and replace them with the proper phrase.

Use any library introduced in this session for extracting NEs.

Apply your code on a piece of text longer than 500 words which includes the TLA's you considered.


In [None]:
# insert your code here

---

## References:
1. https://www.nltk.org/book/ch07.html
2. https://spacy.io/
3. https://web.stanford.edu/~jurafsky/slp3/