# Exercise Sheet 8 - Information Extraction - Solution

## Learning Objectives 

In this lab we are going to:

* Review preprocessing techniques for Information Extraction.
* Learn how to find useful information from the text.
* Play around with a corpus to extract Named-Entities.



In [None]:
# setting the stage, as usual with colab ;)
import nltk
nltk.download('all')

## 1. Preprocessing ##

To processes a document we will follow  5 main steps: 

- Sentence segmentation.
- Tokenization. 
- Tag words with their part-of-speech tags. 
- Identify interesting chunks and entities. 
- Identify relations between different entities in the text.

The first 3 steps perform linguistic preprocessing of a given text in order to perform higher level processing.


This lab is partially based on the Information Extraction chapter of the [NLTK book](http://www.nltk.org/book/).


In [2]:
document = 'The fourth Wells account moving to another agency is the packaged paper-products division of \
Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, \
it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, \
which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel\
Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.'


## Preprocessing

# Step 1: Sentence segmentation.
sentences = nltk.sent_tokenize(document)

# Step 2: Tokenize sentences into words.
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Step 3: POS tagging.
tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]


# print the first sentence
tagged_sentences[0]

[('The', 'DT'),
 ('fourth', 'JJ'),
 ('Wells', 'NNP'),
 ('account', 'NN'),
 ('moving', 'VBG'),
 ('to', 'TO'),
 ('another', 'DT'),
 ('agency', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('packaged', 'VBN'),
 ('paper-products', 'NNS'),
 ('division', 'NN'),
 ('of', 'IN'),
 ('Georgia-Pacific', 'NNP'),
 ('Corp.', 'NNP'),
 (',', ','),
 ('which', 'WDT'),
 ('arrived', 'VBD'),
 ('at', 'IN'),
 ('Wells', 'NNP'),
 ('only', 'RB'),
 ('last', 'JJ'),
 ('fall', 'NN'),
 ('.', '.')]

## 2. Named Entity Recognition ##

Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. 

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function `nltk.ne_chunk()`. 
If we set the parameter `binary=True`, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE. 

In [3]:
sentence = "I will meet John Smith to visit Oracle headquarters."

tokens = nltk.word_tokenize(sentence)   # tokenization
pos_tags = nltk.pos_tag(tokens)         # pos-tagging

# named entity chunking
print(nltk.ne_chunk(pos_tags, binary=True))

(S
  I/PRP
  will/MD
  meet/VB
  (NE John/NNP Smith/NNP)
  to/TO
  visit/VB
  (NE Oracle/NNP)
  headquarters/NNS
  ./.)


In [4]:
tree = nltk.ne_chunk(pos_tags, binary=True)

# find named entities
named_entities = []

for subtree in tree.subtrees():
  if subtree.label() == 'NE':
    entity = ""
    for leaf in subtree.leaves():
      entity = entity + leaf[0] + " "
    named_entities.append(entity.strip())
             
print(named_entities)

['John Smith', 'Oracle']


### Exercise 1 ###

Extract all the named entities from the first 20 sentences of the of the Brown Corpus that are in the category '<i>news</i>'.

In [5]:
# your code goes here

def extract_entity_names(tree):
  named_entities = []
  for subtree in tree.subtrees():
    if subtree.label() == 'NE':
      entity = ""
      for leaf in subtree.leaves():
        entity = entity + leaf[0] + " "
      named_entities.append(entity.strip())
  return named_entities

sentences = nltk.corpus.brown.sents(categories=['news'])[:20]

tagged_sentences = nltk.pos_tag_sents(sentences)
chunks = nltk.ne_chunk_sents(tagged_sentences, binary=True)

for chunk in chunks:
  print(extract_entity_names(chunk))

['Fulton County Grand Jury']
['City Executive Committee', 'Atlanta']
['Fulton Superior Court']
[]
[]
['Fulton']
['Atlanta', 'Fulton County']
['Merger']
[]
['City Purchasing Department']
[]
[]
[]
[]
['Fulton County', 'Fulton County']
[]
['Fulton County']
['Fulton']
['Fulton']
[]


## 3. Relation Extraction ##
We now want to extract the relations that exist between the specific types of named entities. One way of approaching this task is to use regular expressions to look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y.  

In [6]:
import re, nltk

# Search for strings that contain the word "in".

# \b matches the empty string, but only at the beginning or end of a word. (b = boundary)
'''Negative lookahead assertion(?<= ...). Matches if ... doesnt match next. 
To disregard the strings such as success in supervising, where in is followed by a gerund.'''

IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')

# Using the documents from the IEEE Corpus - New York Times, 15 March 1998.
# (see details here: http://www.nltk.org/_modules/nltk/corpus/reader/ieer.html)
docs = nltk.corpus.ieer.parsed_docs('NYT_19980315')

for doc in docs:
  for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):
    print(nltk.sem.relextract.rtuple(rel))
    print(nltk.sem.clause(rel, relsym = "IN"))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
IN('whyy', 'philadelphia')
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
IN('mcglashan_&_sarrail', 'san_mateo')
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
IN('freedom_forum', 'arlington')
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
IN('brookings_institution', 'washington')
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
IN('idealab', 'los_angeles')
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
IN('open_text', 'waterloo')
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
IN('wgbh', 'boston')
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
IN('bastille_opera', 'paris')
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
IN('omnicom', 'new_york')
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
IN('ddb_needham', 'new_york')
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
IN('kaplan_thaler_group', 'new_york')
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
IN('bbdo_south', 'atlanta')
[ORG: 'G

### Exercise 2 ###

Extract places of birth of people from the the ieeer corpus, using the 'X born in Y' pattern, where X is a person and Y is a location.

In [7]:
# your code goes here
# example of the output 
# [PER: 'McCarthy'] 'was born in' [LOC: 'Belle Plaine']
# Birthplace('mccarthy', 'belle_plaine')

from nltk.corpus import ieer

BORN_IN = re.compile(r'.*\bborn in\b')

for fileId in ieer.fileids():
  for doc in nltk.corpus.ieer.parsed_docs(fileId):
    for rel in nltk.sem.relextract.extract_rels('PER', 'LOC', doc, corpus='ieer', pattern=BORN_IN):
      print(nltk.sem.relextract.rtuple(rel))
      print (nltk.sem.clause(rel, relsym = "Birthplace"))

[PER: 'McCarthy'] 'was born in' [LOC: 'Belle Plaine']
Birthplace('mccarthy', 'belle_plaine')


In [8]:
# Another example
sentences = ["Barack Hussein Obama II (born August 4, 1961) is an American politician and attorney.",
             "Obama was born in Honolulu, Hawaii.",
             "After graduating from Columbia University in 1983, he worked as a community organizer in Chicago."]

BORN_IN = re.compile(r'.*\bborn\b .*')

for sent in sentences:
  print(re.findall(BORN_IN, sent))

['Barack Hussein Obama II (born August 4, 1961) is an American politician and attorney.']
['Obama was born in Honolulu, Hawaii.']
[]


### Exercise 3 ###

Extract people and their role in an organisation by using the 'X ROLE at the Y' or 'X ROLE of the Y' patterns, where X is a person and Y is an organisation. 

In [9]:
# your code goes here
from nltk.corpus import ieer

ROLES = re.compile(',.*(\sat\sthe?|\sof\sthe?)')

for file in ieer.fileids():
  for doc in nltk.corpus.ieer.parsed_docs(file):
    for r in nltk.sem.relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
      print (nltk.sem.relextract.clause(r, relsym="ROLES"))

ROLES('kivutha_kibwana', 'national_convention_assembly')
ROLES('boban_boskovic', 'plastika')
ROLES('robert_mergess', 'berkeley_center_for_law_and_technology')
ROLES('jack_balkin', 'yale')
ROLES('david_post', 'cyberspace_law_institute')
ROLES('william_gale', 'brookings_institution')
ROLES('joel_slemrod', 'university_of_michigan')
ROLES('kaufman', 'tv_books_llc')
ROLES('sherry_lansing', 'paramount_motion_picture_group')
ROLES('rick_yorn', 'addis-wechsler_&_associates')
ROLES('ken_kaess', 'ddb_needham')
ROLES('norio_ohga', 'sony_corporation')
ROLES('raymond_rosen', 'robert_wood_johnson_medical_school')
ROLES('pepper_schwartz', 'university_of_washington')
ROLES('irwin_goldstein', 'boston_university_school_of_medicine')
ROLES('jennifer_berman', 'university_of_maryland')
ROLES('anthony_chan', 'banc_one_investment_advisors_corp')
ROLES('kevin_ashby', 'the_sun_advocate')
ROLES('paul_volcker', 'us_federal_reserve')
ROLES('israel_singer', 'world_jewish_congress')
ROLES('katherine_abraham', 'bure


## 4. Inter-Annotator agreement ##

As discussed in the lecture, Inter-Annotator Agreement (IAA) between two annotators can be calculated using Cohen's Kappa $(\kappa)$ as follows:

$$\kappa = \frac{P_o - P_e}{1-P_e}$$

where $P_o$ is observed agreement and $P_e$ is expected agreement.

### Exercise 4.1 ###

For the given document:

1. Annotate the given document with named entities using the IOB tagging scheme (annotate all of the named entities: PERson, LOCation, ORGanisation, TIME).
2. Calculate the Inter-Annotator Agreement with one other student.
3. Interpret the obtained value according to Landis and Koch scale.


**Document 1**: *The fourth Wells account moving to another agency is the packaged paper-products division of 
Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, 
it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, 
which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel
Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.*


Annotator 1:

**O**: The **O**: fourth **O**: Wells **O**: account **O**: moving **O**: to **O**: another **O**: agency **O**: is **O**: the **O**: packaged **O**: paper-products **O**: division **O**: of **B_ORG**: Georgia-Pacific **I_ORG**: Corp. **O**: , **O**: which **O**: arrived **O**: at **O**: Wells **O**: only **O**: last **TIME**: fall **O**: . **O**: Like **O**: Hertz **O**: and **O**: the **O**: History **O**: Channel **O**: , **O**: it **O**: is **O**: also **O**: leaving **O**: for **O**: an **B_ORG**: Omnicom-owned **O**: agency **O**: , **O**: the **B_ORG**: BBDO **I_ORG**: South **O**: unit **O**: of **B_ORG**: BBDO **I_ORG**: Worldwide **O**: . **B_ORG**: BBDO **I_ORG**: South **O**: in **B_LOC**: Atlanta **O**: , **O**: which **O**: handles **O**: corporate **O**: advertising **O**: for **B_ORG**: Georgia-Pacific **O**: , **O**: will **O**: assume **O**: additional **O**: duties **O**: for **O**: brands **O**: like **B_ORG**: Angel **I_ORG**: Soft **O**: toilet **O**: tissue **O**: and **B_ORG**: Sparkle **O**: paper **O**: towels **O**: , **O**: said **B-PER**: Ken **I-PER**: Haldin **O**: , **O**: a **O**: spokesman **O**: for **B_ORG**: Georgia-Pacific **O**: in **B_LOC**: Atlanta **O**: .


Annotator 2:

**O**: The **O**: fourth **B_ORG**: Wells **O**: account **O**: moving **O**: to **O**: another **O**: agency **O**: is **O**: the **O**: packaged **O**: paper-products **O**: division **O**: of **B_ORG**: Georgia-Pacific **I_ORG**: Corp. **O**: , **O**: which **O**: arrived **O**: at **B_ORG**: Wells **O**: only **O**: last **O**: fall **O**: . **O**: Like **B_ORG**: Hertz **O**: and **O**: the **B_ORG**: History **I_ORG**: Channel **O**: , **O**: it **O**: is **O**: also **O**: leaving **O**: for **O**: an **B_ORG**: Omnicom-owned **O**: agency **O**: , **O**: the **B_ORG**: BBDO **I_ORG**: South **I_ORG**: unit **O**: of **B_ORG**: BBDO **I_ORG**: Worldwide **O**: . **B_ORG**: BBDO **I_ORG**: South **O**: in **B_LOC**: Atlanta **O**: , **O**: which **O**: handles **O**: corporate **O**: advertising **O**: for **B_ORG**: Georgia-Pacific **O**: , **O**: will **O**: assume **O**: additional **O**: duties **O**: for **O**: brands **O**: like **B_ORG**: Angel **I_ORG**: Soft **O**: toilet **O**: tissue **O**: and **B_ORG**: Sparkle **O**: paper **O**: towels **O**: , **O**: said **B_PER**: Ken **I_PER**: Haldin **O**: , **O**: a **O**: spokesman **O**: for **B_ORG**: Georgia-Pacific **O**: in **B_LOC**: Atlanta **O**: .

In [10]:
sentence = "The fourth Wells account moving to another agency is the packaged paper-products division of \
Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, \
it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, \
which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel \
Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta."

tokens = nltk.word_tokenize(sentence)

In [11]:
# Download annotations.
!curl https://pastebin.com/raw/Ka19SRJp > annotation-1.tsv   # Omnia's annotation
!curl https://pastebin.com/raw/yZ5Ker55 > annotation-2.tsv   # Nivranshu's annotation

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   863    0   863    0     0  10034      0 --:--:-- --:--:-- --:--:-- 10034
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   884    0   884    0     0  10045      0 --:--:-- --:--:-- --:--:-- 10045


In [12]:
# Load annotations.
with open('annotation-1.tsv') as f:
  annotation_1 = [line.strip().split('\t')[-1] for line in f.readlines()]
with open('annotation-2.tsv') as f:
  annotation_2 = [line.strip().split('\t')[-1] for line in f.readlines()]

In [13]:
# Calculate Cohen's Kappa.

# Using sklearn's implementation. See details on the link below.
# (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)

from sklearn.metrics import cohen_kappa_score

cohen_kappa_score(annotation_1, annotation_2)

0.7503152585119799

In [14]:
# nltk also has a function to calculate Cohen's Kappa but it requires 
# the annotations to be structured as a list containing (coder, item, label).
# More information here: https://www.nltk.org/_modules/nltk/metrics/agreement.html

from nltk.metrics.agreement import AnnotationTask

# create a single list of annotations in the format (coder, item, label).
annotations = [(1, idx, label) for (idx, label) in enumerate(annotation_1)] + \
              [(2, idx, label) for (idx, label) in enumerate(annotation_2)]

task = AnnotationTask(data=annotations)
task.kappa()

0.7503152585119798

### Exercise 4.2

As discussed in the lecture, to improve the obtained Kappa value we can:
- adjust annotation guidelines for re-annotation
- discuss disagreements between annotators


In this exercise, design a set of annotation guidelines with other student(s) and annotate the document below as per those guidelines. Also, calculate IAA after annotation.

**Document 2**: *The maker of farm equipment said the three-year labor agreement with the International Association of Machinists and Aerospace Workers at John Deere Horicon Works, Deere's primary facility for producing lawn and grounds-care equipment, takes effect immediately and extends through Oct. 1 , 1992.*

In [15]:
# Left for the students

---
#### Resources for Regular Expressions:

1. [RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCs](https://regexone.com/)
1. [Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript](https://regex101.com/)
1. [re — Regular expression operations — Python 3.9.0 documentation](https://docs.python.org/3/library/re.html)