# Assignment 2: Information Extraction

In [1]:
import nltk
import re

# nltk.download('all')

## Task 1: Named Entity Annotation (10 Marks)

Using the IOB tagging scheme annotate all of the named entities (PERson, LOCation, ORGanisation, TIME) in the following sentence:

*Wayne Rooney is a professional footballer from England who last played for Major League Soccer club D.C. United and will join Derby County in January 2020.*

Edit this cell and write your annotation below the line. (Note that you don't have to write code for this task, you have to annotate it manually)

---
B_PER:Wayne I_PER:Rooney O:is O:a O:professional O:footballer O:from B_LOC:England O:who O:last O:played O:for B_ORG:Major I_ORG:League I_ORG:Soccer I_ORG:club I_ORG:D.C I_ORG:United O:and O:will O:join B_ORG:Derby I_ORG:County O:in B_TIME:January I_TIME:2020


---

### For subsequent tasks in this assignment, you will work with the documents in `football_players.txt` to perform various information extraction tasks.

In [3]:
# Download the text file (uncomment the line below in this cell, if not already downloaded from Blackboard)
!curl "https://ideone.com/plain/OvwDXZ" > football_players.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24172  100 24172    0     0  30871      0 --:--:-- --:--:-- --:--:-- 30871


 Read all the documents from `football_players.txt` into a list called `docs`.

In [4]:
docs = []
# your code goes here
line = open('football_players.txt', 'r').read()
docs = line.split('\n')

## Task 2 (10 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

Please keep in mind that the expected output is a list within a list as shown below.


Hint: For this task you need to perform three steps:
1. Sentence Segmentation
1. Word Tokenization
1. Part-of-Speech Tagging

In [2]:
def ie_preprocess(document):
  # your code goes here
  
    # Step 1: Sentence segmentation.
    sentences = nltk.sent_tokenize(document)

    # Step 2: Tokenize sentences into words.
    tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

    # Step 3: POS tagging.
    tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]
    return tagged_sentences

Run the cell below to verify your result for the second sentence in the first document.
Expected output: 
`[('He', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('forward', 'NN'), ('and', 'CC'), ('serves', 'NNS'), ('as', 'IN'), ('captain', 'NN'), ('for', 'IN'), ('Portugal', 'NNP'), ('.', '.')]`

In [5]:
first_doc = docs[0]
tagged_sentences = ie_preprocess(first_doc)
print(tagged_sentences[1])

[('He', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('forward', 'NN'), ('and', 'CC'), ('serves', 'NNS'), ('as', 'IN'), ('captain', 'NN'), ('for', 'IN'), ('Portugal', 'NNP'), ('.', '.')]


## Task 3 (20 Marks)
Write a function that takes a list of tokens with POS tags for a sentence and returns a list of named entities (NE). 

Hint: Use `binary = True` while calling NE chunk function

In [571]:
def find_named_entities(sent):
    named_entities = []
  # your code goes here    
    tree = nltk.ne_chunk(sent, binary = True)
    for subtree in tree.subtrees():
        if subtree.label() == 'NE':
            entity = ""
            for leaf in subtree.leaves():
              entity = entity + leaf[0] + " "
            named_entities.append(entity.strip())
    return named_entities

Run the cell below to verify your result for the first sentence in the first document.
Expected output: `['Cristiano Ronaldo', 'Santos Aveiro', 'ComM', 'GOIH', 'Portuguese', 'Portuguese', 'Spanish', 'Real Madrid', 'Portugal']`

In [647]:
tagged_sentences = ie_preprocess(docs[0])

find_named_entities(tagged_sentences[0])

['Cristiano Ronaldo',
 'Santos Aveiro',
 'ComM',
 'GOIH',
 'Portuguese',
 'Portuguese',
 'Spanish',
 'Real Madrid',
 'Portugal']

## Task 4 (5 Marks)

Implement the `find_all_named_entities` function below to find **all** NEs in a given document.

Hint: Use `find_named_entities` implemented above for this task.

In [39]:
import itertools
def find_all_named_entities(doc):
    named_entities = []
    named_entities_temp = []
  # your code goes here
    odc_sent = doc.split(".")
    tagged_sentences = ie_preprocess(doc)
    for i in range(len(tagged_sentences)):
        named_entities_temp.append(find_named_entities(tagged_sentences[i]))
    named_entities = list(itertools.chain.from_iterable(named_entities_temp))  
    return named_entities   # return a flat list and not a list of lists

How many named entities did you find in the first document?

In [575]:
# your code goes here
len(find_all_named_entities(docs[0]))

56

## Task 5 (5 Marks)

Find named entities across **all** documents in `football_players.txt`, and save the result into a single flat list.

In [576]:
all_named_entities = []
# your code goes here
for i in docs:
    ne = find_all_named_entities(i)
    for i in ne:
        all_named_entities.append(i)

How many named entities did you find across all documents?

In [577]:
# your code goes here
len(all_named_entities)

380

## Task 6 (40 Marks)

Write functions to extract the name of the player, country of origin and date of birth as well as the following relations: team(s) of the player and position(s) of the player.

Hint: Use the `re.compile()` function to create the extraction patterns.

Reference: https://docs.python.org/3/howto/regex.html

In [None]:
a = [[1,2]]
a.

In [9]:
import re

"""
Taking the required sentence through regex by filtering on 'is a or is an' and then checking through the grammer 
for the required pattern and also to cross validate the answer by checking the specified grammer.
"""
def name_of_the_player(doc):
  # your code goes here
    pattern = (r'.*?(\bis a| is an \b).+?')
    match = re.search(pattern, doc)
    text = match.group(0)
    grammar =  r"""
    NBAR:
        {<NN.*>*<``|''|NNP|FW>*<NN.*|NNP.*>}  # starting with noun only, having NNP or FW in the middle and ending with either NN or NNP only.
      """
    chunker = nltk.RegexpParser(grammar)
    tokenized_sentence = nltk.word_tokenize(text)
    tagged_sentence = nltk.pos_tag(tokenized_sentence)
    tree = chunker.parse(ie_preprocess(text)[0])
    leave = []
    for subtree in tree.subtrees():
        leave.append(subtree.leaves())
    res = ""
    for i in range(len(leave[1])):
        res += " " +leave[1][i][0]
    return res

"""
Taking the required sentence through regex by filtering on 'national team' and then checking through the grammer 
for the required pattern and also to cross validate the answer by checking the specified grammer.
"""

def country_of_origin(doc):
  # your code goes here
    pattern =  (r'.*?(\bnational team\b).+?')
    match = re.search(pattern, doc)
    text = match.group(0)
    t = text.strip().split(" ")
    grammar =  r"""
    NBAR:
        {<JJ>|<NNP>}  # Only NNP or JJ words
  """
    chunker = nltk.RegexpParser(grammar)
    tokenized_sentence = nltk.word_tokenize(text)
    tagged_sentence = nltk.pos_tag(tokenized_sentence)
    tree = chunker.parse(tagged_sentence)
    leave = []
    for subtree in tree.subtrees():
        leave.append(subtree.leaves())
    return t[-3]


"""
Taking the required sentence through regex by filtering on 'born' and then checking through the grammer 
for the required pattern and also to cross validate the answer by checking the specified grammer.
"""

def date_of_birth(doc):
  # your code goes here
    pattern =  (r'.?(\bborn \b).*\.')
    match = re.search(pattern, doc)
    text = match.group(0)
    grammar =  r"""
    NBAR:
        {<CD><NNP><CD>}  # CD for Day, NNP for month and CD for Year
   """
    chunker = nltk.RegexpParser(grammar)
    tokenized_sentence = nltk.word_tokenize(text)
    tagged_sentence = nltk.pos_tag(tokenized_sentence)
    tree = chunker.parse(tagged_sentence)
    leave = []
    for subtree in tree.subtrees():
        leave.append(subtree.leaves())
    date = ""
    for i in range(len(leave[1])):
        if(i == 0):
            date += leave[1][i][0] 
        else:
            date += "-"+leave[1][i][0]
    return date


"""
Taking the required sentence through regex by filtering on 'for' and then checking through the grammer 
for the required pattern and also to cross validate the answer by checking the specified grammer.
"""
def team_of_the_player(doc):
  # your code goes here
    pattern =  (r'.?(\bfor \b).*\.')
    match = re.search(pattern, doc)
    text = match.group(0)
    grammar =  r"""
    NBAR:
        {<JJ|NNP><NN|NNN|NNP|CC|DT|JJ|,>*<NN|NNP>}  # starting JJ or NNP, after that it could be NN or NNN or NNP or CC or DT or JJ, and ending with either NN or NNP
   """
    chunker = nltk.RegexpParser(grammar)
    tokenized_sentence = nltk.word_tokenize(text)
    tagged_sentence = nltk.pos_tag(tokenized_sentence)
    tree = chunker.parse(tagged_sentence)
    leave = []
    for subtree in tree.subtrees():
        leave.append(subtree.leaves())
    team = ""
    for i in range(len(leave[1])):
        if(i == 0):
            team += leave[1][i][0] 
        else:
            team += " "+leave[1][i][0]
    team = re.split(' and |[,]', team)
    team = [i.strip() for i in team]
    return team

"""
As their finite number of positions in football we can filter on list of the positions and check if it matches in the document.
After finding a position I am checking through the grammer 
for the required pattern and also to cross validate the answer by checking the specified grammer.
"""

def position_of_the_player(doc):
  # your code goes here
    pattern = (r'.?(\b winger.*|central midfielder.*|attacking midfielder.*|forward.*|striker.*|right winger.*|left back.*\b).*\.')
    match = re.search(pattern, doc)
    text = match.group(0)
    grammar =  r"""
    NBAR:
        {<RB|VBG|NN|JJ>*<NN|CC|RB>*<RB|NN>}  # starting with RB or VBG or NN or JJ and multiple occurences and ending with maybe RB or NN
       """
    chunker = nltk.RegexpParser(grammar)
    tokenized_sentence = nltk.word_tokenize(text)
    tagged_sentence = nltk.pos_tag(tokenized_sentence)
    tree = chunker.parse(tagged_sentence)
    leave = []
    for subtree in tree.subtrees():
        leave.append(subtree.leaves())
    position = ""
    for i in range(len(leave[1])):
        if(i == 0):
            position += leave[1][i][0] 
        else:
            position += " "+leave[1][i][0]
    position = re.split(' or | and ', position)
    return position


print(name_of_the_player(docs[0]))
print(country_of_origin(docs[0]))
print(team_of_the_player(docs[0]))
print(date_of_birth(docs[0]))
print(position_of_the_player(docs[0]))

 Cristiano Ronaldo dos Santos Aveiro
Portugal
['Spanish club Real Madrid', 'the Portugal national team']
5-February-1985
['forward']


Execute the cell below to verify the `date_of_birth` function for the third player. Expected output `5 February 1992`


In [631]:
date_of_birth(docs[2])

'5-February-1992'

## Task 6 (10 Marks)
Identify one other relation (besides team and player) and write a function to extract it.

In [650]:
# your code goes here
""" 
For this task I am looking for all the awards the player has won through out the document. 
We can find all the sentences that have won in them and can look through grammer for the awards which are mostly starting from JJ.
Printing it with the name of the player for quick understanding. But there are case where a player might not have
won any awards. This condition is also being handled in the code below.
"""

def awards_of_the_player(doc):
  # your code goes here
    pattern = re.compile(r'(?:\bwon)[\w+\s+,-]+[A-Z]+[^\.]+')
    match = re.search(pattern, doc)
    text = pattern.findall(doc)
    tagged_sentence = []
    grammar =  r"""
    NBAR:
        {<JJ|NNP|CD>.*<.*>*}  # starting with JJ or NNP or CD with multiple occurences and could end with anything.
       """
    chunker = nltk.RegexpParser(grammar)
    for sub_text in text:
        tokenized_sent = nltk.word_tokenize(sub_text)
        tagged_sentence.append(nltk.pos_tag(tokenized_sent))
    tagged_sent = list(itertools.chain.from_iterable(tagged_sentence))
    if len(tagged_sent) > 0:
        tree = chunker.parse(tagged_sent)
        leave = []
        for subtree in tree.subtrees():
            leave.append(subtree.leaves())
        award = ""
        for i in range(len(leave[1])):
            if(leave[1][i][0] != 'won'):
                if(i == 0):
                    award += leave[1][i][0] 
                else:
                    award += " "+leave[1][i][0]
        award = re.split(' and |[,]', award)
        award = [i.strip() for i in award]
        return award
    else:
        return None

print(name_of_the_player(docs[0]), " - ", awards_of_the_player(docs[0]))


 Cristiano Ronaldo dos Santos Aveiro  -  ["first Ballon d'Or", "FIFA World Player of the Year awards the FIFA Ballon d'Or in 2013", '2014 one La Liga title', 'two Copas del Rey', 'two Champions League titles', '', 'a Club World Cup']
