# Pattern Based Relation Extraction

---
## Information Extraction and Relation Extraction
The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.

---

Download and unarchive `movies.zip` from Blackboard and place it in the same location as this notebook or uncomment the code cell below to get the data in a directory called `movies` and also place it automatically in the same location as this notebook.

---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [2]:
import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(50):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt'), encoding = "utf8") as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json'), encoding = "utf8") as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 50
assert len(labels) == 50

In [3]:
# Load the libraries which might be useful

import re
import nltk
#nltk.download('all', quiet=True)

---

## Task 1: Document Pre-processing

In [1]:
def ie_preprocess(document):
  tagged_sentences = []
  # your code goes here
  # tokenizing the document into sentences, with each sentence tokenized into words before post_tagging
  for sentence in nltk.sent_tokenize(document):
    tagged_sentences.append(nltk.pos_tag(nltk.word_tokenize(sentence)))
  return tagged_sentences

In [5]:
# check output for Task 1
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

## Task 2: Named Entity Recognition

In [6]:
def find_named_entities(tagged_document):
  named_entities = []
  for sent in tagged_document:
    for chunk in nltk.ne_chunk(sent, binary = True):
      try:
        if chunk.label() == "NE":   # checking for NE label to extract named entities
          named_entities.append(' '.join(i[0] for i in chunk))
      except:
        pass   # pass if the chunk is not a tree but a tuple (where there is no named entity)
  return named_entities

In [7]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document
find_named_entities(tagged_document)[:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

## Task 3: Information / Relation Extraction (I)

In [8]:
# common function to extract paired relations based on given pattern like *directed* , or *produced*
def get_rel(document, pattern):
  relations = []
  tagged_document = ie_preprocess(document)
  # Define X, Y, and \alpha
  subjclass = 'NE'
  objclass = 'NE'
  # Filter relevant relations by matching the regexp pattern.
  relfilter = lambda x: (x['subjclass'] == subjclass and
                           pattern.match(x['filler']) and
                           x['objclass'] == objclass)
  for sent in tagged_document:
    chunked_sent = nltk.ne_chunk(sent, binary = True)
    #print(chunked_sent)
  
  # Group a chunk structure into a list of 'semi-relations'.
    pairs = nltk.sem.relextract.tree2semi_rel(chunked_sent)

# Convert 'semi-relations' into a dictionary which stores information 
# about the subject and object NEs plus the filler between them.
    reldicts = nltk.sem.relextract.semi_rel2reldict(pairs + [[[]]])
    rels = list(filter(relfilter, reldicts))
    
    # variable to store all the relations extracted in all the sentences
    relations += rels
  return relations

In [9]:
# relation 1
def directed_by(document):
  pattern_1 = re.compile(r".*(directed|director).*")

  # pattern to extract for cases where the word "directed" is not between two named entities 
    # for e.g. if the sentence starts with "...movie made in 2015. Directed by Christopher Nolan"
  pattern_2 = re.compile(r'.*\b(directed|Directed)\b.*')
  
  # set of directors variable extracted from pattern 1  
  directors = {"".join(re.split(r"/[A-Z]+", re.findall(r"\[NE: '(.+?)'\]", nltk.sem.relextract.rtuple(rel))[-1])) 
                    for rel in get_rel(document, pattern_1)}
  
  # extraction using pattern 2
  for sent in nltk.sent_tokenize(document):
    if len(re.findall(pattern_2, sent)) > 0: # checking if the pattern is present
      # finding the named entities 
      tagged_sent = nltk.pos_tag(nltk.word_tokenize(sent))
      chunked_sent = nltk.ne_chunk(tagged_sent, binary = True)
      flag = 0
      for chunk in chunked_sent:
        if flag == 0 and type(chunk) == tuple and chunk[0] in ["directed", "Directed"]:
          flag = 1
          continue
        # the first named entity after setting the flag is taken as the value (e.g. first name after "Directed by ..")
        if flag == 1 and type(chunk) == nltk.tree.Tree:
          directors.add(" ".join([i[0] for i in chunk.leaves()]))
          break
  # returning the list
  return list(directors)

In [10]:
# relation 2
def produced_by(document):
  pattern_1 = re.compile(r".*(produced|producer).*")

# pattern to extract for cases where the word "directed" is not between two named entities 
    # for e.g. if the sentence starts with "...movie made in 2015. Produced by Christopher Nolan"
  pattern_2 = re.compile(r'.*\b(Produced|produced|producer|Producer)\b.*')
  # set of producers variable extracted from pattern 1  
  producers = {"".join(re.split(r"/[A-Z]+", re.findall(r"\[NE: '(.+?)'\]", nltk.sem.relextract.rtuple(rel))[-1])) 
                    for rel in get_rel(document, pattern_1)}
  # extraction using pattern 2
  for sent in nltk.sent_tokenize(document):
    if len(re.findall(pattern_2, sent)) > 0:    # checking if the pattern is present
      # finding the named entities 
      tagged_sent = nltk.pos_tag(nltk.word_tokenize(sent))
      chunked_sent = nltk.ne_chunk(tagged_sent, binary = True)
      flag = 0
      for chunk in chunked_sent:
        if flag == 0 and type(chunk) == tuple and chunk[0] in ["Produced", "produced", "producer", "Producer"]:
          flag = 1
          continue
        # the first named entity after setting the flag is taken as the value (e.g. first name after "Produced by ..")
        if flag == 1 and type(chunk) == nltk.tree.Tree:
          producers.add(" ".join([i[0] for i in chunk.leaves()]))
          break
  # returning the list
  return list(producers)

In [11]:
# relation 3
def written_by(document):
  pattern_1 = re.compile(r".*(written|writer|writes).*")
# pattern to extract for cases where the word "directed" is not between two named entities 
    # for e.g. if the sentence starts with "...movie made in 2015. Written by Christopher Nolan"
  pattern_2 = re.compile(r'.*\b(written|Written|writer|Writer|writes|Writes)\b.*')
  # set of producers variable extracted from pattern 1 
  writers = {"".join(re.split(r"/[A-Z]+", re.findall(r"\[NE: '(.+?)'\]", nltk.sem.relextract.rtuple(rel))[-1])) 
                    for rel in get_rel(document, pattern_1)}
  # extraction using pattern 2
  for sent in nltk.sent_tokenize(document):
    if len(re.findall(pattern_2, sent)) > 0:    # checking if the pattern is present
      # finding the named entities
      tagged_sent = nltk.pos_tag(nltk.word_tokenize(sent))
      chunked_sent = nltk.ne_chunk(tagged_sent, binary = True)
      flag = 0
      for chunk in chunked_sent:
        if flag == 0 and type(chunk) == tuple and chunk[0] in ["written", "Written", "writer", "Writer", "writes", "Writes"]:
          flag = 1
          continue
        # the first named entity after setting the flag is taken as the value (e.g. first name after "Written by ..")
        if flag == 1 and type(chunk) == nltk.tree.Tree:
          writers.add(" ".join([i[0] for i in chunk.leaves()]))
          break
  # returning the list
  return list(writers)

---
## Task 4: Information / Relation Extraction (II)

In [2]:
def part_of(document):
  # pattern to identify is installment of the movie is mentioned
  pattern = re.compile(r'.*\b(installment)\b.*')
  part = set()
  for sent in nltk.sent_tokenize(document):
    if len(re.findall(pattern, sent)) > 0: # cheking if the pattern is present in sentence
      part.add(sent.split("installment")[0].split()[-1] + " part")
  return list(part)

---

## Task 5: Combining information in the output

In [13]:
def extract_info(document):
    output = {
    ##### EDIT BELOW THIS LINE #####
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
    "Directed by": directed_by(document),
    "Written by": written_by(document),
    "Produced by": produced_by(document),
    # save the output from Task 4 here
    "Task 4": part_of(document),

    ##### EDIT ABOVE THIS LINE #####
  }

  return output


# check output for the first document
extract_info(documents[0])

{'Directed by': ['George Lucas'],
 'Written by': ['George Lucas'],
 'Produced by': ['Lucasfilm'],
 'Task 4': ['first part']}

---
## Task 6: Evaluation (I)

In [14]:
def evaluate(labels, predictions):
  assert len(predictions) == len(labels)

  scores = {
      'precision': 0.0, 'recall': 0.0, 'f1': 0.0
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.
  tp, fp, fn = 0, 0, 0
  for i in range(len(predictions)):
    label, pred = labels[i], predictions[i]
    for key, pred_vals in pred.items():
        try:
            true_vals = label[key]  # calculation is done for the keys present in labels
        except KeyError:
            continue
        for v in pred_vals:
            if v in true_vals:
                tp += 1   # true positive if predicted value is present in true labels.
            else:
                fp += 1   # false positive if predicted value is NOT present in true labels.
        for v in true_vals:
            if v not in pred_vals:
                fn += 1   # flase negative if true value NOT present in predicted value.
  # calculating and storing the values
  scores["precision"] = round(tp / (tp + fp), 2)
  scores["recall"] = round(tp / (tp + fn), 2)
  scores["f1"] = round(2 * scores["precision"] * scores["recall"] / (scores["precision"] + scores["recall"]), 2)
  return scores

In [15]:
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

Unnamed: 0,precision,recall,f1
0,0.55,0.34,0.42


---
## Task 7: Challenges and Issues with the above Evaluation (II)

---
There are a number of challenges associated with the above evaluation. Some of them are as follows:
> 1. True values in labels have different value than what is present in text. For instance, in document 1 labels, producer is given as "Gary Kurtz", but in the 01.doc.txt, producer is mentioned as "Lucasfilm", which is correctly extracted but doesn't match the truth value. Same is the case in few other documents as well like for document 03.doc.txt and 02.doc.txt where truth label values are not present in movie text.
> 2. All the keys/relations are not present in labels, for instance writer is extracted for document 04.doc.txt, but the written by field is not present in corresponding json file.

The evaluation with the above issues doen't give a proper idea of how the extraction performs, since such issues will negatively affect the performance metric, which works only if a exact value is extracted.