# Exercise 11: Entity and Relation Extraction

## Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [1]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load('en')

In [2]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [5]:
movies[1]

{'director': 'Daniel Lee', 'movie': '14_Blades'}

In [9]:
example = parse_wikipedia(movies[1]['movie'])
example

'14 Blades is a 2010 wuxia film directed by Daniel Lee, starring Donnie Yen, Zhao Wei, Sammo Hung, Wu Chun, Kate Tsui, Qi Yuwu and Damian Lau. The film was released on 4 February 2010 in China and on 11 February 2010 in Hong Kong.\n\n'

In [22]:
def find_PER_entities(txt):
    persons = []
    nlp_txt = nlp(txt)
    
    for e in nlp_txt.ents:
        if e.label_ == "PERSON":
            persons.append(e.text)
        
    return persons

example_person = find_PER_entities(example)
example_person

['Daniel Lee',
 'Donnie Yen',
 'Zhao Wei',
 'Sammo Hung',
 'Wu Chun',
 'Kate Tsui',
 'Qi Yuwu',
 'Damian Lau']

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [30]:
def find_director(txt, persons):
    split = re.sub('[!?,.]', '', txt).split() # simple split method by !, ?, ',', '.'
    for index in range(0,len(split)):
        if split[index] == "directed":
            for index2 in range(index,len(split)):
                for person in persons:
                    if person.startswith(split[index2]):
                        return person
    return ''

find_director(example, example_person) 

'Daniel Lee'

In [31]:
movies = read_tsv()

statements=[]
fp = 0
for m in movies:

        txt = parse_wikipedia(m['movie'])
        persons = find_PER_entities(txt)
        director = find_director(txt, persons)
        
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')
            if director != m['director']:
                fp += 1

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [32]:
# compute precision and recall
fn = len(movies) - len(statements)
tp = len(statements) - fp

precision = tp / (tp + fp)
recall = tp / (tp + fn)
print ('Precision:',precision)
print ('Recall:',recall)

print()
print('***Sample Statements***')
for s in statements[:5]:
    print (s)

Precision: 0.8081632653061225
Recall: 0.825

***Sample Statements***
13_Assassins_(2010_film) is directed by Takashi Miike.
14_Blades is directed by Daniel Lee.
22_Bullets is directed by Richard Berry.
The_A-Team_(film) is directed by Joe Carnahan.
Alien_vs_Ninja is directed by Seiji Chiba.


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [2]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [3]:
#Bigram Representation

#### [:-1] to remove last element
#### [1:] to start from 2nd element

def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for b in bigrams:
    if b in PER:
        annotations.append([b,"I"])
    else:
        annotations.append([b,"O"])
print('Annotation\n', annotations,'\n')

Annotation
 [['The best', 'O'], ['best blues', 'O'], ['blues singer', 'O'], ['singer was', 'O'], ['was Bobby', 'O'], ['Bobby Bland', 'I'], ['Bland while', 'O'], ['while Ray', 'O'], ['Ray Charles', 'I'], ['Charles pioneered', 'O'], ['pioneered soul', 'O'], ['soul music', 'O'], ['music .', 'O'], ['Bobby Bland', 'I'], ['Bland was', 'O'], ['was just', 'O'], ['just a', 'O'], ['a singer', 'O'], ['singer whereas', 'O'], ['whereas Ray', 'O'], ['Ray Charles', 'I'], ['Charles was', 'O'], ['was a', 'O'], ['a pianist', 'O'], ['pianist ,', 'O'], [', songwriter', 'O'], ['songwriter and', 'O'], ['and singer', 'O'], ['singer .None', 'O'], ['.None of', 'O'], ['of them', 'O'], ['them lived', 'O'], ['lived in', 'O'], ['in Chicago', 'O'], ['Chicago .', 'O']] 



#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [4]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}

nI = 0
nO = 0

for annotation in annotations[1:]:
    if annotation[1] == "I":
        nI += 1
    elif annotation[1] == "O":
        nO += 1

#Prior
transition_prob['P(I|start)'] = nI / (nI + nO)
transition_prob['P(O|start)'] = 1 - transition_prob['P(I|start)']

O_O_count = 0
O_I_count = 0
I_I_count = 0
I_O_count = 0

for index in range(0,len(annotations)-1):
    if annotations[index][1] == "O":
        if annotations[index+1][1] == "O":
            O_O_count += 1
        elif annotations[index+1][1] == "I":
            O_I_count += 1
    elif annotations[index][1] == "I":
        if annotations[index+1][1] == "O":
            I_O_count += 1
        elif annotations[index+1][1] == "I":
            I_I_count += 1

transition_prob['P(O|O)'] = O_O_count / nO
transition_prob['P(O|I)'] = I_O_count / nI
transition_prob['P(I|O)'] = O_I_count / nO
transition_prob['P(I|I)'] = I_I_count / nI

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

def count_upper_first_char(bigram):
    bigram = bigram.split(" ")
    count = 0
    if bigram[0][0].isupper():
        count += 1
    if bigram[1][0].isupper():
        count += 1
    return count
        
default_emission = (1 - lambda_) / len(annotations) # len(annotations) == len(bigrams)

two_upper_O = 0
two_upper_I = 0
one_upper_O = 0
one_upper_I = 0
zero_upper_O = 0
zero_upper_I = 0

for annotation in annotations:
    if count_upper_first_char(annotation[0]) == 2 and annotation[1] == "O":
        two_upper_O += 1
    elif count_upper_first_char(annotation[0]) == 2 and annotation[1] == "I":
        two_upper_I += 1
    elif count_upper_first_char(annotation[0]) == 1 and annotation[1] == "O":
        one_upper_O += 1
    elif count_upper_first_char(annotation[0]) == 1 and annotation[1] == "I":
        one_upper_I += 1
    elif count_upper_first_char(annotation[0]) == 0 and annotation[1] == "O":
        zero_upper_O += 1
    elif count_upper_first_char(annotation[0]) == 0 and annotation[1] == "I":
        zero_upper_I += 1

emission_prob['P(2_upper|O)'] = (two_upper_O / nO) * lambda_ + default_emission
emission_prob['P(2_upper|I)'] = (two_upper_I / nI) * lambda_ + default_emission
emission_prob['P(1_upper|O)'] = (one_upper_O / nO) * lambda_ + default_emission
emission_prob['P(1_upper|I)'] = (one_upper_I / nI) * lambda_ + default_emission
emission_prob['P(0_upper|O)'] = (zero_upper_O / nO) * lambda_ + default_emission
emission_prob['P(0_upper|I)'] = (zero_upper_I / nI) * lambda_ + default_emission

print('Emission Probabilities\n', emission_prob, '\n')

Transition Probabilities
 {'P(I|start)': 0.11764705882352941, 'P(O|start)': 0.8823529411764706, 'P(O|O)': 0.8666666666666667, 'P(O|I)': 1.0, 'P(I|O)': 0.13333333333333333, 'P(I|I)': 0.0} 

Emission Probabilities
 {'P(2_upper|O)': 0.014285714285714285, 'P(2_upper|I)': 0.5142857142857142, 'P(1_upper|O)': 0.18095238095238095, 'P(1_upper|I)': 0.014285714285714285, 'P(0_upper|O)': 0.36428571428571427, 'P(0_upper|I)': 0.014285714285714285} 



#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [5]:
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:
    I_prob = transition_prob["P(I|" + prev_state + ")"] * \
             emission_prob["P(" + str(count_upper_first_char(b)) + "_upper|I)"]
    O_prob = transition_prob["P(O|" + prev_state + ")"] * \
             emission_prob["P(" + str(count_upper_first_char(b)) + "_upper|O)"]
    
    if O_prob > I_prob:
        prev_state = 'O'
    else:
        entities.append(b)
        prev_state = 'I'

print('Predicted Entities\n', entities, '\n')

Predicted Entities
 ['Ray Charles', 'Bobby Bland', 'Ray Charles', 'Muddy Waters', 'Chicago Blues'] 



Precision is *75%* while recall is *100%*. 
<br>Chicago Blues is not PER

#### 4) Comment on how you can further improve this model.

...