## 📚 Exercise 13: Entity & Relation Extraction

### Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [126]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load('en_core_web_sm')

In [127]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [128]:
def find_PER_entities(txt):
    persons = []
    doc = nlp(txt)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            persons.append(ent.text)
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [129]:
def find_director(txt, persons):
    txt = re.sub("[!?,.]", "", txt).split()
    director = ''
    if 'directed' in txt:
        idx_directed = txt.index('directed')
        short_sentence = ' '.join(txt[idx_directed+1:idx_directed+4])

        for person in persons:
            if person in short_sentence:
                director = person 
                break

    return director

In [130]:
movies = read_tsv()[:10]
movies[:10]

[{'movie': '13_Assassins_(2010_film)', 'director': 'Takashi Miike'},
 {'movie': '14_Blades', 'director': 'Daniel Lee'},
 {'movie': '22_Bullets', 'director': 'Richard Berry'},
 {'movie': 'The_A-Team_(film)', 'director': 'Joe Carnahan'},
 {'movie': 'Alien_vs_Ninja', 'director': 'Seiji Chiba'},
 {'movie': 'Bad_Blood_(2010_film)', 'director': 'Dennis Law'},
 {'movie': 'Bangkok_Knockout', 'director': 'Panna Rittikrai'},
 {'movie': 'Blades_of_Blood', 'director': 'Lee Joon-ik'},
 {'movie': 'The_Book_of_Eli', 'director': 'Allen Hughes'},
 {'movie': 'The_Bounty_Hunter_(2010_film)', 'director': 'Andy Tennant'}]

In [131]:
statements=[]
tp = 0
fp = 0
for m in movies:

        txt = parse_wikipedia(m['movie'])
        persons = find_PER_entities(txt)
        director = find_director(txt, persons)
        
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')

            if director == m['director']:
                  tp += 1
            else:
                  fp += 1

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [132]:
# compute precision and recall
fn = len(movies) - tp
precision = tp / (tp + fn)
recall = tp / (tp+fp)
print ('Precision:',precision)
print ('Recall:',recall)
print('\n***Sample Statements***')
for s in statements[:5]:
    print(s)

Precision: 0.8
Recall: 1.0

***Sample Statements***
13_Assassins_(2010_film) is directed by Takashi Miike.
14_Blades is directed by Daniel Lee.
22_Bullets is directed by Richard Berry.
Alien_vs_Ninja is directed by Seiji Chiba.
Bad_Blood_(2010_film) is directed by Dennis Law.


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [153]:
training_set = [
    "The best blues singer was Bobby Bland while Ray Charles pioneered soul music .",
    "Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .",
    "None of them lived in Chicago .",
]

test_set = [
    "Ray Charles was born in 1930 .",
    "Bobby Bland was born the same year as Ray Charles .",
    "Muddy Waters is the father of Chicago Blues .",
]

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [154]:
#Bigram Representation
def getBigrams(sents):
    return [
        [b[0] + " " + b[1] for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
        for l in sents
    ]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for sentence in bigrams:
    sentence_annotation = []
    for bigram in sentence:
        if bigram in PER:
            sentence_annotation.append('I')
        else: 
            sentence_annotation.append('O')
    annotations.append(sentence_annotation)
print('Annotation\n', annotations,'\n')

Annotation
 [['O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'I', 'O', 'O', 'O', 'O'], ['I', 'O', 'O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O']] 



#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [148]:
import re
lambda_ = 0.5

I_Start = 0
O_Start = 0
O_O = 0
O_I = 0
I_O = 0
I_I = 0

for sentence in annotations:

    for idx in range (len(sentence)-1):
        if idx == 0:
            if sentence[idx] == 'O':
                O_Start += 1
            else:
                I_Start += 1
        else:
            if sentence[idx] + sentence[idx+1] == 'OO':
                O_O += 1
            elif sentence[idx] + sentence[idx+1] == 'OI':
                I_O += 1
            elif sentence[idx] + sentence[idx+1] == 'IO':
                O_I += 1
            elif sentence[idx] + sentence[idx+1] == 'II':
                I_I += 1

#Transition Probabilities
transition_prob={}


#Prior
transition_prob['P(I|start)'] = I_Start / (I_Start + O_Start)
transition_prob['P(O|start)'] = O_Start / (I_Start + O_Start)

transition_prob['P(O|O)'] = O_O / (O_O + I_O)
transition_prob['P(O|I)'] = O_I / (O_I + I_I)
transition_prob['P(I|O)'] = I_O / (O_O + I_O)
transition_prob['P(I|I)'] = I_I / (O_I + I_I)


        
                
print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

        
default_emission = (1-lambda_) * 1 / len(sum(bigrams, []))

upper2_O = 0
upper2_I = 0
upper1_O = 0
upper1_I = 0
upper0_O = 0
upper0_I = 0

for i, sentence in enumerate(bigrams):
    for j, bigram in enumerate(sentence):
        nb_capital = len(re.findall(r'[A-Z]',bigram))

        if nb_capital == 0:
            if annotations[i][j] == 'O':
                upper0_O += 1
            else:
                upper0_I += 1
        
        elif nb_capital == 1:
            if annotations[i][j] == 'O':
                upper1_O += 1
            else:
                upper1_I += 1

        elif nb_capital == 2:
            if annotations[i][j] == 'O':
                upper2_O += 1
            else:
                upper2_I += 1


emission_prob['P(2_upper|O)'] = lambda_ * upper2_O / repr(annotations).count("O") + default_emission
emission_prob['P(2_upper|I)'] = lambda_ * upper2_I / repr(annotations).count("I") + default_emission
emission_prob['P(1_upper|O)'] = lambda_ * upper1_O / repr(annotations).count("O") + default_emission
emission_prob['P(1_upper|I)'] = lambda_ * upper1_I / repr(annotations).count("I") + default_emission
emission_prob['P(0_upper|O)'] = lambda_ * upper0_O / repr(annotations).count("O") + default_emission
emission_prob['P(0_upper|I)'] = lambda_ * upper0_I / repr(annotations).count("I") + default_emission

print('Emission Probabilities\n')
for em, value in emission_prob.items():
    print(em, value)

Transition Probabilities
 {'P(I|start)': 0.3333333333333333, 'P(O|start)': 0.6666666666666666, 'P(O|O)': 0.8846153846153846, 'P(O|I)': 1.0, 'P(I|O)': 0.11538461538461539, 'P(I|I)': 0.0} 

Emission Probabilities

P(2_upper|O) 0.014285714285714285
P(2_upper|I) 0.5142857142857142
P(1_upper|O) 0.19170506912442398
P(1_upper|I) 0.014285714285714285
P(0_upper|O) 0.3368663594470046
P(0_upper|I) 0.014285714285714285


#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [172]:
# Prediction
bigrams = getBigrams(test_set)

entities = []
for sentence in bigrams:
    prev_state = "start"
    for b in sentence:
        I_prob = (
            transition_prob["P(I|" + prev_state + ")"]
            * emission_prob["P(" + str((len(re.findall(r'[A-Z]',b)))) + "_upper|I)"]
        )
        O_prob = (
            transition_prob["P(O|" + prev_state + ")"]
            * emission_prob["P(" + str(len(re.findall(r'[A-Z]',b))) + "_upper|O)"]
        )

        if I_prob > O_prob:
            entities.append(b)
            prev_state = "I"
        else:
            prev_state = "O"

print("Predicted Entities\n", entities, "\n")

Predicted Entities
 ['Ray Charles', 'Bobby Bland', 'Ray Charles', 'Muddy Waters', 'Chicago Blues'] 



In [180]:
for name in entities:
    print(name in PER)

True
True
True
False
False


In [181]:
tp = sum([name in PER for name in entities])
fp = len(entities) - tp
fn = 0

precision = tp / (tp+fp)
recall = tp / (tp+fn)

print(f"precision is {precision} while recall is {recall}")

precision is 0.6 while recall is 1.0


Precision is *...%* while recall is *...%*. 

#### 4) Comment on how you can further improve this model.

...