<div class="alert alert-info">
    
➡️ Make sure that you have read the **[rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins)** and the **[policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)** before starting with this lab.

➡️ Make sure you fill in any cells (and _only_ those cells) that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**, and do _not_ modify any of the other cells.

➡️ **Before you submit your lab, make sure everything runs as expected.** For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).

</div>

# L4: Information extraction

Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, **named entity recognition** (identifying mentions of entities) and **entity linking** (matching these mentions to entities in a knowledge base).

In [28]:
# Define some helper functions that are used in this notebook

from IPython.display import display, HTML

def success():
    display(HTML('<div class="alert alert-success"><strong>Solution appears correct!</strong></div>'))

## Data set

We start by loading spaCy. However, the data that we will be using has been tokenized following the conventions of the [Penn Treebank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html), and we need to prevent spaCy from using its own tokenizer on top of this. We therefore override spaCy&rsquo;s tokenizer with the default one that simply splits on whitespace.

In [29]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = Tokenizer(nlp.vocab)

The main data set for this lab is a collection of news wire articles in which mentions of named entities have been annotated with page names from the [English Wikipedia](https://en.wikipedia.org/wiki/). The next code cell loads the training and the development parts of the data into Pandas data frames.

In [30]:
import bz2
import csv
import pandas as pd
import numpy as np

with bz2.open('ner-train.tsv.bz2', 'rt') as source:
    df_train = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

with bz2.open('ner-dev.tsv.bz2', 'rt') as source:
    df_dev = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

Each row in these two data frames corresponds to one mention of a named entity and has five columns:

1. a unique identifier for the sentence containing the entity mention
2. the pre-tokenized sentence, with tokens separated by spaces
3. the start position of the token span containing the entity mention
4. the end position of the token span (exclusive, as in Python list indexing)
5. the entity label; either a Wikipedia page name or the generic label `--NME--`

The following cell prints the first five samples from the training data:

In [31]:
df_train.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0000-000,EU rejects German call to boycott British lamb .,0,1,--NME--
1,0000-000,EU rejects German call to boycott British lamb .,2,3,Germany
2,0000-000,EU rejects German call to boycott British lamb .,6,7,United_Kingdom
3,0000-001,Peter Blackburn,0,2,--NME--
4,0000-002,BRUSSELS 1996-08-22,0,1,Brussels


In this sample, we see that the first sentence is annotated with three entity mentions:

* the span 0–1 &lsquo;EU&rsquo; is annotated as a mention but only labelled with the generic `--NME--`
* the span 2–3 &lsquo;German&rsquo; is annotated with the page [Germany](http://en.wikipedia.org/wiki/Germany)
* the span 6–7 &lsquo;British&rsquo; is annotated with the page [United_Kingdom](http://en.wikipedia.org/wiki/United_Kingdom)

## Problem 1: Evaluation measures

To warm up, we ask you to write code to print the three measures that you will be using for evaluation:

In [32]:
import numpy as np

def evaluation_scores(gold, pred):
    """Print precision, recall, and F1 score.
    
    Arguments:
        gold: The set with the gold-standard values.
        pred: The set with the predicted values.
    
    Returns:
        A tuple or list containing the precision, recall, and F1 values
        (in that order), computed based on the specified sets.
    """
    # YOUR CODE HERE
    # minlen = min(len(gold), len(pred))  
    # tp = np.sum(np.asarray(list(gold))[:minlen] == np.asarray(list(pred))[:minlen])
    tp = len([i for i in pred if i in gold])

    precision = tp/len(pred)
    # tp/p (positive pred)
    recall = tp/len(gold)
    # tp/(tp + fn) (positive obs)
    F1 = 2*precision*recall/(precision+recall)

    return [precision, recall, F1]
    # raise NotImplementedError()

Let's also define a convenience function that prints the scores nicely:

In [33]:
def print_evaluation_scores(scores):
    p, r, f = scores
    print(f"Precision: {p:.3f}, Recall: {r:.3f}, F1: {f:.3f}")

### 🤞 Test your code

To test your code, you can run the following cell. This should give you a precision of 60%, a recall of 100%, and an F1-value of 75%.

In [34]:
# Check if the results match what is expected
result = evaluation_scores(set(range(3)), set(range(5)))
assert len(result) == 3, "Should return exactly three scores"
print_evaluation_scores(result)
assert np.isclose(result, (.6, 1.0, .75)).all(), "Should be close to the expected values"
success()

Precision: 0.600, Recall: 1.000, F1: 0.750


## Problem 2: Span recognition

One of the first tasks that an information extraction system has to solve is to locate and classify (mentions of) named entities, such as persons and organizations. Here we will tackle the simpler task of recognizing **spans** of tokens that contain an entity mention, without the actual entity label.

The English language model in spaCy features a full-fledged [named entity recognizer](https://spacy.io/usage/linguistic-features#named-entities) that identifies a variety of entities, and can be updated with new entity types by the user. Your task in this problem is to evaluate the performance of this component when predicting entity spans in the development data.

### Task 2.1

Start by implementing a generator function that yields the gold-standard spans in a given data frame.  (If you're not familiar with the `yield` keyword in Python, check out [this brief explanation](https://www.nbshare.io/notebook/851988260/Python-Yield/).)

**Hint:** The Pandas method [`itertuples()`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.itertuples.html) is useful when iterating over the rows in a DataFrame.

In [35]:
import pandas as pd

In [36]:
#test
df = df_dev.head()
def testf(df):
    for row in df.head().itertuples():
        idx = [1,3,4]
        print(row)
        yield 1,2
        # yield 2
        # yield (row[1],row[3],row[4])
        # doc = nlp(row)
        # for ent in doc.ents:
        #     print(ent.text, ent.start_char, ent.end_char, ent.label_)
next(testf(df))

Pandas(Index=0, sentence_id='0946-000', sentence='CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .', beg=2, end=3, label='Leicestershire_County_Cricket_Club')


(1, 2)

In [37]:
def gold_spans(df):
    """Yield the gold-standard mention spans in a data frame.

    Arguments:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        triples consisting of the sentence id, start position, and end
        position of each span.
    """
    # YOUR CODE HERE
    for row in df.itertuples():
        
        yield ((row[1],row[3],row[4]))
        # id, beg ,end


    # raise NotImplementedError()

#### 🤞 Test your code

To test your code, you can run the following cell, which counts the spans yielded by your function when called on the development data (there should be 5,917 _unique_ triples), and checks if the first and last yielded triples are included in the results.

In [38]:
spans_dev_gold = set(gold_spans(df_dev))
assert len(spans_dev_gold) == 5917, "The number of unique returned triples is not correct."
assert ('0946-000', 2, 3) in spans_dev_gold, "The first expected triple is not included in the results."
assert ('1161-010', 1, 3) in spans_dev_gold, "The last expected triple is not included in the results."
success()

### Task 2.2

Your next task is to write code that calls spaCy to predict the named entities in the development data.  You should do this in form of a function that works the same as `gold_spans()`, but which returns the spans as predicted by spaCy instead.

In [39]:
def pred_spans(df):
    """Run and evaluate spaCy's NER.

    Arguments:
        df: A data frame.

    Yields:
        The predicted mention spans in the specified data frame as
        triples consisting of the sentence id, start position, and end
        position of each span.
    """
    for row in df.itertuples():
        senId = row[1]
        sen = row[2]
        doc = nlp(sen)
        for ent in doc.ents:
            yield (senId, ent.start, ent.end)

    # YOUR CODE HERE
    # raise NotImplementedError()

In [40]:
next(pred_spans(df.head()))

('0946-001', 0, 1)

#### 🤞 Test your code

The following cell runs the prediction and reports the evaluation measures. The expected precision is above 50%, with a recall above 70% and an F1-score around 60%.


In [293]:
spans_dev_pred = set(pred_spans(df_dev))
spans_dev_gold = set(gold_spans(df_dev))
scores = evaluation_scores(spans_dev_gold, spans_dev_pred)
print_evaluation_scores(scores)
assert scores[0] > .50, "Precision should be above 50%."
assert scores[1] > .70, "Recall should be above 70%."
success()

Precision: 0.519, Recall: 0.717, F1: 0.602


In [205]:
#test
tg = set(range(3))
tp = set(range(5))
df = df_dev.head()
spans_dev_pred = set(pred_spans(df_dev.head()))
spans_dev_gold = set(gold_spans(df_dev.head()))
[i for i in spans_dev_pred if i in spans_dev_gold]
print(df_dev.head())
# print(spans_dev_gold)
# print(spans_dev_pred)
print('-----')
for row in df.itertuples():
        senId = row[1]
        sen = row[2]
        doc = nlp(sen)
        print("###row",row)
        print((doc,row[3],row[4]))
        for ent in doc.ents:
            print("###ent",ent)
            print(type(ent),type(ent.text))
            text = ent[ent.start:ent.end]
            print("###ent.text",text)
            print ((senId, ent.start, ent.end))

  sentence_id                                           sentence  beg  end  \
0    0946-000  CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTE...    2    3   
1    0946-001                                  LONDON 1996-08-30    0    1   
2    0946-002  West Indian all-rounder Phil Simmons took four...    0    2   
3    0946-002  West Indian all-rounder Phil Simmons took four...    3    5   
4    0946-002  West Indian all-rounder Phil Simmons took four...   12   13   

                                label  
0  Leicestershire_County_Cricket_Club  
1                              London  
2            West_Indies_cricket_team  
3                        Phil_Simmons  
4  Leicestershire_County_Cricket_Club  
-----
###row Pandas(Index=0, sentence_id='0946-000', sentence='CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .', beg=2, end=3, label='Leicestershire_County_Cricket_Club')
(CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY ., 2, 3)
###row Pandas(Index=1, sente

## Problem 3: Error analysis

As you were able to see in Problem&nbsp;2, the span accuracy of the named entity recognizer is far from perfect. In particular, only slightly more than half of the predicted spans are correct according to the gold standard. Your next task is to analyse this result in more detail.

Here is a function that prints the false positives as well as the false negatives spans for a data frame, given a reference set of gold-standard spans and a candidate set of predicted spans.

In [44]:
from collections import defaultdict

def error_report(df, spans_gold, spans_pred):
    false_pos = defaultdict(list)
    for s, b, e in spans_pred - spans_gold:
        false_pos[s].append((b, e))
    false_neg = defaultdict(list)
    for s, b, e in spans_gold - spans_pred:
        false_neg[s].append((b, e))
    for row in df.drop_duplicates('sentence_id').itertuples():
        if row.sentence_id in false_pos or row.sentence_id in false_neg:
            print('Sentence:', row.sentence)
            for b, e in false_pos[row.sentence_id]:
                print('  FP:', ' '.join(row.sentence.split()[b:e]))
            for b, e in false_neg[row.sentence_id]:
                print('  FN:', ' '.join(row.sentence.split()[b:e]))

### Task 3.1

1. Use the `error_report()` function above to inspect and analyse the errors that the automated prediction makes. Base your analysis on the first 500 rows of the training data. Can you see any patterns?
2. Summarize your observations in a short text.

In [91]:
df = df_train[:500]
spans_train_pred = set(pred_spans(df))
spans_train_gold = set(gold_spans(df))

error_report(df,spans_train_gold,spans_train_pred)

# YOUR CODE HERE
# raise NotImplementedError()

Sentence: BRUSSELS 1996-08-22
  FP: 1996-08-22
Sentence: The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .
  FP: The European Commission
  FP: Thursday
  FN: European Commission
Sentence: Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .
  FP: Wednesday
  FP: the European Union 's
  FN: European Union
Sentence: " We do n't support any such recommendation because we do n't see any grounds for it , " the Commission 's chief spokesman Nikolaus van der Pas told a news briefing .
  FP: Nikolaus van der
  FP: Pas
  FN: Nikolaus van der Pas
Sentence: He said further scientific study was required and if it was found that action was needed it should be taken by the European Union .
  FP: the Europe

YOUR ANSWER HERE

The prediction here tend to include extra or sometimes less elements than gold label. which makes those prediction pretty close and promising in human's view but addiing both fp and fn at the same time for scoring and ruins the scores result. And it cannot handle numbers and locations well.

### Task 3.2

Now, use the insights from your error analysis to improve the automated prediction that you implemented in Problem&nbsp;2. While the best way to do this would be to [update spaCy&rsquo;s NER model](https://spacy.io/usage/linguistic-features#updating) using domain-specific training data, for this lab it suffices to write code to post-process the output produced by spaCy. To filter out specific labels it is useful to know the named entity label scheme, which can be found in the [model's documentation](https://spacy.io/models/en#en_core_web_sm). You should be able to improve the F1 score from Problem&nbsp;2 by at least 15 percentage points.

In [62]:
#test
df = df_dev.head()
for row in df.itertuples():
    senId = row[1]
    sen = row[2]
    doc = nlp(sen)
    
    for ent in doc.ents:
        # if 
        print(ent.text)
        # yield (senId, ent.start, ent.end) 
t = "123"


LONDON
1996-08-30
West Indian
Phil Simmons
four
38
Friday
Leicestershire
39
two days
West Indian
Phil Simmons
four
38
Friday
Leicestershire
39
two days
West Indian
Phil Simmons
four
38
Friday
Leicestershire
39
two days


True

In [303]:
from spacy.tokens import Span
import re
import calendar
def pred_spans_improved(df):
    """Run and evaluate spaCy's NER, with post-processing to improve the results.

    Arguments:
        df: A data frame.

    Yields:
        The predicted mention spans in the specified data frame as
        triples consisting of the sentence id, start position, and end
        position of each span.
    """
    for row in df.itertuples():
        senId = row[1]
        sen = row[2]
        doc = nlp(sen)
        
        for ent in doc.ents:
            # if ent.text.isnumeric():
            #     continue
            if re.search("\d",ent.text) is not None:
                continue
            if re.match(".*-.*",ent.text) is not None:
                continue
            # if re.match(".*year.*",ent.text.lower()) is not None:
            #     continue
            if any([day.lower() in ent.text.lower() for day in list(calendar.day_name)]):
                continue
            if any([mon.lower() in ent.text.lower() for mon in list(calendar.month_name)[1:]]):
                continue
            words = ['one','two','three','four','five','six','seven','eight','nine','ten','year','month','week']
            if any([num in ent.text.lower() for num in words]):
                continue
            start = ent.start
            end = ent.end
            if re.match(".*'s$",ent.text) is not None:
                end -= 1
            if re.match("^the .*",ent.text.lower()) is not None:
                start += 1
            yield (senId, start, end)           
            


    # YOUR CODE HERE
    # raise NotImplementedError()

In [291]:
#test
t = "European 123Union may 's"
tre =   re.search("\d",t)

tre is not None
tre
t.isdigit()
bool(tre)

"january" in "qwe january qwe"

# import calendar
# any([i.lower() in t for i in list(calendar.month_name)[1:]])
numbers = ['one','two','three','four','five','six','seven','eight','nine','ten']
any([num in ent.text.lower() for num in numbers])
# set(pred_spans_improved(df_dev.head()))
# evaluation_scores(spans_dev_gold, set(pred_spans_improved(df_dev)))
# list(calendar.day_name)


True

In [304]:
#test
df = df_train[:500]
spans_t_pred = set(pred_spans_improved(df))
spans_t_gold = set(gold_spans(df))
print(evaluation_scores(spans_t_gold, spans_t_pred))
error_report(df,spans_t_gold,spans_t_pred)

[0.9049676025917927, 0.838, 0.8701973001038422]
Sentence: " We do n't support any such recommendation because we do n't see any grounds for it , " the Commission 's chief spokesman Nikolaus van der Pas told a news briefing .
  FP: Nikolaus van der
  FP: Pas
  FN: Nikolaus van der Pas
Sentence: He said a proposal last month by EU Farm Commissioner Franz Fischler to ban sheep brains , spleens and spinal cords from the human and animal food chains was a highly specific and precautionary move to protect human health .
  FP: EU Farm
  FN: EU
Sentence: Fischler proposed EU-wide measures after reports from Britain and France that under laboratory conditions sheep could contract Bovine Spongiform Encephalopathy ( BSE ) -- mad cow disease .
  FN: EU-wide
Sentence: " What we have to be extremely careful of is how other countries are going to take Germany 's lead , " Welsh National Farmers ' Union ( NFU ) chairman John Lloyd Jones said on BBC radio .
  FP: BBC
  FN: BBC radio
  FN: NFU
  FN: Wels

#### 🤞 Test your code

The following cell reports the evaluation measures from the new function and tests if you achieve the performance goal.

In [305]:
scores_old = evaluation_scores(spans_dev_gold, spans_dev_pred)
scores_new = evaluation_scores(spans_dev_gold, set(pred_spans_improved(df_dev)))
print_evaluation_scores(scores_new)
assert scores_new[-1] - scores_old[-1] > .15, "F1-score should improve by at least 15 percentage points."
success()

Precision: 0.879, Recall: 0.729, F1: 0.797


### Task 3.3

Before moving on, we ask you to store the outputs of the improved named entity recognizer on the development data in a new data frame. This new frame should have the same layout as the original data frame for the development data that you loaded above, but should contain the *predicted* start and end positions for each token span, rather than the gold positions. As the `label` of each span, you can use the special value `--NME--`.

In [191]:
#test
# t = df_dev.loc[df_dev['sentence_id'] == '0946-000'].iloc[:1,]
# t2 = ('1072-007', 13, 19)
# t.iloc[:1,2:] = t2[1:]+('test',)
# tt = t.copy()
# tt
# tdf = pd.DataFrame()
# tdf = pd.concat([tdf,t])
# pd.concat([tdf,t])
pred = spans_dev_pred
res_df = pd.DataFrame()
# for t in pred:
#     df_old = df.loc[df_dev['sentence_id'] == t[0]].iloc[:1,]
#     df_new = df_old.copy()
#     df_new[:1,2:] = t[1:]+('--NME--',)
#     res_df = pd.concat([res_df, df_new])
df = df_dev
list(pred)[0]
df_old = df.loc[df_dev['sentence_id'] == t[0]].iloc[:1,]
df_new = df_old.copy()
df_new.iloc[:1,2:] = t[1:]+('--NME--',)


Unnamed: 0,beg,end,label
3752,1,3,André_Joubert


In [192]:
def df_with_pred_spans(df):
    """Make a new DataFrame with *predicted* NER spans.

    Arguments:
        df: A data frame.

    Returns:
        A *new* data frame with the same layout as `df`, but containing
        the predicted start and end positions for each token span.
    """
    pred = set(pred_spans_improved(df))
    res_df = pd.DataFrame()
    for t in pred:
        df_old = df.loc[df['sentence_id'] == t[0]].iloc[:1,]
        df_new = df_old.copy()
        df_new.iloc[:1,2:] = t[1:]+('--NME--',)
        res_df = pd.concat([res_df, df_new])
    return res_df

    # YOUR CODE HERE
    # raise NotImplementedError()

#### 🤞 Test your code

Run the following cell to run your function and display the first few lines of the new data frame:

In [193]:
df_dev_pred = df_with_pred_spans(df_dev)
display(df_dev_pred.head())

Unnamed: 0,sentence_id,sentence,beg,end,label
4475,1099-018,"Scotland : Andrew Goram , Craig Burley , Thoma...",34,36,--NME--
852,0966-095,6. Phylis Smith ( Britain ) 52.05,4,5,--NME--
2850,1051-011,"292 Iain Pyman 71 75 75 71 , David Gilford 69 ...",18,19,--NME--
1203,0978-000,Jones Medical completes acquisition .,0,2,--NME--
4738,1112-008,Coach Berti Vogts has called up a virtually id...,1,3,--NME--


## Problem 4: Entity linking

Now that we have a method for predicting mention spans, we turn to the task of **entity linking**, which amounts to predicting the knowledge base entity that is referenced by a given mention. In our case, for each span we want to predict the Wikipedia page that this mention references.

### Task 4.1

Start by extending the generator function that you implemented in Task&nbsp;2.1 to labelled spans.

In [194]:
#test
for row in df.head().itertuples():
    print(row)

Pandas(Index=0, sentence_id='0946-000', sentence='CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .', beg=2, end=3, label='Leicestershire_County_Cricket_Club')
Pandas(Index=1, sentence_id='0946-001', sentence='LONDON 1996-08-30', beg=0, end=1, label='London')
Pandas(Index=2, sentence_id='0946-002', sentence='West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .', beg=0, end=2, label='West_Indies_cricket_team')
Pandas(Index=3, sentence_id='0946-002', sentence='West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .', beg=3, end=5, label='Phil_Simmons')
Pandas(Index=4, sentence_id='0946-002', sentence='West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings 

In [195]:
def gold_mentions(df):
    """Yield the gold-standard mentions in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and entity label of each span.
    """
    for row in df.itertuples():
    
        yield ((row[1],row[3],row[4],row[5]))
    

    # YOUR CODE HERE
    # raise NotImplementedError()

#### 🤞 Test your code

To test your code, you can run the following cell, which counts the spans yielded by your function when called on the development data (there should still be 5,917 unique tuples, just as in Task 2.1), and checks if some expected tuples are included in the results.

In [196]:
mentions_dev_gold = set(gold_mentions(df_dev))
assert len(mentions_dev_gold) == 5917, "The number of unique returned quadruples should be the same as before."
assert ('0966-159', 1, 3, '--NME--') in mentions_dev_gold, "An expected tuple is not included in the results."
assert ('1094-020', 0, 1, 'Seattle_Mariners') in mentions_dev_gold, "An expected tuple is not included in the results."
success()

### Task 4.2

A naive baseline for entity linking on our data set is to link each mention span to the Wikipedia page name that we get when we join the tokens in the span by underscores, as is standard in Wikipedia page names. Suppose, for example, that a span contains the two tokens

    Jimi Hendrix

The baseline Wikipedia page name for this span would be

    Jimi_Hendrix

Implement this naive baseline and evaluate its performance.

**Here and in the remainder of this lab, you should base your entity predictions on the predicted spans that you computed in Problem&nbsp;3.**

In [284]:
def baseline(df):
    """A naive baseline for entity linking that "predicts" Wikipedia
       page names from the tokens in the mention span.

    Arguments:
        df: A data frame.

    Yields:
        The predicted mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and the predicted entity label of each span.
    """
    pred = set(pred_spans_improved(df))
    for t in pred:
        id,beg,end = t
        sentence = df.loc[df['sentence_id']== id,'sentence'].iloc[0]
        label = '_'.join(sentence.split()[beg:end])
        yield t+(label,)


    # YOUR CODE HERE
    # raise NotImplementedError()

#### 🤞 Test your code

Again, we can turn to the evaluation measures that we implemented in Problem&nbsp;1.  The expected precision should be around 32%, with an F1-score around 29%.

In [286]:
#test
df= df_dev.head()
# print(df)
set(baseline(df))
# rowIdx = (df['sentence_id']== '0946-002')

# t = df.loc[df['sentence_id']== '0946-002','sentence'].iloc[0]
# t.split()[0:2]
# for row in df.itertuples():
#     print(((row[1],row[3],row[4],row[5])))
#     for ent in doc.ents:
#         print('_'.join(str(ent).split()))

{('0946-001', 0, 1, 'LONDON'),
 ('0946-002', 0, 2, 'West_Indian'),
 ('0946-002', 3, 5, 'Phil_Simmons'),
 ('0946-002', 6, 7, 'four'),
 ('0946-002', 12, 13, 'Leicestershire'),
 ('0946-002', 22, 24, 'two_days')}

In [306]:
scores = evaluation_scores(mentions_dev_gold, set(baseline(df_dev_pred)))
print_evaluation_scores(scores)
assert .31 < scores[0] < .32, "Precision should be between 31% and 32%."
assert .28 < scores[-1] < .29, "F1-score should be between 28% and 29%."
# maybe too strict, implementation of pred_spans_improved() is different
success()

Precision: 0.326, Recall: 0.270, F1: 0.296


AssertionError: Precision should be between 31% and 32%.

## Problem 5: Extending the training data using the knowledge base

State-of-the-art approaches to entity linking exploit information in knowledge bases. In our case, where Wikipedia is the knowledge base, one particularly useful type of information are links to other Wikipedia pages. In particular, we can interpret the anchor texts (the highlighted texts that you click on) as mentions of the entities (pages) that they link to. This allows us to harvest long lists of mention–entity pairings.

The following cell loads a data frame summarizing anchor texts and page references harvested from the first paragraphs of the English Wikipedia. The data frame also contains all entity mentions in the training data (but not the development or the test data).

In [307]:
with bz2.open('kb.tsv.bz2', 'rt') as source:
    df_kb = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

To understand what information is available in this data, the following cell shows the entry for the anchor text `Sweden`.

In [308]:
df_kb.loc[df_kb.mention == 'Sweden']

Unnamed: 0,mention,entity,prob
17436,Sweden,Sweden,0.985768
17437,Sweden,Sweden_national_football_team,0.014173
17438,Sweden,Sweden_men's_national_ice_hockey_team,5.9e-05


As you can see, each row of the data frame contains a pair $(m, e)$ of a mention $m$ and an entity $e$, as well as the conditional probability $P(e|m)$ for mention $m$ referring to entity $e$. These probabilities were estimated based on the frequencies of mention–entity pairs in the knowledge base. The example shows that the anchor text &lsquo;Sweden&rsquo; is most often used to refer to the entity [Sweden](http://en.wikipedia.org/wiki/Sweden), but in a few cases also to refer to Sweden&rsquo;s national football and ice hockey teams. Note that references are sorted in decreasing order of probability, so that the most probable pairing come first.

Implement an entity linking method that resolves each mention to the most probable entity in the data frame. If the mention is not included in the data frame, you can predict the generic label `--NME--`.

In [332]:
#test
df = df_dev.head()
pred = set(pred_spans_improved(df))
for t in pred:
    id,beg,end = t
    sentence = df.loc[df['sentence_id']== id,'sentence'].iloc[0]
    mention = ' '.join(sentence.split()[beg:end])
    try:
        mention = 'qweqweqwe'
        label = df_kb.loc[df_kb.mention == mention, 'entity'].iloc[0]
        print(t+(label,))
    except:
        print('oops')
    
    # label = '_'.join(sentence.split()[beg:end])
    # yield t+(label,)


oops
oops
oops
oops


In [333]:
def most_probable_method(df, df_kb):
    """An entity linker that resolves each mention to the most probably entity in a knowledge base.

    Arguments:
        df: A data frame containing the mention spans.
        df_kb: A data frame containing the knowledge base.

    Yields:
        The predicted mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and the predicted entity label of each span.
    """
    pred = set(pred_spans_improved(df))
    for t in pred:
        id,beg,end = t
        sentence = df.loc[df['sentence_id']== id,'sentence'].iloc[0]
        mention = ' '.join(sentence.split()[beg:end])
        try:
            label = df_kb.loc[df_kb.mention == mention, 'entity'].iloc[0]
        except:
            label = '--NME--'
        
        yield t+(label,)
    

    # YOUR CODE HERE
    # raise NotImplementedError()

### 🤞 Test your code

We run the same evaluation as before. The expected precision should now be above 65%, with an F1-score just around 60%.

In [335]:
scores = evaluation_scores(mentions_dev_gold, set(most_probable_method(df_dev_pred, df_kb)))
print_evaluation_scores(scores)
assert scores[0] > .65, "Precision should be above 65%."
assert .59 < scores[-1] < .61, "F1-score should be around 60%."
# again
success()

Precision: 0.675, Recall: 0.560, F1: 0.612


AssertionError: F1-score should be around 60%.

## Problem 6: Context-sensitive disambiguation

Consider the entity mention &lsquo;Lincoln&rsquo;. The most probable entity for this mention turns out to be [Lincoln, Nebraska](http://en.wikipedia.org/Lincoln,_Nebraska); but in pages about American history, we would be better off to predict [Abraham Lincoln](http://en.wikipedia.org/Abraham_Lincoln). This suggests that we should try to disambiguate between different entity references based on the textual context on the page from which the mention was taken. Your task in this last problem is to implement this idea.

Set up a dictionary that contains, for each mention $m$ that can refer to more than one entity $e$, a separate Naive Bayes classifier that is trained to predict the correct entity $e$, given the textual context of the mention. As the prior probabilities of the classifier, choose the probabilities $P(e|m)$ that you used in Problem&nbsp;5. To let you estimate the context-specific probabilities, we have compiled a data set with mention contexts:

In [339]:
with bz2.open('contexts.tsv.bz2') as source:
    df_contexts = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

This data frame contains, for each ambiguous mention $m$ and each knowledge base entity $e$ to which this mention can refer, up to 100 randomly selected contexts in which $m$ is used to refer to $e$. For this data, a **context** is defined as the 5 tokens to the left and the 5 tokens to the right of the mention. Here are a few examples:

In [337]:
df_contexts.head()

Unnamed: 0,mention,entity,context
0,1970,UEFA_Champions_League,Cup twice the first in @ and the second in 1983
1,1970,FIFA_World_Cup,America 1975 and during the @ and 1978 World C...
2,1990 World Cup,1990_FIFA_World_Cup,Manolo represented Spain at the @
3,1990 World Cup,1990_FIFA_World_Cup,Hašek represented Czechoslovakia at the @ and ...
4,1990 World Cup,1990_FIFA_World_Cup,renovations in 1989 for the @ The present capa...


Note that, in each context, the position of the mention is indicated by the `@` symbol.

From this data frame, it is easy to select the data that you need to train the classifiers – the contexts and corresponding entities for all mentions. To illustrate this, the following cell shows how to select all contexts that belong to the mention &lsquo;Lincoln&rsquo;:

In [338]:
df_contexts.context[df_contexts.mention == 'Lincoln']

41465    Nebraska Concealed Handgun Permit In @ municip...
41466    Lazlo restaurants are located in @ and Omaha C...
41467    California Washington Overland Park Kansas @ N...
41468    City Missouri Omaha Nebraska and @ Nebraska It...
41469    by Sandhills Publishing Company in @ Nebraska USA
                               ...                        
41609                                      @ Leyton Orient
41610                    English division three Swansea @ 
41611    league membership narrowly edging out @ on goa...
41612                                          @ Cambridge
41613                                                   @ 
Name: context, Length: 149, dtype: object

Implement the context-sensitive disambiguation method and evaluate its performance.  Do this in two parts, first implementing a function that builds the classifiers _(refer to the text above for a detailed description)_, then implementing a prediction function that uses these classifiers to perform the entity prediction.

Here are some more **hints** that may help you along the way:

1. The prior probabilities for a Naive Bayes classifier can be specified using the `class_prior` option. You will have to provide the probabilities in the same order as the alphabetically sorted class (entity) names.

2. Not all mentions in the knowledge base are ambiguous, and therefore not all mentions have context data. If a mention has only one possible entity, pick that one. If a mention has no entity at all, predict the `--NME--` label.

In [356]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
count = df_contexts.groupby(['entity']).count().context
prior = np.asarray(count/sum(count))
clfMNB = Pipeline([('Vectorizer',CountVectorizer()),('classifier',MultinomialNB(class_prior = prior))] )
clfMNB.fit(df_contexts["context"],  df_contexts["entity"])
# print(classification_report(df_contexts["entity"], clfMNB.predict(df_contexts["context"])))

In [359]:
clfMNB.predict(df_kb["context"])

0        UEFA_Champions_League
1               FIFA_World_Cup
2          1990_FIFA_World_Cup
3          1990_FIFA_World_Cup
4          1990_FIFA_World_Cup
                 ...          
86029            Western_world
86030            Western_world
86031            North_America
86032            West_Virginia
86033        Western_Australia
Name: entity, Length: 86034, dtype: object

In [None]:

def build_entity_classifiers(df_kb, df_contexts):
    """Build Naive Bayes classifiers for entity prediction.

    Arguments:
        df_kb: A data frame with the knowledge base.
        df_contexts: A data frame with contexts for each mention.

    Returns:
        A dictionary where the keys are mentions and the values are Naive Bayes
        classifiers trained to predict the correct entity, given the textual
        context of the mention (as described in detail above).
    """
    count = df_contexts.groupby(['entity']).count().context
    prior = np.asarray(count/sum(count))
    clfMNB = Pipeline([('Vectorizer',CountVectorizer()),('classifier',MultinomialNB(class_prior = prior))] )
    clfMNB.fit(df_contexts["context"],  df_contexts["entity"])

    # YOUR CODE HERE
    # raise NotImplementedError()

In [None]:
def extended_dictionary_method(df, classifiers, df_kb):
    """An entity linker that resolves each mention to the most probably entity in a knowledge base.

    Arguments:
        df: A data frame containing the mention spans.
        classifiers: A dictionary of classifiers as produced by the
            `build_entity_classifiers` function.
        df_kb: A data frame with the knowledge base. (Should be used
            to look up a mention if it doesn't have a classifier.)

    Yields:
        The predicted mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and the predicted entity label of each span.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

### 🤞 Test your code

The cell below shows how your functions should all come together.

In [None]:
classifiers = build_entity_classifiers(df_kb, df_contexts)
mentions_dev_pred_dictionary = set(extended_dictionary_method(df_dev_pred, classifiers, df_kb))

Finally, the cell below evaluates the results as before. You should expect to see a small (around 1&nbsp;unit) increase in each of precision, recall, and F1.

In [None]:
scores = evaluation_scores(mentions_dev_gold, mentions_dev_pred_dictionary)
print_evaluation_scores(scores)
assert scores[0] > .67, "Precision should be above 67%."
assert scores[-1] > .61, "F1-score should be above 61%."
success()

**Congratulations on finishing this lab! 👍**

<div class="alert alert-info">
    
➡️ Don't forget to **test that everything runs as expected** before you submit!

</div>