In [1]:
from IPython.core.display import HTML

HTML("<style>" + open("style.css").read() + "</style>")

<div class="headline">
Language Technology / Sprachtechnologie
<br><br>
Wintersemester 2019/2020
</div>
<br>
<div class="description">
    Übung zum Thema <i id="topic">"Named Entity / Coreference"</i>
    <br><br>
    Deadline Abgabe: <i #id="submission">Thursday, 14.11.2019 (23:55 Uhr)</i>
</div>

# Präsenzübung

In [None]:
import nltk
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from nltk.corpus import*
from nltk.book import*
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk.tag import pos_tag
from nltk.corpus import gazetteers, names

from sklearn import datasets, svm, tree, metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd 

### Warm Up

<div class="task_description">
    <i class="task">Task 4.1:</i> Named Entity Recognition: <br>
</div>

Which of the following statements are true?

1. The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities.
2. Named entity recognition is a method to extract person names from text.
3. Named entities are language independent.
4. In named entity recognition we need to be able to identify the beginning and the end of multi-token sequences.

### Using a Named Entities Classifier

A type of noun phrase that is of particular interest is a named entity. This might be a person, such as Albert Einstein, or a place, such as Duisburg or a business, such as Irish Pub. <br>
In general, this is a hard problem. Words can have multiple uses, and there’s an unbounded number of possible names. Within a domain, though, we can have better luck. NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk() <br>
The table below states the commonly used types of named entities, as they are provided by nltk:

| NE Type | Examples  |
|------|------|
|ORGANIZATION|Georgia-Pacific Corp., WHO
|PERSON| Eddy Bonte, President Obama
|LOCATION|Murray River, Mount Everest
|DATE|June, 2008-06-29
|TIME|two fifty a m, 1:30 p.m.
|MONEY|175 million Canadian Dollars, GBP 10.40
|PERCENT|twenty pct, 18.75%
|FACILITY|Washington Monument, Stonehenge
|GPE (geo-political entities)|South East Asia, Midlothian


<div class="task_description">
    <i class="task">Task 4.2:</i> <br>
</div>

Use the sentence below: <br><br>
The capital of the United States of America is named after the first US president George Washington.

<div class="task_description">
   <i class="subtask">4.2.1</i> <i class="l2">L2</i> <br>
</div>

Use word_tokenize to tokenize the sample.

<div class="task_description">
   <i class="subtask">4.2.2</i> <i class="l2">L2</i> <br>
</div>

Use nltk.pos_tag to tag the sentence.

<div class="task_description">
   <i class="subtask">4.2.3</i> <i class="l2">L2</i> <br>
</div>

Use nltk.ne_chunk to chunk the tagged sentence. Experiment with the argument "binary". What is the difference?

<div class="task_description">
   <i class="subtask">4.2.4</i> <i class="l2">L2</i> <br>
</div>

Draw (.draw()) and analyze the resulting tree structure

<div class="task_description">
   <i class="subtask">4.2.5</i> <i class="l3">L3</i> <br>
</div>

Write a function extract_entity_names(tree), that extracts all identified named entities of the given tree and returns it as a list of words.<br>
Since 'tree' is is a nested structure implement this function using a recursion. It is standard to use a recursive function to traverse a tree. The listing below defines an algorithm to traverse a tree. You may change it to fit your purpose.

In [None]:
def traverse(t):
    try:
        t.label
    except AttributeError:
        print(t)
    else:
        #Now we know that t.node is defined
        print('(', t.label),
        for child in t:
            traverse(child)
        print(')'),
        
traverse(tree)

### Precision, Recall, F-Score

<div class="task_description">
    <i class="task">Task 4.3:</i> <br>
</div>

The following confusion matrix shows the evaluation result of a named entities classifier. The columns contain the gold standard and the rows the system output. The target class is NE.

Confusion Matrix |NE | no NE |
-|-|-|
NE| 50 | 30 |
no NE| 20 | 200 |

<div class="task_description">
   <i class="subtask">4.3.1</i> <i class="l1">L1</i> <br>
</div>
How many true positives, true negatives, false positives and false negatives are there? How do you interpret them?

<div class="task_description">
   <i class="subtask">4.3.2</i> <i class="l2">L2</i> <br>
</div>
Compute precision, recall and F-score given the confusion matrix above.

### Building your own Named Entities Classifier

<div class="task_description">
    <i class="task">Task 4.4:</i> <br>
</div>

<div class="task_description">
   <i class="subtask">4.4.1</i> <i class="l1">L1</i> <br>
</div>
What does the following code do?

In [None]:
df = pd.read_csv("NER_clean.csv", delimiter = "\t", encoding="utf-8", names=["WORD", "NE"], quoting=3)
df["WORD"] = df["WORD"].apply(str)
print(df[:30])

<div class="task_description">
   <i class="subtask">4.4.2</i> <i class="l1">L1</i> <br>
</div>
What does the following code do?

In [None]:
words =list(df.loc[:, "WORD"])
df["WORDLENGTH"] = [len(word) for word in words]
print(df[:30])

<div class="task_description">
   <i class="subtask">4.4.3</i> <i class="l2">L2</i> <br>
</div>

Add 4 columns to the data frame which contain

- whether a word is capitalized (True/False)
- whether a word is fully written in uppercase (True/False)
- whether the word is a noun (True/False)
- whether the word appears in the corpus "names" or "gazetteers" from NLTK (True/False)

<div class="task_description">
   <i class="subtask">4.4.4</i> <i class="l3">L3</i> <br>
</div>

The following code creates a Decision Tree Classifier which classifies whether a token is a named entity or not based on the features you provided above. The data are split in a training and a test set and the variable "predicted" contains the predicted labels (NE = "True", no NE ="False") while "gold" contains the corresponding gold labels.

In [None]:
x = df.iloc[:, 2:len(df.columns)]
y = df.iloc[:, [1]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

c_tree = tree.DecisionTreeClassifier(max_depth=4)
c_tree.fit(x_train, y_train)

predicted = list(c_tree.predict(x_test))
gold = list(y_test.loc[:, "NE"])

Based on the predicted and the gold labels, compute precision, recall and F-score for the classifier. 

You can compare your results to the built-in classification report:

In [None]:
print(classification_report(gold,predicted))

# Homework

### Submission Guidlines
* The submission has to be done by a team of two people. **Individual submissions will not be graded.**
* Please state **the name and matriculation number of all team members in every submission** clearly.
* Only **one team member** should submit the homework. If more than one version of the same homework is submitted by accident (submitted by more than one group member), please reach out to a tutor **as soon as possible**. Otherwise, the first submitted homework will be graded.
* The submission must be in a Jupyter Notebook format (.ipynb). Submissions in other formats will **not be graded**.
* It is not necessary to also submit the part of the exercise discussed by the tutor, please only submit the homework part.
* If pictures need to be submitted, it is allowed to hand them in in a zip folder, together with the notebook. They should be added to the notebook like this: <br> *!\[example1](examplepicture1.PNG)* (without apostrophs in a Markdown-Cell).

<div class="task_description">
    <i class="task">Task 4.1:</i> 
</div>

<i class="subtask">4.1.1</i> 
Annotate all named entities in the file "Langtech_NER.txt"

* The file contains 100 German sentences (note that the sentences do not form a coherent text) and each sentence may contain one or more named entities but it is also possible that there is no named entity in a sentence
<br><br>
* The 4 named entity types to annotate are PERSON (PER), ORGANIZATION (ORG), LOCATION (LOC), OTHER (OTH). For further information about which named entity belongs to which type, please refer to the "NoSta-D-TagSet" on page 6 in the file "Clarin_NoSta-D_NER-AnnotationGuidelines.pdf" that you can download from Moodle. Important: you are not asked to follow these annotation guidelines completely. Especially, note the followning:
     * Anything that is tagged with "deriv" or "part" tags according to these guidelines is ignored (e.g. LOCderiv, ORGpart)
     * In our annotation, there are no nested named entities. For example "Bayern München" is labeled as ORG and the individual parts "Bayern" and "München" are not labeled as LOC. As a general rule, the longest possible span gets the label.
 <br>	
<br>
* Upload the annotated file (ending with ".ann", see below), to Moodle. Make sure that the filename contains your name!

<i class="subtask">4.1.2</i> 

Write down at least 5 different cases that you found difficult to annotate. For each, write down 1-2 sentences explaining why it was difficult (e.g. by saying which other label could have applied and why or why you were unsure whether something is a named entity or not). Upload your descriptions to Moodle as a PDF file.



### Technical instructions


- Download the annotation tool YEDDA from https://github.com/jiesutd/YEDDA

- Attention: YEDDA requires Python 2.7, so make sure you have this version installed!

- To start the annotation, open a console (in the YEDDA-master folder) and type python YEDDA.py (make sure you start it with Python 2 not Python 3, so maybe you have to type something like /path/to/python2 YEDDA.py !)

- Download the file "Langtech_NER.config" from Moodle and place it in the folder "YEDDA-master/configs/".

- To open the sentences to annotate, click on "open" and select the file "Langtech_NER.txt" (or "Langtech_NER.ann" if you have already saved an annotated version and want to continue)

- Select the correct set of labels: In the drop down menu under "Map Templates" on the right hand side, select the file "Langtech_NER.config"

- To annotate a named entity, mark the whole Named Entity and press the key on the keyboard that is associated with the right label (A: PERSON, B: ORGANISATION, C:LOCATION, D:OTHER)

- To change a label, click within the entity span and press the key for the new label

- To remove a label, click within the entity span and press 'q' . Important: In order to remove the label, do not mark the whole entity. If you then press 'q' this will remove the whole entity, not just the label!

- Clicking on "Export" will save the annotated text. Note: two files will be saved, one ending with *.ann and one with *.anns. Upload the one ending with *.ann to Moodle (change the filename so that it contains your name!)