
# TxMM Assignment 3: Information Extraction
## Learning goals of this assignment:
1. Become familiar with the intricacies and challenges of time expression labelling
2. Practice with manual annotation using BIO tags
3. Learn how to compute inter-annotator agreement scores with Cohen’s kappa, and understand how to interpret these kappa scores
4. Define a set of patterns for extracting time patterns from text
5. Calculate precision for the output of your code on an unseen text
6. Gain insight into the importance of pattern generalizability
7. Reflect on the gap between labeling time expressions and the actual IE task of extracting and matching events from texts.

## Group Assignment

This assignment is a group assignment. For this assignment, ensure you are enrolled in an "Assignment 3 - Information Extraction" group on Brightspace with a group of three people.


## Practical information

Note that the assignment will be graded with a Pass/Fail system.

Whenever you have any questions, feel free to ask us in the open lunch hours on Mondays! You can also contact the TAs, Mats Robben, Nityaa Kalra and Raul Mihalca through discord or by sending a mail to nityaa.kalra@ru.nl, raul.mihalca@ru.nl and/or mats.robben@ru.nl.

We would appreciate it if you do not contact us via WhatsApp for non-urgent matters, so we can keep our TA work and private life (somewhat) separate.

We only support the use of [Google Colab](https://https://colab.research.google.com/) as all assignments have been implemented and tested using this. In case of (strange) bugs on other platforms, please consider switching to Colab to make sure that we can provide you with all the help you may need.

## Handing in the assignment:
Please hand in the following files:
- This notebook containing your answers in a .ipynb format.
- One .csv file containing all three of your BIO-tag annotations.

Please run the notebook before you hand in the assignment.

Only one of the group members has to hand in the assignment files on Brightspace.

# The Assignment


This assignment consists of two parts: the first part focuses on textual annotation of time expressions and the second part focuses on automatizing time expression extraction with regular expressions and normalizing them to a unified format.

The goal of this assignment is to let you practice in annotation labeling and reflect on the challenges when doing manual annotation of time expressions in text.

Recognizing time expressions is a crucial part of many NLP and IE applications. In this assignment we focus on date expressions in biography texts from Wikipedia.

# Part 1 - Manual Annotation & Inter-Annotator agreement

### Uploading files to Colab

Similar to a previous assignment you need to upload some files that are needed for the assignment in the working directory. For this assignment you need to upload the files *Yoshua_Bengio_bio_trainset.txt*,  *Yann_LeCunn_bio_devset.txt*,
*Fei-Fei_Li_bio_testset.txt* and *utils.py*.

Instead of using the upload functionality, you can also download the file directly in the notebook. Therefore you need to upload the files to https://transfer.sh/ then run the following command in a code cell using the URLs you created:

```
!wget <transfer.sh url>/Yoshua_Bengio_bio_trainset.txt
!wget <transfer.sh url>/Yann_LeCunn_bio_devset.txt
!wget <transfer.sh url>/Fei-Fei_Li_bio_testset.txt
!wget <transfer.sh url>/utils.py
```

Pay attention: The links expire after two weeks and you have to create new ones.

Note that it is not at all important that you achieve and report a high Cohen's kappa for this task. In a real text mining application, researchers often go through multiple cycles of annotation rounds to come up with consistent and clear annotation guidelines. Here you are starting with an initial round of annotations.


### Task 1: Perform Manual Annotation
Each of you individually annotates the date expressions in the biography text of Yoshua Bengio taken from Wikipedia without discussing the task together. This results in 3 versions (A,B,C) of the text with annotations. *Note: if you work in a group with less than three people it is alright to ask another group for one of their annotations.*

A **date expression** is a sequence within the text (can contain letters, numbers, and/or punctuation) that expresses a point in time or a period of time.

As annotation labels we use the B (begin), I (inside) and O (outside) labels to indicate if a token is part of a time expression or not.

We provide a text with one token per line as a starting point, you can find `Yoshua_Bengio_bio_TPL.csv` in the provided zip file. We advise you to use Excel or another editor that allows you to save the resulting annotations as a CSV file.

*After you individually created your annotated versions, combine them into one annotation file. Hand in the file together with this notebook on Brightspace.*

### Task 2: Compute Inter-Annotator Agreement
Upload the csv file and compute Cohen’s kappa between each pair of annotations (AB,BC,AC) and report the 3 scores. Use sklearn.metrics.cohen_kappa_score for this computation.

In [None]:
from sklearn.metrics import cohen_kappa_score
### BEGIN SOLUTION
# import the libraries you need
### END SOLUTION


### BEGIN SOLUTION
# your code goes here
### END SOLUTION

*Report on the three scores here*

### Task 3: Reflect on the Cohen's Kappa score:
#### 3.1) How do you interpret the kappa scores? Are all 3 scores similar?  What does a high score indicate?
#### 3.2) If you place the three columns of BIO tags next to each other, what are the cases where you disagreed? Can you explain why?
#### 3.3) Was the Kappa score metric suitable to evaluate the inter-annotator agreement on this task of time expression labeling, also taking into account the number of annotators?


*Your answer here*



---


#Part 2 - Regular Expressions and Normalization *(Extracting timelines and matching events from biographies)*

## 2.1 - Loading & inspecting the Yann_LeCunn_bio_devset.txt

In this second part you implement a IE program that automatically extracts time expresssions from texts. We focus here on events described in two texts (the biographies of Yann Le Cunn and Fei-Fei Li) and you will find the matching events (i.e. overlapping dates) of the two timelines.

Here, your goal is to identify and extract date expressions from sentences.

First, you want to get an impression of the date expressions in the biography of Yann Le Cunn.
Run the next cell to see Yann_LeCunn_bio_devset.txt. *Don't look at Fei-Fei_Li_bio_testset.txt yet, this is an extraction of the biography of Fei-Fei Li!*

In [None]:
import os
# Make sure Yann_LeCunn_bio_devset.txt and Fei-Fei_Li_bio_testset.txt are in the working directory.

def read_file(file_name):
  with open(file_name, "r") as f:
    return f.read()

working_dir = os.getcwd()  # get our working directory
train_file_path = os.path.join(working_dir, 'Yann_LeCunn_bio_devset.txt')
test_file_path = os.path.join(working_dir, 'Fei-Fei_Li_bio_testset.txt')

text_lecunn = read_file(train_file_path)
print(text_lecunn)


### Task 1: Examine the biography of Yann Le Cunn, and notice the patterns in which the date expressions occur. List and describe your observations in the next cell.

*Your observations go here*

### Task 2: Implement the function *sentence_tokenize_text* so that it splits a text into a list of sentences.

*Note: we split the assignment in two parts as the first part used a word-per-line format, while in this second part we look at sentence-per-line.*

In [None]:
### BEGIN SOLUTION
# import the libraries you need
### END SOLUTION

def sentence_tokenize_text(text):
  """
  :param text: An input text, i.e. a string
  :return: A list of strings, where each string is one sentence
  """
  ### BEGIN SOLUTION
  # your code goes here
  ### END SOLUTION

# sentence_tokenize_text(text_geoffrey)  # You can display the result for testing.

###Task 3: Implement the function *extract_date_expressions* so that it extracts date expressions from sentences.
**To ensure unbiased annotation of the test data in task 10, only one person of your group will annotate the test data. This person cannot look at the regex that is being developed in this task. The other two can develop the regex together.**

The function should take in our list of sentences and return a pandas DataFrame. This DataFrame has two columns:
- Date: The extracted date expressions
- Sentence: The sentence from which a data expression was extracted

**Write your own patterns** and do not rely on libraries that automatically extract date expressions as learning about regular expressions is one of the learning objectives of the exercise.

Use the Yann_LeCunn_bio_devset.txt and the manual annotation that you did on the biography of Yoshua Bengio as inspiration for your regular expressions.

*Hint: Check out https://regexr.com/ for testing and refining the regular expressions you use to capture date expressions. It also has a handy cheat sheet you can use. *

*For coding newbies: You can contact the TAs to get a regex example function in Python.*

In [None]:
### BEGIN SOLUTION
# import the libraries you need
### END SOLUTION

def extract_date_expressions(sentences):
  """
  :param sentences: A list of strings, where each string is one sentence
  :return: A pandas DataFrame with the columns
                "Date" (extracted date expressions as a string)
                "Sentence" (sentences from which a date expression was extracted)
  """
  ### BEGIN SOLUTION
  # your code goes here
  ### END SOLUTION

# Apply the function to the tokenized text:
df_dates_lecunn =  extract_date_expressions(sentence_tokenize_text(text_lecunn))
df_dates_lecunn # use this for testing

### Task 4: Normalization: Implement the function *dates_to_iso8601* so that it converts a date expression string to the ISO 8601 date standard.
Then, add the ISO 8601 converted dates as a column ("ISO") to our dataframe.

You can find more info and examples on https://www.iso.org/iso-8601-date-and-time-format.html.

To implement the function check out Python’s inbuilt datetime module. you’ll find functions in there that can make this task a lot easier.

In [None]:
from datetime import datetime

def date_expression_to_iso8601(date_string):
  """
  :param date_string: A string containing a date expression
  :return: A string containing the date in ISO 8601 format
  """
  ### BEGIN SOLUTION
  # your code goes here
  ### END SOLUTION

# Now, add a column "ISO" to your DataFrame
### BEGIN SOLUTION
df_dates_lecunn["ISO"] = # YOUR CODE HERE
### END SOLUTION
df_dates_lecunn  # use this for testing

###Task 5: In which respect does the ISO 8601 format defer from the date information present in the text? (If you noticed an additional issue caused by a discrepancy between the standard and the Python implementation please mention this as well.)

*Your answere goes here*

### Task 6: Combine the previous steps in the function *get_sorted_df_from_file_name* so that it runs the whole date extraction pipeline and returns a DataFrame.
**Make sure to order the DataFrame rows chronologically according to the ISO dates!**

Consider the following example text:

> This is an example text about interesting upcoming dates. Halloween takes place on 31 October 2024. Our Christmas holiday is from Friday 21 December 2024 - Friday 5 January 2025. We will celebrate Sinterklaas on 5 December 2024.

Here's an illustration of what the example text's DataFrame should look like:

|ISO |Date | Sentence |
|----:|----:|:----|
|2024-10-31 |31 October 2024| Halloween takes place on 31 October 2024.|
|2024-12-05 |5 December 2024| We will celebrate Sinterklaas on 5 December 2024.|
|2024-12-21 |21 December 2024| Our Christmas holiday is from Friday 21 December 2024 - Friday 5 January 2025.|
|2025-01-05 |5 January 2025| Our Christmas holiday is from Friday 21 December 2024 - Friday 5 January 2025.|

In [None]:
def get_sorted_df_from_file_name(file_name):
  """
  :param file_name: A string containing the full path to a file
  :return: A pandas DataFrame with the columns "Date", "Sentence" and "ISO"
          (see above), where rows are sorted according to "ISO"
  """
  ### BEGIN SOLUTION
  # Extract dates, then sort by ISO
  ### END SOLUTION

get_sorted_df_from_file_name(train_file_path)

## 2.2 - Manual labeling: Yann_LeCunn_bio_devset.txt
To evaluate your date expression pipeline, you first need to have gold labels. For Part 2 we use another labeling format to simplify the regex matching. Let's familiarize ourselves with this format.

Look at our example text again:

> This is an example text about interesting upcoming dates. Halloween takes place on 31 October 2024. Our Christmas holiday is from Friday 21 December 2024 - Friday 5 January 2025. We will celebrate Sinterklaas on 5 December 2024.


This is how we store the gold labels for this example text:
~~~python
example_manual_labels = [
  {"Dates": [],
    "Sentence": "This is an example text about interesting upcoming dates."},
  {"Dates": ["2025-10-31"],
    "Sentence": "The next Halloween takes place on 31 October 2024."},
  {"Dates": ["2024-12-21", "2025-01-05"],
    "Sentence": "Our Christmas holiday is from Friday 21 December 2024-Friday 5 January 2025."},
  {"Dates": ["2024-12-05"],
    "Sentence": "We will celebrate Sinterklaas on 5 December 2024."}
  ]
~~~

As you can see, we use a list of dictionaries. We have one dictionary for each sentence. This dictionary has two keys:
- Key "Sentence": The corresponding value is the sentence (i.e. a string).
- Key "Dates": The corresponding value is a list of all date expressions (strings; correctly converted to the ISO 8601 date standard) that appear in that sentence. If a sentence does not contain any date expressions, this list is empty.


We've created a helper function that automatically provides a template list based on *your sentence_tokenize_text* function. Run the cell below to get this template for *text_lecunn*:

In [None]:
def get_manual_labeling_list(text):
  """
  A helper function to print a manual labeling list in which you only have to
  fill in the dates for each sentence.
  :param text: An input text, i.e. a string
  """
  tokenized_text = sentence_tokenize_text(text)
  return [{"Dates": [], "Sentence": tokenized_text[i]} for i in range(len(tokenized_text))]

get_manual_labeling_list(text_lecunn)

### Task 7: Manually label all sentences from Yann_LeCunn_bio_devset.txt.   
Copy the template list output by our helper function for Yann_LeCunn_bio_devset.txt into the cell below. Then manually fill the list *lecunn_manual_labels* with the dates (in correct ISO 8601 format) from each sentence.

In [None]:
lecunn_manual_labels = [
  # Add one dictionary for each sentence here.
]


Now that we have labels for Yann_LeCunn_bio_devset.txt, we can plot a confusion matrix to get an impression of your extraction procedure's performance:

In [None]:
from utils import plot_confusion_matrix

plot_confusion_matrix(manual_labels = lecunn_manual_labels,
                      sorted_date_df = get_sorted_df_from_file_name(train_file_path),
                      normalize    = False,
                      title_names = ['Positive','Negative'])

### Task 8: Manually calculate the precision and recall of your date expression procedure on Yann_LeCunn_bio_devset.txt.
Write your calculation into the cell below. Write your *full* calculation including the formula you are using.

Hint: You can use $\LaTeX$ here.
- To display a formula inline, surround it by the '\$' sign.
    - For example, '\$ 4=2^2 \$' will be displayed like this: $4=2^2$
- To display a formula in a block, surround it by '\$\$'.
    - For example, '\$\$ 16=4^2 \$\$' will be displayed like this: $$16=4^2$$

*Insert your calculation here.*

**Now, repeat the following steps until you are satisfied with the performance:**
1. Run the date expression procedure on Yann_LeCunn_bio_devset.txt.
2. Make adaptations to your code if necessary.
3. Go through the output manually and calculate precision and recall. **Make sure the cell above contains the latest calculation.**

Once you are satisfied with your performance, proceed to the next part of this assignment.

### Task 9: Discuss the difficulties you encountered during each repeat of the above steps to develop the time patterns.  ####

*Describe the difficulties*

## 2.3 - Applying the extraction procedure to the unseen Fei-Fei_Li_bio_testset.txt
Next, we will test your date extraction procedure and see how it performs on the unseen file Fei-Fei_Li_bio_testset.txt. First, let's have a look at the text inside this file:

In [None]:
text_feifei = read_file(test_file_path)
print(text_feifei)

### Task 10: Manually label all sentences from Fei-Fei_Li_bio_testset.txt
**The one person that didn't work on the regex has to be the person to annotate the test file**

The following cell gives you the template list.
Fill the list *feifei_manual_labels* (just like you previously did for Yann_LeCunn_bio_devset.txt) in the cell under the following one with the dates from the text.

In [None]:
get_manual_labeling_list(text_feifei)

In [None]:
feifei_manual_labels = [
  # Add one dictionary for each sentence here.

]

**Now, let's run your date expression procedure on the unseen text *Fei-Fei_Li_bio_testset.txt* and look at the resulting DataFrame.**


In [None]:
get_sorted_df_from_file_name(test_file_path)

### Task 11:  Make adaptations to your date extraction code if necessary. Do not change the functions from the previous part in this assignment, but make your adjustments by changing the three functions below.

Currently, each of these "adapted" functions just uses the functions from the previous parts. If you want to make any changes to one of the functions, overwrite this return statement with your changes.

In [None]:
def sentence_tokenize_text_adapted(text):
  """
  :param text: An input text, i.e. a string
  :return: A list of strings, where each string is one sentence
  """
  # change this if you want to adapt your original function
  return sentence_tokenize_text(text)


def extract_date_expressions_adapted(sentences):
  """
  :param sentences: A list of strings, where each string is one sentence
  :return: A pandas DataFrame with the columns
                "Date" (extracted date expressions as a string)
                "Sentence" (sentences from which a date expression was extracted)
  """
  # change this if you want to adapt your original function
  return extract_date_expressions(sentences)


def date_expression_to_iso8601_adapted(date_string):
  """
  :param date_string: A string containing a date expression
  :return: A string containing the date in ISO 8601 format
  """
  # change this if you want to adapt your original function
  return date_expression_to_iso8601(date_string)


def get_sorted_df_from_file_name_adapted(file_name):
  """
  :param file_name: A string containing the full path to a file
  :return: A pandas DataFrame with the columns "Date", "Sentence" and "ISO"
          (see above), where rows are sorted according to "ISO"
  """
  # change this if you want to adapt your original function
  return get_sorted_df_from_file_name(file_name)

### Task 12: Discuss the difficulties you encountered extracting the new timeline. Also address the adaptations that you needed to make for processing the unseen biography Fei-Fei_Li_bio_testset.txt.

Write your discussion into the cell below.

*Your discussion goes here.*

**Now, we can evaluate your adapted date expression procedure. Let's plot one confusion matrix for each of the text files. **

In [None]:
print('Confusion matrix for Yann Le Cunn:')
plot_confusion_matrix(manual_labels = lecunn_manual_labels,
                      sorted_date_df = get_sorted_df_from_file_name_adapted(train_file_path),
                      normalize    = False,
                      title_names = ['Positive','Negative'])

print('Confusion matrix for Fei-Fei Li:')
plot_confusion_matrix(manual_labels = feifei_manual_labels,
                      sorted_date_df = get_sorted_df_from_file_name_adapted(test_file_path),
                      normalize    = False,
                      title_names = ['Positive','Negative'])

### Task 13: Calculate the precision and recall of your adapted date expression procedure on both Yann_LeCunn_bio_devset.txt and Fei-Fei_Li_bio_testset.txt.

Write your calculations into the cell below.

*Your precision and recall calculations go here*

### Task 14: Compare the precision and recall scores on Yann_LeCunn_bio_devset.txt and Fei-Fei_Li_bio_testset.txt. Address the following points:
- Did the values of Yann_LeCunn_bio_devset.txt change after your adaptions?
- Are there any differences between the two texts?
- If so, where does this difference come from?
- What does this difference mean in terms of generalizability?

*Your answer goes here*

### Task 15: Take a closer look at the errors your automatic extraction method makes. What is happening there?

*Your answer goes here*

###Task 16: Reflect on the relation between date mentions in the texts and the events that they denote, reflect on duration and overlap of events. (max. 4 sentences)

*Your answer goes here*

## 2.4 - Finding matching events

### Task 17: Find the matching events (i.e. overlapping dates in the two biographical timelines) between the two timelines.

You could do this in one of the following ways:
- **Command line** We highly encourage you to use the ‘comm’ function on your ISO-8601 dates in your command line (more info and example: https://www.computerhope.com/unix/ucomm.htm. Note that this command also works on Mac OS/Windows).
- **Python** It is also allowed to find the matching events programmatically in Python.
- **Online tool ** You can use an online tool (e.g. https://text-compare.com/).

**Important: Do not search for matching events manually!**

Please put the command, the code or the link to the website you used into the cell below or briefly describe how you found the matching events.  

Then, also paste the list of matching events (in ISO-8601 format) into that cell.


*Your description goes here*

*Your list goes here*


### Task 18: Discussion of matching events
1. Discuss the list of matching events that you found.
2. When going through the texts manually, do you find matching events that you did not find programmatically/automatically?
3. If so, what could be the reason(s) for this? Discuss your answers to these questions and any other difficulties you encountered during the extraction of matching events.

*Your answer goes here*

In [None]:
print('Congratulations! You finished this assignment!')
print('We would like to thank Theo Kent for letting us use his notebook as a basis for this assignment!')

Please look at the ***Handing in the assignment*** section for instructions on what to hand in.