# Week 6 | Assignment 2: Training & Evaluating ML Algorithms | Part 1

### Overview
* You start from the gold annotations that you produced
* This week: pre-process the data and run a simple ML algorithm over it
* Next week: in-depth evaluation

### Preparation
* Download your gold annotations from Doccano
* For this assignment, there is **no** strict separation between annotators and researchers. It's up to you how you divide the tasks within your group. There are three types of tasks: 1) writing pre-processing code, 2) writing experiment code, 3) report writing

In [None]:
# run this cell to install & import necessary packages
import json
import nltk
import random
from nltk.tokenize import word_tokenize
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Pre-processing the data
#### SEQUENCE LABELING TASKS
(if you have a document classification task, please scroll down to the next section)


**STEP 1.1**

* Write a function that reads files exported from Doccano and gets all the annotated spans. For each span, you need to extract 1) the text of the span itself; 2) the text before and after the span 3) the label.

* The function should take as input a JSONL file(name) and output a list of dictionaries

* For now, simply get *all* text before and after; later, you will cut this down to smaller pieces

* Your output should look like this:
```python
[{'label': 'Car',
  'text_after': ' over de kop bij knooppunt Valburg, bestuurder gewond\n\nVALBURG - Op\xa0de verbindingsweg tussen de A15 en de A50 is vrijdagmorgen een bestuurder gewond geraakt doordat hij met zijn auto van de weg raakte en over de kop sloeg. Het ongeval gebeurde even na 11.30 uur ter hoogte van knooppunt Valburg.\n\n',
  'text_prev': '',
  'text_span': 'Auto'},
 {'label': 'Scooter',
  'text_after': ' remt en wordt aangereden\n\nKESTEREN - Een bestuurder van een motorscooter is gewond geraakt bij een ongeluk in Kesteren. De man is met een ambulance naar het ziekenhuis gebracht.\n\nKESTEREN - Een bestuurder van een motorscooter is gewond geraakt bij een ongeluk in Kesteren. De man is met een ambulance naar het ziekenhuis gebracht.\n\nDe motorscooter reed over de Spoorstraat in Kesteren toen een meisje op een fiets vanuit de Industrieweg plotseling de straat over stak. De bestuurder ging vol in de remmen. De automobilist die achter hem reed had dit te laat door en botste achterop de motorscooter. Die belandde in de sloot.\n\nDe motorscooter is door een berger uit de sloot gehaald. Een opgeroepen traumahelikopter is geannuleerd. Het meisje dat de weg overstak is weggefietst.',
  'text_prev': 'Meisje op fiets steekt plots over, bestuurder ',
  'text_span': 'motorscooter'},
 {'label': 'Car',
  'text_after': ' die achter hem reed had dit te laat door en botste achterop de motorscooter. Die belandde in de sloot.\n\nDe motorscooter is door een berger uit de sloot gehaald. Een opgeroepen traumahelikopter is geannuleerd. Het meisje dat de weg overstak is weggefietst.',
  'text_prev': 'Meisje op fiets steekt plots over, bestuurder motorscooter remt en wordt aangereden\n\nKESTEREN - Een bestuurder van een motorscooter is gewond geraakt bij een ongeluk in Kesteren. De man is met een ambulance naar het ziekenhuis gebracht.\n\nKESTEREN - Een bestuurder van een motorscooter is gewond geraakt bij een ongeluk in Kesteren. De man is met een ambulance naar het ziekenhuis gebracht.\n\nDe motorscooter reed over de Spoorstraat in Kesteren toen een meisje op een fiets vanuit de Industrieweg plotseling de straat over stak. De bestuurder ging vol in de remmen. De ',
  'text_span': 'automobilist'}]
```

In [None]:
# sequence labeling tasks: load the data
def load_span_data(file_name):
  label_list = []

  with open(file_name, 'r', encoding='utf-8') as json_file:
      json_list = list(json_file)

  for json_str in json_list:
      
      result = json.loads(json_str)
      labels = result['label']
      text = result['data']
      for label in labels:
          label_dict = {}
          begin = label[0]
          end = label[1]
          label = label[2]
          text_span = text[begin:end]
          text_prev = text[:end-len(text_span)]
          text_after = text[begin+len(text_span):]
          label_dict['label'] = label
          label_dict['text_prev'] = text_prev
          label_dict['text_after'] = text_after
          label_dict['text_span'] = text_span  
          label_list.append(label_dict)
  return label_list

data_in = load_span_data("cl-event-1.jsonl")
print(data_in)

[{'label': 'goal', 'text_prev': 'Borussia Dortmund 1-0 Ajax: Lewandowski ', 'text_after': " to seal Group D victory\nThe Bundesliga outfit dominated proceedings in the second half after an even opening 45 minutes, and the Poland international's late strike helped them to a vital win\nBorussia Dortmund recorded a 1-0 victory over Ajax at Signal Iduna Park on Tuesday in their Champions League Group D opener.\nThe hosts should have opened the scoring in the second half via Mats Hummels, but the centre-back missed his spot kick. However, Robert Lewandowski's late goal helped Dortmund to the full three points and spared his team-mate's blushes.\nBVB started the match with attacking intentions and threatened early on via Kuba and Lewandowski. However, the first big chance of the game was for Ajax following some sloppy play from Ilkay Gundogan. Ryan Babel stole the ball off the Dortmund midfielder's feet before setting up Christian Eriksen inside the area, but the Denmark international's shot

**STEP 1.2** 

* Write a function that:
  - _lowercases_ the text
  - *tokenizes* text (split into words) 
  - reduces the _context window_ (= number of words before and after)
* For tokenization, you can use the `nltk.word_tokenize(language="language_name")` function from the Python Natural Language Toolkit (NLTK). See the documentation here: https://www.nltk.org/api/nltk.tokenize.html)
  * `language_name` can be either `"english"` or `"dutch"`
  * if you work with football data that is in both English and Dutch, you can either use the `langdetect` package (https://pypi.org/project/langdetect/) to detect which language an article is in, or use the English tokenization for both languages.
* You should tokenize the span text itself as well as the context before and after
* Your function should have a parameter called `context_window` that decides how many words before and after the span should be considered
* Output a list of labels and a list of feature dictionaries. This should look like this:
```python
labels = ["Car", "Scooter", "Car"]
features = [
    {'text_after': ['over', 'de', 'kop'],
    'text_prev': [],
    'text_span': ['auto']},
    {'text_after': ['remt', 'en', 'wordt'],
    'text_prev': ['over', ',', 'bestuurder'],
    'text_span': ['motorscooter']},
    {'text_after': ['die', 'achter', 'hem'],
    'text_prev': ['remmen', '.', 'de'],
    'text_span': ['automobilist']}
  ]
``` 

In [None]:
def prepare_span_data(data_in, context_window):
    labels = []
    for data in data_in:
        labels.append(data['label'])
        # del data['label']
        data['text_prev'] = word_tokenize(data['text_prev'].lower())
        data['text_after'] = word_tokenize(data['text_after'].lower())
        data['text_prev'] = data['text_prev'][0:context_window]
        data['text_after'] = data['text_after'][0:context_window]
        
    return labels, data_in

labels, new_data = prepare_span_data(data_in, 4)
print('labels =', labels)
print('features =', new_data)

labels = ['goal', 'goal', 'goal', 'pass', 'save', 'save', 'save', 'pass', 'save', 'pass', 'pass', 'save', 'save', 'pass', 'goal', 'foul', 'goal', 'goal', 'goal', 'pass', 'save', 'foul', 'goal', 'save', 'pass', 'pass', 'goal', 'save', 'pass', 'goal', 'pass', 'foul', 'goal', 'goal', 'pass', 'save', 'save', 'goal', 'goal', 'goal', 'goal', 'pass', 'goal', 'pass', 'goal', 'goal', 'save', 'pass', 'goal', 'goal', 'goal', 'goal', 'goal', 'goal', 'goal', 'pass', 'goal', 'goal', 'goal', 'goal', 'goal', 'foul', 'goal', 'goal', 'goal', 'save', 'save', 'pass', 'save', 'save', 'goal', 'pass', 'goal', 'goal', 'pass', 'save', 'pass', 'pass', 'save', 'pass', 'pass', 'save', 'pass', 'goal', 'pass', 'goal', 'pass', 'pass', 'goal', 'pass', 'save', 'foul', 'foul', 'save', 'goal', 'foul', 'goal', 'goal', 'foul', 'foul', 'goal', 'goal', 'goal']
features = [{'label': 'goal', 'text_prev': ['borussia', 'dortmund', '1-0', 'ajax'], 'text_after': ['to', 'seal', 'group', 'd'], 'text_span': 'nets late'}, {'label': '

**STEP 1.3**
* Use the `DictVectorizer` from scikit-learn to transform your feature dictionaries into vectors (see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)
* Call the `dv.fit_transform()` method and look at the output. What is the meaning of the `.shape` attribute? Can you make sense of the rows and columns? (N.B.: by default, you will get a "sparse matrix" as output. If you call `.toarray()`, you will get a regular NumPy array instead which is easier to inspect. 
* Call the `dv.get_feature_names_out()` method. Does the output make sense to you?


In [None]:
from sklearn.feature_extraction import DictVectorizer
indices = list(range(0, len(data_in)))
random.shuffle(indices)
print(indices)
dv = DictVectorizer(sparse=False)
vectors = dv.fit_transform(data_in)
print(vectors)
print(dv.get_feature_names_out())

[58, 93, 13, 94, 99, 3, 75, 11, 54, 82, 90, 35, 24, 27, 44, 67, 50, 43, 29, 10, 100, 88, 23, 65, 51, 102, 55, 73, 7, 46, 52, 31, 33, 19, 101, 12, 83, 64, 40, 42, 78, 74, 91, 39, 98, 16, 20, 34, 95, 84, 69, 25, 77, 97, 49, 38, 62, 36, 5, 96, 45, 41, 70, 14, 30, 2, 1, 57, 92, 86, 71, 87, 47, 63, 72, 28, 21, 59, 85, 53, 61, 18, 48, 76, 6, 66, 22, 0, 17, 26, 15, 79, 81, 60, 37, 68, 89, 8, 56, 9, 4, 32, 80]
[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
['label=foul' 'label=goal' 'label=pass' 'label=save' 'text_after=!'
 "text_after='" "text_after='s" 'text_after=(' 'text_after=)'
 'text_after=,' 'text_after=.' 'text_after=0-1' 'text_after=1-1.'
 'text_after=17th' 'text_after=2-0.' 'text_after=20' 'text_after=38th'
 'text_after=39e' 'text_after=42' 'text_after=84' 'text_after=:'
 'text_after=a' 'text_after=aan' 'text_after=acheampong'
 'text_after=after' 'text_after=against' 'text_af

#### DOCUMENT-LEVEL TASKS


**STEP 1.1**

* Write a function that reads files exported from Doccano and gets all the annotated texts. For each span, you need to extract 1) the text of the document; 2) the label.

* The function should take as input a JSONL file(name) and output a list of dictionaries

* For now, simply get all the raw text (we will tokenize and prepare it later)

* Your output should look like this:
```python
[{'label': ['meerdere voertuigen'],
  'text': "A28 bij Assen tot eind van middag..."},
 {'label': ['meerdere voertuigen'],
  'text': "Drie auto's botsen..."},
 {'label': ['eenzijdig ongeval'],
  'text': 'Automobiliste botst tegen boom op de...'}]
```

* N.B.: if you have *multiple labels* per document, then the "label" attribute should have a list as value, otherwise (*single label*) it should be a string 


In [None]:
# sequence labeling tasks: load the data
def load_doc_data(file_name):
  pass

**STEP 1.2** 

* Write a function that:
  - _lowercases_ the text
  - *tokenizes* text (split into words) 
  - removes stopwords
* For tokenization, you can use the `nltk.word_tokenize(language="language_name")` function from the Python Natural Language Toolkit (NLTK). See the documentation here: https://www.nltk.org/api/nltk.tokenize.html)
  * `language_name` can be either `"english"` or `"dutch"`
  * if you work with football data that is in both English and Dutch, you can either use the `langdetect` package (https://pypi.org/project/langdetect/) to detect which language an article is in, or use the English tokenization for both languages.
* Removing stopwords can also be done with NLTK. See https://pythonspot.com/nltk-stop-words/ for details (again, remember to specify the right language!)
* Output a list of labels and a list of features (token lists). This should look like this:
```python
features = [['a28', 'bij', 'assen', 'tot', ...], ['drie', 'auto', "'s", 'botsen', 'op', 'elkaar', ...], ['automobiliste', 'botst', 'tegen', 'boom', 'op', 'de', 'wesselseweg', 'in', 'barneveld', 'de', 'bestuurster', 'is', 'onderzocht', 'in', 'de', 'ambulance', ',', 'maar', 'hoefde', 'niet', '...]]
labels = [['meerdere voertuigen'], ['meerdere voertuigen'], ['eenzijdig ongeval']]

``` 

In [None]:
def prepare_doc_data(data_in, context_window):
  pass


**STEP 1.3**
* Use the `TfidfVectorizer` from scikit-learn to transform your token lists into vectors (see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). The vectorizer will first count word frequencies in your data and then weight them (see [the explanation here](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) --- basically, words that occur in almost all documents [e.g. "car" in the traffic dataset] will get a low weight because they don't carry much information related to specific documents; words that occur frequently in a particular document but rarely occur in other documents are seen as more informative and get a higher weight) 
* N.B.: by default, `TfidfVectorizer` will expect raw (untokenized) text and do the tokenization and other pre-processing for you. In this assignment, we use NLTK for tokenization instead (among other reasons because it allows us to do language-specific tokenization). For this reason, you need to modify the `tokenizer` parameter of the constructor so that it _doesn't do anything_ (e.g. pass a function that takes lists of tokens and returns them as-is)
* Call the `tv.fit_transform()` method and inspect the output (tip: use the `.toarray()` method to make the data easier to look at)
* If you have a multilabel setup: you also need to vectorize the outputs (e.g. `[[label1, label2], [label1, label3]] --> [[1, 1, 0], [1, 0, 1]]`). See this page in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

In [None]:
from collections import Counter
def count_labels(labels):
  counted_labels = Counter(labels)
  return counted_labels

print(count_labels(labels))

Counter({'goal': 46, 'pass': 27, 'save': 21, 'foul': 9})


### Running the ML experiment
(this part is the same for sequence-labeling and document-level tasks)

---

**STEP 2.0**

* Inspect your dataset: 
  * How many samples do you have? 
  * How frequent are the different labels?
  * Do you foresee any problems? (Which labels could be easier or harder to learn? Do you expect a bias towards a specific class?) 

In [None]:
print("Amount of samples:", len(labels))
print("Amount of goals:", labels.count("goal"))
print("Amount of pass:", labels.count("pass"))
print("Amount of save:", labels.count("save"))
print("Amount of foul:", labels.count("foul"))


Amount of samples: 103
Amount of goals: 46
Amount of pass: 27
Amount of save: 21
Amount of foul: 9


The amount of samples is not very high. This could lead to inaccurate results for the Machine Learning. Especially the amount of fouls is very low.

**STEP 2.1**
* Write a function that splits your dataset (vectors and labels) as follows:
  * 70% training data
  * 10% validation data
  * 20% test data
  * The data should be shuffled (put in random order) before making the split --- for this you could use the `random` package in the Python standard library. In this way, you can avoid "weird" effects related to the order of the data (maybe articles about the same topic are close to each other, or maybe you got tired during the annotation and did a less good job on the last annotations compared to the first ones)

In [None]:
from sklearn.model_selection import train_test_split

def randomize_and_split(labels, data):
  train_ratio = 0.70
  validation_ratio = 0.10
  test_ratio = 0.20

  x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=1-train_ratio)

  x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

  return x_train, y_train, x_test, y_test


x_train, y_train, x_test, y_test = randomize_and_split(labels, indices)

**STEP 2.2**

* Finally, time to train a machine learning algorithm! We'll use two models from scikit-learn: `MultinomialNB` (a variant of Naïve Bayes) and `RandomForestClassifier` (a variant of decision trees) 
* There are two steps:
  * 1) **model selection** --> here, you try out different variants of the models, and select the best one based on the score that you get on the validation set. You should experiment with different settings for the following *hyperparameters* (= parameters that the model doesn't learn by itself but are set by the experimenter):
    - `MultinomialNB`: try different values for the `alpha` parameter (this "smoothing" parameter will try to correct for features that never occur with a particular label -- think of the 'pizza problem' from the very first tutorial). You can try as many values as you like, but at least try 0.0 (no smoothing), 1.0 (default), and several values between 0 and 1 or above 1. 
    - `RandomForestClassifier`: try different values for `n_estimators` (this influences the number of decision tree models that are trained behind-the-scenes; the random forest then aggregates the different models to make a final prediction), below and above the default value (100). 
  * 2) **final model** --> using the best settings that you determined, train the model again and report results on the test set
* N.B.: for now, it's sufficient to work with just accuracy scores, you'll do a more in-depth evaluation next week!

In [None]:
import numpy as np
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

#multi = MultinomialNB()
#multi.fit(x_train, y_train)
#print(multi.predict(x_test))

#forest = RandomForestClassifier(max_depth=2, random_state=0)
#forest.fit(x_train, y_train)
#print(forest.predict(x_test))






In [None]:
## COPY & COMPLETE

test_set_text = []
test_set_context = []  # only for span-level tasks
test_set_labels = []

# --- below here, add code that:
# --- 1) goes through all items in the `features` list that you created in step Step 1.2
# --- 2) filter so that you only look at items that are in the TEST SET
for i in x_test:
  test_set_text.append(data_in[i]['text_span'])
  test_set_context.append(str(data_in[i]['text_prev'])+str(data_in[i]['text_span'])+str(data_in[i]['text_after']))
  test_set_labels.append(data_in[i]['label'])

# --- 3a) add the text of each document or span to `test_set_text` as a string to `test_set_text` (tip: use `" ".join(feature_list)` to convert from list to string)
# --- 3b) for span-based tasks, add the full sentence context (i.e. text_before + text_span + text_after) as a string to `test_set_context`
# --- 3c) add the correct label for each document or span to `test_set_labels`

# --- below here, add code code that
# --- 4) takes the best model and settings that you determined last week, train it again, and produce predicted labels (with `model.predict()`) for the test set
train_vectors = []
train_labels = []
for i in x_train:
  train_vectors.append(vectors[i])
  train_labels.append(data_in[i]['label'])

test_vectors = []
for i in x_test:
  test_vectors.append(vectors[i])

forest = RandomForestClassifier(max_depth=2, random_state=0)
forest.fit(train_vectors, train_labels)
predicted_labels = forest.predict(test_vectors)

#########

import pandas as pd

# --- run one of the two lines below as appropriate (delete or comment out the other one)
df = pd.DataFrame({"text": test_set_text, "context": test_set_context, "gold_labels": test_set_labels, "predicted_labels": predicted_labels})
#df = pd.DataFrame({"text": test_set_text, "gold_labels": test_set_labels, "predicted_labels": predicted_labels})

# Calculate precision, recall, and F-scores
print(sklearn.metrics.precision_recall_fscore_support(test_set_labels, predicted_labels, average=None))

# df.to_csv("test_set_predictions.csv")

(array([0.38095238, 0.        , 0.        ]), array([1., 0., 0.]), array([0.55172414, 0.        , 0.        ]), array([8, 6, 7]))


  _warn_prf(average, modifier, msg_start, len(result))


## Report writing

* Write a report (using Google Docs or similar) of 1-2 pages in which you summarize your findings. It should be structured like this:
  
  * 1) _Introduction_ --> summarize your problem (the topic and the type of annotations that you use)
 *  2) _Methods_:
    * 2.1 _Dataset_: summarize the dataset (number of annotators, number of annotations, number of labels, frequencies of the labels)
    * 2.2 _Models_: describe the models and the variants that you tried
 * 3) _Results_: 
    * Accuracy scores for each of the variants that you tried


Next week, you'll update and expand the report with more detailed evaluation, error analysis and a conclusion. 