## Annotation for Machine Learning | Mock Exam | Practical Session

* This notebook contains questions that are similar to the ones that you will get during the practical part of the exam
* You should be able to complete the questions in ~1 hour (for the real exam, you'll have two hours for this part plus the practical part)
* In the real exam, you will have full internet access, so you can look up anything you need
* Each assignment carries a number of points; the maximum number of points is 100
* If you get stuck in one of the assignments, it is very important WRITE DOWN what you tried and which problems you ran into. (Create extra text cells for this if necessary) For an incomplete assignment with explanations of what went wrong, you can still get part of the points of that assignment.

### Data
* Use the command below to download the data (tip: if you completed all the lab assignments, this dataset should look familiar!)

In [None]:
!wget https://gitlab.com/gosseminnema/annotation4ml-2022/-/raw/main/labs/week_01/trainset2.txt

--2022-06-19 11:22:14--  https://gitlab.com/gosseminnema/annotation4ml-2022/-/raw/main/labs/week_01/trainset2.txt
Resolving gitlab.com (gitlab.com)... 172.65.251.78, 2606:4700:90:0:f22e:fbec:5bed:a9b9
Connecting to gitlab.com (gitlab.com)|172.65.251.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 811861 (793K) [text/plain]
Saving to: ‘trainset2.txt’


2022-06-19 11:22:15 (7.20 MB/s) - ‘trainset2.txt’ saved [811861/811861]



## Assignment 1: Working with JSONL _[15 points]_ 
* Convert the dataset into JSONL format. Write **two** output files: `dataset.jsonl` (containing ALL reviews) and `dataset.music.jsonl` (containing ONLY reviews for music). Each line in this file should look like the following:
```json
{"review_id": 575, "product_category": "music", "review_sentiment": "neg", "review_text": "the cd came as promised ..."}
```
(N.B.: "review_id" corresponds to the .txt file (but as an integer, without the .txt extension) given in the 3rd column of the original data file 

In [None]:
# WRITE YOUR CODE HERE
import json

dict_lines_all = []
dict_lines_music = []
with open("trainset2.txt", encoding="utf-8") as f:
  for line in f:
    columns = line.split()
    line_dict = {"review_id": int(columns[2].rstrip(".txt")), "product_category": columns[0], "review_sentiment": columns[1], "review_text": " ".join(columns[3:])}
    dict_lines_all.append(line_dict)
    if columns[0] == "music":
        dict_lines_music.append(line_dict)

with open("dataset.jsonl", "w", encoding="utf-8") as f_out:
  for line in dict_lines_all:
    f_out.write(json.dumps(line) + "\n")

with open("dataset.music.jsonl", "w", encoding="utf-8") as f_out:
  for line in dict_lines_music:
    f_out.write(json.dumps(line) + "\n")

In [None]:
dict_lines_music[:10]

[{'product_category': 'music',
  'review_id': 575,
  'review_sentiment': 'neg',
  'review_text': "the cd came as promised and in the condition promised . i 'm very satisfied"},
 {'product_category': 'music',
  'review_id': 737,
  'review_sentiment': 'pos',
  'review_text': 'sometimes i like to look up and see what i can find on some of my favorite bands from the 70 \'s and starz being one of them . i was lucky enough to see the band play a few times around 77 and 78. another band that took time for their fans ! ! ! ! ! ! ! i read through the many reviews of posted here on amazon and not sure if there is much that i can add , also great to see that there were so many out there that enjoyed the band and their music . i would recommend reading about each of the starz cd \'s and listening to the sound files if available and deciding for yourself....... . i remember i liked this cd so much and the logo as well , i even hand - made a stencil and made my own starz shirt , i think i used some 

## Assignment 2: (Re-)defining labels and guidelines _[25 points]_

* Look at the first 10 lines of `dataset.music.jsonl`.
* In the current dataset, sentiment is annotated as only "positive" or "negative" overall. It would be possible to make the scheme more fine-grained by instead judging specific aspects of the product as "positive", "neutral", or negative.  
* An example of an aspect category could be "price/quality ratio". Based on this category, we could assign labels to each review such as "price/quality ratio: negative" or "price/quality ratio: positive".

**Excercises**: 
1. Based on the first 10 examples, identify three possible aspect categories. For each category, write:
  * a general description: what is this aspect about? Why is it relevant? (1-3 sentences)
  * a guideline for how to decide if a particular review text implies a positive, negative, or neutral sentiment for this aspect (3-5 sentences)
  * an example of an edge case (= difficult case) from one of the 10 reviews
  * _[5 points per aspect category]_
2. Annotate the first 10 reviews based on the three categories _[10 points]_. 

> WRITE YOUR ANSWERS IN THIS CELL

> Different answers possible; potential aspect categories include:
> - condition of physical medium ("the CD came as promised")
> - attitude towards the band ("another band that took time for their fans")
> - quality of the music
> - price / quality ratio ("quite a bargin") 
> 
> Examples of edge cases:
> - "came as promised and in the condition promised" --> is this positive or neutral with respect to the quality of the medium?
> - "all songs pretty much sound the same" --> negative or neutral with respect to music quality? 
>
> Five points per proposed category, minus one point for every requirement that is missing or incomplete
>
>One point per complete annotated review, provided that the annotated labels sound more or less plausible



## Assignment 3: Calculating inter-annotator agreement _[20 points]_

* Download the extra file below. In this file, you'll find annotations (sentiment: positive/negative) for the first 20 music reviews from a second annotator

In [None]:
!wget https://gitlab.com/gosseminnema/annotation4ml-2022/-/raw/main/mock_exam/dataset.music.ann2.jsonl

--2022-06-19 11:34:32--  https://gitlab.com/gosseminnema/annotation4ml-2022/-/raw/main/mock_exam/dataset.music.ann2.jsonl
Resolving gitlab.com (gitlab.com)... 172.65.251.78, 2606:4700:90:0:f22e:fbec:5bed:a9b9
Connecting to gitlab.com (gitlab.com)|172.65.251.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15150 (15K) [text/plain]
Saving to: ‘dataset.music.ann2.jsonl’


2022-06-19 11:34:32 (107 MB/s) - ‘dataset.music.ann2.jsonl’ saved [15150/15150]



* In the code box below, write code that takes the first 20 music review annotations from both reviewers (i.e. from the file you created in Assignment 1 and from the new file) and calculates Cohen's Kappa Score using the `cohen_kappa_score` function from Scikit-Learn _[15 points]_

In [None]:
from sklearn.metrics import cohen_kappa_score

annotations_1 = []
annotations_2 = []

with open("dataset.music.jsonl", encoding="utf-8") as f:
  for i, line in enumerate(f):
    if i > 19:
      break
    data = json.loads(line)
    annotations_1.append(data["review_sentiment"])


with open("dataset.music.ann2.jsonl", encoding="utf-8") as f:
  for line in f:
    data = json.loads(line)
    annotations_2.append(data["review_sentiment"])

print(annotations_1)
print(annotations_2)

cohen_kappa_score(annotations_1, annotations_2)

['neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos']
['pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg', 'pos', 'neg', 'pos']


0.6039603960396039

* Write down the agreement score. Which agreement level does this correspond to? (slight, fair, moderate, ...) _[5 points]_

> We get a $\kappa$ score of 0.60. This is just on the border of "moderate" and "substantial" agreement. 

## Assignment 4: Training a simple model _[20 points]_

* In the cell below, complete the code for training a simple machine learning model for predicting sentiment (positive/negative) for the music reviews from `dataset.music.jsonl`.
* Use the first 100 examples are training data and the rest as testing data. You don't need to use a validation set for this assignment. It is also not necessary to split the data. 
* Run the experiment with different options for TfidfVectorizer. Write down the scores you get for the baseline version and each experiment.  
  * baseline: default settings
  * experiment 1: `stopwords="english"`
  * experiment 2: `ngram_range=(1,2)`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB


# lists of strings, each string is a music review
corpus = []
# corpus_train = []
# corpus_test = []


# lists of labels, each string is "positive" or "negative"
labels = []
# labels_train = []
# labels_test = []


# === DATA PREPARATION [10 points] ===
# --- write code to read the text and labels from dataset.music.jsonl ---
with open("dataset.music.jsonl", encoding="utf-8") as f:
  for line in f:
    data = json.loads(line)
    corpus.append(data["review_text"])
    labels.append(data["review_sentiment"])

labels_train = labels[:100]
labels_test = labels[100:]


# --- write code to vectorize the corpus ---
# vectorizer = TfidfVectorizer()  # baseline
# vectorizer = TfidfVectorizer(stop_words="english")  # exp1
vectorizer = TfidfVectorizer(ngram_range=(1,2))  # exp2
all_vectors = vectorizer.fit_transform(corpus) # vectorize first the entire corpus, then split into train/test
vectors_train = all_vectors[:100]
vectors_test = all_vectors[100:]

# --- train the model & make predictions ---
model = MultinomialNB()
model.fit(vectors_train, labels_train)
predictions = model.predict(vectors_test)
print(predictions)

# === BASELINE EVALUATION [5 points] ===
# --- calculate accuracy score
acc = model.score(vectors_test, labels_test)
print(acc)

# --- calculate F1 score (with option average="micro")
from sklearn.metrics import f1_score
f1 = f1_score(labels_test, predictions, average="micro")
print(f1)

# === ADDITIONAL EXPERIMENTS [5 points] ===

['pos' 'neg' 'neg' 'pos' 'pos' 'neg' 'pos' 'neg' 'neg' 'neg' 'neg' 'neg'
 'neg' 'neg' 'neg' 'neg' 'pos' 'neg' 'neg' 'neg' 'pos' 'neg' 'neg' 'pos'
 'pos' 'neg' 'neg' 'neg' 'pos' 'neg' 'pos' 'neg' 'pos' 'pos' 'neg' 'neg'
 'pos' 'pos' 'pos' 'pos' 'neg' 'neg' 'pos' 'neg' 'pos' 'pos' 'pos' 'pos'
 'pos' 'pos' 'neg' 'neg' 'neg' 'pos' 'pos' 'neg' 'neg' 'neg' 'pos' 'neg'
 'pos' 'neg' 'pos' 'neg' 'neg' 'pos' 'pos' 'neg' 'neg']
0.6231884057971014
0.6231884057971014


> WRITE DOWN YOUR SCORES HERE

> Baseline: Both accuracy and micro-average F1 score are 0.61. 
> Experiment 1: ACC/F1 0.67 --> stop-words seem to help a bit in this case
> Experiment 2: ACC/F1 0.62 --> very slight improvement from including bigrams