<div style="text-align: center;" >
<h1 style="margin-top: 0.2em; margin-bottom: 0.1em;">Assignment 3</h1>
</div>
<br>

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed. 


* [`transformers`](https://huggingface.co/) is a Python package for creating and working with transformers. [Here](https://huggingface.co/docs) is the documentation of `transformers`.
* [`pandas`](https://pandas.pydata.org/docs/index.html) is a Python package for creating and working with tabular data. [Here](https://pandas.pydata.org/docs/reference/index.html) is the documentation of `pandas`.

In [1]:
! pip install transformers
! pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple


You may need to restart the Kernel after installing the dependencies!

### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [2]:
import pandas as pd
from transformers import pipeline
import numpy as np
import matplotlib.pyplot as plt

### Exercise 1

In the following exercise you will use the emotion classification model [LEIA](https://huggingface.co/LEIA/LEIA-base) to classify the emotion of the sentences in the [enISEAR dataset](https://www.romanklinger.de/data-sets/). You read more about the `LEIA-base` in the [documentation](https://huggingface.co/LEIA/LEIA-base) and learn about the implementation details from this [paper](https://arxiv.org/abs/2304.10973).

#### 1.1 LEIA introduction
* Load the `LEIA-base` model and tokenize either as a [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines), or you can load the model and the tokenizer [directly](https://huggingface.co/docs/transformers/autoclass_tutorial) and implement the classification steps by yourself. LEIA only accepts sentences with up to 128 tokens. Make sure that your tokenizer [truncates](https://huggingface.co/docs/transformers/pad_truncation) longer sentences to this lenght to avoid errors.
* What are the possible labels the model can predict?
* Input the sentence `Today is a great day.` to the model, and predict the emotion of the sentence.

In [5]:
!pip install --upgrade ipywidgets

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
Collecting ipywidgets
  Using cached ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.10 (from ipywidgets)
  Using cached widgetsnbextension-4.0.10-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.10 (from ipywidgets)
  Using cached jupyterlab_widgets-3.0.10-py3-none-any.whl.metadata (4.1 kB)
Using cached ipywidgets-8.1.2-py3-none-any.whl (139 kB)
Using cached jupyterlab_widgets-3.0.10-py3-none-any.whl (215 kB)
Using cached widgetsnbextension-4.0.10-py3-none-any.whl (2.3 MB)
Installing collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-8.1.2 jupyterlab-widgets-3.0.10 widgetsnbextension-4.0.10


In [4]:
# import pipeline
pipe_leia = pipeline('text-classification', model="LEIA/LEIA-base", tokenizer="LEIA/LEIA-base", truncation=True, max_length=128)

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

In [57]:
results = pipe_leia("Today is a great day")

In [63]:
results[0]['label']

'Happiness'

In [7]:
pipe_leia.model.config.id2label

{0: 'Sadness', 1: 'Affection', 2: 'Fear', 3: 'Happiness', 4: 'Anger'}

#### 1.2 enISEAR dataset
* Load the enISEAR dataset.
* What are the possible labels in the dataset? (the `Prior_Emotion` column stores the actual label)
* The last 7 columns store the labels given by the annotators. Create a new column `Annotator_Majority_Label`, which stores the emotion with the highest annotator score (i.e. the emotion the highest number of annotators chose for the given sentence).
* What percent of the sentences were correctly classified by the (majority vote of the) annotators?

In [109]:
data = pd.read_csv("enISEAR.tsv", sep='\t')

In [110]:
data.columns

Index(['Sentence_id', 'Prior_Emotion', 'Sentence', 'Temporal_Distance',
       'Intensity', 'Duration', 'Gender', 'City', 'Country', 'Worker_id',
       'Time', 'Anger', 'Disgust', 'Fear', 'Guilt', 'Joy', 'Sadness', 'Shame'],
      dtype='object')

In [111]:
data.Prior_Emotion.values

array(['Fear', 'Shame', 'Guilt', ..., 'Shame', 'Shame', 'Shame'],
      dtype=object)

In [112]:
data['Annotator_Majority_Label'] = data.iloc[:,-7:].idxmax(axis=1)

In [113]:
data['Annotator_Majority_Label'].head(2)

0    Sadness
1      Guilt
Name: Annotator_Majority_Label, dtype: object

#### 1.3 Classification
* Drop the rows from the enISEAR dataset, where the `Prior_Emotion` is not one of `Fear`, `Sadness`, `Anger` or `Joy`
* Use `Leia` to classify the emotion of each remaining sentence in the dataset, and add a column `Leia_Label` to store the predicted classes
* Now remove `I felt ... ` from the beginning of each sentence, and rerun the classfication. Store your results in a column named `Leia_Label_Clean`
* Where the model predicted `Happiness` or `Affection`, change the prediction to `Joy` to match the dataset's labels (for both columns -> `Leia_Label` and `Leia_Label_Clean`)

In [114]:
drop_emotions = ['Fear', 'Sadness', 'Anger', 'Joy']
data = data[~data.Prior_Emotion.isin(drop_emotions)]
len(data)

429

In [55]:
data.head()

Unnamed: 0,Sentence_id,Prior_Emotion,Sentence,Temporal_Distance,Intensity,Duration,Gender,City,Country,Worker_id,Time,Anger,Disgust,Fear,Guilt,Joy,Sadness,Shame,Annotator_Majority_Label
1,597,Shame,I felt ... one Christmas as one of our patient...,Y,I,Dom,Fl,Dulwich,GBR,86,11/26/2018 06:52:02,1,0,0,4,0,0,0,Guilt
2,282,Guilt,I felt ... because I could not help a friend w...,M,Mi,Dom,Fl,Linlithgow,GBR,83,11/21/2018 18:45:00,0,0,0,4,0,1,0,Guilt
3,171,Disgust,I felt ... when I read that hunters had killed...,Y,Mi,H,Ml,Bristol,GBR,87,11/28/2018 00:55:11,3,0,0,0,0,2,0,Anger
5,181,Disgust,I felt ... when I stepped in dog shit on the w...,M,I,H,Fl,Shepherds Bush,GBR,90,11/28/2018 21:42:00,0,5,0,0,0,0,0,Disgust
7,642,Shame,I felt ... when my daughter was rude to my wife.,D,N,Fm,Ml,Chelmsford,GBR,91,11/26/2018 23:35:24,3,0,0,0,0,2,0,Anger


In [126]:
#data = data.drop(['Leia_Label'], axis=1)

In [115]:
results = [pipe_leia(sent) for sent in data['Sentence']]

In [116]:
data['Leia_Label'] = [d["label"] for row in results for d in row]

In [124]:
data['Sentence'] = data['Sentence'].str.replace('^I felt ', '', regex=True)

In [127]:
data['Leia_Label_Clean'] = [d["label"] for row in [pipe_leia(sent) for sent in data['Sentence']] for d in row]

In [147]:
data[['Leia_Label', 'Leia_Label_Clean']] = data[['Leia_Label', 'Leia_Label_Clean']].replace({'Happiness': 'Joy', 'Affection': 'Joy'})

In [148]:
data[['Leia_Label', 'Leia_Label_Clean']].values[6:9]

array([['Fear', 'Anger'],
       ['Joy', 'Joy'],
       ['Joy', 'Joy']], dtype=object)

#### 1.4 Evaluate Performance

First, let's calculate the accuracy for the two classifiers, plot the results. Hint: You can do bar plots to compare the values

Next calculate the precision of the `"Joy"` class for the data.
This is calculated as follows:
$
\begin{align}
    precision = \frac{TP}{TP + FP}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Joy"`*

Now calculate the recall score. This is done by:
$
\begin{align}
    recall = \frac{TP}{TP + FN}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Joy"`*

Last, calculate the [F1 score](https://towardsdatascience.com/the-f1-score-bec2bbc38aa6) of the joy class. The F1 score is calculated as:

$
\begin{align}
    F_1 = 2 * \frac{precision * recall}{precision + recall}
\end{align}
$

This can also be done for the other classes `'Sadness', 'Guilt', 'Anger', 'Disgust', 'Fear', 'Joy', 'Shame'`.

Now, try to calculate the mean f1 score over all classes for each of the classes.

#### 1.5 Interpretation

* Discuss your results. 
* Are the models accurately predicting human emotions?
* Which approach seems to work better? Why?
* What kind of additional preprocessing could we perform to improve the model's predictions?

### Exercise 2

#### 2.1 Data annotation
* In the following exercise you will need to test emotion detection methods on data from [Vent](https://www.vent.co/), a website where users talk about their feelings. 
* On GitHub, in your `a03` folder you can find 3 files. First open `sample_for_labeling.csv`, and label each row according the emotion the sentence expresses. The possible classes are: 0 (Sadness), 1 (Affection), 2 (Fear), 3 (Happiness), 4 (Anger). ***Important: Make sure to upload the labeled data with your submission.***
* After you finished labeling the data load it as a pandas dataframe. Also load `sample_with_labels.csv` as a dataframe, which contains the actual labels of the data.
* Merge the two dataframes, and rename the column containing your labels as `label_human`.
* Rename the class ids (0, 1, 2, ...) stored in the `label`, and `label_human` columns to the class names (Sadness, Affection, ...).

#### 2.2 LEIA
* Use the [LEIA](https://huggingface.co/LEIA/LEIA-base) model introduced in the previous exercise to classify the sentences and store the results in a column named `label_leia`.

#### 2.3 Analysis
* Look at the performance of the the model, as well as the quality of your annotation using the metrics introduced in part 1 (accuracy, precision, recall) or other metrics you find interesting. Create informative visualizations to aid the comparison.
* Discuss your results. 
* Are the models accurately predicting human emotions?
* Which approach seems to work better? Why?