# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni



**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface




# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:



Teaching Assistants:



- Federico Ruggeri -> federico.ruggeri6@unibo.it

- Eleonora Mancini -> e.mancini@unibo.it



Professor:

- Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are asked to address the [EXIST 2023 Task 1](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition

The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).



### Examples:



**Text**: *Can’t go a day without women womening*



**Label**: Sexist



**Text**: *''Society's set norms! Happy men's day though!#weareequal''*



**Label**: Not sexist

#[Task 1 - 1.0 points] Corpus



We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2024-2025/Assignment%201/data).



Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.



The three sets are slightly unbalanced, with a bias toward the `Non-sexist` class.




### Dataset Description

- The dataset contains tweets in both English and Spanish.

- There are labels for multiple tasks, but we are focusing on **Task 1**.

- For Task 1, soft labels are assigned by six annotators.

- The labels for Task 1 represent whether the tweet is sexist ("YES") or not ("NO").












### Example





    "203260": {

        "id_EXIST": "203260",

        "lang": "en",

        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",

        "number_annotators": 6,

        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],

        "gender_annotators": ["F", "F", "M", "M", "M", "F"],

        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],

        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],

        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],

        "labels_task3": [

          ["STEREOTYPING-DOMINANCE"],

          ["OBJECTIFICATION"],

          ["SEXUAL-VIOLENCE"],

          ["-"],

          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],

          ["OBJECTIFICATION"]

        ],

        "split": "TRAIN_EN"

      }

    }

### Instructions

1. **Download** the `A1/data` folder.

2. **Load** the three JSON files and encode them as pandas dataframes.

3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.

4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.

5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.

6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

In [None]:
import os
import requests
from pathlib import Path
import pandas as pd
import json
import zipfile
import numpy as np
from tqdm import tqdm  # for progress bar
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import precision_score, recall_score, f1_score
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
# Define the GitHub base URL for raw files (replace <username>, <repository>, <branch>, and <path>)
base_url = "https://raw.githubusercontent.com/nlp-unibo/nlp-course-material/refs/heads/main/2024-2025/Assignment%201/data/"

# List of JSON file names in the A1/data directory
json_files = ["test.json", "training.json", "validation.json"]

# Local directory to save the files
local_dir = Path("Assignment%201/data")
local_dir.mkdir(parents=True, exist_ok=True)


# Download each file
for file_name in json_files:
    url = base_url + file_name
    response = requests.get(url)

    if response.status_code == 200:
        with open(local_dir / file_name, "wb") as file:
            file.write(response.content)
        print(f"Downloaded {file_name}")
    else:
        print(f"Failed to download {file_name}")


# Load JSON files into pandas DataFrames
dataframes = {}
for file_name in json_files:
    with open(local_dir / file_name, "r") as file:
        data = json.load(file)
        dataframes[file_name] = pd.DataFrame(data)


original_train_df = dataframes['training.json']
original_validation_df = dataframes['validation.json']
original_test_df = dataframes['test.json']

Downloaded test.json
Downloaded training.json
Downloaded validation.json


In [None]:
def determine_majority(response_list):
    yes_count = response_list.count("YES")
    no_count = response_list.count("NO")

    if yes_count > no_count:
        return 1
    elif no_count > yes_count:
        return 0
    else:
        return 2


def transform_df(df):
    # Swap row and col
    df = df.T
    # Apply majority voting to task 1
    df['hard_label_task1'] = df['labels_task1'].apply(determine_majority)
    # Keep only row that has en as lang and does not have a draw in voting
    df = df[df['lang'] == 'en']
    df = df[df['hard_label_task1'] != 2]
    # Drop unecessary columns
    df = df[['id_EXIST', 'lang', 'tweet', 'hard_label_task1']]

    return df

In [None]:
original_train_df = transform_df(original_train_df)
original_validation_df = transform_df(original_validation_df)
original_test_df = transform_df(original_test_df)
original_train_df

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
200002,200002,en,Writing a uni essay in my local pub with a cof...,1
200003,200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,1
200006,200006,en,According to a customer I have plenty of time ...,1
200007,200007,en,"So only 'blokes' drink beer? Sorry, but if you...",1
200008,200008,en,New to the shelves this week - looking forward...,0
...,...,...,...,...
203256,203256,en,idk why y’all bitches think having half your a...,1
203257,203257,en,This has been a part of an experiment with @Wo...,1
203258,203258,en,"""Take me already"" ""Not yet. You gotta be ready...",1
203259,203259,en,@clintneedcoffee why do you look like a whore?...,1


# [Task2 - 0.5 points] Data Cleaning

In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.




### Instructions

- **Remove emojis** from the tweets.

- **Remove hashtags** (e.g., `#example`).

- **Remove mentions** such as `@user`.

- **Remove URLs** from the tweets.

- **Remove special characters and symbols**.

- **Remove specific quote characters** (e.g., curly quotes).

- **Perform lemmatization** to reduce words to their base form.

In [None]:
train_df = original_train_df.copy()
validation_df = original_validation_df.copy()
test_df = original_test_df.copy()

In [None]:
# Download NLTK resources (only need to run this once)
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

# Create the lemmatizer
lemmatizer = WordNetLemmatizer()


# Function to clean tweets
def clean_tweet(tweet):
    # Remove emojis using a regex
    tweet = re.sub(r'[^\x00-\x7F]+', '', tweet)
    # Remove hashtags (e.g., #example)
    tweet = re.sub(r'#\w+', '', tweet)
    # Remove mentions (e.g., @user)
    tweet = re.sub(r'@\w+', '', tweet)
    # Remove URLs
    tweet = re.sub(r'http\S+|www\S+', '', tweet)
    # Remove special characters and symbols
    tweet = re.sub(r'[^a-zA-Z0-9\s]', '', tweet)
    # Remove specific quote characters (curly quotes, etc.)
    tweet = tweet.replace('“', '').replace('”', '').replace('‘', '').replace('’', '')
    # Convert to lowercase
    tweet = tweet.lower()
    # Tokenize the tweet
    words = word_tokenize(tweet)
    # Lemmatize the words and remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

    # Rejoin words into a cleaned-up tweet
    cleaned_tweet = ' '.join(words)

    return cleaned_tweet


# Apply the cleaning function to the 'tweet' column
train_df['tweet'] = train_df['tweet'].apply(clean_tweet)
validation_df['tweet'] = validation_df['tweet'].apply(clean_tweet)
test_df['tweet'] = test_df['tweet'].apply(clean_tweet)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
train_df

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
200002,200002,en,writing uni essay local pub coffee random old ...,1
200003,200003,en,2021 1921 dont appreciate two ride team member...,1
200006,200006,en,according customer plenty time go spent stirli...,1
200007,200007,en,bloke drink beer sorry arent bloke drink wine ...,1
200008,200008,en,new shelf week looking forward reading book,0
...,...,...,...,...
203256,203256,en,idk yall bitch think half as hanging cute look...,1
203257,203257,en,part experiment im learning though there littl...,1
203258,203258,en,take already yet got ta readyim dripping say d...,1
203259,203259,en,look like whore lh,1


# [Task 3 - 0.5 points] Text Encoding

To train a neural sexism classifier, you first need to encode text into numerical format.






### Instructions



* Embed words using **GloVe embeddings**.

* You are **free** to pick any embedding dimension.








### Note : What about OOV tokens?

   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.

   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., [UNK]) and a **static** embedding.

   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)




### More about OOV



For a given token:



* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).

* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.



Your vocabulary **should**:



* Contain all tokens in train set; or

* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

In [None]:
# Function to download GloVe embeddings with a progress bar
def download_glove_embeddings(glove_url, save_path='glove.zip', extract_path='glove'):
    if not os.path.exists(extract_path):  # Only download if not already downloaded
        print("Downloading GloVe embeddings...")

        # Stream the download with a progress bar
        response = requests.get(glove_url, stream=True)
        total_size = int(response.headers.get('content-length', 0))

        with open(save_path, 'wb') as f, tqdm(
            desc="Downloading",
            total=total_size,
            unit='B',
            unit_scale=True,
            unit_divisor=1024,
        ) as bar:
            for data in response.iter_content(chunk_size=1024):
                f.write(data)
                bar.update(len(data))

        # Extract the zip file
        print("Extracting GloVe embeddings...")
        with zipfile.ZipFile(save_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)

        # Clean up by removing the zip file
        os.remove(save_path)
        print("Download and extraction complete.")
    else:
        print("GloVe embeddings already downloaded.")


# Load the GloVe embeddings from the extracted file
def load_glove_embeddings(filepath, embedding_dim=100):
    embeddings = {}
    with open(filepath, 'r', encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings


# Define the GloVe URL and download path
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"  # 6B is the 400K word vocab, various embedding dim
download_path = 'glove.6B.zip'
extract_path = 'glove'

# Download and extract GloVe embeddings
download_glove_embeddings(glove_url, download_path, extract_path)

# Specify the embedding dimension and file, choose dim from 50,100,200,300
embedding_dim = 100
glove_filepath = os.path.join(extract_path, f'glove.6B.{embedding_dim}d.txt')

# Load embeddings
glove_embeddings = load_glove_embeddings(glove_filepath, embedding_dim)
print("Loaded GloVe embeddings with dimension:", embedding_dim)

Downloading GloVe embeddings...


Downloading: 100%|██████████| 822M/822M [02:39<00:00, 5.42MB/s]


Extracting GloVe embeddings...
Download and extraction complete.
Loaded GloVe embeddings with dimension: 100


In [None]:
def find_max_len(df, column_name):
    # Calculate the maximum length of the text in the specified column
    max_len = 0
    for text in df[column_name]:
        max_len = max(max_len, len(text.split()))
    return max_len

def pad_text_column(df, column_name, max_len, pad_token="<PAD>"):
    # Apply padding
    df['padded_' + column_name] = df[column_name].apply(lambda x: x if isinstance(x, list) else x.split())  # Tokenize if not already
    df['padded_' + column_name] = df['padded_' + column_name].apply(
        lambda x: x[:max_len] + [pad_token] * (max_len - len(x)) if len(x) < max_len else x[:max_len]
    )

    # Return the modified DataFrame
    return df, max_len

MAX_LEN = max(find_max_len(train_df, 'tweet'), find_max_len(validation_df, 'tweet'), find_max_len(test_df, 'tweet'))
# Usage example:
train_df, MAX_LEN = pad_text_column(train_df, 'tweet', MAX_LEN)
validation_df, MAX_LEN = pad_text_column(validation_df, 'tweet', MAX_LEN)
test_df, MAX_LEN = pad_text_column(test_df, 'tweet', MAX_LEN)

print(train_df['padded_tweet'].iloc[0])
print(MAX_LEN)

['writing', 'uni', 'essay', 'local', 'pub', 'coffee', 'random', 'old', 'man', 'keep', 'asking', 'drunk', 'question', 'im', 'trying', 'concentrate', 'amp', 'end', 'good', 'luck', 'youll', 'end', 'getting', 'married', 'use', 'anyway', 'alive', 'well', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
36


In [None]:
# Tokenize tweets and build vocabulary
def build_vocab_and_embeddings(data, embeddings_index, embedding_dim):
    vocab = {}
    word_embeddings = []
    for tweet in data['padded_tweet']:
        for token in tweet:
            if token not in vocab:
                if token in embeddings_index:
                    # Use GloVe embedding
                    vocab[token] = len(vocab)
                    word_embeddings.append(embeddings_index[token])
                else:
                    # Generate a random embedding for OOV tokens in the training set
                    vocab[token] = len(vocab)
                    word_embeddings.append(np.random.uniform(-0.1, 0.1, embedding_dim))
    return vocab, word_embeddings

# Build vocabulary and embeddings for the training set
vocab, word_embeddings = build_vocab_and_embeddings(train_df, glove_embeddings, embedding_dim)

In [None]:
unk_token = '[UNK]'
vocab[unk_token] = len(vocab)

# Custom embedding for '[UNK]' (e.g., mean of GloVe embeddings)
unk_embedding = np.mean(word_embeddings, axis=0)
word_embeddings = np.vstack([word_embeddings, unk_embedding])

In [None]:
def embed_tweet(df, vocab, word_embeddings):
    data = df.copy()
    new_tweet = []
    for tweet in data['padded_tweet']:
        embedded_tweet = []
        for token in tweet:
            if token not in vocab:
                # token is OOV so considered UNK
                embedded_tweet.append(word_embeddings[vocab['[UNK]']])
            else:
                embedded_tweet.append(word_embeddings[vocab[token]])
        new_tweet.append(embedded_tweet)
    data['padded_tweet'] = new_tweet
    return data

train_df_embedded = embed_tweet(train_df, vocab, word_embeddings)
validation_df_embedded = embed_tweet(validation_df, vocab, word_embeddings)
test_df_embedded = embed_tweet(test_df, vocab, word_embeddings)
train_df_embedded.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1,padded_tweet
200002,200002,en,writing uni essay local pub coffee random old ...,1,"[[0.17459000647068024, 0.2806999981403351, -0...."
200003,200003,en,2021 1921 dont appreciate two ride team member...,1,"[[0.4375700056552887, 0.5958300232887268, 0.52..."
200006,200006,en,according customer plenty time go spent stirli...,1,"[[-0.06825800240039825, -0.04764899984002113, ..."
200007,200007,en,bloke drink beer sorry arent bloke drink wine ...,1,"[[0.1125900000333786, 0.4171999990940094, 0.62..."
200008,200008,en,new shelf week looking forward reading book,0,"[[-0.04395899921655655, 0.18935999274253845, 0..."


# [Task 4 - 1.0 points] Model definition



You are now tasked to define your sexism classifier.






### Instructions



* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.

* You are **free** to experiment with hyper-parameters to define the baseline model.



* **Model 1**: add an additional LSTM layer to the Baseline model.

### Token to embedding mapping



You can follow two approaches for encoding tokens in your classifier.



### Work directly with embeddings



- Compute the embedding of each input token

- Feed the mini-batches of shape (batch_size, # tokens, embedding_dim) to your model



### Work with Embedding layer



- Encode input tokens to token ids

- Define a Embedding layer as the first layer of your model

- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)

- Initialize the Embedding layer with the computed embedding matrix

- You are **free** to set the Embedding layer trainable or not

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset


class BiLSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, bidirectional=True):
        super(BiLSTMModel, self).__init__()

        # Define the bidirectional LSTM
        self.lstm = nn.LSTM(input_size=input_size,
                            hidden_size=hidden_size,
                            num_layers=num_layers,
                            bidirectional=bidirectional,
                            batch_first=True)

        # Define a fully connected layer on top
        self.fc = nn.Linear(hidden_size * 2 if bidirectional else hidden_size, output_size)

    def forward(self, x):
        # Pass input through LSTM
        lstm_out, (hidden, cell) = self.lstm(x)  # lstm_out has shape [batch, seq_len, hidden_size * 2] if bidirectional

        # Take the last hidden state from both directions (forward and backward) for the last timestep
        if self.lstm.bidirectional:
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)  # Concatenate the last forward and backward hidden states
        else:
            hidden = hidden[-1,:,:]  # Take the last hidden state of the forward direction only

        # Pass through the fully connected layer
        out = self.fc(hidden)
        return out.view(-1)

In [None]:
input_size = 100
hidden_size = 128
num_layers = 1
output_size = 1

baseline_model = BiLSTMModel(input_size, hidden_size, num_layers, output_size)

In [None]:
num_layers = 2
model_1 = BiLSTMModel(input_size, hidden_size, num_layers, output_size)

### Padding



Pay attention to padding tokens!



Your model **should not** be penalized on those tokens.



#### How to?



There are two main ways.



However, their implementation depends on the neural library you are using.



- Embedding layer

- Custom loss to compute average cross-entropy on non-padding tokens only



**Note**: This is a **recommendation**, but we **do not penalize** for missing workarounds.

# [Task 5 - 1.0 points] Training and Evaluation



You are now tasked to train and evaluate the Baseline and Model 1.




### Instructions



* Train **all** models on the train set.

* Evaluate **all** models on the validation set.

* Compute metrics on the validation set.

* Pick **at least** three seeds for robust estimation.

* Pick the **best** performing model according to the observed validation set performance.

* Evaluate your models using macro F1-score.

In [None]:
import random

class Trainer:
    def __init__(self, model, output_size=5, learning_rate=0.001, num_epochs=20, batch_size=32, seed=42):
        self.set_seed(seed)

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}", end='\n\n')

        self.model = model.to(self.device)
        self.output_size = output_size
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.criterion = nn.BCEWithLogitsLoss()
        self.optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    def set_seed(self, seed):
        """Set the seed for reproducibility."""
        random.seed(seed)  # Python random module
        np.random.seed(seed)  # NumPy random module
        torch.manual_seed(seed)  # PyTorch CPU
        torch.cuda.manual_seed(seed)  # PyTorch GPU (if CUDA is available)
        torch.cuda.manual_seed_all(seed)  # PyTorch all GPUs (if using multiple GPUs)
        torch.backends.cudnn.deterministic = True  # Ensure deterministic behavior
        torch.backends.cudnn.benchmark = False  # Disable the auto-tuner to avoid randomness
        print(f"Random seed set to: {seed}")


    def train(self, train_df):
        # Prepare training data
        self.x_train = torch.tensor(train_df['padded_tweet'], dtype=torch.float32).to(self.device)
        self.y_train = torch.tensor(train_df['hard_label_task1'], dtype=torch.float32).to(self.device)
        self.train_dataset = TensorDataset(self.x_train, self.y_train)
        self.train_loader = DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)

        # Training loop
        for epoch in range(self.num_epochs):
            self.model.train()  # Set the model to training mode
            running_loss = 0.0  # Track the total loss for the epoch

            for batch_idx, (inputs, labels) in enumerate(self.train_loader):
                # Zero the parameter gradients
                self.optimizer.zero_grad()

                # Forward pass
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)

                # Backward pass and optimization
                loss.backward()
                self.optimizer.step()

                # Accumulate the loss
                running_loss += loss.item()

            # Print the average loss for this epoch
            avg_loss = running_loss / len(self.train_loader)
            print(f"Epoch [{epoch + 1}/{self.num_epochs}], Loss: {avg_loss:.4f}")


    def test(self, test_df):
        # Prepare test data
        x_test = torch.tensor(test_df['padded_tweet'], dtype=torch.float32).to(self.device)
        y_test = torch.tensor(test_df['hard_label_task1'], dtype=torch.float32).to(self.device)
        test_dataset = TensorDataset(x_test, y_test)
        test_loader = DataLoader(test_dataset, batch_size=self.batch_size, shuffle=False)

        # Set the model to evaluation mode
        self.model.eval()

        all_preds = []
        all_labels = []

        with torch.no_grad():  # No need to compute gradients during evaluation
            for inputs, labels in test_loader:
                # Forward pass
                outputs = self.model(inputs)

                # Apply sigmoid activation to outputs, since BCEWithLogitsLoss already includes sigmoid
                predicted = torch.round(torch.sigmoid(outputs))  # Convert outputs to binary predictions

                all_preds.append(predicted.cpu().numpy())
                all_labels.append(labels.cpu().numpy())

        # Flatten the lists
        all_preds = np.concatenate(all_preds, axis=0)
        all_labels = np.concatenate(all_labels, axis=0)

        # Calculate precision, recall, and F1 score
        precision = precision_score(all_labels, all_preds, average='macro')
        recall = recall_score(all_labels, all_preds, average='macro')
        f1 = f1_score(all_labels, all_preds, average='macro')

        accuracy = (all_preds == all_labels).mean()

        return accuracy, f1

In [None]:
num_seeds = 3
# train with various seed
for seed in range(num_seeds):
    baseline_model = BiLSTMModel(input_size, hidden_size, num_layers, output_size)
    baseline_trainer = Trainer(model=baseline_model, num_epochs=30, batch_size=32, learning_rate=0.001, seed=seed)
    baseline_trainer.train(train_df_embedded)
    accuracy, f1 = baseline_trainer.test(validation_df_embedded)
    print("On Val:")
    print(f"Accuracy: {accuracy:.4f}, f1_score: {f1:.4f}")
    accuracy, f1 = baseline_trainer.test(test_df_embedded)
    print("On Test:")
    print(f"Accuracy: {accuracy:.4f}, f1_score: {f1:.4f}", end='\n\n')

Random seed set to: 0
Using device: cuda

Epoch [1/30], Loss: 0.6061
Epoch [2/30], Loss: 0.5246
Epoch [3/30], Loss: 0.4820
Epoch [4/30], Loss: 0.4511
Epoch [5/30], Loss: 0.4311
Epoch [6/30], Loss: 0.3932
Epoch [7/30], Loss: 0.3648
Epoch [8/30], Loss: 0.3060
Epoch [9/30], Loss: 0.2805
Epoch [10/30], Loss: 0.2273
Epoch [11/30], Loss: 0.1819
Epoch [12/30], Loss: 0.1594
Epoch [13/30], Loss: 0.1061
Epoch [14/30], Loss: 0.0893
Epoch [15/30], Loss: 0.0550
Epoch [16/30], Loss: 0.0471
Epoch [17/30], Loss: 0.0593
Epoch [18/30], Loss: 0.0501
Epoch [19/30], Loss: 0.0312
Epoch [20/30], Loss: 0.0272
Epoch [21/30], Loss: 0.0158
Epoch [22/30], Loss: 0.0124
Epoch [23/30], Loss: 0.0391
Epoch [24/30], Loss: 0.0434
Epoch [25/30], Loss: 0.0114
Epoch [26/30], Loss: 0.0081
Epoch [27/30], Loss: 0.0158
Epoch [28/30], Loss: 0.0059
Epoch [29/30], Loss: 0.0079
Epoch [30/30], Loss: 0.0021
On Val:
Accuracy: 0.7911, f1_score: 0.7804

On Test:
Accuracy: 0.7517, f1_score: 0.7492

Random seed set to: 1
Using device: cu

In [None]:
for seed in range(num_seeds):
    num_layers = 2
    model_1 = BiLSTMModel(input_size, hidden_size, num_layers, output_size)
    model_1_trainer = Trainer(model=model_1, num_epochs=30, batch_size=32, learning_rate=0.001, seed=seed)
    model_1_trainer.train(train_df_embedded)
    accuracy, f1 = model_1_trainer.test(validation_df_embedded)
    print("On Val:")
    print(f"Accuracy: {accuracy:.4f}, f1_score: {f1:.4f}")
    accuracy, f1 = model_1_trainer.test(test_df_embedded)
    print("On Test:")
    print(f"Accuracy: {accuracy:.4f}, f1_score: {f1:.4f}", end='\n\n')

Random seed set to: 0
Using device: cuda

Epoch [1/30], Loss: 0.6125
Epoch [2/30], Loss: 0.5190
Epoch [3/30], Loss: 0.4836
Epoch [4/30], Loss: 0.4516
Epoch [5/30], Loss: 0.4357
Epoch [6/30], Loss: 0.3925
Epoch [7/30], Loss: 0.3575
Epoch [8/30], Loss: 0.3125
Epoch [9/30], Loss: 0.2790
Epoch [10/30], Loss: 0.2327
Epoch [11/30], Loss: 0.1894
Epoch [12/30], Loss: 0.1764
Epoch [13/30], Loss: 0.1324
Epoch [14/30], Loss: 0.1265
Epoch [15/30], Loss: 0.0917
Epoch [16/30], Loss: 0.0688
Epoch [17/30], Loss: 0.0822
Epoch [18/30], Loss: 0.0396
Epoch [19/30], Loss: 0.0348
Epoch [20/30], Loss: 0.0201
Epoch [21/30], Loss: 0.0276
Epoch [22/30], Loss: 0.0459
Epoch [23/30], Loss: 0.0363
Epoch [24/30], Loss: 0.0210
Epoch [25/30], Loss: 0.0265
Epoch [26/30], Loss: 0.0127
Epoch [27/30], Loss: 0.0030
Epoch [28/30], Loss: 0.0015
Epoch [29/30], Loss: 0.0013
Epoch [30/30], Loss: 0.0017
On Val:
Accuracy: 0.8165, f1_score: 0.8070
On Test:
Accuracy: 0.7517, f1_score: 0.7470

Random seed set to: 1
Using device: cud

In [None]:
print(train_df_embedded['tweet'].iloc[0])
print(train_df_embedded['tweet'].iloc[1])
print(train_df_embedded['tweet'].iloc[4])
print(nn.Sigmoid()(baseline_model(torch.tensor(train_df_embedded['padded_tweet'].iloc[:5], dtype=torch.float32).to("cuda"))))
print(nn.Sigmoid()(model_1(torch.tensor(train_df_embedded['padded_tweet'].iloc[:5], dtype=torch.float32).to("cuda"))))

writing uni essay local pub coffee random old man keep asking drunk question im trying concentrate amp end good luck youll end getting married use anyway alive well
2021 1921 dont appreciate two ride team member looked behind asked man behind many party impressed
new shelf week looking forward reading book
tensor([0.9971, 0.8939, 0.9909, 0.9990, 0.0053], device='cuda:0',
       grad_fn=<SigmoidBackward0>)
tensor([9.9732e-01, 9.2997e-01, 9.9919e-01, 9.9956e-01, 2.7825e-05],
       device='cuda:0', grad_fn=<SigmoidBackward0>)


# [Task 6 - 1.0 points] Transformers



In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).






### Relevant Material

- Tutorial 3

### Instructions

1. **Load the Tokenizer and Model**



2. **Preprocess the Dataset**:

   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.



   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.



3. **Train the Model**:

   Use the `Trainer` to train the model on your training data.



4. **Evaluate the Model on the Test Set** using F1-macro.

In [None]:
train_df = original_train_df.copy()
validation_df = original_validation_df.copy()
test_df = original_test_df.copy()

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary

task='hate'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]


"""
text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")"""

config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

'\n\ntext = "Good night 😊"\n\ntext = preprocess(text)\n\nencoded_input = tokenizer(text, return_tensors=\'pt\')\n\noutput = model(**encoded_input)\n\nscores = output[0][0].detach().numpy()\n\nscores = softmax(scores)\n\n\n\nranking = np.argsort(scores)\n\nranking = ranking[::-1]\n\nfor i in range(scores.shape[0]):\n\n    l = labels[ranking[i]]\n\n    s = scores[ranking[i]]\n\n    print(f"{i+1}) {l} {np.round(float(s), 4)}")'

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

In [None]:
test_df['tweet'] = test_df['tweet'].apply(preprocess)
train_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
200002,200002,en,Writing a uni essay in my local pub with a cof...,1
200003,200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,1
200006,200006,en,According to a customer I have plenty of time ...,1
200007,200007,en,"So only 'blokes' drink beer? Sorry, but if you...",1
200008,200008,en,New to the shelves this week - looking forward...,0


In [None]:
tokenized = tokenizer(test_df['tweet'].to_list(), padding=True, truncation=True, max_length=256)
print(tokenized['input_ids'][0])
test_df['input_ids'] = tokenized['input_ids']
test_df['attention_mask'] = tokenized['attention_mask']
test_df.head()

[0, 134, 620, 183, 23, 5, 3716, 15, 10, 2721, 395, 11, 5300, 328, 12135, 1099, 367, 28335, 7, 213, 11, 528, 7, 5, 25805, 808, 514, 32196, 36, 1694, 9200, 2248, 5251, 8, 11464, 52, 4362, 84, 477, 43, 17841, 27819, 2054, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1,input_ids,attention_mask
400178,400178,en,1st day at the pool on a beautiful Sunday in N...,0,"[0, 134, 620, 183, 23, 5, 3716, 15, 10, 2721, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
400179,400179,en,“I like your outfit too except when i dress up...,1,"[0, 17, 48, 100, 101, 110, 7490, 350, 4682, 77...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
400180,400180,en,"@user 🥺💖 same, though!!! the angst just comes ...",0,"[0, 1039, 12105, 8103, 8210, 3070, 6569, 10659...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
400181,400181,en,@user @user Fuck that cunt. Tried to vote her ...,1,"[0, 1039, 12105, 787, 12105, 43774, 14, 48391,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
400182,400182,en,@user u gotta say some shit like “i’ll fuck th...,1,"[0, 1039, 12105, 1717, 16112, 224, 103, 15328,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [None]:
output = model(input_ids=torch.tensor(test_df['input_ids']), attention_mask=torch.tensor(test_df['attention_mask']))

In [None]:
print(output)

# [Task 7 - 0.5 points] Error Analysis



### Instructions



After evaluating the model, perform a brief error analysis:



 - Review the results and identify common errors.



 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)

 - Suggest possible solutions to address the identified errors.




# [Task 8 - 0.5 points] Report



Wrap up your experiment in a short report (up to 2 pages).

### Instructions



* Use the NLP course report template.

* Summarize each task in the report following the provided template.

### Recommendations



The report is not a copy-paste of graphs, tables, and command outputs.



* Summarize classification performance in Table format.

* **Do not** report command outputs or screenshots.

* Report learning curves in Figure format.

* The error analysis section should summarize your findings.


# Submission



* **Submit** your report in PDF format.

* **Submit** your python notebook.

* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...

* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ



Please check this frequently asked questions before contacting us

### Execution Order



You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings



You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture



You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.


### Neural Libraries



You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer



If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation



Each model is trained with at least 3 random seeds.



Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis



To carry out the error analysis you are **free** to either



* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)

* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis



Some topics for discussion include:

   * Precision/Recall curves.

   * Confusion matrices.

   * Specific misclassified samples.

### Bonus Points

Bonus points are arbitrarily assigned based on significant contributions such as:

- Outstanding error analysis

- Masterclass code organization

- Suitable extensions

Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).



**Possible Extensions/Explorations for Bonus Points:**

- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.

- **Experiment with other custom architectures or models from HuggingFace**

- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.












# The End