# Project 1 -  Text Classification Task
Developed by Group 05:
- Emanuel Maia - up202107486
- Rita Leite - up202105309
- Tiago Azevedo - up202108699

## 1 - Introduction

This project is a continuation of the first one, but now we are exploring the use of Hugging Face Transformers. That said, the theme of the project and the structure of our data remain the same, consisting of publications sourced from Reddit and Google, authored by individuals from England, Australia, and India.

The Reddit-sourced data is divided as follows:
- Reddit (England): Training data and test data
- Reddit (Australia): Training data and test data
- Reddit (India): Training data and test data

Similarly, an equivalent division applies to the Google-sourced data:
- Google (England): Training data and test data
- Google (Australia): Training data and test data
- Google (India): Training data and test data

All datasets share the same attributes: `id`, a unique identifier for each entry, `text`, the content of the publication, and `sentiment_label`, the target variable for our analysis. The `sentiment_label` is binary, where 0 indicates a negative sentiment and 1 indicates a positive sentiment.

## 2 - Preparation

The process begins by importing the necessary libraries and implementing utility functions that immediately handle tasks such as language detection and translation. These early steps ensure that all input text is brought to a common language, English, making it suitable for consistent processing and analysis across multilingual data.

Data from Reddit and Google, covering user content from the UK, India, and Australia, is first loaded and merged by region. To ensure consistency, text entries from India are automatically translated into English. All datasets are then standardized by aligning their structure and renaming relevant columns. Finally, the regional datasets are combined into a single unified collection, ready for further analysis.

In [None]:
import pandas as pd
import numpy as np

import contractions
import evaluate
import torch
import nltk
import re

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification, RobertaForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from peft import get_peft_model, LoraConfig, TaskType
from datasets import Dataset, DatasetDict
from googletrans import Translator
from langdetect import detect

translator = Translator()
def translate_text(text):
    if detect(text) == "en":
        return text
    else:
        try:
            return translator.translate(text, src="hi", dest="en").text
        except:
            try:
                return translator.translate(text, src="ur", dest="en").text
            except:
                try:
                    return translator.translate(text, src="bn", dest="en").text
                except:
                    return text

In [None]:
# Read Reddit-sourced data 
reddit_uk_train = pd.read_json("data/reddit-uk-train.jsonl", lines=True).drop("id", axis=1)
reddit_in_train = pd.read_json("data/reddit-in-train.jsonl", lines=True).drop("id", axis=1)
reddit_au_train = pd.read_json("data/reddit-au-train.jsonl", lines=True).drop("id", axis=1)
reddit_uk_valid = pd.read_json("data/reddit-uk-valid.jsonl", lines=True).drop("id", axis=1)
reddit_in_valid = pd.read_json("data/reddit-in-valid.jsonl", lines=True).drop("id", axis=1)
reddit_au_valid = pd.read_json("data/reddit-au-valid.jsonl", lines=True).drop("id", axis=1)

# Read Google-sourced data 
google_uk_train = pd.read_json("data/google-uk-train.jsonl", lines=True).drop("id", axis=1)
google_in_train = pd.read_json("data/google-in-train.jsonl", lines=True).drop("id", axis=1)
google_au_train = pd.read_json("data/google-au-train.jsonl", lines=True).drop("id", axis=1)
google_uk_valid = pd.read_json("data/google-uk-valid.jsonl", lines=True).drop("id", axis=1)
google_in_valid = pd.read_json("data/google-in-valid.jsonl", lines=True).drop("id", axis=1)
google_au_valid = pd.read_json("data/google-au-valid.jsonl", lines=True).drop("id", axis=1)

# Merge and translate data by country
uk_union = pd.concat([reddit_uk_train, reddit_uk_valid, google_uk_train, google_uk_valid], ignore_index=True)
au_union = pd.concat([reddit_au_train, reddit_au_valid, google_au_train, google_au_valid], ignore_index=True)
in_union = pd.concat([reddit_in_train, reddit_in_valid, google_in_train, google_in_valid], ignore_index=True)
in_union["text"] = in_union["text"].apply(translate_text)

# Rename columns for huggingface datasets
uk_union.rename(columns={'sentiment_label':'label'}, inplace = True)
au_union.rename(columns={'sentiment_label':'label'}, inplace = True)
in_union.rename(columns={'sentiment_label':'label'}, inplace = True)

# Merge all data
gl_union = pd.concat([uk_union, au_union, in_union]).reset_index(drop=True)

## 3 - Preprocessing

The process begins by downloading necessary resources from the NLTK library, such as stopwords, tokenizers, and lemmatizers. A set of stopwords is loaded, with some additional words specifically removed to prevent them from being filtered out during preprocessing. Following that, a lemmatization function is applied, which processes the text by identifying the parts of speech and reducing the words to their base forms accordingly.

Next, a text preprocessing procedure is performed on the datasets, which includes expanding contractions (e.g., "I'm" to "I am"), cleaning non-ASCII characters and unnecessary symbols, converting all text to lowercase for consistency, and removing any extra spaces. The text is then tokenized into words, stopwords are eliminated to focus on more meaningful terms, and lemmatization is applied to reduce words to their root forms.

This comprehensive preprocessing ensures that the datasets from the UK, Australia, India, and a combined global set are standardized and ready for analysis, improving the quality and consistency of the text data.

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

# Prepare stopwords to not include the ones with negation value
stop_words = set(nltk.corpus.stopwords.words('english'))
stop_words_remove = {"no", "not", "nor", "t"}
stop_words.difference_update(stop_words_remove)

# Apply lemmatization (using pos tagging)
lemma = nltk.WordNetLemmatizer()
def lemmatize_with_pos(text):
    words = token(text)   
    words_tag = nltk.pos_tag(words)   
    words_lem = []
    for word, tag in words_tag:
        if tag.startswith('N'): 
            words_lem.append(lemma.lemmatize(word, pos='n'))
        elif tag.startswith('V'): 
            words_lem.append(lemma.lemmatize(word, pos='v'))
        elif tag.startswith('J'): 
            words_lem.append(lemma.lemmatize(word, pos='a'))
        elif tag.startswith('R'): 
            words_lem.append(lemma.lemmatize(word, pos='r'))
        else: 
            words_lem.append(lemma.lemmatize(word))
    return words_lem

# Apply preprocessing techniques
token = nltk.word_tokenize
def text_pre_processing(dataset):
    dataset['text'] = dataset['text'].apply(contractions.fix)
    dataset['text'] = dataset["text"].apply(lambda x: re.sub(r'[^\x00-\x7F]|[^a-zA-Z ]', ' ', x).strip())
    dataset["text"] = dataset["text"].apply(str.lower)
    dataset["text"] = dataset["text"].apply(lambda x: re.sub(r'\s+', ' ', x).strip())
    dataset['text'] = dataset['text'].apply(token)    
    dataset['text'] = dataset['text'].apply(lambda x: [word for word in x if word not in stop_words])
    dataset['text'] = dataset['text'].apply(lambda x: lemmatize_with_pos(" ".join(x)))
    dataset['text'] = [" ".join(text) for text in dataset["text"]]
    return dataset  
       
uk_union = text_pre_processing(uk_union)
au_union = text_pre_processing(au_union)
in_union = text_pre_processing(in_union)
gl_union = text_pre_processing(gl_union)

## 4 - Transformers

In this section, we define several functions necessary for preprocessing the text data, partitioning it into training, validation, and test sets, and evaluating the model's performance during training.

We also define the models that we will evaluate, that are the following:

- **DistilBERT Base Uncased fine-tuned on SST-2** ( learn more clicking [here](https://dataloop.ai/library/model/distilbert_distilbert-base-uncased-finetuned-sst-2-english/) )

    - Overview
        - Distilled version of BERT, meaning it has been compressed to be smaller and faster while retaining most of BERT’s performance.
        - It has been specifically fine-tuned on the Stanford Sentiment Treebank (SST-2) dataset, which contains sentences labeled with either positive or negative sentiment.
        - Well-suited for binary sentiment classification tasks. 

    - Strengths
        - High accuracy, as it achieves around 91.3% on the SST-2 development set.
        - Its compact size makes it more efficient to use, particularly in environments with limited computational resources.
    
    - Weaknesses
        - It may exhibit biased behavior, especially toward underrepresented populations, as a result of biases in its training data.
        - It is not designed for factual reasoning or objective classification beyond the specific scope of sentiment analysis.
    
    - Reason
        - We selected this model because it offers a good balance of performance and efficiency. As a distilled version of BERT, it retains much of the accuracy while being smaller and faster, making it ideal for text classification tasks. Specifically fine-tuned on the SST-2 dataset for sentiment analysis, it is well-suited for binary sentiment classification, which aligns perfectly with our project’s goals.

- **TabularisAI Multilingual Sentiment Analysis** ( learn more clicking [here](https://dataloop.ai/library/model/tabularisai_multilingual-sentiment-analysis/) )
    
    - Overview
        - A multilingual sentiment analysis model designed to handle text in multiple languages.
        - Trained to classify sentiment into five distinct categories: very negative, negative, neutral, positive, and very positive.
    
    - Strengths
        - Multilingual support enables sentiment analysis without the need for prior translation, useful in international and multicultural contexts.

    - Weaknesses
        - The five-level classification may be overly complex for tasks that require only a binary sentiment distinction (e.g., positive vs. negative).
        - Performance can vary depending on the language and input quality; lower-resourced languages may have reduced accuracy.
        - Multilingual models typically require more computational resources than single-language, task-specific models.

    - Reason
        - We selected this model because of its ability to handle multiple languages, including Hindi and Bengali. While we translate texts that are not entirely in English during preprocessing, some texts cannot be fully translated. In these cases, we believe this model’s multilingual capabilities will allow it to better handle such texts and provide more accurate sentiment analysis across different languages.

- **CardiffNLP Twitter RoBERTa Base Sentiment Latest** ( learn more clicking [here](https://dataloop.ai/library/model/cardiffnlp_twitter-roberta-base-sentiment-latest/) )

    - Overview
        - A RoBERTa-based model fine-tuned specifically on a large collection of tweets for sentiment analysis.
        - It classifies sentiment into three categories: negative, neutral, and positive.
        - Designed to handle informal, short-form text typical of social media platforms like Twitter.

    - Strengths
        - Optimized for real-world, noisy text data such as tweets, making it robust to slang, abbreviations, and emojis.
        - Good performance in social media contexts where traditional models may struggle.
        - Fine-tuned using manually labeled tweets, improving accuracy in sentiment prediction for tweet-style text.

    - Weaknesses
        - May underperform when applied to formal or long-form text outside the social media domain.
    
    - Reason
        - We selected this model because it is specifically designed for analyzing sentiment in informal texts, like tweets. Since part of our dataset contains texts with similar characteristics, including abbreviations and less formal language, we believe this model is well-suited to handle these cases more effectively than models trained on formal datasets.

In [None]:
# Cut the text
def apply_text_cut(text):
    return ' '.join(text.split()[:100])

# Divide the text in train, validation and test
def apply_partition(sample):
    sample_hf = Dataset.from_pandas(sample)
    train_test = sample_hf.train_test_split(test_size=0.2)
    valid_test = train_test['test'].train_test_split(test_size=0.5)
    return DatasetDict({
        'train': train_test['train'],
        'validation': valid_test['train'],
        'test': valid_test['test']
    })

# Summarize the text
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
def apply_summarizer(text):
    try:
        summary = summarizer(
            ' '.join(text.split()[:512]),
            max_length=100,
            min_length=10,
            do_sample=False
            )[0]["summary_text"]
        return summary
    except Exception as e:
        print(f"Summarization failed: {e}")
        return text

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### 4.1 - Without Fine-Tunning

We evaluate the performance of pretrained language models on our own sentiment classification dataset, without additional fine-tuning. The goal is to assess how well these models, trained on general-purpose sentiment data, generalize to our specific domain.

The steps involved are:
- Load the pretrained model and tokenizer using the Hugging Face Transformers library;
- Preprocess the dataset (text truncation, tokenization, padding);
- Use the model to predict sentiment labels on our test set;
- Evaluate performance using standard metrics: accuracy, precision, recall, F1-score, and the confusion matrix.

To accelerate the classification process and due to our computational limitations, we chose to limit textual features to a maximum of 100 tokens. To achieve this, we followed two different approaches: the first involved truncating the text to the first 100 words, and the second used a text summarization model to condense longer texts into a maximum of 100 tokens.

As can be seen from the results, there was little difference in terms of precision or accuracy between the two methods. However, as expected, the simple truncation approach proved to be more advantageous, as it is significantly faster than the summarization method.

This serves as a baseline for comparison before applying more advanced techniques like fine-tuning or parameter-efficient adaptation (e.g., LoRA).

In [None]:
# Models used without training
models_pretrain = [
    {'name': "distilbert-base-uncased-finetuned-sst-2-english", "type": AutoModelForSequenceClassification, "label": "label", "num": 2},
    {'name': "tabularisai/multilingual-sentiment-analysis", "type": AutoModelForSequenceClassification, "label": "label", "num": 5},
    {'name': "cardiffnlp/twitter-roberta-base-sentiment-latest", "type": RobertaForSequenceClassification, "label": "Sentiment", "num": 3}
]

In [None]:
def apply_model_pretrain(model_dict, sample, use_summary=False):
    
    if(model_dict["label"] != "label"):
        sample = sample.rename(columns={'label':'Sentiment'})
    
    # initialize the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_dict["name"])
    model = model_dict["type"].from_pretrained(model_dict["name"])
    trainer = Trainer(model=model, compute_metrics=compute_metrics)  
    
    # prepare the data
    if use_summary:
        sample["text"] = sample["text"].apply(lambda t: apply_summarizer(t) if len(t.split()) > 100 else t)
    sample['text'] = sample['text'].apply(apply_text_cut)
          
    MAX_LENGTH = 128  
    sample_hf = apply_partition(sample)
    sample_hf = sample_hf.map(lambda x: tokenizer(x["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH), batched=True)
    
    # predict the labels
    predictions = trainer.predict(test_dataset=sample_hf["test"])
    y_pred = np.argmax(predictions.predictions, axis=-1)
    y_test = sample_hf["test"][model_dict["label"]]
    
    if model_dict["num"] == 3:
        y_pred = [1 if pred in [1, 2] else 0 for pred in y_pred]
    elif model_dict["num"] == 5:
        y_pred = [1 if pred in [2, 3, 4] else 0 for pred in y_pred]
    
    # print confusion matrix and metrics
    print(confusion_matrix(y_test, y_pred))
    print('Accuracy:', round(accuracy_score(y_test, y_pred),2))
    print('Precision:', round(precision_score(y_test, y_pred, average='macro'),2))
    print('Recall:', round(recall_score(y_test, y_pred, average='macro'),2))
    print('F1:', round(f1_score(y_test, y_pred, average='macro'),2))

In [None]:
for model in models_pretrain:
    print("Results for model:", model["name"])
    apply_model_pretrain(model, gl_union.copy())

In [None]:
for model in models_pretrain:
    print("Results for model:", model["name"])
    apply_model_pretrain(model, gl_union.copy(), True)
    print("\n")

### 4.2 - With Fine-Tunning

Next, we fine-tuned some models using our dataset to improve classification performance. For this, we trained it for 3 epochs.

As you can observe, the results achieved by the same model improved significantly after being fine-tuned on our own dataset. This highlights the importance of task-specific training, as the model was able to better adapt to the characteristics and nuances of our data, ultimately leading to higher classification performance compared to its pre-trained-only usage.

In [None]:
# Models used with traning
models_tunning = [
    {'name': "distilbert-base-uncased-finetuned-sst-2-english"}
]

In [None]:
def apply_model_tunning(model_name, sample, target_modules, use_lora = False):
    
    # initialize the model and tokenizer
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    
    # prepare the data
    MAX_LENGTH = 128
    sample['text'] = sample['text'].apply(apply_text_cut)
    sample_hf = apply_partition(sample)
    sample_hf = sample_hf.map(
        lambda x: tokenizer(x["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH), 
        batched=True
    )

    if (use_lora):
        lora_config = LoraConfig(
            task_type=TaskType.SEQ_CLS,
            r=8,
            lora_alpha=16,
            lora_dropout=0.1,
            bias="none",
            target_modules=target_modules,
        )
        model = get_peft_model(model, lora_config)
        
    # set the training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        eval_strategy="epoch",
        # save_strategy="epoch",
        save_strategy="no",
        load_best_model_at_end=False,
    )

    # initialize the trainer 
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=sample_hf["train"],
        eval_dataset=sample_hf["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    # train the model
    trainer.train()
    eval_results = trainer.evaluate()
    predictions = trainer.predict(test_dataset=sample_hf["test"])
    print("Evaluation results:", eval_results)
    print("Predictions:", predictions.predictions)
    print("Predicted labels:", np.argmax(predictions.predictions, axis=-1))

In [None]:
for model in models_tunning:
    print("Results for model:", model["name"])
    apply_model_tunning(model["name"], gl_union.copy(), ["q_lin", "k_lin", "v_lin"])

### 4.3 - With Prompting

We also explored the use of prompting for our classification task by leveraging the "google/flan-t5-small" model. This model is a fine-tuned version of Google's T5 architecture, adapted through instruction tuning on a wide range of tasks. Although it is relatively lightweight, it is capable of understanding and responding to task-specific prompts in natural language.

In our setup, we prompted the model to classify input text as either positive or negative, aligning with the goal of our task. We also included a note in the prompt highlighting that the input text could potentially be in Hindi, Urdu, or Bengali. This was an important consideration, as misclassification of texts in these languages was a major limitation encountered by our best-performing model in the first project.

However, although it produced better results than the other models we tested without fine-tuning, the binary text classification model we trained on our dataset still achieved superior performance overall.

In [None]:
models_prompting = [
    {'name': "google/flan-t5-small"}
]

In [None]:
def apply_model_prompting(model_name, sample, text_column="text", label_column="label"):
    sample['text'] = sample['text'].apply(apply_text_cut)
    sample_hf = apply_partition(sample)

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

    def classify_batch_prompts(texts, batch_size=8):
        prompts = [
            f"Classify the following text as positive or negative: '{text}'. Pay attention because the text, although in principle in English, may also be or have parts in Hindi, Urdu or Bengali. Try to be as accurate as you can."
            for text in texts
        ]

        results = pipe(prompts, max_new_tokens=4, batch_size=batch_size, truncation=True)
        labels = []

        for res in results:
            response = res['generated_text'].strip().lower()
            if "positive" in response:
                labels.append(1)
            elif "negative" in response:
                labels.append(0)
            else:
                labels.append(0)

        return labels

    texts = sample_hf["test"][text_column]
    true_labels = sample_hf["test"][label_column]
    pred_labels = classify_batch_prompts(texts)

    acc = accuracy_score(true_labels, pred_labels)
    prec = precision_score(true_labels, pred_labels)
    rec = recall_score(true_labels, pred_labels)
    f1 = f1_score(true_labels, pred_labels)

    print(f"Accuracy: {acc:.2f}")
    print(f"Precision: {prec:.2f}")
    print(f"Recall: {rec:.2f}")
    print(f"F1 Score: {f1:.2f}")

In [None]:
for model in models_prompting:
    apply_model_prompting(model["name"], gl_union.copy())

## 5 - Error Analysis

We will now analyze the errors made by our best-performing model. As previously observed, our best model was the DistilBERT Base Uncased fine-tuned on SST-2, which achieved an accuracy of 0.86 after being fine-tuned on our dataset for 3 epochs.

As we did in the previous project, we will now analyze the model’s performance when trained on subsets of the dataset it was originally fine-tuned on. Specifically, we divide our dataset based on the country of origin of each sample, creating three separate training sets: one for Australia, one for the United Kingdom, and one for India.

We adopt this approach because, in the previous project, it allowed us to identify one of the main weaknesses of our best-performing model: its poor classification performance on the subset containing only data from India.

In [None]:
apply_model_tunning(models_tunning[0]["name"], au_union.copy(), ["q_lin", "k_lin", "v_lin"])

In [None]:
apply_model_tunning(models_tunning[0]["name"], uk_union.copy(), ["q_lin", "k_lin", "v_lin"])

In [None]:
apply_model_tunning(models_tunning[0]["name"], in_union.copy(), ["q_lin", "k_lin", "v_lin"])

As can be observed, even after fine-tuning the model on country-specific datasets, it is evident that the model struggles to adapt effectively to the dataset from India. This may be due to the use of region-specific expressions that are uncommon in standard English, or due to portions of the text being written in other languages such as Hindi, Urdu, or Bengali.