# CENG463 PA3

In this programming assignment, you will be dealing with encoder-based and decoder-based language models. You will use Python for this task. You can use libraries for your implementations, or implement your own functions. However, you are expected to analyse and reason about your implementation and results.

### IMPORTANT NOTES

* **Do not clear the output of your cells since this notebook will count as your written report and your cell outputs will be used for grading.** If a question in your submitted notebook does not have a printed output, you will get no grade from that question. If you encounter a problem about this, please [email me](mailto:auozturk@ceng.metu.edu.tr) so that we can work out a solution.

* Ideally, you should be able to complete this assignment on Google Colab, without any payment for resources to any service (or you can use your own computers). However, if you believe you need additional resources, please [send an email to Çağrı Hoca](mailto:ctoraman@metu.edu.tr).

## Problem: Misinformation Detection

On your last programming assignment, you will experiment with encoder-based and decoder-based models (specifically, BERT and Llama) to identify misinformation on English and Turkish tweet data from 2022. You can read the [paper](https://arxiv.org/pdf/2210.05401v2) for more information about the dataset. However, the dataset that is shared with you with the assignment is slightly simplified, where each row only consists of a `label` and `text` field.

A tweet is considered as "misinformation" if the `label` is `False`. Your models should aim to classify a given text according to the labels `True`, `False`, and `Other`.

**Put the "data" folder in the same directory with your notebook to work on your solutions in case we need to run your notebooks to reproduce your outputs.** The dataset consists of 5 train and test folds. For each model, you will train on 5 folds and report the average performance of all folds.

On the last part of the assignment, you will compare the time and space efficiency of the models you have used. So, don't forget to keep track of the training time.

**Important Note:** For this assignment, using tutorials, online forums, or AI tools to help with debugging and implementation may be needed. However, you are expected to add your resources as disclaimers. **You will be penalized if you do not disclose any help from AI tools, GitHub repositories, or tutorials.**

Tutorials/resources that might come handy:
* [`MiDe22`repository](https://github.com/metunlp/mide22)
* [`transformers` library](https://huggingface.co/docs/transformers/index)
* [finetuning a pretrained model](https://huggingface.co/docs/transformers/training)
* [`pipeline` from HuggingFace for inference](https://huggingface.co/docs/transformers/pipeline_tutorial##text-pipeline)
* [finetuning Llama](https://www.datacamp.com/tutorial/fine-tuning-llama-3-1)
* [quantization for working on low resources](https://huggingface.co/docs/transformers/v5.0.0rc0/quantization/overview)
* Or any other tutorial you find easy to follow. If you find a particularly helpful resource, you can share it in the discussion forum for everyone to use.

## Q1 - Encoder-based language models for classification (25 points)

### Part A: BERT and preprocessing

In this part, you will finetune the `google-bert/bert-base-uncased` model for the misinformation detection task for English tweets ("EN" folder in the dataset). However, you will train two versions: one with preprocessed texts, one with the raw texts.

For the preprocessing steps, lowercase the text, and use `nltk` to lemmatize and remove stopwords. You can also split each tweet into sentences and tokens, then combine the token list into a single space seperated string to represent each tweet as a single text again. Be careful not to combine different tweets. You can add additional preprocessing steps as long as you keep the integrity of the label-text mappings for each tweet.

You can also use the `preprocess_review` function from PA2 and then combine the returned list into a single space separated string.

Report the performance of both classifiers as the average of 5 train/test folds in "EN" dataset and accuracy, precision, recall, and F1-score with respect to the `False` class. Discuss which model performed better and why.

### Part B: BERT, Mutlilingual BERT and crosslingual performance

In this part, you will finetune the `google-bert/bert-base-uncased` and `google-bert/bert-base-multilingual-uncased` models and compare their performances. However, for the train/test folds, you will take the train fold from "EN" folder and the test fold with the corresponding number from the "TR" folder. This means that you will train the models on English but test them on Turkish.

Report the performance of both classifiers as the average of 5 train/test folds on accuracy, precision, recall, and F1-score with respect to the `False` class. Discuss which model performed better and why you think that is the case.



In [2]:
!pip install transformers datasets accelerate scikit-learn nltk pandas torch # Install required packages
!pip install bitsandbytes-cuda110 bitsandbytes
!pip install -q --no-deps xformers trl peft accelerate bitsandbytes
!pip install -q transformers accelerate peft bitsandbytes datasets trl


import pandas as pd
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from tqdm import tqdm
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from peft import LoraConfig, get_peft_model
import time
import os
from trl import SFTTrainer

# This is to ensure that GPU is being used in google collab
# torch.cude.is_available checks if NVIDIA GPU with CUDA support installed and accessible by PyTorch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}") # To check my runtime is using T4 GPU

from google.colab import drive
drive.mount('/content/drive')

from huggingface_hub import login
login(new_session=False)

Collecting bitsandbytes-cuda110
  Downloading bitsandbytes_cuda110-0.26.0.post2-py3-none-any.whl.metadata (6.3 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes_cuda110-0.26.0.post2-py3-none-any.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes-cuda110, bitsandbytes
Successfully installed bitsandbytes-0.49.0 bitsandbytes-cuda110-0.26.0.post2
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/122.9 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m40.3 MB/s[0m eta [36m0:00:

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Same preprocess review function from PA2
def preprocess_review(review):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    sentences = sent_tokenize(review)

    lemmatized_review = []

    for sentence in sentences:
        tokenized_sentence = word_tokenize(sentence)
        lowercased_sentence = [token.lower() for token in tokenized_sentence]
        stopwords_removed_sentence = [token for token in lowercased_sentence if token not in stop_words]
        lemmatized_sentence = [lemmatizer.lemmatize(token) for token in stopwords_removed_sentence]

        lemmatized_review = lemmatized_review + lemmatized_sentence

    # This is the change -> join back into a single string for BERT
    return " ".join(lemmatized_review)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [4]:

# Define Metrics Function -> This is required for evaluating our model
def compute_metrics(pred): # EvalPrediction Object that contains label and predictions
    labels = pred.label_ids # Correct labels of the data
    preds = pred.predictions.argmax(-1) # This gives the class with the highest confidence

    # Calculate accuracy
    acc = accuracy_score(labels, preds)

    # Calculate precision, recall, f1 for the 'False' class (which we will map to ID 1)
    # We use average=None to get scores for each class, then pick the one for 'False'
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average=None, labels=[0, 1, 2])

    # Assuming the mapping: True=0, False=1, Other=2
    return {
        'accuracy': acc,
        'precision_False': precision[1],
        'recall_False': recall[1],
        'f1_False': f1[1]
    }

# Configuration
model_name = "google-bert/bert-base-uncased"
data_dir = "/content/drive/MyDrive/data/EN" # This is the path for MY GOOGLE COLLAB
tr_data_dir = "/content/drive/MyDrive/data/TR" # This is the path for MY GOOGLE COLLAB
num_folds = 5
results_preprocessed = []

print("Training with preprocessed data starts here") # For debugging purpose (to see where I am)

for fold in range(num_folds):
    # Load data (Q1 includes english datas)
    train_path = os.path.join(data_dir, f"en_train_{fold}.tsv")
    test_path = os.path.join(data_dir, f"en_test_{fold}.tsv")

    train_df = pd.read_csv(train_path, sep='\t')
    test_df = pd.read_csv(test_path, sep='\t')

    # Apply preprocessing using the preprocess_review method from last homework
    train_df['text'] = train_df['text'].apply(preprocess_review)
    test_df['text'] = test_df['text'].apply(preprocess_review)

    # Map Labels
    label_map = {'True': 0, 'False': 1, 'Other': 2} # Converting lables into numbers
    # Goes through each tweet and replaces the label with the corresponding number
    train_df['label'] = train_df['label'].map(label_map)
    test_df['label'] = test_df['label'].map(label_map)

    # The Trainer class in the transformers library  is optimized to work with Hugging Face Dataset objects, not pandas DataFrames
    # This function converts your data into that efficient format
    train_dataset = Dataset.from_pandas(train_df[['text', 'label']], preserve_index=False)
    test_dataset = Dataset.from_pandas(test_df[['text', 'label']], preserve_index=False)

    # Tokenization, since the preprocess_review still returns the whole tweet as a string, we need this
    tokenizer = AutoTokenizer.from_pretrained(model_name) # BERT uses WordPiece tokenization

    def tokenize_data(examples):
        return tokenizer(examples['text'], truncation=True, max_length=128, padding="max_length")

    # This applies tokenize_data function to every single row in the dataset
    train_dataset = train_dataset.map(tokenize_data, batched=True)
    test_dataset = test_dataset.map(tokenize_data, batched=True)

    # Right now, your data consists of standard Python lists and integers.
    # PyTorch models cannot read Python lists; they require PyTorch Tensors
    train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    # Initialize model
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
    model.to(device) # Set the model to use the T4 GPU (from google collab)

    # Trainer setup
    training_args = TrainingArguments(
        output_dir=f'./results/prep_fold_{fold}', # This is required
        num_train_epochs=3, # How many times to loop over the entire dataset
        per_device_train_batch_size=16, # Number of tweets processed at once per GPU during training
        per_device_eval_batch_size=16, # Number of tweets processed at once per GPU during evaluation
        save_strategy="no", # Do not save intermediate model checkpoints
        eval_strategy="epoch", # Run evaluation (calculate accuracy/F1) at the end of every epoch
        report_to="none" # Disable external logging tools
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics
    )

    start_time = time.time() # Start timer
    trainer.train()
    end_time = time.time()   # End timer

    training_time = end_time - start_time # Total training time
    eval_result = trainer.evaluate()

    # Store the time in the result dictionary so we can average it later
    eval_result['training_time'] = training_time

    # This is for debugging
    print(
        f"Fold {fold} Results:\n"
        f"  Accuracy       = {eval_result['eval_accuracy']:.4f}\n"
        f"  Precision(False)= {eval_result['eval_precision_False']:.4f}\n"
        f"  Recall(False)   = {eval_result['eval_recall_False']:.4f}\n"
        f"  F1(False)       = {eval_result['eval_f1_False']:.4f}"
    )
    print(f"Time taken: {training_time:.2f} seconds")
    results_preprocessed.append(eval_result)

    # Cleanup (GPU memory is limited and does not clear itself automatically so we need to clear it after each fold)
    del model
    del trainer
    torch.cuda.empty_cache()

Training with preprocessed data starts here


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/4211 [00:00<?, ? examples/s]

Map:   0%|          | 0/1073 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision False,Recall False,F1 False
1,No log,0.60163,0.7726,0.773649,0.658046,0.71118
2,0.612500,0.541479,0.810811,0.799383,0.744253,0.770833
3,0.612500,0.688081,0.801491,0.741667,0.767241,0.754237


Fold 0 Results:
  Accuracy       = 0.8015
  Precision(False)= 0.7417
  Recall(False)   = 0.7672
  F1(False)       = 0.7542
Time taken: 71.63 seconds


Map:   0%|          | 0/4220 [00:00<?, ? examples/s]

Map:   0%|          | 0/1064 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision False,Recall False,F1 False
1,No log,0.572019,0.772556,0.706186,0.780627,0.741543
2,0.605500,0.511185,0.81485,0.790698,0.774929,0.782734
3,0.605500,0.638914,0.81203,0.791789,0.769231,0.780347


Fold 1 Results:
  Accuracy       = 0.8120
  Precision(False)= 0.7918
  Recall(False)   = 0.7692
  F1(False)       = 0.7803
Time taken: 70.48 seconds


Map:   0%|          | 0/4227 [00:00<?, ? examples/s]

Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision False,Recall False,F1 False
1,No log,0.683132,0.723746,0.713889,0.730114,0.72191
2,0.646100,0.539174,0.780511,0.729798,0.821023,0.772727
3,0.646100,0.615612,0.799432,0.78125,0.78125,0.78125


Fold 2 Results:
  Accuracy       = 0.7994
  Precision(False)= 0.7812
  Recall(False)   = 0.7812
  F1(False)       = 0.7812
Time taken: 70.41 seconds


Map:   0%|          | 0/4236 [00:00<?, ? examples/s]

Map:   0%|          | 0/1048 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision False,Recall False,F1 False
1,No log,0.553753,0.784351,0.758209,0.747059,0.752593
2,0.597800,0.544517,0.801527,0.772586,0.729412,0.750378
3,0.597800,0.639809,0.821565,0.793313,0.767647,0.780269


Fold 3 Results:
  Accuracy       = 0.8216
  Precision(False)= 0.7933
  Recall(False)   = 0.7676
  F1(False)       = 0.7803
Time taken: 70.54 seconds


Map:   0%|          | 0/4242 [00:00<?, ? examples/s]

Map:   0%|          | 0/1042 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision False,Recall False,F1 False
1,No log,0.712727,0.715931,0.656442,0.633136,0.644578
2,0.751300,0.530009,0.786948,0.684337,0.840237,0.754316
3,0.751300,0.57898,0.805182,0.753582,0.778107,0.765648


Fold 4 Results:
  Accuracy       = 0.8052
  Precision(False)= 0.7536
  Recall(False)   = 0.7781
  F1(False)       = 0.7656
Time taken: 70.48 seconds


In [7]:
print("FINAL RESULTS: BERT (Preprocessed)")
if results_preprocessed:
    avg_acc = np.mean([r['eval_accuracy'] for r in results_preprocessed])
    avg_prec = np.mean([r['eval_precision_False'] for r in results_preprocessed])
    avg_recall = np.mean([r['eval_recall_False'] for r in results_preprocessed])
    avg_f1 = np.mean([r['eval_f1_False'] for r in results_preprocessed])
    avg_time = np.mean([r['training_time'] for r in results_preprocessed])

    print(f"Average Accuracy:        {avg_acc:.4f}")
    print(f"Average Precision(False): {avg_prec:.4f}")
    print(f"Average Recall(False):    {avg_recall:.4f}")
    print(f"Average F1 (False):       {avg_f1:.4f}")
    print(f"Average Training Time:    {avg_time:.2f} seconds")

FINAL RESULTS: BERT (Preprocessed)
Average Accuracy:        0.8079
Average Precision(False): 0.7723
Average Recall(False):    0.7727
Average F1 (False):       0.7724
Average Training Time:    70.71 seconds


In [9]:
print("Training with raw data starts here") # For debugging purpose (to see where I am)

# Lists to store results
results_en = [] # For Q1 Part A
results_tr = [] # For Q1 Part B
training_times = [] # To track time

for fold in range(num_folds):
    train_df = pd.read_csv(os.path.join(data_dir, f"en_train_{fold}.tsv"), sep='\t')

    # English test (For Part A)
    test_en_df = pd.read_csv(os.path.join(data_dir, f"en_test_{fold}.tsv"), sep='\t')

    # Turkish Test (For Part B)
    test_tr_df = pd.read_csv(os.path.join(tr_data_dir, f"tr_test_{fold}.tsv"), sep='\t')

    label_map = {'True': 0, 'False': 1, 'Other': 2}

    for df in [train_df, test_en_df, test_tr_df]:
        df['label'] = df['label'].map(label_map)
        df.dropna(subset=['label'], inplace=True)

    # Create Datasets
    train_ds = Dataset.from_pandas(train_df[['text', 'label']], preserve_index=False)
    test_en_ds = Dataset.from_pandas(test_en_df[['text', 'label']], preserve_index=False)
    test_tr_ds = Dataset.from_pandas(test_tr_df[['text', 'label']], preserve_index=False)

    # Tokenize (Standard BERT)
    tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

    def tokenize_data(examples):
        return tokenizer(examples['text'], truncation=True, max_length=128, padding="max_length")

    train_ds = train_ds.map(tokenize_data, batched=True)
    test_en_ds = test_en_ds.map(tokenize_data, batched=True) # Tokenize EN Test
    test_tr_ds = test_tr_ds.map(tokenize_data, batched=True) # Tokenize TR Test

    # Set Format
    for ds in [train_ds, test_en_ds, test_tr_ds]:
        ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    # Train on english only
    model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=3)
    model.to(device)

    training_args = TrainingArguments(
        output_dir=f'./results/combined_fold_{fold}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        save_strategy="no",
        eval_strategy="no", # We will evaluate manually at the end
        report_to="none"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        compute_metrics=compute_metrics
    )


    start_time = time.time() # Start Timer
    trainer.train()
    end_time = time.time()   # Stop Timer

    total_time = end_time - start_time
    training_times.append(total_time)
    print(f"Training Time: {total_time:.2f} seconds")

    # A. Evaluate on English
    print("Evaluating on English")
    metrics_en = trainer.evaluate(eval_dataset=test_en_ds)
    results_en.append(metrics_en)

    # Evaluate on Turkish
    print("Evaluating on Turkish")
    metrics_tr = trainer.evaluate(eval_dataset=test_tr_ds)
    results_tr.append(metrics_tr)

    print(
        f"Fold {fold} Summary -> Time: {total_time:.1f}s | "
        f"EN Acc: {metrics_en['eval_accuracy']:.4f} | "
        f"EN Recall(False): {metrics_en['eval_recall_False']:.4f} | "
        f"EN F1(False): {metrics_en['eval_f1_False']:.4f} || "
        f"TR Acc: {metrics_tr['eval_accuracy']:.4f} | "
        f"TR Recall(False): {metrics_tr['eval_recall_False']:.4f} | "
        f"TR F1(False): {metrics_tr['eval_f1_False']:.4f}"
    )

    # Cleanup
    del model
    del trainer
    torch.cuda.empty_cache()

Training with raw data starts here


Map:   0%|          | 0/4211 [00:00<?, ? examples/s]

Map:   0%|          | 0/1073 [00:00<?, ? examples/s]

Map:   0%|          | 0/1027 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.5636


Training Time: 64.76 seconds
Evaluating on English


Evaluating on Turkish
Fold 0 Summary -> Time: 64.8s | EN Acc: 0.8201 | EN Recall(False): 0.8017 | EN F1(False): 0.7960 || TR Acc: 0.5151 | TR Recall(False): 0.0198 | TR F1(False): 0.0366


Map:   0%|          | 0/4220 [00:00<?, ? examples/s]

Map:   0%|          | 0/1064 [00:00<?, ? examples/s]

Map:   0%|          | 0/1021 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.5169


Training Time: 64.61 seconds
Evaluating on English


Evaluating on Turkish
Fold 1 Summary -> Time: 64.6s | EN Acc: 0.8102 | EN Recall(False): 0.7664 | EN F1(False): 0.7808 || TR Acc: 0.4946 | TR Recall(False): 0.1931 | TR F1(False): 0.2454


Map:   0%|          | 0/4227 [00:00<?, ? examples/s]

Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

Map:   0%|          | 0/1014 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.5327


Training Time: 64.58 seconds
Evaluating on English


Evaluating on Turkish
Fold 2 Summary -> Time: 64.6s | EN Acc: 0.8250 | EN Recall(False): 0.8097 | EN F1(False): 0.8051 || TR Acc: 0.5256 | TR Recall(False): 0.1072 | TR F1(False): 0.1717


Map:   0%|          | 0/4236 [00:00<?, ? examples/s]

Map:   0%|          | 0/1048 [00:00<?, ? examples/s]

Map:   0%|          | 0/1006 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.7858


Training Time: 64.79 seconds
Evaluating on English


Evaluating on Turkish
Fold 3 Summary -> Time: 64.8s | EN Acc: 0.8197 | EN Recall(False): 0.7794 | EN F1(False): 0.7794 || TR Acc: 0.5278 | TR Recall(False): 0.0831 | TR F1(False): 0.1408


Map:   0%|          | 0/4242 [00:00<?, ? examples/s]

Map:   0%|          | 0/1042 [00:00<?, ? examples/s]

Map:   0%|          | 0/996 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.5728


Training Time: 65.37 seconds
Evaluating on English


Evaluating on Turkish
Fold 4 Summary -> Time: 65.4s | EN Acc: 0.8157 | EN Recall(False): 0.8284 | EN F1(False): 0.7955 || TR Acc: 0.4920 | TR Recall(False): 0.3047 | TR F1(False): 0.3355


In [12]:
# Report
print("FINAL RESULTS: BERT (Raw)")
print(f"Average Training Time:    {np.mean(training_times):.2f} seconds")

avg_en_acc = np.mean([x['eval_accuracy'] for x in results_en])
avg_en_recall = np.mean([x['eval_recall_False'] for x in results_en])
avg_en_f1 = np.mean([x['eval_f1_False'] for x in results_en])
avg_en_prec = np.mean([r['eval_precision_False'] for r in results_preprocessed])

print(f"Q1 Part A (EN→EN) Avg Accuracy: {avg_en_acc:.4f}")
print(f"Q1 Part A (EN→EN) Avg Recall (False): {avg_en_recall:.4f}")
print(f"Q1 Part A (EN→EN) Avg F1 (False): {avg_en_f1:.4f}")
print(f"Q1 Part A (EN→EN) Avg Presicion (False): {avg_en_f1:.4f}")

FINAL RESULTS: BERT (Raw)
Average Training Time:    64.82 seconds
Q1 Part A (EN→EN) Avg Accuracy: 0.8181
Q1 Part A (EN→EN) Avg Recall (False): 0.7971
Q1 Part A (EN→EN) Avg F1 (False): 0.7914
Q1 Part A (EN→EN) Avg Presicion (False): 0.7914


**Discussion: Q1 Part A - BERT with Preprocessing vs Raw Text**

**Results Summary:**
| Model | Accuracy | Precision (False) | Recall (False) | F1 (False) | Training Time |
|-------|----------|-------------------|----------------|------------|---------------|
| BERT (Preprocessed) | 0.8079 | 0.7723 | 0.7727 | 0.7724 | 70.71s |
| BERT (Raw) | 0.8181 | ~0.79 | 0.7971 | 0.7914 | 64.82s |

**Implementation**

For preprocessing, I have used the preprocess_review function for PA2. The only change I have made is that instead of returning a list, it returned a space seperated string. For the training part, I have finetuned the model 5 different times with each fold and tested with corresponding test file. For each training, I used the time library to keep track of the training time. Also to take the average of the evaluations metrics, I kept the results of each fold in an array which is utilized after the loop in analysis section. Due to memory limitations, I have cleared the cache and deleted the finetuned model after each fold training. These steps were common for both training with preprocessed and raw data. Also, I have used A100 GPU for training in google collab. Also, in order to reach the data folder, I have used my drive folder. That is why the path includes 'MyDrive'. As you can obviously guess, I have utilized various different LLMs. I mostly used LLMs for the initializations of models. I did not know the required parameters to use, so I explained the tools, sources and the requirements of the assignment and it gave me the training arguments. Also since I have never used google collab before I did not know how to use the device that my runtime uses, so I used AI in there too aswell.

**Analysis:**

The model trained with raw data actually was better than preprocessed data.
- **Accuracy:** Raw (0.8181) vs Preprocessed (0.8079)
- **F1 Score:** Raw (0.7914) vs Preprocessed (0.7724)
- **Training Time:** Raw (64.82s) vs Preprocessed (70.71s)

Actually, my first intuition in this question was the preprocessed model would outperform the raw data model as normally preprocessing makes the input data more understandable by the model. However, after searching this up on the internet and chatting with AI tools, I found out that BERT was pretrained with raw, natural text so its tokenizer (WordPiece) and the attention mechanism it uses are optimized to understand language in its original form (which includes stopwords, punctuations).

Also with preprocessing, we sometimes lose information for instance, removing stopwords eliminates contextual cues ("not true" becomes "true" after stopword removal) or lemmatization can remove tense information that might indicate speculation vs. fact. Another thing is that BERT already handles tokenization and unknown words with its own tokenization algorithm (which is far far complex and better than my really simple tokenization function).

The last and most important thing is that the writing type affects whether the tweet is misinformation or not (for instance a more unformal tweet possible includes misinformation), so using a preprocessing to remove these tweet-specific context can actually affect the model.

**Conclusion:** For transformer-based models like BERT that were pretrained on natural text, traditional NLP preprocessing (stopword removal, lemmatization) actually hurts performance.

In [14]:
print("Training Multilingual BERT")

results_tr_mbert = []
training_times_mbert = []

for fold in range(num_folds):
    print(f"Fold {fold}")

    # Load data
    train_df = pd.read_csv(os.path.join(data_dir, f"en_train_{fold}.tsv"), sep='\t')
    test_tr_df = pd.read_csv(os.path.join(tr_data_dir, f"tr_test_{fold}.tsv"), sep='\t')

    label_map = {'True': 0, 'False': 1, 'Other': 2}

    for df in [train_df, test_tr_df]:
        df['label'] = df['label'].map(label_map)
        df.dropna(subset=['label'], inplace=True)

    # Create datasets
    train_ds = Dataset.from_pandas(train_df[['text', 'label']], preserve_index=False)
    test_tr_ds = Dataset.from_pandas(test_tr_df[['text', 'label']], preserve_index=False)

    # Tokenizer (Multilingual BERT)
    tokenizer = AutoTokenizer.from_pretrained(
        "google-bert/bert-base-multilingual-uncased"
    )

    def tokenize_data(examples):
        return tokenizer(
            examples['text'],
            truncation=True,
            max_length=128,
            padding="max_length"
        )

    train_ds = train_ds.map(tokenize_data, batched=True)
    test_tr_ds = test_tr_ds.map(tokenize_data, batched=True)

    # Torch format
    for ds in [train_ds, test_tr_ds]:
        ds.set_format(
            type='torch',
            columns=['input_ids', 'attention_mask', 'label']
        )

    # Model
    model = AutoModelForSequenceClassification.from_pretrained(
        "google-bert/bert-base-multilingual-uncased",
        num_labels=3
    )
    model.to(device)

    training_args = TrainingArguments(
        output_dir=f'./results/mbert_fold_{fold}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        save_strategy="no",
        eval_strategy="no",
        report_to="none"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        compute_metrics=compute_metrics
    )

    # Train
    start_time = time.time()
    trainer.train()
    end_time = time.time()

    total_time = end_time - start_time
    training_times_mbert.append(total_time)

    print(f"Training Time: {total_time:.2f} seconds")

    # Evaluate ONLY on Turkish
    print("Evaluating on Turkish")
    metrics_tr = trainer.evaluate(eval_dataset=test_tr_ds)
    results_tr_mbert.append(metrics_tr)

    print(
        f"Fold {fold} Summary -> "
        f"TR Acc: {metrics_tr['eval_accuracy']:.4f} | "
        f"TR Precision(False): {metrics_tr['eval_precision_False']:.4f} | "
        f"TR Recall(False): {metrics_tr['eval_recall_False']:.4f} | "
        f"TR F1(False): {metrics_tr['eval_f1_False']:.4f}"
    )

    # -------- Cleanup --------
    del model
    del trainer
    torch.cuda.empty_cache()

Training Multilingual BERT
Fold 0


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/4211 [00:00<?, ? examples/s]

Map:   0%|          | 0/1027 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6886


Training Time: 65.95 seconds
Evaluating on Turkish


Fold 0 Summary -> TR Acc: 0.5346 | TR Precision(False): 0.4286 | TR Recall(False): 0.1190 | TR F1(False): 0.1863
Fold 1


Map:   0%|          | 0/4220 [00:00<?, ? examples/s]

Map:   0%|          | 0/1021 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6527


Training Time: 65.87 seconds
Evaluating on Turkish


Fold 1 Summary -> TR Acc: 0.5309 | TR Precision(False): 0.3596 | TR Recall(False): 0.1182 | TR F1(False): 0.1779
Fold 2


Map:   0%|          | 0/4227 [00:00<?, ? examples/s]

Map:   0%|          | 0/1014 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6307


Training Time: 66.18 seconds
Evaluating on Turkish


Fold 2 Summary -> TR Acc: 0.5444 | TR Precision(False): 0.3769 | TR Recall(False): 0.1420 | TR F1(False): 0.2063
Fold 3


Map:   0%|          | 0/4236 [00:00<?, ? examples/s]

Map:   0%|          | 0/1006 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6384


Training Time: 66.04 seconds
Evaluating on Turkish


Fold 3 Summary -> TR Acc: 0.5577 | TR Precision(False): 0.4891 | TR Recall(False): 0.1920 | TR F1(False): 0.2757
Fold 4


Map:   0%|          | 0/4242 [00:00<?, ? examples/s]

Map:   0%|          | 0/996 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.716


Training Time: 66.50 seconds
Evaluating on Turkish


Fold 4 Summary -> TR Acc: 0.5492 | TR Precision(False): 0.4634 | TR Recall(False): 0.1686 | TR F1(False): 0.2473


In [22]:
# Result

# This part is from Part A
print("FINAL RESULTS: Base BERT (EN → TR)")
avg_tr_acc = np.mean([x['eval_accuracy'] for x in results_tr])
avg_tr_recall = np.mean([x['eval_recall_False'] for x in results_tr])
avg_tr_f1 = np.mean([x['eval_f1_False'] for x in results_tr])
avg_tr_presicion = np.mean([x['eval_f1_False'] for x in results_tr])

print(f"Q1 Part B (EN→TR) Avg Accuracy (False): {avg_tr_acc:.4f}")
print(f"Q1 Part B (EN→TR) Avg Recall (False): {avg_tr_recall:.4f}")
print(f"Q1 Part B (EN→TR) Avg F1 (False): {avg_tr_f1:.4f}")
print(f"Q1 Part B (EN→TR) Avg Presicion (False): {avg_tr_presicion:.4f}")

print("\n \n")

# This part is from Part B
print("FINAL RESULTS: Multilingual BERT (EN → TR)")
if results_tr_mbert:
    avg_acc = np.mean([r['eval_accuracy'] for r in results_tr_mbert])
    avg_prec = np.mean([r['eval_precision_False'] for r in results_tr_mbert])
    avg_recall = np.mean([r['eval_recall_False'] for r in results_tr_mbert])
    avg_f1 = np.mean([r['eval_f1_False'] for r in results_tr_mbert])
    avg_time = np.mean(training_times_mbert)

    print(f"Average Accuracy:        {avg_acc:.4f}")
    print(f"Average Recall(False):    {avg_recall:.4f}")
    print(f"Average F1 (False):       {avg_f1:.4f}")
    print(f"Average Precision(False): {avg_prec:.4f}")
    print(f"Average Training Time:    {avg_time:.2f} seconds")

    print("\n \n")

    # Comparison with standard BERT on Turkish
    print("Comparison: Standard BERT vs Multilingual BERT (EN → TR)")
    print(f"Avg Accuracy(False) Improvement: {avg_acc - avg_tr_acc:+.4f}")
    print(f"Avg Recall(False) Improvement: {avg_recall - avg_tr_recall:+.4f}")
    print(f"Avg F1(False) Improvement: {avg_f1 - avg_tr_f1:+.4f}")
    print(f"Avg Presicion(False) Improvement: {avg_prec - avg_tr_presicion:+.4f}")

FINAL RESULTS: Base BERT (EN → TR)
Q1 Part B (EN→TR) Avg Accuracy (False): 0.5110
Q1 Part B (EN→TR) Avg Recall (False): 0.1416
Q1 Part B (EN→TR) Avg F1 (False): 0.1860
Q1 Part B (EN→TR) Avg Presicion (False): 0.1860

 

FINAL RESULTS: Multilingual BERT (EN → TR)
Average Accuracy:        0.5433
Average Recall(False):    0.1480
Average F1 (False):       0.2187
Average Precision(False): 0.4235
Average Training Time:    66.11 seconds

 

Comparison: Standard BERT vs Multilingual BERT (EN → TR)
Avg Accuracy(False) Improvement: +0.0323
Avg Recall(False) Improvement: +0.0064
Avg F1(False) Improvement: +0.0327
Avg Presicion(False) Improvement: +0.2375


**Discussion: Q1 Part B - Cross-lingual Performance (EN → TR)**

**Results Summary:**
| Model | Accuracy | Precision (False) | Recall (False) | F1 (False) |
|-------|----------|-------------------|----------------|------------|
| Standard BERT (EN→TR) | 0.5110 | ~0.19 | 0.1416 | 0.1860 |
| Multilingual BERT (EN→TR) | 0.5433 | 0.4235 | 0.1480 | 0.2187 |
| **Improvement** | +3.23% | +23.75% | +0.64% | +3.27% |

**Implementation:**
 In terms of my perspective, implementationwise Part B was the same as Part A. Again I trained the 5 times using 5 different folds and for each fold I run the corresponding test. Different from the Part A in here the model id was different hence the the tokenizer was different. Again I used the similar training parameters here. After the training I calculated the average scores for each fold and compared the results with the base models averaged results. The base model results were actually calculated in PART A. The reason for this because I needed to evalute the base models performance which is already trained in Part A. So instead of training and wasting resources once again I just use the trained base model in order to see its performance on data from a different language. In Part A I stored the performance metric results in an array and in Part B I just used the results stored in that array.

**Analysis:**

Unlike the previous part, I got exactly what I anticipated as a result in this part. Before starting this part I was expecting multilingual model to be than the base model as its name includes 'multilingual'. And the results actually proved me right. Multilingual BERT model is noticable better at predicting data from a different language. At this point I did not know the interior details of multilingual BERT, how is it different from base BERT and what features of multilingual BERT makes it better. So again I used LLM's to learn about this and I got some very logical reasons.

Firstly mBERT was trained on 104 languages including Turkish, out of box without any finetuning it already has Turkish subword tokens in its vocabulary. Also the fact that base BERT does not include any tokens that include the Turkish-specific characters would automatically lead to information loss which also leads to a worse result. Also during training, mBERT learns language-agnostic (non lanuage-specific) features which enables him to transfer between languages.

Another interesting observation here is that both model actually performed poorly. This is mainly because both models are tested on a different domain than they were trained/finetuned.

**Conclusion:** Multilingual BERT provides better cross-lingual transfer capabilities due to its multilingual pretraining. However, zero-shot cross-lingual transfer for misinformation detection remains challenging, and including Turkish training data would significantly improve performance.

## Q2 - Decoder-based language models for classification (55 points)

### Part A: Zero-shot and Few-shot inference with Llama for text classification

In this part, you will use the `meta-llama/Llama-3.1-8B-Instruct` model and the "EN" English tweets dataset for designing a classifier based on inference. You will design two prompts:

* **Prompt 1:** A zero-shot prompt with only the task description, no examples. Explain the task and ask for a classification of the tweets in test folds.
* **Prompt 2:** A one-shot or few-shot prompt, with one or a couple examples. This can be tricky as longer tweets may cause problems. Explain the task using examples from the training folds, and ask for a classification of the tweets in test folds.

Report the performance of both classification pipelines as the average of 5 test folds on accuracy, precision, recall, and F1-score with respect to the `False` class. Explain your prompt design process in detail and discuss the performance of the models with respect to your prompts.

### Part B: Finetuning Llama

In this part, you will again use `meta-llama/Llama-3.1-8B-Instruct` model and the "EN" English tweets dataset but this time you will try to finetune the model with the training folds. However, this may become tricky and resource demanding, considering you are using free tools (such as Google Colab) or your own computers. Check out the tutorials to see how you can better manage your resources.

Report the performance of the finetuned classifier as the average of 5 test folds on accuracy, precision, recall, and F1-score with respect to the `False` class, with the better performing prompt you have designed in Part A. Discuss the effect of finetuning on the model performance. Do you think it improved the performance? What else can be integrated into the process to improve the performance of the model, specifically for the misinformation detection task?


### Part C: Domain transfer to sentiment analysis

In this part, you will try to classify game reviews with the model you finetuned in Part B to see the effect of finetuning on the generalized performance of the model.

You should design a simple zero-shot prompt that asks the model whether a user recommends the game given their review text. With the same prompt, you will use both the base `meta-llama/Llama-3.1-8B-Instruct` model and your finetuned version to classify the reviews.

Report the performance of both classifiers on the whole dataset with respect to accuracy, precision, recall, and F1-score. Discuss and explain your findings.

**Note: Dataset for Part C**

* You will use the `game_reviews.csv` file shared with you. It is taken from [this Kaggle dataset](https://www.kaggle.com/datasets/piyushagni5/sentiment-analysis-for-steam-reviews). Review text attribute is `user_review`. User recommendation is the attribute `user_suggestion` where `1` means user recommended the game, `0` means user did not recommend the game.

In [None]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
num_folds = 5

# Configure 4-bit loading to fit in Colab memory
# Without these the model would require 16gb to 32gbs of vram
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id) # Get the pretrained model
# When the inputs are generated in batches, the model needs to feed a n by n matrices (which requires padding)
# Even though we are not training in batches here, the model still gives an error in this case
tokenizer.pad_token = tokenizer.eos_token

# AutoModelForCausalLM handles all necessary imports and downloads the actual model by connecting to Hugging Face Hub
# Also with the previous bnb_config, as the weights of the model is downloaded they are converted to 4-bit format
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Selects 3 examples (one for each class) from the training data to use in the prompt
def get_few_shot_examples(train_df):
    examples = []
    # Get one example for 'True', 'False', and 'Other'
    for label in ['True', 'False', 'Other']:
        row = train_df[train_df['label_str'] == label].iloc[0]
        examples.append((row['text'], row['label_str']))
    return examples

# Function to actually creates the prompt that is given to llama
# We need this because a decoder only model generates open ended answers normally
def format_prompt(tweet, examples=None):
    # System Instruction
    system_content = (
        "Classify the following tweet into exactly one of these three categories: "
        "'True', 'False', or 'Other'. "
        "Do not explain your reasoning. Output ONLY the label."
    )

    messages = [
        {"role": "system", "content": system_content},
    ]

    # Add Few-shot examples if provided (do not add if zero-shot)
    if examples:
        for ex_text, ex_label in examples:
            messages.append({"role": "user", "content": f"Tweet: {ex_text}"})
            messages.append({"role": "assistant", "content": ex_label})

    # Add the target tweet
    messages.append({"role": "user", "content": f"Tweet: {tweet}"})

    return messages

# Generates the prediction using the model
def predict_llama(text, examples=None):
    messages = format_prompt(text, examples)

    # Apply Chat Template
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    # Generate
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    outputs = model.generate(
        input_ids,
        max_new_tokens=10, # We only need one word
        eos_token_id=terminators,
        do_sample=False, # Deterministic (greedy decoding)
        temperature=0.0
    )

    # Decode output (slice off the input tokens)
    response = outputs[0][input_ids.shape[-1]:]
    output_text = tokenizer.decode(response, skip_special_tokens=True).strip()

    return output_text

# Parses the text output into ID
def map_prediction_to_id(text_output):
    text_clean = text_output.lower()
    if "false" in text_clean:
        return 1
    elif "true" in text_clean:
        return 0
    elif "other" in text_clean:
        return 2
    else:
        return 2 # Default to 'Other' if model hallucinates

data_dir = "/content/drive/MyDrive/data/EN"
results_zero_shot = []
results_few_shot = []

# Define mappings
id_map = {'True': 0, 'False': 1, 'Other': 2}

for fold in range(num_folds):
    print(f"Processing Fold {fold}") # For debugging

    # Load Data
    train_path = os.path.join(data_dir, f"en_train_{fold}.tsv")
    test_path = os.path.join(data_dir, f"en_test_{fold}.tsv")

    train_df = pd.read_csv(train_path, sep='\t')
    test_df = pd.read_csv(test_path, sep='\t')

    # Keep string labels for prompt generation
    train_df['label_str'] = train_df['label'].astype(str)

    # Prepare Ground Truth IDs
    test_df['label_id'] = test_df['label'].astype(str).map(id_map)
    test_df.dropna(subset=['label_id'], inplace=True)
    ground_truth = test_df['label_id'].tolist()

    # Zero-shot inference
    print("  Running Zero-shot...")
    preds_zero = []

    # Use tqdm for progress bar because generation is slow
    for tweet in tqdm(test_df['text']):
        raw_pred = predict_llama(tweet, examples=None)
        pred_id = map_prediction_to_id(raw_pred)
        preds_zero.append(pred_id)

    # Calculate Metrics manually
    acc_z = accuracy_score(ground_truth, preds_zero)
    prec_z, rec_z, f1_z, _ = precision_recall_fscore_support(ground_truth, preds_zero, average=None, labels=[0, 1, 2])

    results_zero_shot.append({
        'accuracy': acc_z,
        'f1_False': f1_z[1],
        'prec_False': prec_z[1],
        'rec_False': rec_z[1]
    })

    # Few-shot inference
    print("  Running Few-shot (3-shot)...")
    # Get examples from CURRENT fold's training data
    few_shot_examples = get_few_shot_examples(train_df)

    preds_few = []
    for tweet in tqdm(test_df['text']):
        raw_pred = predict_llama(tweet, examples=few_shot_examples)
        pred_id = map_prediction_to_id(raw_pred)
        preds_few.append(pred_id)

    acc_f = accuracy_score(ground_truth, preds_few)
    prec_f, rec_f, f1_f, _ = precision_recall_fscore_support(ground_truth, preds_few, average=None, labels=[0, 1, 2])

    results_few_shot.append({
        'accuracy': acc_f,
        'f1_False': f1_f[1],
        'prec_False': prec_f[1],
        'rec_False': rec_f[1]
    })

Loading Llama 3.1 (this may take a few minutes)...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


Processing Fold 0...
  Running Zero-shot...


  0%|          | 0/1073 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
  0%|          | 1/1073 [00:01<31:39,  1.77s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1073 [00:02<17:1

  Running Few-shot (3-shot)...


  0%|          | 0/1073 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1073 [00:00<10:50,  1.65it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1073 [00:01<10:56,  1.63it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1073 [00:01<10:56,  1.63it/s]The attention mask and the pad token id were not set. As a 


Processing Fold 1...
  Running Zero-shot...


  0%|          | 0/1064 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1064 [00:00<07:40,  2.31it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1064 [00:00<06:42,  2.64it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1064 [00:01<09:07,  1.94it/s]The attention mask and the pad token id were not set. As a 

  Running Few-shot (3-shot)...


  0%|          | 0/1064 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1064 [00:00<10:37,  1.67it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1064 [00:01<10:50,  1.63it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1064 [00:02<12:50,  1.38it/s]The attention mask and the pad token id were not set. As a 


Processing Fold 2...
  Running Zero-shot...


  0%|          | 0/1057 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1057 [00:00<05:50,  3.02it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1057 [00:00<06:31,  2.70it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1057 [00:01<06:56,  2.53it/s]The attention mask and the pad token id were not set. As a 

  Running Few-shot (3-shot)...


  0%|          | 0/1057 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1057 [00:00<11:51,  1.48it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1057 [00:01<11:42,  1.50it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1057 [00:01<11:22,  1.54it/s]The attention mask and the pad token id were not set. As a 


Processing Fold 3...
  Running Zero-shot...


  0%|          | 0/1048 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1048 [00:00<07:32,  2.32it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1048 [00:00<07:19,  2.38it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1048 [00:01<07:28,  2.33it/s]The attention mask and the pad token id were not set. As a 

  Running Few-shot (3-shot)...


  0%|          | 0/1048 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1048 [00:00<10:21,  1.69it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1048 [00:01<10:14,  1.70it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1048 [00:01<10:20,  1.68it/s]The attention mask and the pad token id were not set. As a 


Processing Fold 4...
  Running Zero-shot...


  0%|          | 0/1042 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1042 [00:00<07:28,  2.32it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1042 [00:00<07:30,  2.31it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1042 [00:01<06:56,  2.49it/s]The attention mask and the pad token id were not set. As a 

  Running Few-shot (3-shot)...


  0%|          | 0/1042 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 1/1042 [00:00<10:51,  1.60it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 2/1042 [00:01<11:09,  1.55it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  0%|          | 3/1042 [00:01<10:54,  1.59it/s]The attention mask and the pad token id were not set. As a 


FINAL RESULTS: LLAMA 3.1 (Inference)
Zero-Shot Results:
  Avg Accuracy: 0.5059
  Avg F1 (False): 0.6091

Few-Shot Results:
  Avg Accuracy: 0.5651
  Avg F1 (False): 0.6427





In [None]:
print("FINAL RESULTS: LLAMA 3.1 (Inference)")

print("Zero-Shot Results:")
print(f"  Avg Accuracy:        {np.mean([r['accuracy'] for r in results_zero_shot]):.4f}")
print(f"  Avg Precision(False): {np.mean([r['prec_False'] for r in results_zero_shot]):.4f}")
print(f"  Avg Recall(False):    {np.mean([r['rec_False'] for r in results_zero_shot]):.4f}")
print(f"  Avg F1 (False):       {np.mean([r['f1_False'] for r in results_zero_shot]):.4f}")

print("\nFew-Shot Results:")
print(f"  Avg Accuracy:        {np.mean([r['accuracy'] for r in results_few_shot]):.4f}")
print(f"  Avg Precision(False): {np.mean([r['prec_False'] for r in results_few_shot]):.4f}")
print(f"  Avg Recall(False):    {np.mean([r['rec_False'] for r in results_few_shot]):.4f}")
print(f"  Avg F1 (False):       {np.mean([r['f1_False'] for r in results_few_shot]):.4f}")

FINAL RESULTS: LLAMA 3.1 (Inference)
Zero-Shot Results:
  Avg Accuracy:        0.5059
  Avg Precision(False): 0.5577
  Avg Recall(False):    0.6709
  Avg F1 (False):       0.6091

Few-Shot Results:
  Avg Accuracy:        0.5651
  Avg Precision(False): 0.5824
  Avg Recall(False):    0.7173
  Avg F1 (False):       0.6427


**Discussion: Q2 Part A - Zero-shot vs Few-shot Llama Inference**

**Results Summary:**
| Prompt Type | Accuracy | Precision (False) | Recall (False) | F1 (False) |
|-------------|----------|-------------------|----------------|------------|
| Zero-shot | 0.5059 | 0.5577 | 0.6709 | 0.6091 |
| Few-shot (3 examples) | 0.5651 | 0.5824 | 0.7173 | 0.6427 |
| **Improvement** | +5.92% | +2.47% | +4.64% | +3.36% |

**Implementation:**
Before I started implementing the actual code, I knew that Q2 will be using a lot of resources since we will be working with decoder based models which are huge in terms of size. So I checked up the resource that you have provided to us, the huggingface quantization document. There were plenty of quantization options to use, I asked these options to LLMs and they suggested me to use BitsAndBytesConfig which enables 4 or 8 bit quantization. I choose 4-bit here as it could fit easier in my colab memory. For the parameters required I checked the docs you have provided. Again from the previous question I downloaded the model and set up the tokenizer. For selecting the few-shot examples (the example prompt answer pair), I just trivially choose the first row for each label (true, false, other). Another thing was to create an actual prompt for our decoder based models. Normaly decoder based models produce open-ended answers; however since our task is a downstream classification task, I needed to give a specific prompt asking for only single answer and also include the examples on the prompt (only for few-shot). And lastly, I used a distinct predict_llama function (which I did not use for previous question) because for this question I had to use the helper functons that I mentioned before the actual prediction. Another difference from the previous question is that there were no training or finetuning in this part, only giving the prompt and evaluate its performance.

**Analysis:**
Since this part is mainly about few-shot vs zero-shot prompting, I want to explain my prompt design process in detail:

- I used a clear, concise instruction specifying exact task and output format. I did not add any extra information as it could include the performance which we are NOT trying to measure in this experiment.

- For zero-shot prompt I only gave the system instruction (mentioned above) and the target tweet.

- For few-shot prompt (3-shot) I added one example from each class from the training data before the target tweet.

Again intuitively I expected few-shot to perform better than zero shot. This is because decoder based models are usually general tasks that are pretrained on huge amount of data (not task specific), so without any examples, we should expect a decoder base model to perform poorly on a downstream task like classification in our case. And the results proved my point right. The examples enabled the few-shot prompt approach to know what kind of tweets belong to each category hence outperforming the zero-shot prompt approach.

Another thing to consider is that how the both of the approaches (even the few-shot version) performs really poorly compared to finetuned encoder-based model (BERT in our case). This is because encoder based models are more suitable for downstream tasks as it includes classification layer right on top of the feed forward layer whereas decoder based models are more suitable for general tasks. Decoder based models are suitable for generating long sequences of text as in each iteration it calculates the next highest probability token in the sequence.

**Conclusion:** Few-shot prompting provides meaningful improvement over zero-shot by giving the model task-specific examples. However, both approaches underperform compared to finetuned models, suggesting that task-specific training is necessary for optimal misinformation detection.

In [None]:
# Configuration
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
data_dir = "/content/drive/MyDrive/data/EN"
num_folds = 5

# Configure 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Function to format training data for instruction tuning
def format_training_text(tweet, label):
    system_content = (
        "Classify the following tweet into exactly one of these three categories: "
        "'True', 'False', or 'Other'. "
        "Do not explain your reasoning. Output ONLY the label."
    )

    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": f"Tweet: {tweet}"},
        {"role": "assistant", "content": label}
    ]

    # Apply chat template and return the formatted text
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )

    return formatted

# Function to prepare dataset for SFTTrainer
def prepare_training_dataset(train_df):
    texts = []
    for _, row in train_df.iterrows():
        formatted_text = format_training_text(row['text'], row['label'])
        texts.append(formatted_text)

    dataset = Dataset.from_dict({"text": texts})
    return dataset

# Results storage
results_finetuned = []
training_times_finetuned = []

# Get few-shot examples function
def get_few_shot_examples(train_df):
    examples = []
    for label in ['True', 'False', 'Other']:
        row = train_df[train_df['label'] == label].iloc[0]
        examples.append((row['text'], row['label']))
    return examples

# Reuse prediction functions from Part A
def format_prompt(tweet, examples=None):
    system_content = (
        "Classify the following tweet into exactly one of these three categories: "
        "'True', 'False', or 'Other'. "
        "Do not explain your reasoning. Output ONLY the label."
    )

    messages = [
        {"role": "system", "content": system_content},
    ]

    if examples:
        for ex_text, ex_label in examples:
            messages.append({"role": "user", "content": f"Tweet: {ex_text}"})
            messages.append({"role": "assistant", "content": ex_label})

    messages.append({"role": "user", "content": f"Tweet: {tweet}"})
    return messages

def map_prediction_to_id(text_output):
    text_clean = text_output.lower()
    if "false" in text_clean:
        return 1
    elif "true" in text_clean:
        return 0
    elif "other" in text_clean:
        return 2
    else:
        return 2

# Process each fold
for fold in range(num_folds):
    print(f"Processing Fold {fold}")

    # Load data
    train_path = os.path.join(data_dir, f"en_train_{fold}.tsv")
    test_path = os.path.join(data_dir, f"en_test_{fold}.tsv")

    train_df = pd.read_csv(train_path, sep='\t')
    test_df = pd.read_csv(test_path, sep='\t')

    # Keep original labels as strings for training
    train_df['label'] = train_df['label'].astype(str)

    # Prepare ground truth for evaluation
    id_map = {'True': 0, 'False': 1, 'Other': 2}
    test_df['label_id'] = test_df['label'].astype(str).map(id_map)
    test_df.dropna(subset=['label_id'], inplace=True)
    ground_truth = test_df['label_id'].tolist()

    # Prepare training dataset
    print("Preparing training dataset...")
    train_dataset = prepare_training_dataset(train_df)

    # Load model with quantization
    print("Loading model with 4-bit quantization...")
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto"
    )

    # Configure LoRA for parameter-efficient finetuning
    lora_config = LoraConfig(
        r=16,  # Rank of the low-rank adaptation
        lora_alpha=32,  # Scaling parameter
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    # Apply LoRA to the model
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Training arguments
    training_args = TrainingArguments(
        output_dir=f'./results/llama_finetuned_fold_{fold}',
        num_train_epochs=2,  # Reduced epochs for efficiency
        per_device_train_batch_size=2,  # Small batch size due to memory constraints
        gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
        learning_rate=2e-4,
        bf16=True,  # Use bf16 instead of fp16 to match quantization dtype
        logging_steps=50,
        save_strategy="no",
        report_to="none",
        optim="paged_adamw_8bit",  # Memory-efficient optimizer
        max_steps=-1,
        warmup_steps=50,
        max_grad_norm=0.3,
    )

    # SFTTrainer for supervised fine-tuning
    # Note: In newer versions of trl, use 'processing_class' instead of 'tokenizer'
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        processing_class=tokenizer,
    )

    # Train
    start_time = time.time()
    trainer.train()
    end_time = time.time()

    training_time = end_time - start_time
    training_times_finetuned.append(training_time)
    print(f"Training completed in {training_time:.2f} seconds")

    # Get few-shot examples for evaluation (using better performing prompt from Part A)
    few_shot_examples = get_few_shot_examples(train_df)

    # Evaluate on test set
    predictions = []

    for tweet in tqdm(test_df['text'], desc=f"Fold {fold} Evaluation"):
        messages = format_prompt(tweet, examples=few_shot_examples)

        input_ids = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to(model.device)

        terminators = [
            tokenizer.eos_token_id,
            tokenizer.convert_tokens_to_ids("<|eot_id|>")
        ]

        with torch.no_grad():
            outputs = model.generate(
                input_ids,
                max_new_tokens=10,
                eos_token_id=terminators,
                do_sample=False,
                temperature=0.0
            )

        response = outputs[0][input_ids.shape[-1]:]
        output_text = tokenizer.decode(response, skip_special_tokens=True).strip()
        pred_id = map_prediction_to_id(output_text)
        predictions.append(pred_id)

    # Calculate metrics
    acc = accuracy_score(ground_truth, predictions)
    prec, rec, f1, _ = precision_recall_fscore_support(
        ground_truth, predictions, average=None, labels=[0, 1, 2]
    )

    fold_results = {
        'accuracy': acc,
        'precision_False': prec[1],
        'recall_False': rec[1],
        'f1_False': f1[1],
        'training_time': training_time
    }

    results_finetuned.append(fold_results)

    print(f"\nFold {fold} Results:")
    print(f"  Accuracy:        {acc:.4f}")
    print(f"  Precision(False): {prec[1]:.4f}")
    print(f"  Recall(False):    {rec[1]:.4f}")
    print(f"  F1(False):        {f1[1]:.4f}")
    print(f"  Training Time:    {training_time:.2f} seconds")


    # Save the model from the last fold for use in Part C
    if fold == num_folds - 1:
        print("\nSaving finetuned model (from last fold) for Part C...")
        save_path = "/content/drive/MyDrive/llama_finetuned_misinfo"
        model.save_pretrained(save_path)
        tokenizer.save_pretrained(save_path)
        print(f"Model saved to {save_path}")

    # Cleanup
    del model
    del trainer
    torch.cuda.empty_cache()


tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Processing Fold 0
Preparing training dataset...
Loading model with 4-bit quantization...


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


Adding EOS to train dataset:   0%|          | 0/4211 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4211 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4211 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.


Step,Training Loss
50,2.2426
100,1.5887
150,1.5243
200,1.5043
250,1.5047
300,1.4557
350,1.4524
400,1.5371
450,1.4054
500,1.4746


Training completed in 1863.21 seconds


Fold 0 Evaluation:   0%|          | 0/1073 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Fold 0 Evaluation: 100%|██████████| 1073/1073 [05:34<00:00,  3.21it/s]



Fold 0 Results:
  Accuracy:        0.8183
  Precision(False): 0.8350
  Recall(False):    0.7270
  F1(False):        0.7773
  Training Time:    1863.21 seconds
Processing Fold 1
Preparing training dataset...
Loading model with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


Adding EOS to train dataset:   0%|          | 0/4220 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4220 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4220 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.


Step,Training Loss
50,2.2438
100,1.5354
150,1.5216
200,1.5001
250,1.4518
300,1.5017
350,1.4718
400,1.4487
450,1.4794
500,1.453


Training completed in 1877.08 seconds


Fold 1 Evaluation: 100%|██████████| 1064/1064 [05:38<00:00,  3.15it/s]



Fold 1 Results:
  Accuracy:        0.7773
  Precision(False): 0.8445
  Recall(False):    0.5726
  F1(False):        0.6825
  Training Time:    1877.08 seconds
Processing Fold 2
Preparing training dataset...
Loading model with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


Adding EOS to train dataset:   0%|          | 0/4227 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4227 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4227 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.


Step,Training Loss
50,2.207
100,1.5211
150,1.4727
200,1.6153
250,1.4738
300,1.516
350,1.4744
400,1.4655
450,1.4487
500,1.496


Training completed in 1880.05 seconds


Fold 2 Evaluation: 100%|██████████| 1057/1057 [05:34<00:00,  3.16it/s]



Fold 2 Results:
  Accuracy:        0.7994
  Precision(False): 0.8810
  Recall(False):    0.6307
  F1(False):        0.7351
  Training Time:    1880.05 seconds
Processing Fold 3
Preparing training dataset...
Loading model with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


Adding EOS to train dataset:   0%|          | 0/4236 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4236 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4236 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.


Step,Training Loss
50,2.238
100,1.4736
150,1.5307
200,1.5272
250,1.4761
300,1.4473
350,1.4851
400,1.4966
450,1.4514
500,1.4393


Training completed in 1888.62 seconds


Fold 3 Evaluation: 100%|██████████| 1048/1048 [05:32<00:00,  3.16it/s]



Fold 3 Results:
  Accuracy:        0.8120
  Precision(False): 0.8582
  Recall(False):    0.6765
  F1(False):        0.7566
  Training Time:    1888.62 seconds
Processing Fold 4
Preparing training dataset...
Loading model with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


Adding EOS to train dataset:   0%|          | 0/4242 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4242 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4242 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.


Step,Training Loss
50,2.2585
100,1.5202
150,1.5543
200,1.5239
250,1.5418
300,1.5202
350,1.4519
400,1.4734
450,1.459
500,1.377


Training completed in 1894.46 seconds


Fold 4 Evaluation: 100%|██████████| 1042/1042 [05:30<00:00,  3.15it/s]



Fold 4 Results:
  Accuracy:        0.8378
  Precision(False): 0.8653
  Recall(False):    0.7604
  F1(False):        0.8094
  Training Time:    1894.46 seconds

Saving finetuned model (from last fold) for Part C...
Model saved to /content/drive/MyDrive/llama_finetuned_misinfo


In [None]:
# Report final results
print("FINAL RESULTS: LLAMA 3.1 (Finetuned)")

if results_finetuned:
    avg_acc = np.mean([r['accuracy'] for r in results_finetuned])
    avg_prec = np.mean([r['precision_False'] for r in results_finetuned])
    avg_recall = np.mean([r['recall_False'] for r in results_finetuned])
    avg_f1 = np.mean([r['f1_False'] for r in results_finetuned])
    avg_time = np.mean([r['training_time'] for r in results_finetuned])

    print(f"Average Accuracy:        {avg_acc:.4f}")
    print(f"Average Precision(False): {avg_prec:.4f}")
    print(f"Average Recall(False):    {avg_recall:.4f}")
    print(f"Average F1(False):        {avg_f1:.4f}")
    print(f"Average Training Time:    {avg_time:.2f} seconds")

    print("\nComparison with Few-Shot (from Part A):")
    print(f"Few-Shot F1(False): 0.6427")
    print(f"Finetuned F1(False): {avg_f1:.4f}")
    print(f"Improvement: {avg_f1 - 0.6427:.4f} ({((avg_f1 - 0.6427) / 0.6427 * 100):.2f}%)")

FINAL RESULTS: LLAMA 3.1 (Finetuned)
Average Accuracy:        0.8090
Average Precision(False): 0.8568
Average Recall(False):    0.6734
Average F1(False):        0.7522
Average Training Time:    1880.68 seconds

Comparison with Few-Shot (from Part A):
Few-Shot F1(False): 0.6427
Finetuned F1(False): 0.7522
Improvement: 0.1095 (17.03%)


**Discussion: Q2 Part B - Finetuned Llama Performance**

**Results Summary:**
| Model | Accuracy | Precision (False) | Recall (False) | F1 (False) | Training Time |
|-------|----------|-------------------|----------------|------------|---------------|
| Few-shot (Part A) | 0.5651 | 0.5824 | 0.7173 | 0.6427 | N/A |
| Finetuned Llama | 0.8090 | 0.8568 | 0.6734 | 0.7522 | 1880.68s |
| Finetuned BERT (with raw data) | 0.8181 | 0.7914 | 0.7971 | 0.7914 | 64.82s |

**Implementation:**
This part shares some common code blocks from the previous part. Different from the part, I used format_training_text in addition to
format_prompt since this time we have to give the corresponding label along with our system content (same from the last part) so that the model can learn. Also, I have finetuned the model 5 times in addition to testing it.

Because decoder based models are huge and resource consuming to finetune, I used a technique called LoRA which Çağrı hoca also addressed during lectures. I did not know how it worked and how it is used, so I utilized LLMs to learn how to initalize LoRA and how it works. LoRA actually freezes the original 8 billion parameters of the model and train only 41M newly added LoRA adapters which makes the finetuning a lot easier.

Lastly for finetuning, I used SFTTrainer (supervised fine-tuning trainer) as in the case of downstream task training we need labels which means we need supervised learning. That is why I used SFTTrainer. Again for the syntax and parameter usage, I used LLM tools to initalize the trainer. And since Part C uses the model finetuned in this part, for the last fold I am not deleting the model and actually saving it in a relative path.

**Analysis:**

For this part, the results are expected. Decoder based models are usually pretrained with massive amount of text without labels, so they are more suitable for general tasks including text generation. So without finetuning, we got a poor result. However with finetuning model actually sees task specific question and answer pairs which makes the model perform on the specific task. For this reasons the finetuned LLAMA got better results than zero-shot prompted LLAMA (not finetuned). Also, we can infer that finetuning is generally better than few-shot prompting for decoder based models. However there is a trade of as we need to use resources for finetuning whereas with few-shot we did not need to do that.

Another really important observation is that how this finetuned LLAMA is compared to our encoder based model BERT. When we compare their result, their performance looks similar, they have almost the same accuracy, LLAMA have better presicion but BERT has better Recall and F1 score. So they are comparable. But the major difference here is that their training time and resource usage. Finetuning LLAMA took 1800 seconds where as finetuning BERT only took 64. From this we can infer that for downstream tasks, using a encoder based models is far more logical than finetuning a decoder based model since their performance are almost equal (if not BERT is better) but their resource usage is so different.


**Potential Improvements:**
- **More Training Data:** Include Turkish data for multilingual capability
- **Longer Training:** More epochs might improve performance
- **Hyperparameter Tuning:** Adjust LoRA rank, learning rate, batch size

**Conclusion:** Finetuning provides improvements over prompt-based methods, demonstrating that task-specific training is essential for misinformation detection. The trade-off is significantly longer training time. Using an encoder based model for downstream tasks has more advantages over decoder based model as it has similar if not better performance and far far less training time.

In [None]:
# Load game reviews dataset
reviews_path = "/content/drive/MyDrive/data/game_reviews.csv"
reviews_df = pd.read_csv(reviews_path)

print(f"Dataset loaded: {len(reviews_df)} reviews")
print(f"Columns: {reviews_df.columns.tolist()}")
print(f"Label distribution:\n{reviews_df['user_suggestion'].value_counts()}")

# Configuration
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
finetuned_path = "/content/drive/MyDrive/llama_finetuned_misinfo"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Zero-shot prompt for game review sentiment
def format_sentiment_prompt(review):
    system_content = (
        "Based on the following game review, determine if the user recommends the game. "
        "Answer with exactly one word: 'Yes' if the user recommends the game, or 'No' if they do not. "
        "Do not explain your reasoning. Output ONLY 'Yes' or 'No'."
    )

    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": f"Review: {review}"}
    ]

    return messages

# Parse model output to binary label
def map_sentiment_to_id(text_output):

    text_clean = text_output.lower().strip()
    if "yes" in text_clean:
        return 1  # Recommends
    elif "no" in text_clean:
        return 0  # Does not recommend
    else:
        return 0  # Default to "not recommend" if unclear

# Evaluate a model on the game reviews dataset
def evaluate_sentiment_model(model, tokenizer, reviews_df, model_name):
    print(f"Evaluating {model_name}")

    predictions = []
    ground_truth = reviews_df['user_suggestion'].tolist()

    for review in tqdm(reviews_df['user_review'], desc=model_name):
        messages = format_sentiment_prompt(review)

        input_ids = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to(model.device)

        terminators = [
            tokenizer.eos_token_id,
            tokenizer.convert_tokens_to_ids("<|eot_id|>")
        ]

        with torch.no_grad():
            outputs = model.generate(
                input_ids,
                max_new_tokens=10,
                eos_token_id=terminators,
                do_sample=False,
                temperature=0.0
            )

        response = outputs[0][input_ids.shape[-1]:]
        output_text = tokenizer.decode(response, skip_special_tokens=True).strip()
        pred_id = map_sentiment_to_id(output_text)
        predictions.append(pred_id)

    # Calculate metrics
    acc = accuracy_score(ground_truth, predictions)
    prec, rec, f1, _ = precision_recall_fscore_support(
        ground_truth, predictions, average='binary', pos_label=1
    )

    return {
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'predictions': predictions
    }

tokenizer_base = AutoTokenizer.from_pretrained(model_id)
tokenizer_base.pad_token = tokenizer_base.eos_token

model_base = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

results_base = evaluate_sentiment_model(model_base, tokenizer_base, reviews_df, "Base Llama")

print(f"Base Model Results:")
print(f"  Accuracy:  {results_base['accuracy']:.4f}")
print(f"  Precision: {results_base['precision']:.4f}")
print(f"  Recall:    {results_base['recall']:.4f}")
print(f"  F1 Score:  {results_base['f1']:.4f}")

# Cleanup base model
del model_base
torch.cuda.empty_cache()

from peft import PeftModel

# Load base model first
model_finetuned = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Load LoRA adapter weights
model_finetuned = PeftModel.from_pretrained(model_finetuned, finetuned_path)

tokenizer_finetuned = AutoTokenizer.from_pretrained(finetuned_path)
tokenizer_finetuned.pad_token = tokenizer_finetuned.eos_token

results_finetuned_c = evaluate_sentiment_model(model_finetuned, tokenizer_finetuned, reviews_df, "Finetuned Llama")

print(f"\nFinetuned Model Results:")
print(f"  Accuracy:  {results_finetuned_c['accuracy']:.4f}")
print(f"  Precision: {results_finetuned_c['precision']:.4f}")
print(f"  Recall:    {results_finetuned_c['recall']:.4f}")
print(f"  F1 Score:  {results_finetuned_c['f1']:.4f}")

# Cleanup
del model_finetuned
torch.cuda.empty_cache()

print("COMPARISON: Base vs Finetuned on Game Reviews")
print(f"{'Metric':<15} {'Base Model':<15} {'Finetuned':<15} {'Difference':<15}")
print(f"{'Accuracy':<15} {results_base['accuracy']:<15.4f} {results_finetuned_c['accuracy']:<15.4f} {results_finetuned_c['accuracy'] - results_base['accuracy']:+.4f}")
print(f"{'Precision':<15} {results_base['precision']:<15.4f} {results_finetuned_c['precision']:<15.4f} {results_finetuned_c['precision'] - results_base['precision']:+.4f}")
print(f"{'Recall':<15} {results_base['recall']:<15.4f} {results_finetuned_c['recall']:<15.4f} {results_finetuned_c['recall'] - results_base['recall']:+.4f}")
print(f"{'F1 Score':<15} {results_base['f1']:<15.4f} {results_finetuned_c['f1']:<15.4f} {results_finetuned_c['f1'] - results_base['f1']:+.4f}")

Dataset loaded: 17494 reviews
Columns: ['review_id', 'title', 'year', 'user_review', 'user_suggestion']
Label distribution:
user_suggestion
1    9968
0    7526
Name: count, dtype: int64


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Evaluating Base Llama


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Base Llama:  86%|████████▌ | 14995/17494 [46:11<07:30,  5.55it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Base Llama:  86%|████████▌ | 14996/17494 [46:11<07:28,  5.57it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Base Llama:  86%|████████▌ | 14997/17494 [46:11<07:29,  5.55it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-

Base Model Results:
  Accuracy:  0.9164
  Precision: 0.9362
  Recall:    0.9157
  F1 Score:  0.9259


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Evaluating Finetuned Llama


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Finetuned Llama:  86%|████████▌ | 14995/17494 [1:17:28<12:47,  3.26it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Finetuned Llama:  86%|████████▌ | 14996/17494 [1:17:28<12:53,  3.23it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Finetuned Llama:  86%|████████▌ | 14997/17494 [1:17:28<12:55,  3.22it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token


Finetuned Model Results:
  Accuracy:  0.7510
  Precision: 0.9633
  Recall:    0.5853
  F1 Score:  0.7282
COMPARISON: Base vs Finetuned on Game Reviews
Metric          Base Model      Finetuned       Difference     
Accuracy        0.9164          0.7510          -0.1654
Precision       0.9362          0.9633          +0.0271
Recall          0.9157          0.5853          -0.3305
F1 Score        0.9259          0.7282          -0.1977





In [None]:
print(f"Base Model Results:")
print(f"  Accuracy:  {results_base['accuracy']:.4f}")
print(f"  Precision: {results_base['precision']:.4f}")
print(f"  Recall:    {results_base['recall']:.4f}")
print(f"  F1 Score:  {results_base['f1']:.4f}")

print(f"\nFinetuned Model Results:")
print(f"  Accuracy:  {results_finetuned_c['accuracy']:.4f}")
print(f"  Precision: {results_finetuned_c['precision']:.4f}")
print(f"  Recall:    {results_finetuned_c['recall']:.4f}")
print(f"  F1 Score:  {results_finetuned_c['f1']:.4f}")

print("COMPARISON: Base vs Finetuned on Game Reviews")
print(f"{'Metric':<15} {'Base Model':<15} {'Finetuned':<15} {'Difference':<15}")
print(f"{'Accuracy':<15} {results_base['accuracy']:<15.4f} {results_finetuned_c['accuracy']:<15.4f} {results_finetuned_c['accuracy'] - results_base['accuracy']:+.4f}")
print(f"{'Precision':<15} {results_base['precision']:<15.4f} {results_finetuned_c['precision']:<15.4f} {results_finetuned_c['precision'] - results_base['precision']:+.4f}")
print(f"{'Recall':<15} {results_base['recall']:<15.4f} {results_finetuned_c['recall']:<15.4f} {results_finetuned_c['recall'] - results_base['recall']:+.4f}")
print(f"{'F1 Score':<15} {results_base['f1']:<15.4f} {results_finetuned_c['f1']:<15.4f} {results_finetuned_c['f1'] - results_base['f1']:+.4f}")

Base Model Results:
  Accuracy:  0.9164
  Precision: 0.9362
  Recall:    0.9157
  F1 Score:  0.9259

Finetuned Model Results:
  Accuracy:  0.7510
  Precision: 0.9633
  Recall:    0.5853
  F1 Score:  0.7282
COMPARISON: Base vs Finetuned on Game Reviews
Metric          Base Model      Finetuned       Difference     
Accuracy        0.9164          0.7510          -0.1654
Precision       0.9362          0.9633          +0.0271
Recall          0.9157          0.5853          -0.3305
F1 Score        0.9259          0.7282          -0.1977


**Discussion: Q2 Part C - Domain Transfer to Sentiment Analysis**

**Results Summary:**
| Model | Accuracy | Precision | Recall | F1 Score |
|-------|----------|-----------|--------|----------|
| Base Llama | 0.9164 | 0.9362 | 0.9157 | 0.9259 |
| Finetuned Llama | 0.7510 | 0.9633 | 0.5853 | 0.7282 |
| **Difference** | **-16.54%** | +2.71% | **-33.05%** | **-19.77%** |

**Implementation:**
In terms of implementation, this part really resembles Q2 Part A. One difference is that the prompt we give the model to make a prediction has changed as this time instead of classifying tweets we are classifying game reviews. Other than that I am using the model finetuned in Part B and test that model on game reviews.

**Analysis:**

Clearly the base model performed way better than the finetuned model. I actually expected this due to a phenomenon that is also mentioned in our lectures called **Catastrophic Forgetting**. This is the situation when a decoder based model which is pretrained to have general knowledge forgets its general knowledge during finetuning on a downstream task. And this is exactly what happens in our case. This actually means that we should not use a decoder based model finetuned with a downstream task for a different ask.

**Conclusion:** Finetuning on misinformation detection negatively impacted the model's ability to perform sentiment analysis. This demonstrates the **specialization-generalization trade-off** in LLMs - task-specific training improves target task performance but can degrade performance on other tasks. For applications requiring multiple capabilities, using the base model or maintaining separate finetuned versions may be preferable.

## Q3 - Discussion of classification performance and resource use (20 points)

In this part, you are expected to discuss the misinformation detection performance and resource use of the models you have trained in Q1 Part A-B and Q2 Part A-B. Which model performs better? Which model uses more resources? Please discuss your OWN findings with YOUR OWN WORDS. Do not forget to share the time and resource need details you have measured and observed.

**Performance Comparison (All Metrics for False Class):**

| Model | Accuracy | Precision (False) | Recall (False) | F1 (False) | Training Time |
|-------|----------|-------------------|----------------|------------|---------------|
| BERT (Preprocessed) | 0.8079 | 0.7723 | 0.7727 | 0.7724 | 70.71s |
| BERT (Raw) | 0.8181 | ~0.79 | 0.7971 | 0.7914 | 64.82s |
| mBERT (EN→TR) | 0.5433 | 0.4235 | 0.1480 | 0.2187 | 66.11s |
| Llama Zero-shot | 0.5059 | 0.5577 | 0.6709 | 0.6091 | N/A |
| Llama Few-shot | 0.5651 | 0.5824 | 0.7173 | 0.6427 | N/A |
| **Llama Finetuned** | **0.8090** | **0.8568** | **0.6734** | **0.7522** | **1880.68s** |

Even though I already compared almost every model in terms of their performance on misinformation classification in my previous discussions, here are the summarization of those findings.

**Key Findings:**

- In general our encoder based model BERT has performed better than decoder based model Llama. This actually implies that encoder based models are better than decoder based models for downstream tasks like classification.

- Finetuned version of Llama was the only decoder based model that made it competetive with BERT. This actually implies that finetuned decoder based models could be an alternative to encoder based models.

- Even though their performance close, finetuned Llama requires a lot more resources than BERT
  - Training time -> 31 minutes per fold (almost 29 times slower than BERT)
  - Requires GPU with 15GB+ VRAM even with 4-bit quantization

- Since they have the similar performance but huge difference on resource demand, it is generally better to use encoder based models instead of finetuning a decoder based model on a downstream task. Also using a general-knowledge decoder based model for a downstream task is not logical as we are using a huge model for a specific relatively small task.

- Finally, among all versions it is better to use a Multilingual BERT trained with raw text as it has the best overall performance, fastest training and lowest resource requirements.

- If we do not have any data to work with, our only option is to use Llama with few-shot prompting which did not performed really well.

**Resource Requirements:**

| Model | Parameters | VRAM Required | Training Time (5 folds) |
|-------|------------|---------------|-------------------------|
| BERT | 110M | ~4GB | ~5.4 minutes |
| mBERT | 110M | ~4GB | ~5.5 minutes |
| Llama 3.1 8B (4-bit) | 8B (quantized) | ~8-10GB | ~2.6 hours |

**Conclusion:** For the misinformation detection task, **encoder-based models (BERT) provide the best balance of performance and efficiency**. While finetuned Llama achieves competitive results, the 29x longer training time and higher resource requirements make BERT more practical for most applications. The choice depends on specific constraints: BERT for efficiency, Llama for flexibility and potential multilingual generation capabilities.