# Lightweight Fine-Tuning Project

The task is to use a Foundation Model to detect fake news (binary classification of `real` or `fake`). The datasets in question are the `mohammadjavadpirhadi/fake-news-detection-dataset-english` available from HuggingFace datasets [here](https://huggingface.co/datasets/mohammadjavadpirhadi/fake-news-detection-dataset-english).

* PEFT technique: `QLoRA (4-bit quantization and LoRA)`
* Model: `microsoft/deberta-v3-base`
* Evaluation approach: `accuracy`
* Fine-tuning dataset: `mohammadjavadpirhadi/fake-news-detection-dataset-english`

## Loading and Evaluating a Foundation Model

In [2]:
from datasets import load_dataset

# Load the train and test splits
train_dataset = load_dataset("mohammadjavadpirhadi/fake-news-detection-dataset-english", split='train')
test_dataset = load_dataset("mohammadjavadpirhadi/fake-news-detection-dataset-english", split='test')

# Downsample the datasets due to memory and processing time constraints
train_dataset = train_dataset.shuffle(seed=99).select(range(5000))
test_dataset = test_dataset.shuffle(seed=99).select(range(500))

Downloading readme:   0%|          | 0.00/786 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/57.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/35918 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8980 [00:00<?, ? examples/s]

In [3]:
id2label = {0: "real", 1: "fake"}
label2id = {"real": 0, "fake": 1}

In [4]:
print(f"""
title: {train_dataset[0]['title']}
label: {id2label[train_dataset[0]['label']]}
text:
{train_dataset[0]['text']}
""")


title: Lebanon's finances can cope with PM resignation: finance minister
label: real
text:
BEIRUT (Reuters) - Lebanon and its financial institutions can cope with the impact of Prime Minister Saad Hariri s surprise resignation, Finance Minister Ali Hassan Khalil said on Monday.  We are confident in the stability of the financial and monetary situation in the country. There are no very big challenges ahead of us,  Khalil said in a televised statement after a meeting on the economy chaired by President Michel Aoun.  The state is able to finance itself,  he said.   



In [5]:
print(f"""
title: {train_dataset[1]['title']}
label: {id2label[train_dataset[1]['label']]}
text:
{train_dataset[1]['text']}
""")


title:  Sean Hannity Is Totally Butthurt Over This Onion Picture
label: fake
text:
Sean Hannity is giddy and offended that finally, just like his Great Orange Leader, there s a bloody Sean Hannity joke picture out there, and he finally gets to share in the victim outrage.All of Hannity s righteous indignation is over an Onion (yes, the satire site) article with the headline,  Hundreds Of Miniature Sean Hannitys Burst From Roger Ailes  Corpse.  The picture is what really got under Hannity s thin yet completely abrasive skin. It should several of him, like in the movie  Alien  bursting from what looks like a white shirt.Hannity, the man who bled advertisers over a false murder accusation toward Hillary Clinton, was just appalled that his 15 year old daughter would see such a horrible picture.What is wrong with the left that they think these sorts of things are funny? https://t.co/sAxON5xxmh  Sean Hannity (@seanhannity) June 1, 2017Personally, I don t think it s funny, but whatever. And 

In [6]:
model_path = 'microsoft/deberta-v3-base'

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=2048)

processed_train_dataset = train_dataset.map(
    lambda x: {'text': x["title"] + "\n" + x["text"]},
)

processed_test_dataset = test_dataset.map(
    lambda x: {'text': x["title"] + "\n" + x["text"]},
)
tokenized_train_dataset = train_dataset.map(lambda x: tokenizer(x["text"], truncation=True),
                                            remove_columns=["title", "text", "subject", "date"])
tokenized_test_dataset = test_dataset.map(lambda x: tokenizer(x["text"], truncation=True),
                                          remove_columns=["title", "text", "subject", "date"])

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [8]:
import torch
import transformers
import numpy as np
from transformers import (
    AutoModelForSequenceClassification, TrainingArguments, BitsAndBytesConfig,
    DataCollatorWithPadding, Trainer, pipeline
)
from transformers.pipelines.pt_utils import KeyDataset
from peft import LoraConfig, AutoPeftModelForSequenceClassification, TaskType, get_peft_model

In [9]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_kwargs = dict(
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

model = AutoModelForSequenceClassification.from_pretrained(model_path, **model_kwargs)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
print(model)

DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear4bit(in_features=768, out_features=768, bias=True)
              (key_proj): Linear4bit(in_features=768, out_features=768, bias=True)
              (value_proj): Linear4bit(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear4bit(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, el

In [11]:
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

In [12]:
output = pipe(processed_train_dataset[0]['text'])
output

[{'label': 'fake', 'score': 0.571789026260376}]

In [13]:
output[0]

{'label': 'fake', 'score': 0.571789026260376}

In [14]:
def predict_test_set(pipeline):
    preds = []
    for output in pipeline(KeyDataset(processed_test_dataset, 'text')):
        preds.append(output['label'])
    return preds

def evaluate_test_set(preds):
    num_correct = 0
    for pred, test_record in zip(preds, processed_test_dataset):
        if pred == id2label[test_record['label']]:
            num_correct += 1
    return num_correct / len(processed_test_dataset)

In [15]:
preds = predict_test_set(pipe)

Token indices sequence length is longer than the specified maximum sequence length for this model (2779 > 2048). Running this sequence through the model will result in indexing errors


In [16]:
evaluate_test_set(preds)

0.542

In [17]:
# Delete the pipeline and free up GPU memory
del pipe
torch.cuda.empty_cache()

## Performing Parameter-Efficient Fine-Tuning

In [18]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [19]:
peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        inference_mode=False,
        task_type=TaskType.SEQ_CLS,
)

model = get_peft_model(model, peft_config=peft_config)

In [20]:
model.print_trainable_parameters()

trainable params: 591,362 || all params: 185,015,044 || trainable%: 0.3196


In [21]:
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./checkpoint_dir",
        overwrite_output_dir=True,
        # Set the learning rate
        learning_rate=1e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.1421,0.07143,1.0
2,0.0652,0.02866,1.0


TrainOutput(global_step=5000, training_loss=0.24774205932617188, metrics={'train_runtime': 2844.4732, 'train_samples_per_second': 3.516, 'train_steps_per_second': 1.758, 'total_flos': 3422276151887376.0, 'train_loss': 0.24774205932617188, 'epoch': 2.0})

In [23]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [24]:
model.push_to_hub("zanelim/deberta-v3-fakenews")

adapter_model.safetensors:   0%|          | 0.00/2.37M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/zanelim/deberta-v3-fakenews/commit/802755f480c080373ddf7be3675a144ac83831b7', commit_message='Upload model', commit_description='', oid='802755f480c080373ddf7be3675a144ac83831b7', pr_url=None, pr_revision=None, pr_num=None)

## Performing Inference with a PEFT Model

In [25]:
trainer.evaluate()

{'eval_loss': 0.028660252690315247,
 'eval_accuracy': 1.0,
 'eval_runtime': 46.349,
 'eval_samples_per_second': 10.788,
 'eval_steps_per_second': 5.394,
 'epoch': 2.0}

In [None]:
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

In [27]:
preds = predict_test_set(pipe)
evaluate_test_set(preds)

1.0

In [28]:
correct_preds = []
for pred, test_record in zip(preds, processed_test_dataset):
        if pred == id2label[test_record['label']]:
            correct_preds.append((pred, id2label[test_record['label']], test_record['text']))

In [29]:
# Let's look at the first 2 correctly predicted examples from test set
for pred, label, text in correct_preds[:2]:
    print(f"Predicted label: {pred}")
    print(f"Actual label: {label}")
    print(f"News: {text}")
    print()

Predicted label: real
Actual label: real
News: Denmark set to become next European country to ban burqas
COPENHAGEN (Reuters) - Denmark looks set to become the next European country to restrict the burqa and the niqab, worn by some Muslim women, after most parties in the Danish parliament backed some sort of ban on facial coverings. Full and partial face veils such as burqas and niqabs divide opinion across Europe, setting advocates of religious freedom against secularists and those who argue that such garments are culturally alien or a symbol of the oppression of women. The niqab covers everything but the eyes, while the burqa also covers the eyes with a transparent veil. France, Belgium, the Netherlands, Bulgaria and the German state of Bavaria have all imposed some restrictions on the wearing of full-face veils in public places.  This is not a ban on religious clothing, this is a ban on masking,  Jacob Ellemann-Jensen, spokesman for the Liberal Party, told reporters on Friday after 