<a href="https://colab.research.google.com/github/setth123/BTL_DACN/blob/main/Longformer_Finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summary
This project aims to fine-tune the Longformer model, a transformer-based architecture designed to handle long sequences of text efficiently, for the task of fake news detection. Longformer, with its attention mechanism optimized for long-range dependencies, is particularly suited for this problem, as news articles tend to be lengthy and require understanding context across long documents.

### Install nescary libraries



In [None]:
!pip install  datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

### Load dataset

In [None]:
from google.colab import files
import zipfile
import pandas as pd
import os

uploaded = files.upload()

with zipfile.ZipFile("News_dataset.zip", 'r') as zip_ref:
     zip_ref.extractall("News_dataset")

fakeData = pd.read_csv("News_dataset/New_dataset/Fake.csv")
fakeData['labels'] = 0

readData = pd.read_csv("News_dataset/New_dataset/True.csv")
readData['labels'] = 1


df = pd.concat([fakeData, readData])


print("Number of records: ", len(df))
print("Preview data:")
df.sample(5)


Number of records:  44898
Preview data:


Unnamed: 0,title,text,subject,date,labels
7352,This Christian Mom Thought It Was A Good Idea...,Her name is M.H. Weibe and she s here to rap a...,News,"March 22, 2016",0
16491,RECKLESS: CLINTON PRESIDENCY Could Mean U.S. M...,Hillary Clinton always putting a radical ideol...,Government News,"Jul 25, 2016",0
11657,JAPANESE SCHOOLS DON’T EMPLOY JANITORS…Why Ame...,Watch NPR employee and Afghanistan refugee (wh...,politics,"Feb 15, 2017",0
1409,Trump’s FCC Will Decimate Internet Freedom (V...,Republicans on the Federal Communications Com...,News,"May 19, 2017",0
4904,Treasury's Mnuchin says Trump does not want tr...,BERLIN (Reuters) - U.S. Treasury Secretary Ste...,politicsNews,"March 16, 2017",1


### Data preprocessing

In [None]:
import re

df=df[['text','labels']]
def preprocess_text(text):
    text = re.sub(r"[^a-zA-Z0-9 ]", "",text)
    return text

df['text'] = df['text'].apply(preprocess_text)
df.sample(5)

Unnamed: 0,text,labels
9571,WASHINGTON Reuters US lawmakers are making pr...,1
7656,Two people were injured during a shoesale shoo...,0
21816,,0
4852,Another day another foreign hack into America ...,0
15077,SEOUL Reuters The leaders of South Korea and ...,1


### Tokenize

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

dataset = Dataset.from_pandas(df)
tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
def preprocess_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=4096
    )
tokenized_datasets = dataset.map(preprocess_function, batched=True)

tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print(tokenized_datasets[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/44898 [00:00<?, ? examples/s]

{'labels': tensor(0), 'input_ids': tensor([    0, 19195,   140,    95,  1705,   326,  2813,    70,  1791,    10,
         9899,   188,  2041,     8,   989,    24,    23,    14,  2978,    37,
           56,     7,   492,    10, 18066,    66,     7,    39, 11058,  3988,
          268,     8,  1437,     5,   182, 27820,  4486,   340,   433,  1437,
           20,   320,  2015,   311,   999,    56,    95,    65,   633,     7,
          109,     8,    37,  1705,   326,   109,    24,   287,    84,  5093,
         6042, 11461,  3651,     8, 18369,    38,   236,     7,  2813,    70,
            9,   127,   964,  2732, 11058,  3988,   268,     8,   190,     5,
          182, 27820, 24530,   491,  2454,    10,  9899,     8, 21487,   188,
         2041,  1437,   270, 32420, 42616,  2858,  1437,   199,    40,    28,
           10,   372,    76,    13,   730,   287,    84,  5093,  6042, 11461,
         3651,     8, 18369,    38,   236,     7,  2813,    70,     9,   127,
          964,  2732, 11058, 



### Load model

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig

# Check for GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load  Longformer
model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=2
)
model.to(device)

#Config LoRa
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["attention.self.query", "attention.self.key", "attention.self.value"]
)

model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/597M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/597M [00:00<?, ?B/s]

trainable params: 1,034,498 || all params: 149,695,492 || trainable%: 0.6911
None


### Training

In [None]:

import os
# turn off wandb (optinal)
os.environ["WANDB_DISABLED"] = "true"
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'


In [None]:
from transformers import DataCollatorWithPadding

## evaluate
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

modelPath="/content/drive/MyDrive/AI Models/Longformer_Finetuned"
training_args = TrainingArguments(
    output_dir=modelPath,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    logging_dir=f"{modelPath}/logs",
    logging_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    gradient_checkpointing=True,
    )


dataset_split=tokenized_datasets.train_test_split(test_size=0.2)

def compute_metrics(p):
    preds = p.predictions.argmax(axis=-1)
    labels = p.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_split["train"],
    eval_dataset=dataset_split["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

model.save_pretrained(modelPath)
tokenizer.save_pretrained(modelPath)

#Save model
trainer.save_model("/content/drive/MyDrive/AI Models/Longformer_Finetuned/Model")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Mounted at /content/drive


No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
Initializing global attention on CLS token...
Input ids are automatically padded to be a multiple of `config.attention_window`: 512


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1855,0.156185,0.979065,0.979082,0.979065,0.979068
2,0.1006,0.083698,0.98363,0.983667,0.98363,0.983634
3,0.09,0.071888,0.984187,0.984226,0.984187,0.984191




('/content/drive/MyDrive/AI Models/Longformer_Finetuned/tokenizer_config.json',
 '/content/drive/MyDrive/AI Models/Longformer_Finetuned/special_tokens_map.json',
 '/content/drive/MyDrive/AI Models/Longformer_Finetuned/vocab.json',
 '/content/drive/MyDrive/AI Models/Longformer_Finetuned/merges.txt',
 '/content/drive/MyDrive/AI Models/Longformer_Finetuned/added_tokens.json',
 '/content/drive/MyDrive/AI Models/Longformer_Finetuned/tokenizer.json')

## Conclusion
The fine-tuned Longformer model for fake news detection demonstrates excellent performance. Over the course of three training epochs, the model consistently improved in both training and validation metrics.
- Final Validation Accuracy: 98.48%
- Final Precision: 98.42%
- Final Recall: 98.41%
- Final F1-score: 98.41%

The Longformer-based model is highly effective for binary fake news classification. Its strong performance metrics make it suitable for deployment in real-world applications requiring reliable detection of misinformation.