<a href="https://colab.research.google.com/github/setth123/Longformer-Finetuned/blob/main/Longformer_Finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summary
This project aims to fine-tune the Longformer model, a transformer-based architecture designed to handle long sequences of text efficiently, for the task of fake news detection. Longformer, with its attention mechanism optimized for long-range dependencies, is particularly suited for this problem, as news articles tend to be lengthy and require understanding context across long documents.

### Install nescary libraries



In [None]:
!pip install  datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

### Load dataset

In [None]:
from google.colab import drive
import pandas as pd

drive.mount('/content/drive',force_remount=True)
fakeData=pd.read_csv('/content/drive/MyDrive/Dataset/New_dataset/Fake.csv')
fakeData['labels']=0
readData=pd.read_csv('/content/drive/MyDrive/Dataset/New_dataset/True.csv')
readData['labels']=1
df=pd.concat([fakeData,readData])
print("Number of records: ",len(df))
print("Preview data")
df.sample(5)

Mounted at /content/drive
Number of records:  44898
Preview data


Unnamed: 0,title,text,subject,date,labels
10383,Reporter files criminal charge of battery agai...,(Reuters) - A reporter for the conservative we...,politicsNews,"March 11, 2016",1
10044,Lawyers for ex House Speaker Hastert ask judge...,CHICAGO (Reuters) - Lawyers for former U.S. Ho...,politicsNews,"April 7, 2016",1
14292,Rosneft's Sechin to miss hearing at ex-ministe...,MOSCOW (Reuters) - The head of Russian state o...,worldnews,"November 21, 2017",1
16769,ONLY HOURS AFTER DEATH Of Supreme Court Justic...,The US Supreme Court is set to decide the firs...,Government News,"Feb 13, 2016",0
4305,WATCH: GOP Senator Pleads With Trump To Drop ...,Mike Lee (R-Utah) has been a major figure in a...,News,"October 8, 2016",0


### Data preprocessing

In [None]:
import re

df=df[['text','labels']]
def preprocess_text(text):
    text = re.sub(r"[^a-zA-Z0-9 ]", "",text)
    return text

df['text'] = df['text'].apply(preprocess_text)
df.sample(5)

Unnamed: 0,text,labels
3065,Actress Gabrielle Union did not hold back afte...,0
15097,BRUSSELS Reuters The European Union told Brit...,1
8191,The Tonight Show s Jimmy Fallon debuted his im...,0
18611,Stories of governments removing citizens from ...,0
11792,BAGHDADERBIL Iraq Reuters Opposition groups q...,1


### Tokenize

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

dataset = Dataset.from_pandas(df)
tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
def preprocess_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=4096
    )
tokenized_datasets = dataset.map(preprocess_function, batched=True)

tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print(tokenized_datasets[0])

Map:   0%|          | 0/44898 [00:00<?, ? examples/s]

{'labels': tensor(0), 'input_ids': tensor([    0, 19195,   140,    95,  1705,   326,  2813,    70,  1791,    10,
         9899,   188,  2041,     8,   989,    24,    23,    14,  2978,    37,
           56,     7,   492,    10, 18066,    66,     7,    39, 11058,  3988,
          268,     8,  1437,     5,   182, 27820,  4486,   340,   433,  1437,
           20,   320,  2015,   311,   999,    56,    95,    65,   633,     7,
          109,     8,    37,  1705,   326,   109,    24,   287,    84,  5093,
         6042, 11461,  3651,     8, 18369,    38,   236,     7,  2813,    70,
            9,   127,   964,  2732, 11058,  3988,   268,     8,   190,     5,
          182, 27820, 24530,   491,  2454,    10,  9899,     8, 21487,   188,
         2041,  1437,   270, 32420, 42616,  2858,  1437,   199,    40,    28,
           10,   372,    76,    13,   730,   287,    84,  5093,  6042, 11461,
         3651,     8, 18369,    38,   236,     7,  2813,    70,     9,   127,
          964,  2732, 11058, 



### Load model

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig

# Check for GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load  Longformer
model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=2
)
model.to(device)

#Config LoRa
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["attention.self.query", "attention.self.key", "attention.self.value"]
)

model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/597M [00:00<?, ?B/s]

Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,034,498 || all params: 149,695,492 || trainable%: 0.6911
None


### Training

In [None]:

import os
# turn off wandb (optinal)
os.environ["WANDB_DISABLED"] = "true"
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'


In [None]:
from transformers import DataCollatorWithPadding

## evaluate
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

modelPath="/content/drive/MyDrive/AI Models/Longformer_Finetuned"
training_args = TrainingArguments(
    output_dir=modelPath,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    logging_dir=f"{modelPath}/logs",
    logging_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    gradient_checkpointing=True,
    )


dataset_split=tokenized_datasets.train_test_split(test_size=0.2)

def compute_metrics(p):
    preds = p.predictions.argmax(axis=-1)
    labels = p.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_split["train"],
    eval_dataset=dataset_split["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

model.save_pretrained(modelPath)
tokenizer.save_pretrained(modelPath)



Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Initializing global attention on CLS token...


model.safetensors:   0%|          | 0.00/597M [00:00<?, ?B/s]

Input ids are automatically padded to be a multiple of `config.attention_window`: 512


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1885,0.158036,0.978062,0.978135,0.978062,0.978067




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1885,0.158036,0.978062,0.978135,0.978062,0.978067


### Evaluate model

In [None]:

#Save model
trainer.save_model("/content/drive/MyDrive/AI Models/Longformer_Finetuned/Model")

eval_results = trainer.evaluate()
print("Evaluation metrics after training:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

