# Fake News Detector using BERT and PyTorch

- BERT Based Fake News Detector (HugginFace Transformers, Pytorch)
- Fine-tune a pretrained transformers (DistilBERT / BERT) on Fake vs Real news
- Evaluate using accuracy / precision / recall / f1
- Save Tokenizer + model for inference

## Import Libraries

In [1]:
%pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [2]:
import os
import random
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
from datasets import Dataset
import evaluate

## Set Hyperparameters

In [3]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

MODEL_NAME = "distilbert-base-uncased"   # use "bert-base-uncased" if you have GPU and more time
MAX_SAMPLES = None   # e.g., 20000 for subsampling on low-memory machines, or None to use all
MAX_LENGTH = 256     # truncation/padding length
BATCH_SIZE = 16      # reduce to 8 or 4 on low-memory CPUs
EPOCHS = 3 # Model large so epoch is small
OUTPUT_DIR = "hf_fake_news_model"

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


## Download Dataset from Kaggle

In [8]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("clmentbisaillon/fake-and-real-news-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'fake-and-real-news-dataset' dataset.
Path to dataset files: /kaggle/input/fake-and-real-news-dataset


## Load Dataset

In [9]:
root_dir = "/kaggle/input/fake-and-real-news-dataset"
real_file = "True.csv"
fake_file = "Fake.csv"
real_path = ""
fake_path = ""
for dirpath, dirnames, filenames in os.walk(root_dir):
    for filename in filenames:
        if filename == real_file:
            real_path = os.path.join(dirpath, filename)
        elif filename == fake_file:
            fake_path = os.path.join(dirpath, filename)

fake = pd.read_csv(fake_path)
real = pd.read_csv(real_path)

In [10]:
fake

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"


In [11]:
real

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"


### Set Labels

In [12]:
# Label: fake=0, real=1
fake["label"] = 0
real["label"] = 1

## Preprocess Dataset
- Concat Real and Fake News with labels
- Split training and testing dataset (85% and 15%)
- From Pandas Dataset to convert HugginFace Dataset

In [13]:
df = pd.concat([fake, real], axis=0).sample(frac=1, random_state=SEED).reset_index(drop=True)
# Keep only text fields to simplify
df["content"] = (df["title"].fillna("") + " " + df["text"].fillna("")).str.strip()
df = df[["content", "label"]]
df = df[df["content"].str.len() > 30].reset_index(drop=True)   # remove extremely short rows

# Optional downsample for low memory
if isinstance(MAX_SAMPLES, int) and MAX_SAMPLES > 0:
    df = df.sample(n=MAX_SAMPLES, random_state=SEED).reset_index(drop=True)

print("Dataset size:", df.shape)
df.head()


Dataset size: (44896, 2)


Unnamed: 0,content,label
0,Ben Stein Calls Out 9th Circuit Court: Committ...,0
1,Trump drops Steve Bannon from National Securit...,1
2,Puerto Rico expects U.S. to lift Jones Act shi...,1
3,OOPS: Trump Just Accidentally Confirmed He Lea...,0
4,Donald Trump heads for Scotland to reopen a go...,1


In [14]:
train_df, val_df = train_test_split(df, test_size=0.15, random_state=SEED, stratify=df["label"])
print("Train:", train_df.shape, "Val:", val_df.shape)

Train: (38161, 2) Val: (6735, 2)


In [15]:
# Convert to HuggingFace Dataset
train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)
print("Train Dataset Shape: ", train_ds.shape)
print("Validation Dataset Shape: ", val_ds.shape)

Train Dataset Shape:  (38161, 3)
Validation Dataset Shape:  (6735, 3)


In [16]:
print(f"Data: {train_ds[0]}")

Data: {'content': 'TWO “HIGH THREAT” EXPLOSIVE Experts Moved From GITMO To African Country With Over 90% Muslim Population [VIDEO] If someone would have told me in 2008 that we would be releasing Muslim explosive experts from GITMO to a country where over 90% of its citizens were Muslim, I m quite sure I would have thought they were out of their minds. Fast forward to 2016 and the idea that this is really happening is barely registering as a blip on the radar of most Americans. Have Obama s radical policies, that have largely gone unchecked, and his open disregard for our national security caused Americans to ignore the treason his is committing against our nation? Two of Al Qaeda s former explosives experts were just transferred out of Guantanamo Bay and sent to Senegal, the Defense Department confirmed Monday, marking the latest detainees to be shipped out of the prison camp despite the risk they could return to the battlefield.The two Libyan former detainees were separately listed a

## Download Tokenizer from HugginFace (distilbert-base-uncased)

In [17]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenize Dataset (each word convert number and padding)

In [18]:
def tokenize_fn(batch):
    return tokenizer(batch["content"], padding=False, truncation=True, max_length=MAX_LENGTH)

# Use map to tokenize datasets (batched)
train_ds = train_ds.map(tokenize_fn, batched=True, remove_columns=["content"])
val_ds = val_ds.map(tokenize_fn, batched=True, remove_columns=["content"])

# Data collator (dynamic padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/38161 [00:00<?, ? examples/s]

Map:   0%|          | 0/6735 [00:00<?, ? examples/s]

In [19]:
print(f"Data: {train_ds[0]}")

Data: {'label': 0, '__index_level_0__': 28020, 'input_ids': [101, 2048, 1523, 2152, 5081, 1524, 11355, 8519, 2333, 2013, 21025, 21246, 2080, 2000, 3060, 2406, 2007, 2058, 3938, 1003, 5152, 2313, 1031, 2678, 1033, 2065, 2619, 2052, 2031, 2409, 2033, 1999, 2263, 2008, 2057, 2052, 2022, 8287, 5152, 11355, 8519, 2013, 21025, 21246, 2080, 2000, 1037, 2406, 2073, 2058, 3938, 1003, 1997, 2049, 4480, 2020, 5152, 1010, 1045, 1049, 3243, 2469, 1045, 2052, 2031, 2245, 2027, 2020, 2041, 1997, 2037, 9273, 1012, 3435, 2830, 2000, 2355, 1998, 1996, 2801, 2008, 2023, 2003, 2428, 6230, 2003, 4510, 25719, 2004, 1037, 1038, 15000, 2006, 1996, 7217, 1997, 2087, 4841, 1012, 2031, 8112, 1055, 7490, 6043, 1010, 2008, 2031, 4321, 2908, 4895, 5403, 18141, 1010, 1998, 2010, 2330, 27770, 2005, 2256, 2120, 3036, 3303, 4841, 2000, 8568, 1996, 14712, 2010, 2003, 16873, 2114, 2256, 3842, 1029, 2048, 1997, 2632, 18659, 1055, 2280, 14792, 8519, 2020, 2074, 4015, 2041, 1997, 23094, 3016, 1998, 2741, 2000, 16028, 1010, 

## Set Evaluation Metrics
- accuracy
- F1
- precision
- recall

In [20]:
metric_acc = evaluate.load("accuracy")
metric_f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

## Set Training Arguments

In [21]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=torch.cuda.is_available(),  # only if GPU supports it
)

## Set Trainer

In [22]:
trainer = Trainer(
    model=model, # distilbert-base-uncased
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer, # Deprecated
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


## Start Training

In [23]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtsejavhaa[0m ([33mtsejavhaa-private[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0021,7.4e-05,1.0,1.0,1.0,1.0
2,0.0,0.000685,0.999703,0.999378,1.0,0.999689
3,0.0,4e-06,1.0,1.0,1.0,1.0


TrainOutput(global_step=7158, training_loss=0.004251225704452651, metrics={'train_runtime': 882.7045, 'train_samples_per_second': 129.696, 'train_steps_per_second': 8.109, 'total_flos': 7582632600167424.0, 'train_loss': 0.004251225704452651, 'epoch': 3.0})

## Save Model and Tokenizer

In [24]:
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

('hf_fake_news_model/tokenizer_config.json',
 'hf_fake_news_model/special_tokens_map.json',
 'hf_fake_news_model/vocab.txt',
 'hf_fake_news_model/added_tokens.json',
 'hf_fake_news_model/tokenizer.json')

## Evaluate Model

In [25]:
eval_res = trainer.evaluate(eval_dataset=val_ds)
print("Eval results:", eval_res)

Eval results: {'eval_loss': 7.424170325975865e-05, 'eval_accuracy': 1.0, 'eval_precision': 1.0, 'eval_recall': 1.0, 'eval_f1': 1.0, 'eval_runtime': 14.7687, 'eval_samples_per_second': 456.033, 'eval_steps_per_second': 28.506, 'epoch': 3.0}


## Detailed Evaluation Report

In [26]:
val_preds = trainer.predict(val_ds)
val_logits = val_preds.predictions
val_labels = val_preds.label_ids
val_preds_arg = np.argmax(val_logits, axis=-1)

print("Classification Report (val):")
print(classification_report(val_labels, val_preds_arg, target_names=["FAKE","REAL"]))

Classification Report (val):
              precision    recall  f1-score   support

        FAKE       1.00      1.00      1.00      3522
        REAL       1.00      1.00      1.00      3213

    accuracy                           1.00      6735
   macro avg       1.00      1.00      1.00      6735
weighted avg       1.00      1.00      1.00      6735



## Example Inference

In [27]:
from transformers import pipeline
pipe = pipeline("text-classification", model=OUTPUT_DIR, tokenizer=OUTPUT_DIR, device=0 if torch.cuda.is_available() else -1)

samples = [
    "Local council approves new budget for schools and parks.",
    "Shocking: cure for common cold discovered by home remedy!"
]
print(pipe(samples))

Device set to use cuda:0


[{'label': 'LABEL_1', 'score': 0.9702268838882446}, {'label': 'LABEL_0', 'score': 0.999929666519165}]
