<h1>Predicting Fake news with a Bidirectional Encoder (BERT) + LSTM model </h1>

<h3>Based off me reading this <a href="https://www.sciencedirect.com/science/article/pii/S2405844023075904">Paper</a></h3>

In [1]:
import pandas as pd
import glob


csv_files = ['politifact_fake.csv', 'gossipcop_fake.csv', 'gossipcop_real.csv', 'politifact_real.csv']

dfs = []

# Loop through the list of files and read each file into a DataFrame
for file in csv_files:
    df = pd.read_csv(file)
    if "fake" in file: 
        df["verdict"] = "Fake"
    else:
        df["verdict"] = "True"
    dfs.append(df)

# Concatenate all DataFrames in the list into a single DataFrame
df = pd.concat(dfs, ignore_index=True)

<h3> The data comes from a dataset release called FakeNewsNet its a combination of data from politifact a user contributed site along with a site called gossipcon with the same concept</h3>

In [2]:
df.shape

(23196, 5)

In [31]:
df.sample(5)

Unnamed: 0,id,news_url,title,tweet_ids,verdict
7526,gossipcop-2895484840,www.magzter.com/article/Celebrity/OK/Taylors-L...,Taylor's Lonely Life,279594449154232321\t280553081064800256\t280797...,Fake
4351,gossipcop-884847,https://www.longroom.com/discussion/720445/wat...,"Watch ""Belligerent"" Scott Disick Freak Out at ...",,True
10822,gossipcop-3662901506,radaronline.com/videos/ellen-degeneres-talk-sh...,Boss From Hell! Ellen DeGeneres Treats Her Tal...,943461276277231616\t943471999036411904\t943499...,Fake
5118,gossipcop-925000,https://www.pinterest.co.uk/pin/29533768803792...,Fearless from Beyoncé and Jay Z's Vacation Pics,981654072489947136,True
4236,gossipcop-907444,https://www.usmagazine.com/celebrity-news/news...,James Franco to Attend SAG Awards 2018 Amid Mi...,954377483410960386\t954389751481675778\t954389...,True


In [4]:
df["verdict"].value_counts().reset_index()

Unnamed: 0,verdict,count
0,True,17441
1,Fake,5755


<h2> Bit of a class imbalance on real vs fake so will sample the same amount of True </h2>

In [5]:
sampled_0_25 = df[df['verdict'] == 'True'].sample(n=5755, random_state=42)
rest_df = df[df['verdict'] != 'True']
df = pd.concat([sampled_0_25, rest_df], ignore_index=True)

In [6]:
df["verdict"].value_counts().reset_index()

Unnamed: 0,verdict,count
0,True,5755
1,Fake,5755


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Convert labels to integers
label_encoder = LabelEncoder()

texts = df["title"].to_list()

labels = label_encoder.fit_transform(df["verdict"].to_list())

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, 
    test_size=0.2,  # 20% of the data for testing
    stratify=labels,  # This ensures the distribution of labels is similar in both sets
    random_state=42  # For reproducibility of results
)


In [8]:
from torch.utils.data import Dataset, DataLoader
import torch
from torch import nn
from transformers import BertModel, BertTokenizerFast
import numpy as np
from tqdm import tqdm

class PolitifactDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)  # Ensure label tensors are long type
        return item

    def __len__(self):
        return len(self.labels)

class BertLSTM(nn.Module):
    def __init__(self, bert_model_name='bert-base-uncased', hidden_dim=256, lstm_layers=1, dropout=0.1, num_classes=2):
        super(BertLSTM, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.lstm = nn.LSTM(self.bert.config.hidden_size, hidden_dim, num_layers=lstm_layers, batch_first=True, dropout=dropout if lstm_layers > 1 else 0)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state
        lstm_output, (h_n, c_n) = self.lstm(sequence_output)
        lstm_output = self.dropout(lstm_output[:, -1, :])
        logits = self.classifier(lstm_output)
        return logits

# Tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Tokenizing the texts
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=128, return_tensors="pt")
test_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=128, return_tensors="pt")

# Creating the datasets
train_dataset = PolitifactDataset(train_encodings, y_train)
test_dataset = PolitifactDataset(test_encodings, y_test)

# Model
model = BertLSTM()

# Working with myt MPS on macbook M1
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.to(device)

# DataLoader setup - batch size of 16..
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

# training loop:
for epoch in range(3):  # Number of epochs
    model.train()
    total_loss = 0.0 
    num_batches = 0  # Count the number of batches processed
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}")
    for batch in progress_bar:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()  # Accumulate loss
        num_batches += 1
        
        # Update progress bar with mean loss for the current epoch
        progress_bar.set_postfix({'mean_loss': total_loss / num_batches})


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 1: 100%|███████████████| 576/576 [01:29<00:00,  6.41it/s, mean_loss=0.576]
Epoch 2: 100%|███████████████| 576/576 [01:29<00:00,  6.41it/s, mean_loss=0.527]
Epoch 3: 100%|███████████████| 576/576 [01:30<00:00,  6.40it/s, mean_loss=0.491]


<h2> Lets test out how well the model predicts on the reserved test set</h2>

In [9]:
from torch import no_grad

model.eval()

predictions = []
true_labels = []

with no_grad():  # Inference mode, gradients not needed
    for batch in tqdm(test_loader, desc="Evaluating"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask)
        logits = outputs
        preds = torch.argmax(logits, dim=1).cpu().numpy()  # Move predictions to CPU and convert to numpy
        
        predictions.extend(preds)
        true_labels.extend(labels.cpu().numpy())  # Move true labels to CPU and convert to numpy


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Evaluating: 100%|█████████████████████████████| 144/144 [00:18<00:00,  7.70it/s]


In [10]:
from sklearn.metrics import f1_score, classification_report, accuracy_score

# Calculate F1 score
f1 = f1_score(true_labels, predictions, average='weighted')  # 'weighted' accounts for label imbalance

# Detailed classification report
report = classification_report(true_labels, predictions, target_names=label_encoder.classes_)

# Accuracy
accuracy = accuracy_score(true_labels, predictions)

print(f"F1 Score: {f1}")
print(f"Accuracy: {accuracy}\n")
print("Classification Report:\n", report)


F1 Score: 0.7346953742993346
Accuracy: 0.7371850564726324

Classification Report:
               precision    recall  f1-score   support

        Fake       0.79      0.64      0.71      1151
        True       0.70      0.83      0.76      1151

    accuracy                           0.74      2302
   macro avg       0.75      0.74      0.73      2302
weighted avg       0.75      0.74      0.73      2302



<p> 73% is not too bad for a quick go- the paper said they got between 73% and 83% with DL</p>

<h2> Lets test it on a article I saw posted in the x-gov AI channel</h2>

In [30]:
model.eval()

input_ids = tokenizer(["Disillusioned Businesses Discovering That AI Kind of Sucks."], truncation=True, padding=True, max_length=128, return_tensors="pt")
input_ids_test = input_ids['input_ids']
attention_mask_test = input_ids['attention_mask']

outputs = model(input_ids_test.to(device), attention_mask_test.to(device))
logits = outputs
predictions = torch.argmax(logits, dim=1).cpu().numpy()  # Move predictions to CPU and convert to numpy

index_to_class = {0: "fake", 1: "true"}  # Adjust based on your actual classes

predicted_label = index_to_class[predictions[0]]
print(f"Predicted label: {predicted_label}")


Predicted label: true


<h2> and one I have just made up</h2>

In [23]:
input_ids = tokenizer(["Aliens have landed in manchester"], truncation=True, padding=True, max_length=128, return_tensors="pt")
input_ids_test = input_ids['input_ids']
attention_mask_test = input_ids['attention_mask']

outputs = model(input_ids_test.to(device), attention_mask_test.to(device))
logits = outputs
predictions = torch.argmax(logits, dim=1).cpu().numpy()  

predicted_label = index_to_class[predictions[0]]
print(f"Predicted label: {predicted_label}")


Predicted label: fake
