# Fake News Classification using BERT

[Get Dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv)

In [1]:
#importing torch and checking GPU compatibility
import torch
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import numpy as np
device=torch.device('cpu')
if torch.cuda.is_available():
    print("GPU available:",torch.cuda.device_count())
    print("GPU Model:",torch.cuda.get_device_name(0))
    device=torch.device('cuda')

GPU available: 1
GPU Model: NVIDIA GeForce RTX 3050 Laptop GPU


In [5]:
true_df=pd.read_csv('data/fake_news/True.csv')
fake_df=pd.read_csv('data/fake_news/Fake.csv')

In [6]:
true_df.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [8]:
fake_df.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [43]:
#sample true text 
true_df.text.iloc[0]

'WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support educat

In [39]:
fake_df.text.iloc[1]

'House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys  don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Tr

### Text Preprocessing

In [45]:
import string
import contractions 

def contract(text):
    '''Contractions fix eg: we're = we are'''
    # creating an empty list
    expanded_words = []   
    for word in text.split():
      # using contractions.fix to expand the shortened words
      expanded_words.append(contractions.fix(word))
    return ' '.join(expanded_words)

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    # text = text.lower()
    text=contract(text) #apply contractions
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('https?://\S+|www\.\S+', ' ', text)
    text = re.sub('<.*?>+', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub('[^A-Za-z0-9]+', ' ', text)
    text = re.sub('\n', ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub(' +', ' ', text) # remove extra spaces
    return text

In [48]:
%%time
true_df.text=true_df.text.apply(clean_text)
fake_df.text=fake_df.text.apply(clean_text)

CPU times: user 36.5 s, sys: 143 ms, total: 36.7 s
Wall time: 36.7 s


In [49]:
#combining dataset
true_df['type']='real'
fake_df['type']='fake'
true_df['label']=0
fake_df['label']=1

In [51]:
full_df=true_df.append(fake_df)
full_df.to_csv('data/fake_news_full_preprsd.csv')

In [52]:
full_df

Unnamed: 0,title,text,subject,date,type,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON Reuters The head of a conservative ...,politicsNews,"December 31, 2017",real,0
1,U.S. military to accept transgender recruits o...,WASHINGTON Reuters Transgender people will be ...,politicsNews,"December 29, 2017",real,0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON Reuters The special counsel investi...,politicsNews,"December 31, 2017",real,0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON Reuters Trump campaign adviser Geor...,politicsNews,"December 30, 2017",real,0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE WASHINGTON Reuters President Donald Tr...,politicsNews,"December 29, 2017",real,0
...,...,...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,Century Wire says As reported earlier this we...,Middle-east,"January 16, 2016",fake,1
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,Century Wire says It s a familiar theme Whene...,Middle-east,"January 16, 2016",fake,1
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen Century WireRemember when t...,Middle-east,"January 15, 2016",fake,1
23479,How to Blow $700 Million: Al Jazeera America F...,Century Wire says Al Jazeera America will go ...,Middle-east,"January 14, 2016",fake,1


In [122]:
full_df.to_csv('data/fake_news_cleaned.csv',index=False)

In [2]:
full_df=pd.read_csv('data/fake_news_cleaned.csv')

## Preparing for Bert

In [3]:
from transformers import BertModel,BertTokenizer,AdamW,get_linear_schedule_with_warmup,BertForSequenceClassification
from torch.utils.data import DataLoader,Dataset
import torch.nn.functional as f
from torch import nn,optim
from sklearn.model_selection import train_test_split
RANDOM_SEED = 42
# np.random.seed(RANDOM_SEED)
# torch.manual_seed(RANDOM_SEED)

In [5]:
model_name='bert-base-cased'
tokenizer=BertTokenizer.from_pretrained(model_name)
bert_model=BertModel.from_pretrained(model_name)

ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/bert-base-cased (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7efef5c793a0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

In [5]:
#EXPERIMENT
# for d in train_dataloader:
#     out=bert_model(input_ids=d["input_ids"],
#                   attention_mask=d["attention_mask"])
#     print(out)
#     break 

In [6]:
#EXP
#out['pooler_output']

In [7]:
bert_model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [8]:
def get_split(text1):
    '''To split text >250 into chuncks of 200 with overlapping 50 words'''
    l_total = []
    l_partial = []
    if len(text1.split()) //150 >0:
        n = len(text1.split())//150
    else:
        n = 1
    for w in range(n):
        if w == 0:
            l_partial = text1.split() [:200]
            l_total.append(" ".join(l_partial))
        else:
            l_partial = text1.split() [w*150:w*150 + 200]
            l_total.append(" ".join(l_partial))
    return l_total

In [9]:
##eg: 
get_split(full_df.text.iloc[0])

["['WASHINGTON Reuters The head of a conservative Republican faction in the YOU S Congress who voted this month for a huge expansion of the national debt to pay for tax cuts called himself a fiscal conservative on Sunday and urged budget restraint in In keeping with a sharp pivot under way among Republicans YOU S Representative Mark Meadows speaking on CBS Face the Nation drew a hard line on federal spending which lawmakers are bracing to do battle over in January When they return from the holidays on Wednesday lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues such as immigration policy even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress President Donald Trump and his Republicans want a big budget increase in military spending while Democrats also want proportional increases for non defense discretionary spending on programs that support education scientific research 

In [10]:
full_df.text=full_df.text.apply(get_split)

#### Creating Dataset Functions

In [11]:
### Splitting Data

df_train,df_val = train_test_split(full_df,test_size=0.2,random_state=RANDOM_SEED)

In [12]:
class NewsDataset(Dataset):
	def __init__(self, texts, labels, tokenizer, max_len):
		self.texts = texts
		self.labels = labels
		self.tokenizer = tokenizer
		self.max_len = max_len
        
	def __len__(self):
		return len(self.texts)
    
	def __getitem__(self, item):
		news=str(self.texts[item])
		target= self.labels[item]
		encoding = self.tokenizer.encode_plus(
		news,
		add_special_tokens=True,
		max_length=self.max_len,
		return_token_type_ids=False,
		pad_to_max_length=True,
		return_attention_mask=True,
		return_tensors='pt',
        )
		return {
		'doc_text': news,
		'input_ids': encoding['input_ids'].flatten(),
		'attention_mask': encoding['attention_mask'].flatten(), 'targets': torch.tensor(target, dtype=torch.long)
		}

In [13]:
# Data Loader

def create_data_loader(df, tokenizer, max_len, batch_size):
    ds=NewsDataset(texts=df.text.to_numpy(),
    labels=df.label.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
    )
    return DataLoader(
    ds,
    batch_size=batch_size,
    num_workers=4
    )

#### Defining Model and functions

In [14]:
class NewsClassifier(nn.Module):
    def __init__(self, n_classes,dropout=0.3):
        super(NewsClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.drop = nn.Dropout(dropout)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
    
    def forward(self, input_ids, attention_mask):
        output = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask
        )
        pooled_output=output["pooler_output"]
        output = self.drop(pooled_output)
        return self.out(output)

In [15]:
def train_epoch(
  model, 
  data_loader, 
  loss_fn, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
    ):
    model = model.train()

    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["targets"].to(device)
        
#         print(input_ids,attention_mask)

        outputs = model(
          input_ids=input_ids,
          attention_mask=attention_mask
        )

        _, preds = torch.max(outputs, dim=1)
        loss = loss_fn(outputs, targets)

        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())

        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        torch.cuda.empty_cache()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    return correct_predictions.double() / n_examples, np.mean(losses)

In [16]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
    model = model.eval()

    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)

            outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
            )
            _, preds = torch.max(outputs, dim=1)

            loss = loss_fn(outputs, targets)

            correct_predictions += torch.sum(preds == targets)
            losses.append(loss.item())

    return correct_predictions.double() / n_examples, np.mean(losses)

In [17]:
def get_predictions(model, data_loader):
    model = model.eval()

    doc_texts = []
    predictions = []
    prediction_probs = []
    real_values = []

    with torch.no_grad():
        for d in data_loader:

            texts = d["doc_text"]
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)

            outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
            )
            _, preds = torch.max(outputs, dim=1)

            probs = F.softmax(outputs, dim=1)

            doc_texts.extend(texts)
            predictions.extend(preds)
            prediction_probs.extend(probs)
            real_values.extend(targets)

    predictions = torch.stack(predictions).cpu()
    prediction_probs = torch.stack(prediction_probs).cpu()
    real_values = torch.stack(real_values).cpu()
    return doc_texts, predictions, prediction_probs, real_values

## Finalizing Training Stuff

In [18]:
from collections import defaultdict
# Configs
BATCH_SIZE=2
MAX_LEN=128
EPOCHS=5
LR=[5e-5, 3e-5, 2e-5] #recommended LRs
classes=2

In [19]:
train_dataloader=create_data_loader(df_train,tokenizer,max_len=MAX_LEN,batch_size=BATCH_SIZE)
val_dataloader=create_data_loader(df_val,tokenizer,max_len=MAX_LEN,batch_size=BATCH_SIZE)

In [20]:
model=NewsClassifier(n_classes=classes)
model = model.to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
optimizer = AdamW(model.parameters(), lr=LR[-1], correct_bias=False)
total_steps = len(train_dataloader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)

In [22]:
loss_fn = nn.CrossEntropyLoss().to(device)
history = defaultdict(list)
best_accuracy = 0

In [23]:
#optional if memory error
# torch.cuda.empty_cache()

In [24]:
# GPU memory summary
# print(torch.cuda.memory_summary(device=None, abbreviated=False))

In [25]:
for epoch in range(EPOCHS):

    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    train_acc, train_loss = train_epoch(
    model,
    train_dataloader,    
    loss_fn, 
    optimizer, 
    device, 
    scheduler, 
    len(df_train)
    )

    print(f'Train loss {train_loss} accuracy {train_acc}')

    val_acc, val_loss = eval_model(
    model,
    val_dataloader,
    loss_fn, 
    device, 
    len(df_val)
    )
    
    print(f'Val   loss {val_loss} accuracy {val_acc}')
    print()

    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)

    if val_acc > best_accuracy:
        torch.save(model.state_dict(), 'saved_models/best_model_state.bin')
        best_accuracy = val_acc

Epoch 1/5
----------


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-s

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 1.62 GiB already allocated; 0 bytes free; 1.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF