<a href="https://colab.research.google.com/github/tomdyer10/fake_news/blob/master/BERT_Classifieir.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Notebook Summary**

Training BERT model on kaggle fake news dataset, using pytorch and hugging face transformer library.

Dataset can be found here - https://www.kaggle.com/c/fake-news

Top entries achieved 98% accuracy on test dataset.

Note - I am using google colab GPU which has fairly limited memory - as such I've had to limit the amount of data I train my model on quite substantially. I have not tried to train this model to completion, see my notebook on this topic with. fastAI + BERT for that. 

References:

- Much of the BERT implementation code inspired by this post https://towardsdatascience.com/bert-classifier-just-another-pytorch-model-881b3cf05784

- Hugging face transformer - https://huggingface.co/bert-base-uncased

- BERT paper - https://arxiv.org/abs/1706.03762

In [0]:
!pip install pytorch_pretrained_bert

Collecting pytorch_pretrained_bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |██▋                             | 10kB 20.3MB/s eta 0:00:01[K     |█████▎                          | 20kB 4.2MB/s eta 0:00:01[K     |████████                        | 30kB 5.3MB/s eta 0:00:01[K     |██████████▋                     | 40kB 5.4MB/s eta 0:00:01[K     |█████████████▎                  | 51kB 4.9MB/s eta 0:00:01[K     |███████████████▉                | 61kB 5.0MB/s eta 0:00:01[K     |██████████████████▌             | 71kB 5.6MB/s eta 0:00:01[K     |█████████████████████▏          | 81kB 5.9MB/s eta 0:00:01[K     |███████████████████████▉        | 92kB 6.1MB/s eta 0:00:01[K     |██████████████████████████▌     | 102kB 6.1MB/s eta 0:00:01[K     |█████████████████████████████▏  | 112kB 6.1MB/s eta 0:00:01[K     |██████████████████████

In [0]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache, downloading to /tmp/tmpu8r7pthq
100%|██████████| 231508/231508 [00:00<00:00, 934647.79B/s]
INFO:pytorch_pretrained_bert.file_utils:copying /tmp/tmpu8r7pthq to cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:pytorch_pretrained_bert.file_utils:removing temp file /tmp/tmpu8r7pthq
INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068

In [0]:
from __future__ import print_function, division
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from random import randrange
import torch.nn.functional as F

In [0]:
text = 'testing the tokenizer'
zz = tokenizer.tokenize(text)

*Note - notice how it breaks tokenizer down to its root word - token*

In [0]:
class BertLayerNorm(nn.Module):
        def __init__(self, hidden_size, eps=1e-12):
            """Construct a layernorm module in the TF style (epsilon inside the square root).
            """
            super(BertLayerNorm, self).__init__()
            self.weight = nn.Parameter(torch.ones(hidden_size))
            self.bias = nn.Parameter(torch.zeros(hidden_size))
            self.variance_epsilon = eps

        def forward(self, x):
            u = x.mean(-1, keepdim=True)
            s = (x - u).pow(2).mean(-1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
            return self.weight * x + self.bias
        

class BertForSequenceClassification(nn.Module):
    """BERT model for classification.
    This module is composed of the BERT model with a linear layer on top of
    the pooled output.
    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
        `num_labels`: the number of classes for the classifier. Default = 2.
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
            with indices selected in [0, ..., num_labels].
    Outputs:
        if `labels` is not `None`:
            Outputs the CrossEntropy classification loss of the output with the labels.
        if `labels` is `None`:
            Outputs the classification logits of shape [batch_size, num_labels].
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    num_labels = 2
    model = BertForSequenceClassification(config, num_labels)
    logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, num_labels=2):
        super(BertForSequenceClassification, self).__init__()
        self.num_labels = num_labels
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
        nn.init.xavier_normal_(self.classifier.weight)
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        return logits
    def freeze_bert_encoder(self):
        for param in self.bert.parameters():
            param.requires_grad = False
    
    def unfreeze_bert_encoder(self):
        for param in self.bert.parameters():
            param.requires_grad = True



In [0]:
from pytorch_pretrained_bert import BertConfig

config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=3, num_attention_heads=12, intermediate_size=3072)

num_labels = 2
model = BertForSequenceClassification(num_labels)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([tokenizer.convert_tokens_to_ids(zz)])

logits = model(tokens_tensor)

INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz not found in cache, downloading to /tmp/tmpc9lev7c7
100%|██████████| 407873900/407873900 [00:14<00:00, 27529386.94B/s]
INFO:pytorch_pretrained_bert.file_utils:copying /tmp/tmpc9lev7c7 to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.file_utils:removing temp file /tmp/tmpc9lev7c7
INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb7

Load Dataset

In [0]:
import pandas as pd
path = 'drive/My Drive/fake_news_1/data/train.csv'
df = pd.read_csv(path)
df = df.dropna()

In [0]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [0]:
df.shape

(18285, 5)

Labels:

1 - unreliable

0 - reliable

Split data - only using a sample of a few thousand datapoints because of memory limitations.

In [0]:
df_test = df.sample(n=2000)
df_test.shape

(2000, 5)

In [0]:
from sklearn.model_selection import train_test_split
X = df_test['text']
y = df_test['label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

In [0]:
X_train = X_train.values.tolist()
X_test = X_test.values.tolist()

y_train = pd.get_dummies(y_train).values.tolist()
y_test = pd.get_dummies(y_test).values.tolist()

Max sequence length set here at 256 - however this is probably not best applicable for whole news articles. Again, a limitation of memory at this point.

In [0]:
max_seq_length = 256
class text_dataset(Dataset):
    def __init__(self,x_y_list, transform=None):
        
        self.x_y_list = x_y_list
        self.transform = transform
        
    def __getitem__(self,index):
        
        tokenized_article = tokenizer.tokenize(self.x_y_list[0][index])
        
        if len(tokenized_article) > max_seq_length:
            tokenized_article = tokenized_article[:max_seq_length]
            
        ids_article  = tokenizer.convert_tokens_to_ids(tokenized_article)

        padding = [0] * (max_seq_length - len(ids_article))
        
        ids_article += padding
        
        assert len(ids_article) == max_seq_length
        
        ids_article = torch.tensor(ids_article)
        
        fake_label = self.x_y_list[1][index] # color        
        list_of_labels = [torch.from_numpy(np.array(fake_label))]
        
        
        return ids_article, list_of_labels[0]
    
    def __len__(self):
        return len(self.x_y_list[0])

In [0]:
batch_size = 24

train_lists = [X_train, y_train]
test_lists = [X_test, y_test]

training_dataset = text_dataset(x_y_list = train_lists )

test_dataset = text_dataset(x_y_list = test_lists )

dataloaders_dict = {'train': torch.utils.data.DataLoader(training_dataset, batch_size=batch_size, shuffle=True, num_workers=0),
                   'val':torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
                   }
dataset_sizes = {'train':len(train_lists[0]),
                'val':len(test_lists[0])}

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [0]:
print(dataset_sizes)

{'train': 1800, 'val': 200}


In [0]:
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()
    print('starting')
    best_model_wts = copy.deepcopy(model.state_dict())
    best_loss = 100

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch + 1, num_epochs))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            
            fake_corrects = 0
            
            # Iterate over data.
            for inputs, fake_label in dataloaders_dict[phase]:
                #inputs = inputs
                # print(len(inputs),type(inputs),inputs)
                #inputs = torch.from_numpy(np.array(inputs)).to(device) 
                inputs = inputs.to(device) 

                fake_label = fake_label.to(device)
                # print('data loaded')
                
                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    #print(inputs)
                    outputs = model(inputs)

                    outputs = F.softmax(outputs,dim=1)
                    loss = criterion(outputs, torch.max(fake_label.float(), 1)[1])
                    # backward + optimize only if in training phase
                    if phase == 'train':
                        
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)

                
                fake_corrects += torch.sum(torch.max(outputs, 1)[1] == torch.max(fake_label, 1)[1])

                
            epoch_loss = running_loss / dataset_sizes[phase]

            
            fake_acc = fake_corrects.double() / dataset_sizes[phase]

            print('{} total loss: {:.4f} '.format(phase,epoch_loss ))
            print('{} classification_accuracy: {:.4f}'.format(phase, fake_acc))

            if phase == 'val' and epoch_loss < best_loss:
                print('saving with loss of {}'.format(epoch_loss), 'improved over previous {}'.format(best_loss))
                best_loss = epoch_loss
                best_model_wts = copy.deepcopy(model.state_dict())
                torch.save(model.state_dict(), 'bert_model_test.pth')


    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val loss: {:4f}'.format(float(best_loss)))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [0]:
model.to(device)
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
   

In [0]:
lrlast = .001
lrmain = .00001
optim1 = optim.Adam(
    [
        {"params":model.bert.parameters(),"lr": lrmain}, #comment out to run without training bert layers at all to test
        {"params":model.classifier.parameters(), "lr": lrlast},
       
   ])

#optim1 = optim.Adam(model.parameters(), lr=0.001)#,momentum=.9)
# Observe that all parameters are being optimized
optimizer_ft = optim1
criterion = nn.CrossEntropyLoss()

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=3, gamma=0.1)

Note: 

For now I just want to test the BERT performance, therefore going to run it on a much smaller dataset to allow full run through in a reasonable time.

In [0]:
model_ft1 = train_model(model, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=3)

starting
Epoch 0/2
----------




train total loss: 0.5422 
train classification_accuracy: 0.7583
val total loss: 0.4327 
val classification_accuracy: 0.8700
saving with loss of 0.4327083146572113 improved over previous 100
Epoch 1/2
----------
train total loss: 0.3830 
train classification_accuracy: 0.9294
val total loss: 0.3747 
val classification_accuracy: 0.9350
saving with loss of 0.3747102463245392 improved over previous 0.4327083146572113
Epoch 2/2
----------
train total loss: 0.3479 
train classification_accuracy: 0.9639
val total loss: 0.3635 
val classification_accuracy: 0.9500
saving with loss of 0.36351649165153505 improved over previous 0.3747102463245392
Training complete in 4m 3s
Best val loss: 0.363516


Achieving 97% validation accuracy on only 1800 training examples is pretty amazing!

Note - no fine tuning of bert layers whatsoever (running with lrmain commented out) gives much poorer performance.

In [0]:
#save model progress
torch.save(model_ft1.state_dict(), 'drive/My Drive/fake_news_1/models/bert_1')

In [0]:
model_ft2 = train_model(model_ft1, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=1)

starting
Epoch 0/0
----------
train total loss: 0.3445 
train classification_accuracy: 0.9678
val total loss: 0.3615 
val classification_accuracy: 0.9500
saving with loss of 0.3614831244945526 improved over previous 100
Training complete in 1m 21s
Best val loss: 0.361483


Reducing learning rate to fine tune model further - may well be that we are nearing the limit of performance of only 1800 training examples.

In [0]:
lrlast = .0001
lrmain = .000001
optim1 = optim.Adam(
    [
        {"params":model.bert.parameters(),"lr": lrmain}, #comment out to run without training bert layers at all to test
        {"params":model.classifier.parameters(), "lr": lrlast},
       
   ])

#optim1 = optim.Adam(model.parameters(), lr=0.001)#,momentum=.9)
# Observe that all parameters are being optimized
optimizer_ft = optim1
criterion = nn.CrossEntropyLoss()

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=3, gamma=0.1)

In [0]:
model_ft3 = train_model(model_ft2, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=1)

starting
Epoch 0/0
----------




train total loss: 0.3401 
train classification_accuracy: 0.9750
val total loss: 0.3590 
val classification_accuracy: 0.9550
saving with loss of 0.3590094029903412 improved over previous 100
Training complete in 1m 21s
Best val loss: 0.359009


For further model training and fine tuning see my fastAI + Bert notebook in this repo