# COLX 585 Project 3.2
## Generated text detection: Project 2

## Introduction

Data: [TweepFake](https://www.kaggle.com/datasets/mtesconi/twitter-deep-fake-text)

...

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import require Python libraries

In [2]:
import tensorflow
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as autograd
from tqdm import tqdm, trange
import pandas as pd
import numpy as np
import io
import os
import matplotlib.pyplot as plt
from keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report, confusion_matrix
import matplotlib
import matplotlib.pyplot as plt

In [3]:
## Set seed of randomization and working device
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

print(torch.cuda.get_device_name(0))

cuda
Tesla T4


Colab doesn't install `transformers` library automatically. Hence, we should install `transformers` first.

In [4]:
! pip install transformers
! pip install sentencepiece

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 28.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 49.9 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 58.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 61.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [5]:
from transformers import BertModel, BertTokenizerFast, AdamW, get_linear_schedule_with_warmup

`Transformers` provides [10+ transformer-based deep learning architectures](https://huggingface.co/transformers/pretrained_models.html) (including English, French). You can implement and load the checkpoints of these architectures using `Transformers` APIs. For each architecture, it provides several class for tokenization, pre-training, and fine-tuning.
Please learn more information [here](https://huggingface.co/transformers/index.html). 

In [6]:
# # Transformers has a unified API
# # here we list models for 10 transformer architectures
# # for the full list of available pretrained-models: go to https://huggingface.co/transformers/pretrained_models.html
# #          Model          | Tokenizer          | Pretrained weights shortcut
# MODELS = [(BertModel,       BertTokenizerFast,   'bert-base-uncased'),
#           (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
#           (GPT2Model,       GPT2Tokenizer,       'gpt2'),
#           (CTRLModel,       CTRLTokenizer,       'ctrl'),
#           (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
#           (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
#           (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
#           (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
#           (RobertaModel,    RobertaTokenizer,    'roberta-base'),
#           (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
#          ]
         
# # Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
# BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
#                       BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

## Data prepare

Data is already split into train, validation and test:

 Split	| # bot tweets |	# human tweets |	total 
 ----- | ------------ | --------------- | ------ 
 Training set |	10354 |	10358	| 20712 
 Validation set |	1152 |	1150 |	2302 
 Test set |	1280 |	1278 |	2558 

 For bot tweets, GPT-2 (11 accounts, 3861 tweets), RNN (7 accounts, 4181 tweets), and Others (5 accounts, 4876 tweets)

First, we define a function to pre-process input data. 

In [7]:
# define a function for data preparation
def data_prepare(file_path, lab2ind, tokenizer, max_len = 32, mode = 'train'):
    '''
    file_path: the path to input file. 
                In train mode, the input must be a tsv file that includes two columns where the first is text, and second column is label.
                The first row must be header of columns.

                In predict mode, the input must be a tsv file that includes only one column where the first is text.
                The first row must be header of column.

    lab2ind: dictionary of label classes
    tokenizer: BERT tokenizer
    max_len: maximal length of input sequence
    mode: train or predict
    '''
    # if we are in train mode, we will load two columns (i.e., text and label).
    if mode == 'train':
        # Use pandas to load dataset
        df = pd.read_csv(file_path, header=0, names=['account_name', 'content', 'account_type', 'label'])
        print("Data size ", df.shape)
        labels = df.label.values
        
        # Create sentence and label lists
        labels = [lab2ind[i] for i in labels] 
        print("Label is ", labels[0])
        
        # Convert data into torch tensors
        labels = torch.tensor(labels)

    # if we are in predict mode, we will load one column (i.e., text).
    elif mode == 'predict':
        df = pd.read_csv(file_path, header=0, names=['account_name', 'content', 'account_type', 'label'])
        print("Data size ", df.shape)
        # create placeholder
        labels = []
    else:
        print("the type of mode should be either 'train' or 'predict'. ")
        return
        
    # Create sentence and label lists
    content = df.content.values

    #### REF START ####

    # We need to add a special token at the beginning for BERT to work properly.
    content = ["[CLS] " + text for text in content]

    # Import the BERT tokenizer, used to convert our text into tokens that correspond to BERT's vocabulary.
    tokenized_texts = [tokenizer.tokenize(text) for text in content]
    
    # if the sequence is longer the maximal length, we truncate it to the pre-defined maximal length
    tokenized_texts = [ text[:max_len+1] for text in tokenized_texts]

    # We also need to add a special token at the end.
    tokenized_texts = [ text+['[SEP]'] for text in tokenized_texts]
    print ("Tokenize the first sentence:\n",tokenized_texts[0])
    
    # Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
    print ("Index numbers of the first sentence:\n",input_ids[0])

    # Pad our input seqeunce to the fixed length (i.e., max_len) with index of [PAD] token
    pad_ind = tokenizer.convert_tokens_to_ids(['[PAD]'])[0]
    input_ids = pad_sequences(input_ids, maxlen=max_len+2, dtype="long", truncating="post", padding="post", value=pad_ind)
    print ("Index numbers of the first sentence after padding:\n",input_ids[0])

    # Create attention masks
    attention_masks = []

    # Create a mask of 1s for each token followed by 0s for pad tokens
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)

    # Convert all of our data into torch tensors, the required datatype for our model
    inputs = torch.tensor(input_ids)
    masks = torch.tensor(attention_masks)
    #### REF END ####

    return inputs, labels, masks

How the data should look like:
```
Tokenize the first sentence:
 ['[CLS]', 'it', 'was', 'my', 'birthday', ',', 'and', 'my', 'wife', 'and', 'daughter', 'surprised', 'me', 'with', 'some', 'surprise', 'guests', 'and', 'a', 'small', 'party', '.', '[SEP]']
Index numbers of the first sentence:
 [101, 2009, 2001, 2026, 5798, 1010, 1998, 2026, 2564, 1998, 2684, 4527, 2033, 2007, 2070, 4474, 6368, 1998, 1037, 2235, 2283, 1012, 102]
Index numbers of the first sentence after padding:
 [ 101 2009 2001 2026 5798 1010 1998 2026 2564 1998 2684 4527 2033 2007
 2070 4474 6368 1998 1037 2235 2283 1012  102    0    0    0    0    0
    0    0    0    0    0    0]
```

--- 

EDIT OUT?

`Transformers` library also provides several off-the-shelf functions, such as [`batch_encode_plus`](https://huggingface.co/transformers/master/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.batch_encode_plus) and [`encode_plus`](https://huggingface.co/transformers/master/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.encode_plus), to help you create BERT input tensors. The codes between `REF START` and `REF END` can be replaced with `batch_encode_plus()`. 

We use `BertTokenizerFast.from_pretrained()` to load vocabulary of pretrained model. The first argument should be either a string with the `shortcut name` of a pretrained model or a path to a directory containing model vocabulary file, `vocab.txt`. `Transformers` provides many pre-trained checkpoints with pre-defined `shortcut name`. If the argument is a correct model identifier listed on [here](https://huggingface.co/models), the model will download the vocabulary and load it to tokenizer automatically. If it doesn't match any model identifier, the model will use this argument as a path to load the vocabulary. 

---

We use ["bert-base-uncased"](https://github.com/google-research/bert) which refers to the **12-layer, 768-hidden, 12-heads, 110M parameters** [variant of BERT model](https://huggingface.co/bert-base-uncased). The vocabulary of "bert-base-uncased" was generated using bype-pair encoding and includes 30,522 WordPieces. 

In [8]:
model_path = "bert-base-uncased"
# define label to number dictionary
lab2ind = {'human': 0, 'gpt2': 1, 'rnn': 2, 'others': 3}

# tokenizer from pre-trained BERT model
tokenizer = BertTokenizerFast.from_pretrained(model_path,do_lower_case=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Inspect the train data

In [9]:
data_folder_path = "/content/drive/MyDrive/Colab Notebooks/data/TweepFake/"
train_df = pd.read_csv(data_folder_path+"train.csv")
print('Info about the train dataset:')
train_df.info()
print('\nLabel counts for the train dataset:')
train_df['class_type'].value_counts()

Info about the train dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20712 entries, 0 to 20711
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   screen_name   20712 non-null  object
 1   text          20712 non-null  object
 2   account.type  20712 non-null  object
 3   class_type    20712 non-null  object
dtypes: object(4)
memory usage: 647.4+ KB

Label counts for the train dataset:


human     10358
others     3920
rnn        3325
gpt2       3109
Name: class_type, dtype: int64

### Extract train and validation data

In [10]:
# Use defined funtion to extract data
print('Train sample:')
train_inputs, train_labels, train_masks = data_prepare(data_folder_path+"train.csv", lab2ind,tokenizer)
print('\nValidation sample:')
validation_inputs, validation_labels, validation_masks = data_prepare(data_folder_path+"validation.csv", lab2ind,tokenizer)
print('\nTest sample:')
test_inputs, test_labels, test_masks = data_prepare(data_folder_path+"test.csv", lab2ind,tokenizer)

Train sample:
Data size  (20712, 4)
Label is  3


Token indices sequence length is longer than the specified maximum sequence length for this model (2194 > 512). Running this sequence through the model will result in indexing errors


Tokenize the first sentence:
 ['[CLS]', 'ye', '##a', 'now', 'that', 'note', 'good', '[SEP]']
Index numbers of the first sentence:
 [101, 6300, 2050, 2085, 2008, 3602, 2204, 102]
Index numbers of the first sentence after padding:
 [ 101 6300 2050 2085 2008 3602 2204  102    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]

Validation sample:
Data size  (2302, 4)
Label is  0
Tokenize the first sentence:
 ['[CLS]', 'tight', ',', 'tight', ',', 'tight', ',', 'yeah', '!', '!', '!', 'https', ':', '/', '/', 't', '.', 'co', '/', 'w', '##j', '##3', '##n', '##v', '##ppa', '##sw', '[SEP]']
Index numbers of the first sentence:
 [101, 4389, 1010, 4389, 1010, 4389, 1010, 3398, 999, 999, 999, 16770, 1024, 1013, 1013, 1056, 1012, 2522, 1013, 1059, 3501, 2509, 2078, 2615, 13944, 26760, 102]
Index numbers of the first sentence after padding:
 [  101  4389  1010  4389  1010  4389  1010  3398   999   999   999 16770
  1024  10

In [11]:
train_inputs.shape

torch.Size([20712, 34])

### Create DataLoader

Create an iterator of our data with `torch DataLoader`. This helps us to save on memory during training.

For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32. We use 32 batch size here. 



In [12]:
batch_size = 8

In [13]:
# We'll take training samples in random order in each epoch. 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, 
                              sampler = RandomSampler(train_data), # Select batches randomly
                              batch_size=batch_size)

# We'll just read validation set sequentially.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_dataloader = DataLoader(validation_data, 
                                   sampler = SequentialSampler(validation_data), # Pull out batches sequentially.
                                   batch_size=batch_size)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_dataloader = DataLoader(test_data, 
                                   sampler = SequentialSampler(test_data), # Pull out batches sequentially.
                                   batch_size=batch_size)


## Loading pre-trained model example

In [14]:
bert_model = BertModel.from_pretrained(model_path, output_hidden_states=True, output_attentions=True).to(device)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's use the first batch as an example to explore BERT model.

In [15]:
dataiter = iter(train_dataloader)
batch = dataiter.next()
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
input_ids, input_mask, labels = batch

We set `output_hidden_states=True, output_attentions=True` so the output of `bert_model` will be 4 variables (i.e., `last_hidden_state, pooler_output, hidden_states, attentions`).

In [16]:
outputs = bert_model(input_ids, attention_mask = input_mask)

In [17]:
print(outputs.keys())  # outputs is a dictionary

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])


In [18]:
last_hidden_state = outputs["last_hidden_state"]
pooler_output = outputs["pooler_output"]
hidden_states = outputs["hidden_states"]
attentions = outputs["attentions"]

1. `last_hidden_state`: sequence of hidden-states at the output of the last layer.

In [19]:
last_hidden_state.shape # [batch size, seq length, hidden size]

torch.Size([8, 34, 768])

2. `pooler_output`: last layer hidden-state of the first token of the sequence (classification token, `[CLS]`).

In [20]:
pooler_output.shape # [batch size, hidden size]

torch.Size([8, 768])

3. `hidden_states`: list of `Tensor` (one for the output of each layer + the output of the embeddings) of shape `[batch_size, sequence_length, hidden_size]`: Hidden-states of the model at the output of each layer plus the outputs of the embedding layer. 

In [21]:
print(len(hidden_states))

13


Let's take a look at each item in this list.

In [22]:
for i, item in enumerate(hidden_states):
  print("layer " + str(i), item.shape) # [batch size, sequence length, hidden size]

layer 0 torch.Size([8, 34, 768])
layer 1 torch.Size([8, 34, 768])
layer 2 torch.Size([8, 34, 768])
layer 3 torch.Size([8, 34, 768])
layer 4 torch.Size([8, 34, 768])
layer 5 torch.Size([8, 34, 768])
layer 6 torch.Size([8, 34, 768])
layer 7 torch.Size([8, 34, 768])
layer 8 torch.Size([8, 34, 768])
layer 9 torch.Size([8, 34, 768])
layer 10 torch.Size([8, 34, 768])
layer 11 torch.Size([8, 34, 768])
layer 12 torch.Size([8, 34, 768])


4. `attentions`: list of `Tensor` (one for each layer) of shape `[batch_size, num_heads, sequence_length, sequence_length]`: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

In [23]:
print(len(attentions))

12


In [24]:
for i, item in enumerate(attentions):
  print("layer " + str(i), item.shape) # [batch size, num_heads, sequence length, sequence_length]

layer 0 torch.Size([8, 12, 34, 34])
layer 1 torch.Size([8, 12, 34, 34])
layer 2 torch.Size([8, 12, 34, 34])
layer 3 torch.Size([8, 12, 34, 34])
layer 4 torch.Size([8, 12, 34, 34])
layer 5 torch.Size([8, 12, 34, 34])
layer 6 torch.Size([8, 12, 34, 34])
layer 7 torch.Size([8, 12, 34, 34])
layer 8 torch.Size([8, 12, 34, 34])
layer 9 torch.Size([8, 12, 34, 34])
layer 10 torch.Size([8, 12, 34, 34])
layer 11 torch.Size([8, 12, 34, 34])


## Creating `Bert_cls` class

Now, we put everthing together. We bulid a `Bert_cls` class to train a BERT classifier end-to-end.

In [25]:
class Bert_cls(nn.Module):

    def __init__(self, lab2ind, model_path, hidden_size, dropout, kernel_num, region_sizes):
        super(Bert_cls, self).__init__()
        self.model_path = model_path
        self.hidden_size = hidden_size
        # self.bert_hidden_size = bert_hidden_size
        # self.rnn_hidden_size = rnn_hidden_size
        # self.rnn_layer_num = rnn_layer_num
        # self.bidirectional = bidirectional

        self.bert_model = BertModel.from_pretrained(model_path, output_hidden_states=True, output_attentions=True)
        
        self.label_num = len(lab2ind)
        
        self.convolution_layers = nn.ModuleList([nn.Conv2d(in_channels = 1, out_channels = kernel_num, kernel_size = (K, hidden_size)) for K in region_sizes])
        
#         self.dense = nn.Linear(self.hidden_size, self.hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(len(kernel_sizes) * kernel_num, self.label_num)

    def forward(self, bert_ids, bert_mask):
        outputs = self.bert_model(input_ids=bert_ids, attention_mask = bert_mask)
        # pooler_output = outputs['pooler_output']
        last_hidden_state = outputs['last_hidden_state']
        bert_attentions = outputs['attentions']

#         x = self.dense(pooler_output)
        last_hidden_state = last_hidden_state.unsqueeze(1)
        # print(last_hidden_state.shape)
        convolute_outputs = [F.relu(conv(last_hidden_state)).squeeze(3) for conv in self.convolution_layers]
        max_pooling_outputs = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in convolute_outputs]
        oncat_list = torch.cat(max_pooling_outputs, 1)

        x = self.dropout(oncat_list)
        fc_output = self.fc(x)

        return fc_output, bert_attentions 


### Hyperparameters

In [26]:
HIDDEN_SIZE = 768
# HIDDEN_SIZE = 1024

# BERT_HIDDEN_SIZE = 768
# LSTM_HIDDEN_SIZE = 512
# RNN_LAYER_NUM = 1
# BIDIRECTIONAL = True
DROPOUT = 0.1

# region size as 2, 3, and 4
kernel_sizes =  [2,3,4]

# the number of kernel in each region size
kernels_num = 32

# rnn_layer_num, bidirectional, dropout

Instantiate model.

In [27]:
bert_model = Bert_cls(lab2ind, 
                      model_path, 
                      HIDDEN_SIZE,
                      # BERT_HIDDEN_SIZE, 
                      # LSTM_HIDDEN_SIZE, 
                      # RNN_LAYER_NUM,
                      # BIDIRECTIONAL,
                      DROPOUT,
                      kernels_num,
                      kernel_sizes).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


---
Not used

`Transformers` library also provides a class, [`BertForSequenceClassification`](https://huggingface.co/transformers/master/model_doc/bert.html#bertforsequenceclassification), to automatically create classifier. Namly, this `bert_model` can be instantiated by 

```bert_model = BertForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=len(lab2ind)).to(device)```

---

Count the number of parameters. 

In [28]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(bert_model):,} trainable parameters')

The model has 109,703,908 trainable parameters


## Optimizer and Learning Rate Scheduler

For the purposes of fine-tuning, the authors recommend the following hyperparameter ranges (from Appendix A.3 of the [paper](https://arxiv.org/pdf/1810.04805.pdf)):

* Batch size: 16, 32
* Learning rate (Adam): 5e-5, 3e-5, 2e-5
* Number of epochs: 2, 3, 4

Hypterparameters for BERT large model:

* Batch size: 32
* Learning rate (Adam): 2e-5
* Number of epochs: 3

Hypterparameters for BERT base model:

* Batch size: 8
* Learning rate (Adam): 3e-5
* Number of epochs: 3

In [29]:
# Parameters:
lr = 3e-5
max_grad_norm = 1.0
epochs = 3
warmup_proportion = 0.1
num_training_steps  = len(train_dataloader) * epochs
num_warmup_steps = num_training_steps * warmup_proportion

### In Transformers, optimizer and schedules are instantiated like this:
# Note: AdamW is a class from the huggingface library
# the 'W' stands for 'Weight Decay"
optimizer = AdamW(bert_model.parameters(), lr=lr, correct_bias=False)
# schedules
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler

# We use nn.CrossEntropyLoss() as our loss function. 
criterion = nn.CrossEntropyLoss()



# Model training

We define a `train()` function. 

In [30]:
def train(model, iterator, optimizer, scheduler, criterion):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        input_ids, input_mask, labels = batch

        outputs,_ = model(input_ids, input_mask)

        loss = criterion(outputs, labels)
        # delete used variables to free GPU memory
        del batch, input_ids, input_mask, labels
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore
        optimizer.step()
        scheduler.step()
        epoch_loss += loss.cpu().item()
        optimizer.zero_grad()
    
    # free GPU memory
    if device == 'cuda':
        torch.cuda.empty_cache()

    return epoch_loss / len(iterator)

We define a `evaluate()` function. 

In [31]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            # Add batch to GPU
            batch = tuple(t.to(device) for t in batch)
            # Unpack the inputs from our dataloader
            input_ids, input_mask, labels = batch

            outputs,_ = model(input_ids, input_mask)
            
            loss = criterion(outputs, labels)

            # delete used variables to free GPU memory
            del batch, input_ids, input_mask
            epoch_loss += loss.cpu().item()

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return epoch_loss / len(iterator), accuracy, f1score

### Training, evaluating and testing model

In [32]:
# create checkpoint directory
import os
save_path = '/content/drive/MyDrive/Colab Notebooks/ckpt/TweepFake/'
if os.path.exists(save_path) == False:
    os.makedirs(save_path)

In [33]:
# Train the model
loss_list = []
acc_list = []

for epoch in trange(epochs, desc="Epoch"):
    train_loss = train(bert_model, train_dataloader, optimizer, scheduler, criterion)  
    val_loss, val_acc, val_f1 = evaluate(bert_model, validation_dataloader, criterion)
    test_loss, test_acc, test_f1 = evaluate(bert_model, test_dataloader, criterion)

    # Create checkpoint at end of each epoch
    state = {
        'epoch': epoch,
        'state_dict': bert_model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict()
        }

    torch.save(state, save_path+"/BERT_base_cnn_"+str(epoch+1)+".pt")

    print('\n Epoch [{}/{}], Train Loss: {:.4f}, Validation Loss: {:.4f}, Validation Accuracy: {:.4f}, Validation F1: {:.4f}'.format(epoch+1, epochs, train_loss, val_loss, val_acc, val_f1))
    print('Test Accuracy: {:.4f}, Test F1: {:.4f}'.format(test_acc, test_f1))
  

Epoch:  33%|███▎      | 1/3 [04:34<09:08, 274.23s/it]


 Epoch [1/3], Train Loss: 0.4956, Validation Loss: 0.3752, Validation Accuracy: 0.8562, Validation F1: 0.8477
Test Accuracy: 0.8604, Test F1: 0.8487


Epoch:  67%|██████▋   | 2/3 [09:15<04:38, 278.11s/it]


 Epoch [2/3], Train Loss: 0.2491, Validation Loss: 0.4197, Validation Accuracy: 0.8736, Validation F1: 0.8686
Test Accuracy: 0.8706, Test F1: 0.8599


Epoch: 100%|██████████| 3/3 [13:57<00:00, 279.19s/it]


 Epoch [3/3], Train Loss: 0.0987, Validation Loss: 0.6785, Validation Accuracy: 0.8640, Validation F1: 0.8621
Test Accuracy: 0.8776, Test F1: 0.8716



