# Using FakeBERT Architecture for Sequence Classification

[Original FakeBERT Paper](https://link.springer.com/content/pdf/10.1007/s11042-020-10183-2.pdf)

FakeBERT Architecture:

![FakeBERT](fakebert.PNG "FakeBERT Architecture")

Summary:




In [1]:
# Mount into drive

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [2]:
# This is the location of the root directory (on github it would be cs7643-project)
# Yours might look different - Megan
root = '/content/drive/MyDrive/CS7643/cs7643-project'
#root = '/content/drive/MyDrive/sankarc/cs7643-project'

In [3]:
# Detect PY file updates and reload
%load_ext autoreload
%autoreload 2

In [4]:
#%cd drive/MyDrive/Colab\ Notebooks/dl_project
import os
os.chdir(root)
%pwd

'/content/drive/MyDrive/CS7643/cs7643-project'

In [5]:
%ls

[0m[01;34mdata[0m/  [01;34mnotebooks[0m/  [01;34mproposal[0m/  README.md  [01;34msource[0m/  Tutorials  [01;34mwandb[0m/


### Installations

In [None]:
!pip install transformers
!pip install datasets
!pip install adjustText
!pip install -r 'notebooks/requirement.txt'

In [7]:
import logging
import time
from platform import python_version

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from torch.autograd import Variable
from transformers import BertTokenizer, BertModel
from torch.optim import Adam, Adadelta
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from platform import python_version

In [19]:
!pip install wandb -qqq

[K     |████████████████████████████████| 1.8 MB 4.1 MB/s 
[K     |████████████████████████████████| 144 kB 86.2 MB/s 
[K     |████████████████████████████████| 181 kB 91.6 MB/s 
[K     |████████████████████████████████| 63 kB 2.1 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [20]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [8]:
print("python version==%s" % python_version())
print("pandas==%s" % pd.__version__)
print("numpy==%s" % np.__version__)
print("torch==%s" % torch.__version__)
print("sklearn==%s" % sklearn.__version__)
print("transformers==%s" % transformers.__version__)
print("matplotlib==%s" % matplotlib.__version__)

python version==3.7.13
pandas==1.3.5
numpy==1.21.6
torch==1.11.0+cu113
sklearn==1.0.2
transformers==4.18.0
matplotlib==3.2.2


In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Parameters (Pre-Model/Experiment)


In [15]:
# inputs
pretrained_fp = 'bert-base-uncased'
MAX_LENGTH = 100

### Get dataset, Bert tokens from preprocessed files (data/test.csv, data/train.csv)


In [None]:
'''
Get the preprocess data from data folder and tokenize and retrieve all data (train, test, validation)
'''
from source.dataloader.get_data import bert_data

bert_data = bert_data(\
                      data_folder = 'data',\
                      train_file = 'Train_Clean.csv',\
                      test_file = 'liar_test_id.csv',\
                      val_file = None,\
                      tokenizer_type = pretrained_fp,\
                      data_tokens = None)

# FYI - col_tokenize is name of column containing the sequence text (in case it changes for some reason)
data_tokenized = bert_data.tokenize(\
                                    col_tokenize = 'text',\
                                    add_special_tokens = True,\
                                    max_length = MAX_LENGTH,\
                                    truncation = True,\
                                    padding = 'max_length')

In [None]:
data_tokenized

### get embeddings using BERT

Each file note becomes a 2D tensor:
* Each row is a token or subtoken in the sequence
* Each column is a value in the embedding (vector) for that token

In [22]:
bert_model = BertModel.from_pretrained('bert-base-uncased').to(device)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Parameters (Model)

### run training data through BERT and CNN



In [23]:
import random
import time

# Specify loss function
#loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

##Example Experiment

In [32]:
import pprint
myrun=wandb.init(project="Fakebert", entity="cs7643-teamscam")

[34m[1mwandb[0m: Currently logged in as: [33mcs7643-teamscam[0m (use `wandb login --relogin` to force relogin)


In [49]:
from datasets import load_dataset, Value
test = load_dataset('data', data_files = {'test':'liar_test_id.csv'})
test = test.cast_column("label", Value("int8"))
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Using custom data configuration data-db6351b9e4873cf6
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/data-db6351b9e4873cf6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/data-db6351b9e4873cf6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-e9f0b24ef7b1fc24.arrow


In [31]:
from source.dataloader.get_data import bert_data

BATCH_SIZE = 128
MAX_LENGTH = 128
TEST_SUBSET = False

# Define the dataset 
data = bert_data(\
                              data_folder = 'data',\
                              train_file = 'Train_Clean.csv',\
                              test_file = 'Test_Clean.csv',\
                              val_file = 'Validate_Clean.csv',\
                              tokenizer_type = 'bert-base-uncased',\
                              data_tokens = None)
# FYI - col_tokenize is name of column containing the sequence text (in case it changes for some reason)
data_tokenized = data.tokenize(\
                                        col_tokenize = 'text',\
                                        add_special_tokens = True,\
                                        max_length = MAX_LENGTH,\
                                        truncation = True,\
                                        padding = 'max_length')

# Get Train, Val data
train_dataloader, val_dataloader, _ = data.get_data_loader(batch_size = BATCH_SIZE, test_subset = TEST_SUBSET)
    

# Create DataLoader for test data
test_dataloader =  bert_data.get_test_data_loader(\
                                      data_folder = 'data',\
                                      data_test = 'liar_test_id.csv',\
                                      max_length = MAX_LENGTH,\
                                      batch_size = BATCH_SIZE)

Using custom data configuration data-640c1151ef84ed15
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/data-640c1151ef84ed15/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/26938 [00:00<?, ?ex/s]

  0%|          | 0/8981 [00:00<?, ?ex/s]

  0%|          | 0/8979 [00:00<?, ?ex/s]

Using custom data configuration data-db6351b9e4873cf6
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/data-db6351b9e4873cf6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1590 [00:00<?, ?ex/s]

In [None]:
from source import train
from source.models.FakeBERT import initialize_model, FakeBERTCNN
from source.dataloader.get_data import bert_data
from source.plot import experiment_plot
from source import train, evaluate
import wandb
import pprint

sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'val_accuracy',
      'goal': 'maximize'   
    },
    'parameters': {
        'learning_rate': {
            'values': [1e-1, 1e-2, 1e-3]
        },
        'batch_size': {
            #'values': [128] # paper value
            'values':[64, 128, 256]
        },
        'epochs':{
            'values':[5, 10, 15]
        },
        'max_length':{
            'values':[128] # paper value
        },
        'dropout':{
            'values':[0.0,0.2,0.5]
        },
        'opt':{
            #'values':['ada'] # paper value
            'values':['adam','ada']
        },
        'rho':{ #adagrid opt param
            'values':[0.9,0.95,0.99]
        },
        'emb_size':{ # should not change
            'value':768
        },
        'num_filters':{
            'value':[128, 128, 128] # [128, 128, 128] in paper
        },
        'filter_sizes':{
            'value':[3, 4, 5] # in paper
        },
        'num_classes':{ # should not change (with standard dataset)
            'value':2
        },
        'test_subset':{
            'value': False
        },
        'alt_test':{
            'value': True
        }
    }
}

def run():
  with wandb.init(project="FakeBert-Megan", entity="cs7643-teamscam") as run:
    config = wandb.config
    # Parameters
    MAX_LENGTH = config['max_length']
    EPOCHS = config['epochs']
    BATCH_SIZE = config['batch_size'] # 128 in paper
    NUM_TOKENS, EMBEDDING_SIZE = (MAX_LENGTH, config['emb_size']) # The num_tokens comes from max_length in tokens
    NUM_FILTERS = config['num_filters'] # [128, 128, 128] in paper
    FILTER_SIZES = config['filter_sizes'] #[3,4,5] in paper
    NUM_CLASSES = config['num_classes']
    DROPOUT = config['dropout']
    LR = config['learning_rate']
    RHO = config['rho']
    ALT_TEST = config['alt_test']
    TEST_SUBSET = config['test_subset'] # Whether or not to get a reduced size dataset (700,200,100) in train, validation, test
    EMB_LAYER = 0 # For this model, we are grabbing the embeddings not the bert output

    # Define the dataset 
    data = bert_data(\
                                  data_folder = 'data',\
                                  train_file = 'Train_Clean.csv',\
                                  test_file = 'Test_Clean.csv',\
                                  val_file = 'Validate_Clean.csv',\
                                  tokenizer_type = 'bert-base-uncased',\
                                  data_tokens = None)
    # FYI - col_tokenize is name of column containing the sequence text (in case it changes for some reason)
    data_tokenized = data.tokenize(\
                                            col_tokenize = 'text',\
                                            add_special_tokens = True,\
                                            max_length = MAX_LENGTH,\
                                            truncation = True,\
                                            padding = 'max_length')

    # Get Train, Val data
    train_dataloader, val_dataloader, test_dataloader = data.get_data_loader(batch_size = BATCH_SIZE, test_subset = TEST_SUBSET)
        

    # Create DataLoader for alternate test data
    if ALT_TEST:
      print('Using liar dataset for test...')
      test_dataloader =  bert_data.get_test_data_loader(\
                                            data_folder = 'data',\
                                            data_test = 'liar_test_id.csv',\
                                            max_length = MAX_LENGTH,\
                                            batch_size = BATCH_SIZE)

    # Specify loss function
    loss_fn = nn.CrossEntropyLoss()

    set_seed(42)
    cnn_bert = initialize_model(device, \
                                max_len=MAX_LENGTH, \
                                filter_sizes=FILTER_SIZES, \
                                num_filters=NUM_FILTERS, \
                                num_classes=NUM_CLASSES,
                                learning_rate=LR, dropout=DROPOUT)

    # Specify the optimizer
    if (config['opt'] == 'adam'): 
      opt = Adam(cnn_bert.parameters(), lr=LR)
    else:
      opt = Adadelta(cnn_bert.parameters(), lr=LR, rho=RHO)


    results = train.train(emb_model = bert_model, \
                 model = cnn_bert,\
                 loss_fn = loss_fn,\
                 optimizer = opt,\
                 train_dataloader = train_dataloader,\
                 val_dataloader = val_dataloader,\
                 epochs=EPOCHS,\
                 bert_layer = EMB_LAYER)
     
    plot_runner = experiment_plot(results = results,\
                                  save_name = "LINEAR" + '-' \
                                            + 'SEQ' + str(MAX_LENGTH)\
                                            + '-LR' + str(LR)\
                                            + '-DO' + str(DROPOUT)\
                                            + '-BS' + str(BATCH_SIZE),
                                  save_path = "experiments/figures/",\
                                  append_time = True
                                  )
    plot_runner.plot(title_prepend = "fakeBERT",\
                    include_points = False)

    val_loss, val_accuracy = evaluate.evaluate(bert_model, cnn_bert, loss_fn, test_dataloader, EMB_LAYER, epoch_i = 'Test')
    print('Final Loss and Accuracy on Test: (%s,%s)' % (val_loss, val_accuracy))

count = 1   # number of runs to execute
sweep_id = wandb.sweep(sweep_config, project="FakeBert-Megan", entity="cs7643-teamscam")
wandb.agent(sweep_id, function=run, count=count)

Create sweep with ID: vvt4r1go
Sweep URL: https://wandb.ai/cs7643-teamscam/FakeBert-Megan/sweeps/vvt4r1go


[34m[1mwandb[0m: Agent Starting Run: nz9zqizy with config:
[34m[1mwandb[0m: 	alt_test: True
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	dropout: 0.2
[34m[1mwandb[0m: 	emb_size: 768
[34m[1mwandb[0m: 	epochs: 15
[34m[1mwandb[0m: 	filter_sizes: [3, 4, 5]
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	max_length: 128
[34m[1mwandb[0m: 	num_classes: 2
[34m[1mwandb[0m: 	num_filters: [128, 128, 128]
[34m[1mwandb[0m: 	opt: adam
[34m[1mwandb[0m: 	rho: 0.9
[34m[1mwandb[0m: 	test_subset: False


Using custom data configuration data-640c1151ef84ed15
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/data-640c1151ef84ed15/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/data-640c1151ef84ed15/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-a7315fecdc7e3baa.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/data-640c1151ef84ed15/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-598d55816dcc061c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/data-640c1151ef84ed15/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-20b353064897dfe4.arrow
Using custom data configuration data-db6351b9e4873cf6
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/data-db6351b9e4873cf6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


Using liar dataset for test...


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/data-db6351b9e4873cf6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-e9f0b24ef7b1fc24.arrow


  0%|          | 0/1590 [00:00<?, ?ex/s]

Start training...

 Epoch  |  Train Loss  |  Train Acc  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------




  0%|          | 0/211 [00:00<?, ?it/s]

In [37]:
myrun.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

### References

https://romanorac.github.io/machine/learning/2019/12/02/identifying-hate-speech-with-bert-and-cnn.html