# Using FakeBERT Architecture for Sequence Classification

[Original FakeBERT Paper](https://link.springer.com/content/pdf/10.1007/s11042-020-10183-2.pdf)

FakeBERT Architecture:

![FakeBERT](fakebert.PNG "FakeBERT Architecture")

Summary:

Questions:
* Do we need to split train/test when we load the csv files? I think this should happen in DataLoader
** I would say yes we should have train/validate/test csv files so we can separate our concerns one time and the huggingface library has a useful way to grab data by type in separate CSVs - Megan
* Do we really need a main.py? This could probably be in the notebook
** I agree - Megan


TODO:
- (SKIP) Create config.xml files in config folder and load them into the notebook via the method used in assignments

- Save preprocessed code with the regex (cleaning out newspaper meta data and removing sequences below certain sizze) into data folder so it can be tokenized (test.csv, train.csv, validation.csv)

- (DONE) (for now, run again with preprocessed data files)Run tokenizer code ONCE and save to pickle (to spare the 10 minute runtime every time this notebook loads)

- (DONE) Pull BertModel out of this notebook and use the sources/models/FakeBert.py

- Look at adding the BERT embedding to the nn.Module with the gradients? (maybe why accuracy is not good right now, since it isn't in computation graph?)

- Create a new sources/models/BertLinear.py with a simple nn.Module to turn the embeddings into a linear layer

- Clean the commented out code


In [1]:
# Mount into drive

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [2]:
# This is the location of the root directory (on github it would be cs7643-project)
# Yours might look different - Megan
root = '/content/drive/MyDrive/CS7643/Project'

In [3]:
# Detect PY file updates and reload
%load_ext autoreload
%autoreload 2

In [4]:
#%cd drive/MyDrive/Colab\ Notebooks/dl_project
import os
os.chdir(root)
%pwd

'/content/drive/MyDrive/CS7643/Project'

In [5]:
%ls

[0m[01;34mdata[0m/       [01;34mexperiments[0m/  [01;34mproposal[0m/     README.md  Tutorials
dataset.py  [01;34mnotebooks[0m/    [01;34m__pycache__[0m/  [01;34msource[0m/    untitled


### Installations

In [None]:
!pip install transformers
!pip install datasets
!pip install adjustText

In [None]:
#!pip install -r 'requirements.txt'
!pip install -r 'notebooks/requirement.txt'

In [8]:
import logging
import time
from platform import python_version

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from torch.autograd import Variable
from transformers import BertTokenizer, BertModel
from torch.optim import Adam, Adadelta
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from platform import python_version

In [None]:
print("python version==%s" % python_version())
print("pandas==%s" % pd.__version__)
print("numpy==%s" % np.__version__)
print("torch==%s" % torch.__version__)
print("sklearn==%s" % sklearn.__version__)
print("transformers==%s" % transformers.__version__)
print("matplotlib==%s" % matplotlib.__version__)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
# n_gpu = torch.cuda.device_count()
# torch.cuda.get_device_name(0)

### Parameters (Pre-Model/Experiment)


In [11]:
# inputs
train_fp = 'train.csv'
test_fp = 'test.csv'
pretrained_fp = 'bert-base-uncased'
MAX_LENGTH = 64
PICKLED_TOKENIZER = None

### Get dataset, Bert tokens from preprocessed files (data/test.csv, data/train.csv)
Question (4/16/22) - After the preprocess code is completed, can we save the FINAL test, train, validation csv files in the data folder so they can be consumed by my get_data.py berty_data class? - Megan

TODO: When the preprocess code is done, the tokenizer only has to be run once and the resulting torch tensors can be saved. Add an 'if exists' check on the pickled token file to restore it from save rather than go through the 10 minute process of tokenizing each time we run this.

In [12]:
'''
Get the preprocess data from data folder and tokenize and retrieve all data (train, test, validation)
'''
from source.dataloader.get_data import bert_data
import pickle

# Try and skip the tokenizer step if a tokenizer file is set already
# Make sure the tokenizer you are using has the MAX_LENGTH equal to the parameter set
if PICKLED_TOKENIZER is not None:
  data_tokens =  pickle.load( open(PICKLED_TOKENIZER, 'rb'))
else:
  data_tokens = None

bert_data = bert_data(\
                      data_folder = 'data',\
                      train_file = 'train.csv',\
                      test_file = 'test.csv',\
                      val_file = None,\
                      tokenizer_type = pretrained_fp,\
                      data_tokens = data_tokens)

# FYI - col_tokenize is name of column containing the sequence text (in case it changes for some reason)
data_tokenized = bert_data.tokenize(\
                                    col_tokenize = 'text',\
                                    add_special_tokens = True,\
                                    max_length = MAX_LENGTH,\
                                    truncation = True,\
                                    padding = 'max_length')

# Pickle the tokenized results (if new) to a file
if PICKLED_TOKENIZER is None:
  timestr = time.strftime("%Y%m%d-%H%M%S")
  pickle.dump( data_tokenized, open( "source/dataloader/tokenized" + str(MAX_LENGTH) + '_' + timestr + ".p", 'wb'))

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using custom data configuration data-cf9ebd5eebbe0a50


Downloading and preparing dataset csv/data to /root/.cache/huggingface/datasets/csv/data-cf9ebd5eebbe0a50/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/data-cf9ebd5eebbe0a50/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/35918 [00:00<?, ?ex/s]

  0%|          | 0/8980 [00:00<?, ?ex/s]

In [13]:
data_tokenized

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 35918
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8980
    })
})

### get embeddings using BERT

Each file note becomes a 2D tensor:
* Each row is a token or subtoken in the sequence
* Each column is a value in the embedding (vector) for that token

In [16]:
bert_model = BertModel.from_pretrained(pretrained_fp).to(device)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Parameters (Model)

In [17]:
BATCH_SIZE = 64
N_SEQUENCES = data_tokenized['train'].shape[0]
NUM_TOKENS, EMBEDDING_SIZE = (MAX_LENGTH, 768) # The num_tokens comes from max_length in tokens
# N_SEQUENCES = 500
# NUM_TOKENS = 50
# EMBEDDING_SIZE = 512
# FILTER_SIZES = [2, 3, 4]
# NUM_FILTERS = [3, 3, 3]
FILTER_SIZES = [MAX_LENGTH, MAX_LENGTH, MAX_LENGTH]
NUM_FILTERS = [3, 4, 5]
NUM_CLASSES = 2
DROPOUT = 0.2
LR = 0.1

### run training data through BERT and CNN



In [18]:
import random
import time

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)


##Example Experiment

In [33]:
# Parameters
BATCH_SIZE = 256
MAX_LENGTH = 512
DROPOUT = 0.2
LR = 0.001
pretrained_fp = 'bert-base-uncased'

In [35]:
'''
Get the preprocess data from data folder and tokenize and retrieve all data (train, test, validation)
'''
from source.dataloader.get_data import bert_data

# Define the dataset 
bert_data = bert_data(\
                      data_folder = 'data',\
                      train_file = 'train.csv',\
                      test_file = 'test.csv',\
                      val_file = None,\
                      tokenizer_type = pretrained_fp,\
                      data_tokens = data_tokens)

# FYI - col_tokenize is name of column containing the sequence text (in case it changes for some reason)
data_tokenized = bert_data.tokenize(\
                                    col_tokenize = 'text',\
                                    add_special_tokens = True,\
                                    max_length = MAX_LENGTH,\
                                    truncation = True,\
                                    padding = 'max_length')

Using custom data configuration data-cf9ebd5eebbe0a50
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/data-cf9ebd5eebbe0a50/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/35918 [00:00<?, ?ex/s]

  0%|          | 0/8980 [00:00<?, ?ex/s]

In [None]:
from source import train
from source.models.linearBERT import initialize_model, linearBERT
from source.plot import experiment_plot


# Get DataLoader 
train_dataloader, val_dataloader = bert_data.get_data_loader(batch_size = BATCH_SIZE)

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

# Initialize Model
bert_linear = initialize_model(device, dropout=DROPOUT)

# Specify the optimizer
opt = Adam(bert_linear.parameters(), lr=LR)

results = train.train(emb_model = bert_model, \
            model = bert_linear,\
            loss_fn = loss_fn,\
            optimizer = opt,\
            train_dataloader = train_dataloader,\
            val_dataloader = val_dataloader,\
            epochs=10)

plot_runner = experiment_plot(results = results,\
                              save_name = "LINEAR" + '-' \
                                        + 'SEQ' + str(MAX_LENGTH)\
                                        + '-LR' + str(LR)\
                                        + '-DO' + str(DROPOUT)\
                                        + '-BS' + str(BATCH_SIZE),
                              save_path = "experiments/figures/",\
                              append_time = True
                              )
plot_runner.plot(title_prepend = "linearBERT",\
                 include_points = False)

Start training...

 Epoch  |  Train Loss  |  Train Acc  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------


  0%|          | 0/141 [00:00<?, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

   1    |   0.560674   |   83.65   |  0.479329  |   92.02   |  788.33  


  0%|          | 0/141 [00:00<?, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

   2    |   0.457702   |   92.27   |  0.428479  |   94.49   |  788.12  


  0%|          | 0/141 [00:00<?, ?it/s]

### References

https://romanorac.github.io/machine/learning/2019/12/02/identifying-hate-speech-with-bert-and-cnn.html