
# DS6050 - Group 6
* Andrej Erkelens <wsw3fa@virginia.edu>
* Robert Knuuti <uqq5zz@virginia.edu>
* Khoi Tran <kt2np@virginia.edu>

## Abstract
English is a verbose language with over 69% redundancy in its construction, and as a result, individuals only need to identify important details to comprehend an intended message.
While there are strong efforts to quantify the various elements of language, the average individual can still comprehend a written message that has errors, either in spelling or in grammar.
The emulation of the effortless, yet obscure task of reading, writing, and understanding language is the perfect challenge for the biologically-inspired methods of deep learning.
Most language and text related problems rely upon finding high-quality latent representations to understand the task at hand. Unfortunately, efforts to overcome such problems are limited to the data and computation power available to individuals; data availability often presents the largest problem, with small, specific domain tasks often proving to be limiting.
Currently, these tasks are often aided or overcome by pre-trained large language models (LLMs), designed by large corporations and laboratories.
Fine-tuning language models on domain-specific vocabulary with small data sizes still presents a challenge to the language community, but the growing availability of LLMs to augment such models alleviates the challenge.
This paper explores different techniques to be applied on existing language models (LMs), built highly complex Deep Learning models, and investigates how to fine-tune these models, such that a pre-trained model is used to enrich a more domain-specific model that may be limited in textual data.

## Project Objective

We are aiming on using several small domain specific language tasks, particularly classification tasks.
We aim to take at least two models, probably BERT and distill-GPT2 as they seem readily available on HuggingFace and TensorFlow's model hub.
We will iterate through different variants of layers we fine tune and compare these results with fully trained models, and ideally find benchmarks already in academic papers on all of the datasets.

We aim to optimize compute efficiency and also effectiveness of the model on the given dataset. Our goal is to find a high performing and generalizable method for our fine tuning process and share this in our paper.


In [1]:
%autosave 0

Autosave disabled


In [2]:
!pip install -q tensorflow-text tokenizers transformers

[K     |████████████████████████████████| 4.6 MB 27.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 55.2 MB/s 
[K     |████████████████████████████████| 4.7 MB 53.2 MB/s 
[K     |████████████████████████████████| 511.7 MB 6.4 kB/s 
[K     |████████████████████████████████| 1.6 MB 68.2 MB/s 
[K     |████████████████████████████████| 5.8 MB 56.0 MB/s 
[K     |████████████████████████████████| 438 kB 69.9 MB/s 
[K     |████████████████████████████████| 596 kB 66.3 MB/s 
[K     |████████████████████████████████| 101 kB 12.3 MB/s 
[?25h

In [2]:
import tensorflow as tf
import tensorflow_text as tf_text

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
%cd /content/drive/MyDrive/ds6050/git/

/content/drive/MyDrive/ds6050/git


In [34]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

import tokenizers
import torch
import transformers

from tensorflow import keras
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers

np.random.seed(42)
tf.random.set_seed(42)

df = pd.read_feather("data-extractor/data/dataset.feather")#.set_index('index')
df['topic'] = df['topic'].str.split('.').str[0]
df_train = df.sample(frac = 0.8)
df_test = df.drop(df_train.index)

In [6]:
features = 'content' # feature for the future - add all the datasets ['categories', 'summary', 'content']
label = 'topic'

In [None]:
# strategy = tf.distribute.MirroredStrategy()

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# (45030, 7): 7 different topics
y_ = ohe.fit_transform(df['topic'].values.reshape(-1,1)).toarray()

In [37]:
max_len = 512
checkpoint = 'gpt2'
hf_gpt2_tokenizer = transformers.GPT2Tokenizer.from_pretrained(checkpoint, add_prefix_space=True)
hf_gpt2_model = transformers.TFGPT2Model.from_pretrained(checkpoint)
# hf_gpt2_model = transformers.GPT2ForSequenceClassification.from_pretrained(checkpoint)

# add for gpt2 padding
if hf_gpt2_tokenizer.pad_token is None:
    hf_gpt2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
hf_gpt2_model.resize_token_embeddings(len(hf_gpt2_tokenizer))

# batch encoding
encodings = hf_gpt2_tokenizer.batch_encode_plus(list(df.summary.values), 
                                                return_tensors='tf', 
                                                padding='max_length',
                                                #add_special_tokens=True,
                                                max_length=None,
                                                truncation=True)

All model checkpoint layers were used when initializing TFGPT2Model.

All the layers of TFGPT2Model were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.
Using pad_token, but it is not set yet.


In [None]:
# train_encodings = hf_bert_tokenizer.batch_encode_plus(list(df_train.summary.values), 
#                                                 return_tensors='tf', 
#                                                 padding='max_length',
#                                                 max_length=None,
#                                                 truncation=True)

# test_encodings = hf_bert_tokenizer.batch_encode_plus(list(df_test.summary.values), 
#                                                 return_tensors='tf', 
#                                                 padding='max_length',
#                                                 max_length=None,
#                                                 truncation=True)

In [53]:
def model_top(pretr_model):
  input_ids = tf.keras.Input(shape=(1024,), dtype='int32')
  attention_mask = tf.keras.Input(shape=(1024,), dtype='int32')

  output = pretr_model(input_ids = input_ids, attention_mask = attention_mask)
  # output = pretr_model([input_ids, attention_mask])
  #pooler_output = output[1]
  pooler_output = tf.keras.layers.AveragePooling1D(pool_size=1024)(output[0])
  flattened_output = tf.keras.layers.Flatten()(pooler_output)
  
  output = tf.keras.layers.Dense(32, activation='tanh')(flattened_output)
  output = tf.keras.layers.Dropout(0.2)(output)

  output = tf.keras.layers.Dense(7, activation='softmax')(output)
  model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
  model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

  return model

In [54]:
model = model_top(hf_gpt2_model)

In [55]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_14 (InputLayer)          [(None, 1024)]       0           []                               
                                                                                                  
 input_15 (InputLayer)          [(None, 1024)]       0           []                               
                                                                                                  
 tfgpt2_model_4 (TFGPT2Model)   TFBaseModelOutputWi  124440576   ['input_14[0][0]',               
                                thPastAndCrossAtten               'input_15[0][0]']               
                                tions(last_hidden_s                                               
                                tate=(None, 1024, 7                                           

In [56]:
model.layers

[<keras.engine.input_layer.InputLayer at 0x7fe3d13f5a10>,
 <keras.engine.input_layer.InputLayer at 0x7fe3cf346150>,
 <transformers.models.gpt2.modeling_tf_gpt2.TFGPT2Model at 0x7fe41819efd0>,
 <keras.layers.pooling.average_pooling1d.AveragePooling1D at 0x7fe3d13fd350>,
 <keras.layers.reshaping.flatten.Flatten at 0x7fe59888ac90>,
 <keras.layers.core.dense.Dense at 0x7fe4c755b7d0>,
 <keras.layers.regularization.dropout.Dropout at 0x7fe41847a650>,
 <keras.layers.core.dense.Dense at 0x7fe41853d050>]

In [57]:
model.layers[2].trainable = False

In [58]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_14 (InputLayer)          [(None, 1024)]       0           []                               
                                                                                                  
 input_15 (InputLayer)          [(None, 1024)]       0           []                               
                                                                                                  
 tfgpt2_model_4 (TFGPT2Model)   TFBaseModelOutputWi  124440576   ['input_14[0][0]',               
                                thPastAndCrossAtten               'input_15[0][0]']               
                                tions(last_hidden_s                                               
                                tate=(None, 1024, 7                                           

In [59]:
!nvidia-smi

Mon Aug  8 04:47:51 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    34W /  70W |   4628MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [60]:
checkpoint_filepath = './tmp/checkpoint'

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=5,
    mode="auto",
)

In [None]:
history = model.fit([encodings['input_ids'], 
                     encodings['attention_mask']], 
                    y_, 
                    validation_split=.2,
                    epochs=10,
                    batch_size=4,
                    callbacks=[model_checkpoint_callback, early_stopping_callback])

Epoch 1/10


In [None]:
train_labels = df_train['topic']
test_labels = df_test['topic']

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),
                                                         train_labels))

test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),
                                                        test_labels))

In [None]:
training_args.strategy.scope()

<tensorflow.python.distribute.distribute_lib._CurrentDistributionContext at 0x7f6135b3cf50>

In [None]:
# hf_bert_model.fit(train_dataset)

ValueError: ignored

In [None]:
train_encodings['input_ids']

<tf.Tensor: shape=(27716, 512), dtype=int32, numpy=
array([[  101,  3078,  3864, ...,     0,     0,     0],
       [  101, 26033,  2271, ...,     0,     0,     0],
       [  101,  7145,  1010, ...,  3939,  7286,   102],
       ...,
       [  101,  8934,  3258, ...,     0,     0,     0],
       [  101,  3972,  8873, ...,     0,     0,     0],
       [  101,  1037,  8754, ...,     0,     0,     0]], dtype=int32)>

In [None]:
hf_bert_model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['acc'])

hf_bert_model.fit(train_encodings['input_ids'])

AttributeError: ignored

In [None]:
hf_bert_model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['acc'])

hf_bert_model.fit(train_dataset, epochs=10, validation_data=test_dataset)

Epoch 1/10


ValueError: ignored

In [None]:
with training_args.strategy.scope():
  model = hf_bert_model

trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)



### Data Preview

In [None]:
for text, label in ds_train.take(5):
  print('Text')
  print(text)
  print('Label')
  print(label)

Text
tf.Tensor(b"Primary elections were first introduced in Italy by Lega Nord in 1995, but were seldom used until before the 2005 regional elections.\nIn January 2005 the centre-left The Union coalition held open primaries in order to select its candidate for President in Apulia. More importantly, in October 2005, The Union asked its voters to choose the candidate for Prime Minister in the 2006 general election: 4.3 million voters showed up and Romano Prodi won hands down. Two years later, in October 2007: 3.5 million voters of the Democratic Party were called to elect Walter Veltroni as their first leader, the party's constituent assembly and regional leaders.\nThe centre-right (see House of Freedoms, The People of Freedom, centre-right coalition and Forza Italia) has held primary elections only at the local level.\n\n\n== Regulatory rules ==\nThere are no laws at country level to govern the conduct of any primary election.\nIn 2004 Tuscany introduced a regional law regulating primar

In [None]:
## This is currently broken - Still tryign to get the TFBertModel to accept the token string in.
max_len = 384
hf_bert_tokenizer_bootstrapper = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
hf_bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")

save_path = Path("data") / "models"
if not os.path.exists(save_path):
    os.makedirs(save_path, exist_ok=True)
hf_bert_tokenizer_bootstrapper.save_pretrained(save_path)
hf_bert_model.save_pretrained(save_path)

# Load the fast tokenizer from saved file
bert_tokenizer = tokenizers.BertWordPieceTokenizer(str(save_path/"vocab.txt"), lowercase=True)

def tf_hf_bertencode(features, label):
    x = bert_tokenizer.encode(tf.compat.as_str(features), add_special_tokens=True)
    y = bert_tokenizer.encode(tf.compat.as_str(label), add_special_tokens=True)
    return x, y

def tf_hf_bertencodeds(features, label):
    encode = tf.py_function(func=tf_hf_bertencode, inp=[features, label], Tout=[tf.int64, tf.int64])
    return encode

encoded_input = ds_train.batch(256).map(tf_hf_bertencodeds)
output = transformers.TFBertModel(config=transformers.PretrainedConfig.from_json_file(str(save_path/"config.json")))
hf_bert = output(encoded_input)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


ValueError: ignored

In [None]:

files = [] # Need to explode train_ds to sep files

tokenizer = tokenizers.BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=True,
    strip_accents=True,
    lowercase=True,
)

tokenizer.train(
    files,
    vocab_size=10000,
    min_frequency=2,
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    limit_alphabet=1000,
    wordpieces_prefix="##",
)

# Save the files
tokenizer.save_model(args.out, args.name)

In [None]:

files = [] # Need to explode train_ds to sep files

tokenizer = tokenizers.BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=True,
    strip_accents=True,
    lowercase=True,
)

tokenizer.train(
    files,
    vocab_size=10000,
    min_frequency=2,
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    limit_alphabet=1000,
    wordpieces_prefix="##",
)

# Save the files
tokenizer.save_model(args.out, args.name)