
# DS6050 - Group 6
* Andrej Erkelens <wsw3fa@virginia.edu>
* Robert Knuuti <uqq5zz@virginia.edu>
* Khoi Tran <kt2np@virginia.edu>

## Abstract
English is a verbose language with over 69% redundancy in its construction, and as a result, individuals only need to identify important details to comprehend an intended message.
While there are strong efforts to quantify the various elements of language, the average individual can still comprehend a written message that has errors, either in spelling or in grammar.
The emulation of the effortless, yet obscure task of reading, writing, and understanding language is the perfect challenge for the biologically-inspired methods of deep learning.
Most language and text related problems rely upon finding high-quality latent representations to understand the task at hand. Unfortunately, efforts to overcome such problems are limited to the data and computation power available to individuals; data availability often presents the largest problem, with small, specific domain tasks often proving to be limiting.
Currently, these tasks are often aided or overcome by pre-trained large language models (LLMs), designed by large corporations and laboratories.
Fine-tuning language models on domain-specific vocabulary with small data sizes still presents a challenge to the language community, but the growing availability of LLMs to augment such models alleviates the challenge.
This paper explores different techniques to be applied on existing language models (LMs), built highly complex Deep Learning models, and investigates how to fine-tune these models, such that a pre-trained model is used to enrich a more domain-specific model that may be limited in textual data.

## Project Objective

We are aiming on using several small domain specific language tasks, particularly classification tasks.
We aim to take at least two models, probably BERT and distill-GPT2 as they seem readily available on HuggingFace and TensorFlow's model hub.
We will iterate through different variants of layers we fine tune and compare these results with fully trained models, and ideally find benchmarks already in academic papers on all of the datasets.

We aim to optimize compute efficiency and also effectiveness of the model on the given dataset. Our goal is to find a high performing and generalizable method for our fine tuning process and share this in our paper.


In [1]:
%autosave 0

Autosave disabled


In [2]:
!pip install -q tensorflow-text tokenizers transformers

[K     |████████████████████████████████| 4.6 MB 5.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 25.0 MB/s 
[K     |████████████████████████████████| 4.7 MB 12.3 MB/s 
[K     |████████████████████████████████| 511.7 MB 5.0 kB/s 
[K     |████████████████████████████████| 1.6 MB 46.3 MB/s 
[K     |████████████████████████████████| 5.8 MB 51.2 MB/s 
[K     |████████████████████████████████| 438 kB 53.3 MB/s 
[K     |████████████████████████████████| 101 kB 9.7 MB/s 
[K     |████████████████████████████████| 596 kB 65.5 MB/s 
[?25h

In [3]:
import tensorflow as tf
import tensorflow_text as tf_text

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd /content/drive/MyDrive/ds6050/git/

/content/drive/MyDrive/ds6050/git


In [6]:
!ls

data  data-extractor  logs  model  tmp


In [7]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

import tokenizers
import transformers

from tensorflow import keras


np.random.seed(42)
tf.random.set_seed(42)

df = pd.read_feather("data-extractor/data/dataset.feather")
df['topic'] = df['topic'].str.split('.').str[0]
df_train = df.sample(frac = 0.8)
df_test = df.drop(df_train.index)

In [54]:
df.topic.unique()

array(['astronomy', 'sports', 'state_and_war', 'biology',
       'political-science', 'plantlife', 'oceanography'], dtype=object)

In [53]:
df=df[df['Topic'].isin(['biology','political-science'])]

Unnamed: 0,index,topic,uri,categories,summary,content
0,0,astronomy,https://en.wikipedia.org/wiki/Astronomical_object,"[All articles to be expanded, Articles to be e...",An astronomical object or celestial object is ...,An astronomical object or celestial object is ...
1,1,astronomy,https://en.wikipedia.org/wiki/243_Ida,"[Articles with J9U identifiers, Articles with ...","Ida, minor planet designation 243 Ida, is an a...","Ida, minor planet designation 243 Ida, is an a..."
2,2,astronomy,https://en.wikipedia.org/wiki/433_Eros,"[2012 in science, 433 Eros, Amor asteroids, Ar...","Eros (minor planet designation: (433) Eros), p...","Eros (minor planet designation: (433) Eros), p..."
3,3,astronomy,https://en.wikipedia.org/wiki/Active_galactic_...,"[Active galaxy types, Articles with GND identi...",An active galactic nucleus (AGN) is a compact ...,An active galactic nucleus (AGN) is a compact ...
4,5,astronomy,https://en.wikipedia.org/wiki/Algol_variable,"[Algol variables, Articles with short descript...",Algol variables or Algol-type binaries are a c...,Algol variables or Algol-type binaries are a c...
...,...,...,...,...,...,...
45025,10565,oceanography,https://en.wikipedia.org/wiki/Word_processor,"[Articles with BNF identifiers, Articles with ...",A word processor (WP) is a device or computer ...,A word processor (WP) is a device or computer ...
45026,10566,oceanography,https://en.wikipedia.org/wiki/Working_animal,"[All articles with unsourced statements, Anima...","A working animal is an animal, usually domesti...","A working animal is an animal, usually domesti..."
45027,10569,oceanography,https://en.wikipedia.org/wiki/World_Wide_Web,"[20th-century inventions, All accuracy dispute...","The World Wide Web (WWW), commonly known as th...","The World Wide Web (WWW), commonly known as th..."
45028,10570,oceanography,https://en.wikipedia.org/wiki/Editing,"[Articles with FAST identifiers, Articles with...",Editing is the process of selecting and prepar...,Editing is the process of selecting and prepar...


In [57]:
pd.set_option('display.max_rows', None)

In [112]:
df = df.drop(columns=['index'])

In [113]:
df.head()

Unnamed: 0,topic,uri,categories,summary,content
0,astronomy,https://en.wikipedia.org/wiki/Astronomical_object,"[All articles to be expanded, Articles to be e...",An astronomical object or celestial object is ...,An astronomical object or celestial object is ...
1,astronomy,https://en.wikipedia.org/wiki/243_Ida,"[Articles with J9U identifiers, Articles with ...","Ida, minor planet designation 243 Ida, is an a...","Ida, minor planet designation 243 Ida, is an a..."
2,astronomy,https://en.wikipedia.org/wiki/433_Eros,"[2012 in science, 433 Eros, Amor asteroids, Ar...","Eros (minor planet designation: (433) Eros), p...","Eros (minor planet designation: (433) Eros), p..."
3,astronomy,https://en.wikipedia.org/wiki/Active_galactic_...,"[Active galaxy types, Articles with GND identi...",An active galactic nucleus (AGN) is a compact ...,An active galactic nucleus (AGN) is a compact ...
4,astronomy,https://en.wikipedia.org/wiki/Algol_variable,"[Algol variables, Articles with short descript...",Algol variables or Algol-type binaries are a c...,Algol variables or Algol-type binaries are a c...


In [9]:
features = 'content' # feature for the future - add all the datasets ['categories', 'summary', 'content']
label = 'topic'

In [10]:
# strategy = tf.distribute.MirroredStrategy()

In [109]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

y_ = ohe.fit_transform(df['topic'].values.reshape(-1,1)).toarray()

In [172]:
max_len = 512
hf_bert_tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
hf_bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
# hf_bert_model = transformers.TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
train_encodings = hf_bert_tokenizer.batch_encode_plus(list(df_train.summary.values), 
                                                return_tensors='tf', 
                                                padding='max_length',
                                                max_length=None,
                                                truncation=True)

test_encodings = hf_bert_tokenizer.batch_encode_plus(list(df_test.summary.values), 
                                                return_tensors='tf', 
                                                padding='max_length',
                                                max_length=None,
                                                truncation=True)

In [115]:
encodings = hf_bert_tokenizer.batch_encode_plus(list(df.summary.values), 
                                                return_tensors='tf', 
                                                padding='max_length',
                                                max_length=None,
                                                truncation=True)


## LightGBM Model Comparison

In [59]:
import re
import inflect
import string

from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [60]:
def text_lowercase(text):
    return text.lower()
p = inflect.engine()
 
# convert number into words
def convert_number(text):
    # split string into list of words
    temp_str = text.split()
    # initialise empty list
    new_string = []
 
    for word in temp_str:
        # if word is a digit, convert the digit
        # to numbers and append into the new_string list
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)
 
        # append the word as it is
        else:
            new_string.append(word)
 
    # join the words of new_string to form a string
    temp_str = ' '.join(new_string)
    return temp_str

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def remove_whitespace(text):
    return  " ".join(text.split())

# remove stopwords function
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

# stem words in the list of tokenized words
def stem_words(text):
    word_tokens = word_tokenize(text)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems

# lemmatize string
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas

def rejoin(text):
    return ' '.join(text)


In [87]:
values = df.summary.apply(text_lowercase).apply(convert_number).apply(remove_punctuation).apply(remove_whitespace).apply(remove_stopwords).apply(rejoin).apply(lemmatize_word)

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [64]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(values.apply(rejoin))

In [65]:
X.shape

(45030, 195482)

In [66]:
import lightgbm
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

In [67]:
le = preprocessing.LabelEncoder()
y = le.fit_transform(df['topic'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_data = lightgbm.Dataset(X_train, label=y_train)
test_data = lightgbm.Dataset(X_test, label=y_test)

In [70]:
params = {'num_leaves': 31, 'objective': 'multiclass', 'seed' : 42, 'num_class': 7} 

In [71]:
num_round = 10
bst = lightgbm.train(params, train_data, num_round, valid_sets=[test_data])

[1]	valid_0's multi_logloss: 1.80624
[2]	valid_0's multi_logloss: 1.7157
[3]	valid_0's multi_logloss: 1.6435
[4]	valid_0's multi_logloss: 1.58183
[5]	valid_0's multi_logloss: 1.52943
[6]	valid_0's multi_logloss: 1.484
[7]	valid_0's multi_logloss: 1.44451
[8]	valid_0's multi_logloss: 1.40931
[9]	valid_0's multi_logloss: 1.37862
[10]	valid_0's multi_logloss: 1.34977


In [72]:
y_pred = bst.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)

In [73]:
accuracy_score(y_test, y_pred)

0.6241394625805019

---

In [183]:
def model_top(pretr_model):
  input_ids = tf.keras.Input(shape=(512,), dtype='int32')
  attention_masks = tf.keras.Input(shape=(512,), dtype='int32')

  output = pretr_model([input_ids, attention_masks])
  #pooler_output = output[1]
  #pooler_output = tf.keras.layers.AveragePooling1D(pool_size=512)(output[0])
  #flattened_output = tf.keras.layers.Flatten()(pooler_output)
  
  output = tf.keras.layers.Dense(512, activation='tanh')(output[1])
  output = tf.keras.layers.Dropout(0.2)(output)

  output = tf.keras.layers.Dense(7, activation='softmax')(output)
  model = tf.keras.models.Model(inputs=[input_ids, attention_masks], outputs=output)
  model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

  return model

In [184]:
model = model_top(hf_bert_model)

In [185]:
model.layers

[<keras.engine.input_layer.InputLayer at 0x7fa89eb0ae90>,
 <keras.engine.input_layer.InputLayer at 0x7fa89ee77b50>,
 <transformers.models.bert.modeling_tf_bert.TFBertModel at 0x7faae619c890>,
 <keras.layers.core.dense.Dense at 0x7faacec8ff10>,
 <keras.layers.regularization.dropout.Dropout at 0x7fabbd8b0ad0>,
 <keras.layers.core.dense.Dense at 0x7faa8dac4790>]

In [186]:
model.layers[2].trainable = False

In [170]:
model.layers[3].trainable = True

In [187]:
model.summary()

Model: "model_12"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_31 (InputLayer)          [(None, 512)]        0           []                               
                                                                                                  
 input_32 (InputLayer)          [(None, 512)]        0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  109482240   ['input_31[0][0]',               
                                thPoolingAndCrossAt               'input_32[0][0]']               
                                tentions(last_hidde                                               
                                n_state=(None, 512,                                        

In [188]:
checkpoint_filepath = './tmp/checkpoint'

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='train_accuracy',
    mode='max',
    save_best_only=True)

early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=5,
    mode="auto",
)



In [None]:
history = model.fit([encodings['input_ids'], 
                     encodings['attention_mask']], 
                    y_, 
                    validation_split=.2,
                    epochs=10,
                    batch_size=64,
                    shuffle=True,
                    callbacks=[model_checkpoint_callback, early_stopping_callback])

Epoch 1/10




Epoch 2/10




Epoch 3/10




Epoch 4/10




Epoch 5/10




Epoch 6/10

In [None]:
train_labels = df_train['topic']
test_labels = df_test['topic']

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),
                                                         train_labels))

test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),
                                                        test_labels))