This notebook documents the model fit during the first phase of Modelling Racial Caste System Project for the DistilBERT model.

## Packages

* More packages get imported in or right before the chunks that apply them to reduce conflict issues and allow for seamless implementation.

In [1]:
import transformers

print(transformers.__version__)

4.33.2


## Data Ingestion

Load Dataset into Pandas Dataframe. The links below contain helpful information on loading datasets for transformer classsification.

https://huggingface.co/docs/datasets/tabular_load#pandas-dataframes  
https://huggingface.co/docs/datasets/loading  
https://huggingface.co/docs/transformers/tasks/sequence_classification

This version of the training dataset had implicit and extrinsic sentences from the UNC dataset. Which explains the weakened accuracy and F1 scores below. However, even these scores were higher than the Random Forest F1 scores. Because this is one of several experiments, the bottom of the script does not have checkpooint images but we include the results in the description and technical document. Please read the Tech Document to see other experiments that we ran to determine the best model.

In [3]:
from datasets import Dataset
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

#Implicit and extrinsic sentences from the UNC dataset. 
df = pd.read_csv("fullnjc_dataframe9.csv")
df.drop(columns = 'Unnamed: 0', inplace=True)
features = df.loc[:,['year','sentence', 'jim_crow','state']].copy()
train, test = train_test_split(df, test_size = 0.2, random_state = 210)

#Specify the training and test columns
train = train.loc[:,['sentence', 'jim_crow']].copy()
test = test.loc[:,['sentence', 'jim_crow']].copy()

#Rename the columns
train = train.set_index('sentence', inplace=False)
test = test.set_index('sentence', inplace=False)
train = train.rename(columns={"jim_crow": "label"})
test = test.rename(columns={"jim_crow": "label"})

In [4]:
train_ds = Dataset.from_pandas(train, split="train")
test_ds = Dataset.from_pandas(test, split="test")

## Preprocessing and Analysis Setup

We tokenized and prepared the sentence column for input into the DistilBERT model:

* The AutoTokenizer from the transformers library, pre-trained for distilbert-base-uncased, was used for tokenization.
* Sentences were tokenized into numerical IDs and attention masks, automatically handling tasks like lowercasing and truncation of sequences longer than the model's maximum length of 512 tokens.
* The tokenization process was applied to the training and testing datasets using the map function, ensuring that the outputs were formatted correctly for the DistilBERT model.
* Dynamic Padding: We used the DataCollatorWithPadding to dynamically pad sequences in each batch to the length of the longest sequence. This ensures efficient batching and compatibility with the TensorFlow framework, without introducing unnecessary computational overhead.

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [6]:
def preprocess_function(examples):
    return tokenizer(examples["sentence"], truncation=True)

In [7]:
tokenized_train = train_ds.map(preprocess_function, batched=True)
tokenized_test = test_ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/14468 [00:00<?, ? examples/s]

Map:   0%|          | 0/3617 [00:00<?, ? examples/s]

In [8]:
#import tensorflow specific libraries

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow import keras

2023-11-19 21:32:25.439435: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-19 21:32:26.093907: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [9]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

## Classification Metrics

We designated accuracy and F1 as the primary metrics of interest to evaluate model performance.

To implement this:

#### Metrics Computation:
* A compute_metrics function calculates accuracy and F1 scores by comparing model predictions with true labels.
* Predictions are derived from model logits using np.argmax to identify the most probable class.
#### Label Mapping:
* The id2label and label2id dictionaries map numeric labels (used internally by the model) to descriptive labels ("jim_crow" and "non_jim_crow") for interpretability.
#### Optimizer and Learning Rate Scheduler:
* An Adam-based optimizer is configured with an initial learning rate (init_lr=2e-5) and a warm-up schedule to stabilize training.
* Training setup includes a batch size of 16, 6 epochs, and dynamically calculated total_train_steps.

In [10]:
import numpy as np
import evaluate

def compute_metrics(eval_pred):
    load_accuracy = evaluate.load("accuracy")
    load_f1 = evaluate.load("f1")
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy =  load_accuracy.compute(predictions=predictions, references= labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references= labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [11]:
id2label = {0: "non_jim_crow", 1: "jim_crow"}
label2id = {"non_jim_crow": 0, "jim_crow": 1}

In [12]:
from transformers import create_optimizer

batch_size = 16
num_epochs = 6
batches_per_epoch = len(tokenized_train) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=5,  # Adjust this based on the size of your dataset. We used 5 because the data is relatively smaller than others in the millions.
    num_train_steps=total_train_steps,
)

## Model Training and Testing

We prepared the model and datasets for training, monitored its performance using metrics, and saved the final model for deployment:

#### Model Initialization:
* The DistilBERT model is loaded with pre-trained weights for text classification.
* The model is configured for binary classification with num_labels=2 and uses the label mappings (id2label and label2id) for consistency.
#### Dataset Preparation:
* Training and validation datasets were prepared using the prepare_tf_dataset method. This ensured proper batching and padding of tokenized data for compatibility with TensorFlow.
#### Compilation and Training:
* The model was compiled with the Adam-based optimizer.
* Training was performed for 6 epochs with a batch size of 16, and validation metrics were monitored using the KerasMetricCallback for accuracy and F1. Epoch 3 had the highest accuracy and F1 scores.
#### Saving the Model:
* The trained model and tokenizer were saved to \"my_bertmodel7\" and \"tokenizer7\", enabling reuse for future predictions or further fine-tuning.

In [13]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

2023-11-19 21:32:42.579126: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-19 21:32:45.002744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 77576 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:0f:00.0, compute capability: 8.0
2023-11-19 21:32:45.006547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 78300 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:bd:00.0, compute capability: 8.0
2023-11-19 21:32:47.705785: I tensorflow/stream_executor/cuda/cuda_blas.cc:1614] TensorFloat-32 will be used for the matrix multiplicati

In [14]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_train,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_test,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [15]:
model.compile(optimizer=optimizer) 

In [16]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)


In [17]:
# Training Chunk

callbacks = [metric_callback]
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=6, callbacks=callbacks)

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f54dc1899a0>

In [None]:
model.save_pretrained("my_bertmodel7")
tokenizer.save_pretrained("tokenizer7")

## Test Set Performance

We use the saved my_bertmodel7 to predict on the 10,000 random sentences.

In [None]:
from transformers import pipeline

# Load the CSV file into a DataFrame
df = pd.read_csv('oct1_10k_samplefixed.csv')
df = df.loc[:,['sentence', 'jim_crow']].copy()
df = df.rename(columns={"jim_crow": "label"})

# Initialize the UVA model and tokenizer inference pipeline for text classification
nlp = pipeline("sentiment-analysis", model="my_bertmodel7", tokenizer="tokenizer7", truncation=True)

# Define the name of the text column in your CSV
text_column = 'sentence'  

# Perform inference and add the inferred labels to a new column
df['inferred_label'] = df[text_column].apply(lambda x: nlp(x)[0]['label'])

# Save the updated DataFrame to a new CSV file
df.to_csv('bert10k_with_inference7.csv', index=False)

print("Inference results saved to 'bert10k_with_inference.csv'")


In [None]:
df['jim_crow'] = df['inferred_label'].apply(lambda x: 1 if x == 'jim_crow' else 0)
df.to_csv('bert10k_with_inference7.csv', index=False)

This version predicted 9621 non Jim Crow and 379 Jim Crow sentences.

In [None]:
df.value_counts(["jim_crow"])

In [None]:
pd.set_option('display.max_colwidth', None)
df.loc[df.jim_crow != 0, ["sentence","jim_crow","inferred_label"]].sample(50)

In [None]:
df.loc[df.jim_crow != 1, ["sentence","jim_crow","inferred_label"]].sample(50)