# Building an AI-detector: fine-tuning DistilBERT with Keras (GPT only)

In this notebook I'll go step-by-step through the process of building an AI detector by fine-tuning a pre-trained LLM ([DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)). The training data consists of human text samples from [English-language Wikipedia](https://en.wikipedia.org), [the IMDB review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) from Stanford AI labs, and [Reddit](https://reddit.com), and AI text generated by gpt-4o and gpt-4o-mini.

## Install and import dependencies

First, we have to import the necessary libraries, making sure the latest version of the Huggingface "transformers" library is installed and is compatible with keras.

In [None]:
pip install --upgrade transformers



In [None]:
!pip install tf-keras
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'



In [None]:
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizerFast
import tensorflow as tf
import numpy as np

## Loading the training data

Next, we load and explore the training data. The test data, which consists of texts from completely different sources to the training data, will not be seen by the model until the testing stage.

In [None]:
import pandas as pd

In [None]:
human_train = pd.read_csv('human_train_gpt_only.csv')
human_train

Unnamed: 0,text,source
0,Alan Mathison Turing (; 23 June 1912 – 7 June ...,English Wikipedia
1,"James Dewey Watson (born April 6, 1928) is an ...",English Wikipedia
2,"Harry George Drickamer (November 19, 1918 – Ma...",English Wikipedia
3,Anthony Stephen Fauci ( FOW-chee; born Decemb...,English Wikipedia
4,"Charles Hard Townes (July 28, 1915 – January 2...",English Wikipedia
...,...,...
25175,I’ve been reading through AITA and found a pos...,Reddit (r/OffMyChest)
25176,"So, my mom bakes cakes and she got an order t...",Reddit (r/OffMyChest)
25177,My brother is 16 and has Down Syndrome. For a ...,Reddit (r/OffMyChest)
25178,With the news of Bill and Melinda Gates divorc...,Reddit (r/OffMyChest)


The human samples are labelled with the source:

In [None]:
human_train['source'].unique()

array(['English Wikipedia', 'IMDB review', 'Reddit (r/AmItheAsshole)',
       'Reddit (r/relationship_advice)', 'Reddit (r/dating_advice)',
       'Reddit (r/tifu)', 'Reddit (r/TrueOffMyChest)',
       'Reddit (r/confessions)', 'Reddit (r/FML)', 'Reddit (r/parenting)',
       'Reddit (r/inlaws)', 'Reddit (r/OffMyChest)'], dtype=object)

In [None]:
for idx, row in human_train.sample(10, random_state=623).iterrows():
  print(f"Source: {row['source']}, text: {row['text']}")
  print('*'*20)

Source: IMDB review, text: I first saw "Signs of Life" on PBS as an American Playhouse presentation. It's a wonderfully written, ensemble production with terrific performances by Michael Lewis as Joey and Vincent D'Onofrio as his brother, Daryl. Arthur Kennedy, in one of his last roles, is also excellent as an aging shipbuilder whose family business is about to close. The rest of the cast which includes Beau Bridges, Kathy Bates and Mary-Louise Parker give remarkable clarity and substance to their characters.

The direction is subtle and effective. I've watched this movie several times over the years and would very much recommend it. A beautiful piece of filmmaking.
********************
Source: Reddit (r/AmItheAsshole), text: 
To start off I want to say that my husband (36M) has an old friend (33M) that he's known since highschool. they're inseperable and spend the entire week together. like they're really really close.

My husband and I struggled with fertility issues for years. we re

Let's now take a look at the AI samples:

In [None]:
AI_train = pd.read_csv('AI_train_gpt_only.csv')
AI_train

Unnamed: 0,text,prompt,system,model,temperature,cleaning
0,Alan Turing (23 June 1912 – 7 June 1954) was a...,Write the introductory section to a Wikipedia ...,You are a wikipedia contributor.,gpt-4o-mini,0.22,Removed headers and markdown formatting
1,"James Dewey Watson (born April 6, 1920) is an ...",Write the introductory section to a Wikipedia ...,You are a wikipedia contributor.,gpt-4o-mini,0.31,Removed headers and markdown formatting
2,Harry George Drickamer (born [insert date of b...,Write the introductory section to a Wikipedia ...,You are a wikipedia contributor.,gpt-4o-mini,0.54,Removed headers and markdown formatting
3,"Anthony Stephen Fauci (born December 24, 1940)...",Write the introductory section to a Wikipedia ...,You are a wikipedia contributor.,gpt-4o-mini,0.38,Removed headers and markdown formatting
4,"Charles H. Townes (July 28, 1915 – January 27,...",Write the introductory section to a Wikipedia ...,You are a wikipedia contributor.,gpt-4o-mini,0.03,Removed headers and markdown formatting
...,...,...,...,...,...,...
25179,"I don’t even know where to begin, but I’ve got...",Write a post in r/OffMyChest with the title: C...,You are a redditor.,gpt-4o-mini,1.11,Removed headers
25180,"So, I just got home from school and found out ...",Write a post in r/OffMyChest with the title: M...,You are a redditor.,gpt-4o-mini,1.03,Removed headers
25181,"Hey everyone,\n\nI just wanted to take a momen...",Write a post in r/OffMyChest with the title: T...,You are a redditor.,gpt-4o-mini,1.17,Removed headers
25182,"I’ve been sitting on this for a while now, and...",Write a post in r/OffMyChest with the title: N...,You are a redditor.,gpt-4o-mini,1.15,Removed headers


We see that, in addition to the text samples, the file contains details of the prompt, model, temperature and cleaning. Let's explore these to get a better idea of the training data:

In [None]:
for idx, row in AI_train.sample(24, random_state=623).iterrows():
  print(f"Prompt: {row['prompt']}\n System: {row['system']}")
  print('*'*20)

Prompt: Write an IMDB review for the following movie: Cars (2006)
 System: You are an amateur movie critic leaving an IMDB review. You aren't writing professionally, but you still put thought into your comments.
********************
Prompt: Write an IMDB review for the following movie: Big Daddy (1999)
 System: You are a casual moviegoer with no experience in creative writing who occasionally writes IMDB reviews while waiting for food to cook in the microwave.
********************
Prompt: Write the introductory section to a Wikipedia page with the following title: Interquartile range
 System: You are a wikipedia contributor.
********************
Prompt: Write an IMDB review for the following movie: Guardians of the Galaxy (2014)
 System: You are a casual moviegoer with no experience in creative writing, leaving an IMDB review for the first time.
********************
Prompt: Write an IMDB review for the following movie: Ruthless People (1986)
 System: You are a casual moviegoer with no 

We can see that the GPT models were prompted to specifically imitate the creators of the human texts. There is a diverse range of writing styles, which will allow the AI detection model to learn deep patterns underlying GPT's text generation mechanism and allow it to generalise to unseen writing.

## Tokenizing the training data

The next step is to tokenize the data. We need to ensure that the text does not exceed the length of the model's max length:

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

In [None]:
tokenizer.model_max_length

512

In [None]:
AI_train['token_count'] = AI_train['text'].apply(lambda text: len(tokenizer(text)['input_ids']))

Token indices sequence length is longer than the specified maximum sequence length for this model (16381 > 512). Running this sequence through the model will result in indexing errors


In [None]:
human_train['token_count'] = human_train['text'].apply(lambda text: len(tokenizer(text)['input_ids']))

We see that some of the samples greatly exceed the max length allowed by the model:

In [None]:
AI_train['token_count'].describe()

Unnamed: 0,token_count
count,25184.0
mean,335.167408
std,163.435898
min,78.0
25%,242.0
50%,315.0
75%,419.0
max,16381.0


In [None]:
AI_train.sort_values(by='token_count', ascending=False)

Unnamed: 0,text,prompt,system,model,temperature,cleaning,token_count
415,Metacritic is a review aggregation website tha...,Write the introductory section to a Wikipedia ...,You are a wikipedia contributor.,gpt-4o-mini,0.41,Removed headers and markdown formatting,16381
19486,"So, here's the situation. I (26M) have a half-...",Write a post in r/AmItheAsshole with the title...,You are a redditor.,gpt-4o,1.19,Removed headers,6670
21694,"Obligatory, this didn't happen today, but rath...",Write a post in r/tifu with the title: TIFU by...,You are a redditor.,gpt-4o,1.18,Removed headers,5146
21676,"So, fellow Redditors, grab your popcorn becaus...",Write a post in r/tifu with the title: TIFU by...,You are a redditor.,gpt-4o-mini,1.07,Removed headers,868
21654,"So, I (26M) have dabbled in some psychedelics ...",Write a post in r/tifu with the title: TIFU us...,You are a redditor.,gpt-4o-mini,1.18,Removed headers,846
...,...,...,...,...,...,...,...
23499,"Today, I was at a crucial point during an impo...",Write a post in r/FML with the title: I coughe...,You are a redditor.,gpt-4o,1.03,Removed headers,89
23565,Post: FML. Just found out there's a new lockdo...,Write a post in r/FML with the title: New lock...,You are a redditor.,gpt-4o,1.17,Removed headers,83
23832,I'm so sorry to hear that you're going through...,Write a post in r/parenting with the title: My...,You are a redditor.,gpt-4o-mini,1.16,Removed headers,82
23543,"Today, I found out that my younger sister, who...",Write a post in r/FML with the title: She's bu...,You are a redditor.,gpt-4o,1.19,Removed headers,81


In [None]:
human_train.sort_values(by='token_count', ascending=False)

Unnamed: 0,text,source,token_count
22797,"Okay, fair warning, this one is long as hell. ...",Reddit (r/confessions),4463
21670,Obligatory this happened 9 years ago but I sti...,Reddit (r/tifu),4371
22882,Some years ago I decided to go alone on a beau...,Reddit (r/confessions),4337
20252,[Original Post](https://www.reddit.com/r/relat...,Reddit (r/relationship_advice),4205
22957,Disclaimer: his vaccine injury has been confir...,Reddit (r/confessions),4189
...,...,...,...
15652,"A great film in its genre, the direction, acti...",IMDB review,31
12475,Great movie - especially the music - Etta Jame...,IMDB review,30
16757,One of the funniest movies made in recent year...,IMDB review,26
9720,You'd better choose Paul Verhoeven's even if y...,IMDB review,21


We will need to preprocess the training data so that the model does not receive input that exceeds the max sequence length. While the tokenizer allows for input to be truncated, this would result in the model getting some samples that are cut off mid-sentence. Instead of this, I'll use a strategy where the text is chunked into paragraphs, and in case the input is too long, it gets truncated to the the leading paragraphs, not the leading tokens.

Since DistilBERT does not recognise a paragraph delimiter as a token, I'll add it to the vocabulary. This would provide the model insight into the paragraph structure of the text:

In [None]:
tokenizer.encode('Hello! Hello!')

[101, 8667, 106, 8667, 106, 102]

In [None]:
tokenizer.encode('Hello!\n\nHello!')

[101, 8667, 106, 8667, 106, 102]

In [None]:
tokenizer.add_tokens(['\n\n'])

1

Next, we will define a function that takes a list of tokenized "chunks", combines as many chunks as will fit into the model, and then pads to the sequence length.

In [None]:
def truncator(group_encodings):
    input_ids = []
    attention_mask = []

    input_ids = [tokenizer.cls_token_id]
    attention_mask = [1]
    n = 0
    while n < len(group_encodings):
        if len(input_ids) + len(group_encodings[n]['input_ids']) + 1 >= tokenizer.model_max_length:
            break
        input_ids = [*input_ids, *group_encodings[n]['input_ids']]
        attention_mask = [*attention_mask, *group_encodings[n]['attention_mask']]
        n += 1

    input_ids.append(tokenizer.sep_token_id)
    attention_mask.append(1)

    pad_length = tokenizer.model_max_length - len(input_ids)
    input_ids = [*input_ids, *[tokenizer.pad_token_id]*pad_length]
    attention_mask = [*attention_mask, *[0]*pad_length]

    return {'input_ids': input_ids,
            'attention_mask': attention_mask}, n

Let's look at an example text that's too long:

In [None]:
sample_text = human_train[human_train['token_count'] > tokenizer.model_max_length].sample(1, random_state=623).iloc[0]['text']
sample_text

'Stuttgart (German: [ˈʃtʊtɡaʁt] ; Swabian: Schduagert [ˈʒ̊d̥ua̯ɡ̊ɛʕd̥]; names in other languages) is the capital and largest city of the German state of Baden-Württemberg. It is located on the Neckar river in a fertile valley known as the Stuttgarter Kessel (Stuttgart Cauldron) and lies an hour from the Swabian Jura and the Black Forest. Stuttgart has a population of 632,865 as of 2022, making it the sixth largest city in Germany, while over 2.8 million people live in the city\'s administrative region and nearly 5.5 million people in its metropolitan area, making it the fourth largest metropolitan area in Germany. The city and metropolitan area are consistently ranked among the top 5 European metropolitan areas by GDP; Mercer listed Stuttgart as 21st on its 2015 list of cities by quality of living; innovation agency 2thinknow ranked the city 24th globally out of 442 cities in its Innovation Cities Index; and the Globalization and World Cities Research Network ranked the city as a Beta-

If we divide the text into paragraphs, we need to also account for the possibility that the first paragraph also exceeds the model max length. We can introduce a hierarchy of delimiters: we first split the text into paragraphs, and if the first paragraph is too long, we'll split that into lines, in case the line delimiter ('\n') is used. If the first line is too long, we'll split that into sentences, and if the first sentence is still too long, *then* we'll simpy truncate. We'll do this process using regex:

In [None]:
import regex
PARAGRAPH_SEP_PATTERN = regex.compile(r'(?<=\n\n)')
LINE_SEP_PATTERN = regex.compile('[\n]+')
PUNCT_PATTERN = regex.compile(r'(?<=[\p{P}])(?=\s+)')

def tokenizer_custom_truncation(text):
  # split text into paragraphs and tokenize
    paragraphs = PARAGRAPH_SEP_PATTERN.split(text)
    paragraph_encodings = [tokenizer(para, add_special_tokens=False) for para in paragraphs]

  # if first paragraph is too long, further split text into lines and tokenize
    if len(paragraph_encodings[0]['input_ids']) +2 >= tokenizer.model_max_length:
        lines = LINE_SEP_PATTERN.split(paragraphs[0])
        line_encodings = [tokenizer(line, add_special_tokens=False) for line in lines]

      # if first line is still too long, split first line on punctuation and tokenize
        if len(line_encodings[0]['input_ids']) +2 >= tokenizer.model_max_length:
            sentences = PUNCT_PATTERN.split(lines[0])
            sentence_encodings = [tokenizer(sentence, add_special_tokens=False) for sentence in sentences]

          # if first sentence is still too long, just return truncated first sentence
            if len(sentence_encodings[0]['input_ids']) +2 >= tokenizer.model_max_length:
              return tokenizer(sentences[0], truncation=True, padding='max_length')
        # otherwise truncate first line split on sentences
            else:
              encodings, _ = truncator(sentence_encodings)
              return encodings
      # otherwise truncate first paragraph split on lines
        else:
            encodings, _ = truncator(line_encodings)
            return encodings
  # otherwise truncate whole text split on paragraphs
    encodings, _ = truncator(paragraph_encodings)
    return encodings

Let's see how the custom tokenizer processes our previous text:

In [None]:
tokenized = tokenizer_custom_truncation(sample_text)

In [None]:
print(tokenizer.decode(tokenized['input_ids']))

[CLS] Stuttgart ( German : [ ˈʃtʊtɡaʁt ] ; Swabian : Schduagert [ [UNK] ] ; names in other languages ) is the capital and largest city of the German state of Baden - Württemberg. It is located on the Neckar river in a fertile valley known as the Stuttgarter Kessel ( Stuttgart Cauldron ) and lies an hour from the Swabian Jura and the Black Forest. Stuttgart has a population of 632, 865 as of 2022, making it the sixth largest city in Germany, while over 2. 8 million people live in the city ' s administrative region and nearly 5. 5 million people in its metropolitan area, making it the fourth largest metropolitan area in Germany. The city and metropolitan area are consistently ranked among the top 5 European metropolitan areas by GDP ; Mercer listed Stuttgart as 21st on its 2015 list of cities by quality of living ; innovation agency 2thinknow ranked the city 24th globally out of 442 cities in its Innovation Cities Index ; and the Globalization and World Cities Research Network ranked the

We've defined a custom tokenizer for a single text sample, but we also need to create a function that can tokenize a list of samples, since this is what we'll need to shape our input into a tensor:

In [None]:
def tokenize_list(texts):
    encodings = [tokenizer_custom_truncation(text) for text in texts]
    return {'input_ids': np.array([e['input_ids'] for e in encodings]),
            'attention_mask': np.array([e['attention_mask'] for e in encodings])}

We can now tokenize the data. We'll first label the human and AI training data and combine them into a single set:

In [None]:
AI_train['label'] = 1
human_train['label'] = 0
full_train = pd.concat([AI_train[['text','label']], human_train[['text','label']]], ignore_index=True)

In [None]:
%%time
full_train_encodings = tokenize_list(full_train['text'].tolist())

CPU times: user 1min 28s, sys: 1.07 s, total: 1min 29s
Wall time: 1min 29s


## Hyperparameter tuning using a validation set

Before we can train a model, we need to determine the optimal number of epochs for training and learning rate schedule using a validation set. We'll need to first label the human and AI training data, then combine them into a single set before performing a train-validation split of the combined data. I'll use a random 80/20 split:

In [None]:
from sklearn.model_selection import train_test_split

train_indices, val_indices = train_test_split(np.arange(len(full_train)), test_size=0.2, random_state=623, stratify=full_train['label'])

train_encodings = {'input_ids': full_train_encodings['input_ids'][train_indices,:],
                   'attention_mask': full_train_encodings['attention_mask'][train_indices,:]}
val_encodings = {'input_ids': full_train_encodings['input_ids'][val_indices,:],
                 'attention_mask': full_train_encodings['attention_mask'][val_indices,:]}

Next, we convert the train and val encodings into a tensorflow dataset:

In [None]:
def create_dataset(encodings, labels, batch_size):
    input_ids = tf.convert_to_tensor(encodings['input_ids'], dtype=tf.int32)
    attention_mask = tf.convert_to_tensor(encodings['attention_mask'], dtype=tf.int32)

    return tf.data.Dataset.from_tensor_slices(
        ({
          'input_ids': input_ids,
          'attention_mask': attention_mask
          }, labels)
        ).shuffle(buffer_size=len(encodings['input_ids'])).batch(batch_size).prefetch(tf.data.AUTOTUNE)

In [None]:
batch_size = 8

train_dataset = create_dataset(train_encodings, full_train.iloc[train_indices]['label'].values, batch_size)
val_dataset = create_dataset(val_encodings, full_train.iloc[val_indices]['label'].values, batch_size)

We are now ready to train the model. Because I'm running this notebook in Google Colab with a single GPU, the batch size needs to be small, and the number of batches will be high. I'll need to train with a low learning rate and monitor the validation metrics epoch by epoch. We'll begin at a learning rate of $1\times 10^{-5}$.

In [None]:
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import LearningRateScheduler

def lr_scheduler(epoch, lr):
    return learning_rate

def get_compile_model():
    model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)
    model.resize_token_embeddings(len(tokenizer))
    model.compile(optimizer=RMSprop(learning_rate=learning_rate),
                  metrics = ['accuracy'])
    model.config.id2label = {0: 'human', 1: 'AI'}
    return model

def fit_model(model, train, val=None):
    history = model.fit(train,
                        epochs=epochs,
                        batch_size=batch_size,
                        callbacks=[LearningRateScheduler(lr_scheduler)],
                        validation_data=val,
                        verbose=1)

In [None]:
learning_rate = 1e-5
model_for_val = get_compile_model()

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
epochs=1
fit_model(model_for_val, train_dataset, val_dataset)



After one epoch, the validation accuracy is already nearly 99%. We'll halve the learning rate and train for another epoch.

In [None]:
learning_rate /= 2
fit_model(model_for_val, train_dataset, val_dataset)



We can see that both the train and validation loss have improved. We'll halve the learning rate and train for another epoch.

In [None]:
learning_rate /= 2
fit_model(model_for_val, train_dataset, val_dataset)



As the train and validation loss have continued to improve, we'll train again at half the learning rate.

In [None]:
learning_rate /=2
fit_model(model_for_val, train_dataset, val_dataset)



The training loss has improved, but the validation loss has become worse, indicating that the model is overfitting to the training data. We'll now train a new model on the full dataset with the same learning rate schedule. We'll save the model after each of the three cycles and evaluate each of them.

## Training on the full dataset

In [None]:
learning_rate = 1e-5
model_full_train = get_compile_model()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
batch_size = 8
full_train_dataset = create_dataset(full_train_encodings, full_train['label'].values, batch_size)

In [None]:
epochs=1
fit_model(model_full_train, full_train_dataset)



In [None]:
model_full_train.save_pretrained('model_epoch_1')

In [None]:
learning_rate /= 2
fit_model(model_full_train, train_dataset)



In [None]:
model_full_train.save_pretrained('model_epoch_2')

In [None]:
learning_rate /= 2
fit_model(model_full_train, train_dataset)



In [None]:
model_full_train.save_pretrained('model_epoch_3')

In [None]:
learning_rate /= 2
fit_model(model_full_train, train_dataset)



In [None]:
model_full_train.save_pretrained('model_epoch_4')