<a href="https://colab.research.google.com/github/tommyliphysics/tommyli-ml/blob/main/ai_detector/notebooks/eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model evaluation

We are ready to evaluate the two LLMs -- deBERTa and distilBERT -- that we fine-tuned in the preceding two notebooks on the test data.

Let's import the test data:

In [1]:
import pandas as pd

samples = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/ai_detector/notebooks/samples.csv')
test = samples[samples['TTV split']==-1]
test

Unnamed: 0,text,source,topic,TTV split,label
2,It would be unwise to judge that that either n...,imdb,movie review,-1.0,0
4,"I am a fan of Jess Franco's bizarre style, and...",imdb,movie review,-1.0,0
20,"""Lights of New York"" originally started out as...",imdb,movie review,-1.0,0
22,"OK, so my summary line is a cheap trick. But t...",imdb,movie review,-1.0,0
38,"I can't quite say that ""Jerry Springer:Ringmas...",imdb,movie review,-1.0,0
...,...,...,...,...,...
20812,"A standards organization, also known as a stan...",wikipedia by GPT,Standards organization,-1.0,1
20813,The International Electrotechnical Commission ...,wikipedia by GPT,International Electrotechnical Commission,-1.0,1
20814,"Bhutan, officially known as the Kingdom of Bhu...",wikipedia by GPT,Bhutan,-1.0,1
20815,Jigme Khesar Namgyel Wangchuck is a prominent ...,wikipedia by GPT,Jigme Khesar Namgyel Wangchuck,-1.0,1


Let's import the two models which we pushed to the huggingface hub. We can import them as classification pipelines, which combines tokenisation and inference so that we can acquire the model predictions by providing unprocessed text.

In [2]:
from transformers import pipeline

model_deberta = pipeline('text-classification', model='tommyliphys/ai-detector-deberta', max_length=512, truncation=True)
model_distilbert = pipeline('text-classification', model='tommyliphys/ai-detector-distilbert', max_length=512, truncation=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/568M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDebertaV2ForSequenceClassification.

All the layers of TFDebertaV2ForSequenceClassification were initialized from the model checkpoint at tommyliphys/ai-detector-deberta.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2ForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/263M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at tommyliphys/ai-detector-distilbert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

We can now perform inference by directly calling the models on the test samples.

In [3]:
predictions = {}

In [4]:
%%time
predictions['deberta'] = model_deberta(test['text'].tolist(), return_all_scores=True)



CPU times: user 3h 9min 47s, sys: 13min 52s, total: 3h 23min 40s
Wall time: 50min 10s


In [5]:
%%time
predictions['distilbert'] = model_distilbert(test['text'].tolist(), return_all_scores=True)

CPU times: user 1h 41min 3s, sys: 3min 22s, total: 1h 44min 26s
Wall time: 27min 58s


In [6]:
y_probas = {}
y_preds = {}
for model_name in predictions:
    y_probas[model_name] = [predictions[model_name][n][1]['score'] for n in range(len(predictions[model_name]))]
    y_preds[model_name] = [y_proba >= 0.5 for y_proba in y_probas[model_name]]

## Classification reports

We can now compare the model predictions to the labels and generate the confusion matrices and metrics for the two models. I'll print the accuracy, precision, recall, F1 and the areas under the ROC and precision-recall curves.

In [44]:
from sklearn.metrics import *

def classification_report(y_eval, y_pred, y_proba):
    cm = confusion_matrix(y_eval, y_pred)
    fpr, tpr, _ = roc_curve(y_test, y_probas[model_name])
    roc_auc = auc(fpr, tpr)
    precision, recall, _ = precision_recall_curve(y_test, y_probas[model_name])
    pr_auc = auc(recall, precision)

    metrics = {'accuracy': accuracy_score(y_eval, y_pred),
               'precision': precision_score(y_eval, y_pred),
               'recall': recall_score(y_eval,y_pred),
               'f1': f1_score(y_eval,y_pred),
               'ROC AUC': roc_auc,
               'P-R AUC': pr_auc}
    return pd.DataFrame(cm,columns=['predicted human','predicted AI'],index = ['human','AI']), metrics

In [58]:
cm_list = {}
metrics_df = []
y_test = test['label']

for model_name in predictions:
    cm, metrics = classification_report(y_test, y_preds[model_name], y_probas[model_name])
    cm_list[model_name] = cm
    metrics['model_name'] = model_name
    metrics_df.append(metrics)
metrics_df = pd.DataFrame(metrics_df).set_index('model_name', drop=True)

Let's look at the confusion matrices:

In [49]:
cm_list['deberta']

Unnamed: 0,human,AI
predicted human,2384,30
predicted AI,0,2358


In [50]:
cm_list['distilbert']

Unnamed: 0,human,AI
predicted human,2391,23
predicted AI,0,2358


In [59]:
metrics_df

Unnamed: 0_level_0,accuracy,precision,recall,f1,ROC AUC,P-R AUC
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
deberta,0.993713,0.987437,1.0,0.993679,0.999895,0.999876
distilbert,0.99518,0.99034,1.0,0.995147,0.99988,0.999862


We find that both LLMs are exceptionally accurate at distinguishing between AI and human generated text, with only 23 and 30 misclassified samples for deBERTa and distilBERT respectively.