In [2]:
!pip install transformers datasets torch scikit-learn pandas accelerate matplotlib seaborn numpy

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [3]:
import kagglehub

# Download the kaggle dataset
path = kagglehub.dataset_download("shanegerami/ai-vs-human-text")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/shanegerami/ai-vs-human-text?dataset_version_number=1...


100%|██████████| 350M/350M [00:09<00:00, 39.1MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/shanegerami/ai-vs-human-text/versions/1


#Fine-Tuning a Transformer Model




###Hugging Face Datasets & Transformers:


First, we load the dataset from Kaggle that contains a large collection of text, each labeled as either “AI-generated” or “Human-written.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv(f"{path}/AI_Human.csv")

# Preview the data
print(df.head())
print(df.shape)

                                                text  generated
0  Cars. Cars have been around since they became ...        0.0
1  Transportation is a large necessity in most co...        0.0
2  "America's love affair with it's vehicles seem...        0.0
3  How often do you ride in a car? Do you drive a...        0.0
4  Cars are a wonderful thing. They are perhaps o...        0.0
(487235, 2)


It shows that dataset is divided into text and respective label, dataset has 487235 unique value and their respective lables.

Then we load the dataset using Hugging face dataset library.

In [5]:
from datasets import Dataset

# Create Hugging Face datasets from the smaller dataframes
hf_dataset = Dataset.from_pandas(df)

# Preview the dataset
print(hf_dataset)

Dataset({
    features: ['text', 'generated'],
    num_rows: 487235
})


And then, we take the full dataset and split it into two groups: one for training the model and one for testing it later. We use 80 percent of the data for training and 20 percent for testing, and we make sure the split keeps the same proportion of AI and human labels by using the stratify option. After that, we shrink both the training and test sets to just 3% of their original sizes so the model can be trained and tested more quickly as the original dataset is too large to train on our devices. We then print the sizes of these smaller sets to see how many rows they have, and finally, we convert both the training and test data from pandas DataFrames into Hugging Face Datasets so they are in the right format for the tools we will use later.


In [6]:

from sklearn.model_selection import train_test_split

# Split the dataset using train_test_split (80-20)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['generated'])

# Take 3% samples from each set
train_df = train_df.sample(frac=0.03, random_state=42)
test_df = test_df.sample(frac=0.03, random_state=42)


print("Train shape (small):", train_df.shape)
print("Test shape (small):", test_df.shape)

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)


Train shape (small): (11694, 2)
Test shape (small): (2923, 2)


###Fine-Tuning:

Next, we load a pre-trained DistilBERT tokenizer, that turns text into tokens the model can understand.

We create a function that takes the text from our dataset and tokenizes it by splitting it into subword units, converting those into IDs, padding shorter texts, and truncating longer ones so every sequence has the same length of 256 tokens. We then apply this function to both the training and test datasets so they are fully tokenized and ready for the model.

After tokenization, we rename the label column from “generated” to “labels” so the Trainer API knows which column contains the correct answers. We also set the dataset format to PyTorch tensors, keeping only the token IDs, attention masks, and labels. Finally, we print the first example from each tokenized dataset to check that the labels are integers and that the format is correct.

In [7]:
from transformers import AutoTokenizer
from datasets import Dataset

 # Load the tokenizer for a pre-trained model  (distilbert)
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a function that tokenizes the text using the loaded tokenizer.
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True,padding="max_length", max_length=256)

# Apply this tokenization function to train and test datasets
train_tokenized = train_dataset.map(tokenize_function, batched=True)
test_tokenized = test_dataset.map(tokenize_function, batched=True)

# Rename column (generated to labels)
train_tokenized = train_tokenized.rename_column("generated", "labels")
test_tokenized = test_tokenized.rename_column("generated", "labels")

# Ensure the labels are integers (0 or 1)
train_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# show examples of tokenization
print(train_tokenized[0])
print(test_tokenized[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/11694 [00:00<?, ? examples/s]

Map:   0%|          | 0/2923 [00:00<?, ? examples/s]

{'labels': tensor(0.), 'input_ids': tensor([  101,  2004,  2974,  7457,  1010,  1996,  2801,  1997,  4062,  3238,
         3765,  2003,  3225,  2000,  2468,  1037,  4507,  1012,  1000, 10793,
        17770,  1010, 20075,  1010,  1998, 16509,  2933,  2000,  2031,  3765,
         2008,  2064,  3298,  3209,  2011, 12609,  1012,  1000,  2004,  1996,
         6019,  2163,  2008, 17856,  3238,  3765,  2024,  2006,  1996,  2126,
         1998,  2024,  2525,  1999,  5082,  1012,  2348,  1996,  2245,  1997,
         2025,  2383,  2000,  2031,  2115,  4952,  2015,  4165,  2307,  1010,
         1045, 20704,  2114,  1996,  4503,  2724,  1997,  2122,  3765,  1012,
         1996,  4503,  2724,  1997, 17856,  3238,  3765,  2052,  3426,  4736,
         2247,  8712,  1998,  1996,  2111,  1012,  2036,  1010,  2045,  2071,
         2022,  6228,  3471,  2008, 11821,  2247,  2007,  1996,  4503,  2724,
         1997,  4062,  3238,  3765,  4786,  3808,  3314,  1012, 22267,  1010,
        29536,  3367,  2111,

Next, we load a pre-trained DistilBERT model that is designed for sequence classification tasks. We set num_labels=2 because our task has two possible classes: AI-generated and human-written. The model is loaded with weights that have already been trained on a large general text dataset, but the final classification layer is initialized fresh so it can learn our specific classification task. Once the model is loaded, we place it to GPU (if available).

Normally, the model expects the labels to match the shape of its outputs so that the loss function can calculate the error between predictions and actual answers. In this case, without any changes, the labels are stored as single integers like 0 or 1 (shape [batch_size]), whereas the model produces predictions as two separate output values for each example (shape [batch_size, 2]). This difference in shapes causes this error message:

```ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))```

This means the model’s output for each batch has two values per example (one for each class), but the labels only have one value per example, so the shapes do not match for the loss function being used.

To fix this problem, we use one-hot encoding. This takes each label and turns it into a vector where the correct class position is 1 and the other is 0. For example, the label 0 becomes [1, 0] and the label 1 becomes [0, 1]. Now, the labels have the same shape as the model’s outputs ([batch_size, 2]), which allows the loss function to compare them directly without errors. We apply this one-hot encoding to both the training and test datasets and then set their format for PyTorch so that the Trainer can use them in model training and evaluation.



In [8]:
from transformers import AutoModelForSequenceClassification
import torch
import numpy as np

#Load the pre-trained model architecture suitable for sequence classification
model_Seq_class = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Move the model to gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_Seq_class.to(device)

#one hot encoding to correct the label format (if human-written the convert to [1,0] if AI then [0,1])
def one_hot_encode(example):
    label = example['labels']
    one_hot = np.zeros(2)
    one_hot[int(label)] = 1
    example['labels'] = one_hot
    return example

# Apply one-hot encoding to the tokenized datasets
train_tokenized_one_hot = train_tokenized.map(one_hot_encode)
test_tokenized_one_hot = test_tokenized.map(one_hot_encode)

# Set the format for PyTorch
train_tokenized_one_hot.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_tokenized_one_hot.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/11694 [00:00<?, ? examples/s]

Map:   0%|          | 0/2923 [00:00<?, ? examples/s]

Now, we set up the training arguments that will control how the model is fine-tuned. We use the TrainingArguments class from the Transformers library to define all the important settings in one place.

1. We specify the output_dir where the model files and results will be saved after training,
2. We set eval_strategy to "epoch" so that the model will be evaluated once after each full pass through the training data. The number of training epochs is set to 1, meaning the model will see the entire training dataset only once in this run.
3. We define the batch sizes for both training and evaluation as 8, which means the model will process 8 examples at a time.
4. we set logging_strategy to "steps" so that training information is printed every certain number of steps, and logging_steps is set to 100 so this happens every 100 batches.
5. We choose a learning rate of 2e-5, which controls how quickly the model’s weights are updated during training,
6. A weight decay value of 0.01 is also set, which helps prevent overfitting by slightly penalizing large weight values.
7. We also set report_to="none" to make sure no training logs are sent to external tracking tools like Weights & Biases.

We experiment with these arguments by increasing learning rate, number of epochs and batch size and examine how the output differs. The results of these experiments are shown in the report.

In [9]:
from transformers import TrainingArguments, Trainer


# Instantiate TrainingArguments from the transformers library
training_args = TrainingArguments(
    output_dir="distilbert-finetune-ai-human",
    eval_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_strategy="steps",
    logging_steps=100,
    logging_dir="logs",
    learning_rate=2e-5,
    weight_decay=0.01,
    # to avoid log training results to any external tracking tools like Weights & Biases
    report_to="none"
)


Now, we create a function `Compute_metrics` that will measure how well the model is performing during training and evaluation. The function receives the model’s raw predictions, logits, and the true labels. Since our labels are one-hot encoded, we first convert them back to single integers by finding the position of the 1 in each label vector. We then take the logits and find the index of the highest value for each prediction, which gives us the predicted class. Using these classes, we calculate the accuracy and the weighted F1-score.

After that, we set up the Hugging Face Trainer, which will handle the entire training and evaluation process. We pass in the pre-trained model, the training arguments, the training and evaluation datasets in their one-hot encoded form, the tokenizer, and the metric function. This Trainer will feed the data into the model, update the model’s weights, evaluate its performance, and log the results.

In [10]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

#Define a function compute_metrics(eval_pred) that takes evaluation predictions, calculates desired metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # If labels are one-hot encoded, get the index of the correct label
    if labels.ndim > 1:
        labels = np.argmax(labels, axis=1)
    preds = np.argmax(logits, axis=1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')
    return {"accuracy": acc, "f1": f1}

#Instantiate the Trainer class, passing the model, training arguments, training dataset, evaluation dataset, tokenizer, and the compute_metrics function.
trainer = Trainer(
    model=model_Seq_class,
    args=training_args,
    train_dataset=train_tokenized_one_hot,
    eval_dataset=test_tokenized_one_hot,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [11]:
# Fine-tune the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0531,0.08619,0.981868,0.981922


TrainOutput(global_step=1462, training_loss=0.106790119417715, metrics={'train_runtime': 347.3376, 'train_samples_per_second': 33.668, 'train_steps_per_second': 4.209, 'total_flos': 774536879941632.0, 'train_loss': 0.106790119417715, 'epoch': 1.0})

The result shows that our training loss dropped to about 0.0476 and our validation loss was about 0.0736, which means that the model made few mistakes on both the training and test data. The accuracy and F1 score are both very high at around 0.984, which means the model correctly classified almost all examples and did well in balancing precision and recall.

###Initial Evaluation & Comparison:

We evaluate our trainer and get the accuracy and f1 score which as can see through the results are really high for this model. (~98%).

In [12]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
#After training, evaluate the model on the test set by calling trainer.evaluate().
eval_results = trainer.evaluate()
print(eval_results)


{'eval_loss': 0.08619033545255661, 'eval_accuracy': 0.9818679438932604, 'eval_f1': 0.981921793261991, 'eval_runtime': 22.2308, 'eval_samples_per_second': 131.484, 'eval_steps_per_second': 16.464, 'epoch': 1.0}


In [13]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Run the model on the test set
preds_output = trainer.predict(test_tokenized_one_hot)

# Get predicted class labels from logits
pred_labels = np.argmax(preds_output.predictions, axis=1)

# Get ground-truth class labels from the prediction output
true_labels = preds_output.label_ids

# If labels are one-hot encoded, convert them back to integers
if len(true_labels.shape) > 1:
    true_labels = np.argmax(true_labels, axis=1)

# Classification Report
target_names = ["Human-Written", "AI-Generated"]
print("\nClassification Report:")
print(classification_report(true_labels, pred_labels, target_names=target_names))

# Accuracy (optional if not included in report)
acc = accuracy_score(true_labels, pred_labels)
print("Accuracy:", round(acc, 4))


Classification Report:
               precision    recall  f1-score   support

Human-Written       0.99      0.98      0.99      1839
 AI-Generated       0.96      0.99      0.98      1084

     accuracy                           0.98      2923
    macro avg       0.98      0.98      0.98      2923
 weighted avg       0.98      0.98      0.98      2923

Accuracy: 0.9819


The results of classification report show that our model is performing extremely well at telling apart human-written text from AI-generated text. For human-written text, the model is perfect in precision, meaning every time it predicts something is human-written it is correct, and it also captures almost all real human-written examples with a recall of 0.98. For AI-generated text, the precision is slightly lower at 0.97, but the recall is very high at 0.99, meaning it catches almost all AI-written examples with very few misses. The F1-scores, which balance precision and recall, are 0.99 for human-written and 0.98 for AI-generated, showing strong and consistent performance across both classes. Overall, the accuracy is about 98.49%, so out of all predictions, almost all are correct, confirming that our fine-tuned DistilBERT model has learned the task very effectively.

In [14]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# gebnerate confusion matrix
cm = confusion_matrix(true_labels, pred_labels)
print("\nConfusion Matrix:\n", cm)



Confusion Matrix:
 [[1796   43]
 [  10 1074]]


This confusion matrix shows that out of 1,839 human-written examples, the model produced 1,804 true negatives and 35 false positives, which means it correctly identified most human-written texts while misclassifying 35 AI-generated texts as human-written. Out of 1,084 AI-generated examples, it produced 1,075 true positives and 9 false negatives, which means it accurately detected nearly all AI-generated texts while only misclassifying 9 human-written texts as AI-generated. These results indicate that the model achieves high true positive and true negative counts while keeping false negatives very low; however, the false positives are relatively higher for this task, which is not ideal. A possible reason for this could be that some AI-generated texts share stylistic patterns, vocabulary, or sentence structures similar to human-written content, causing the model to mistake them for human text. This might also happen if the training data contains overlapping features between the two classes, leading the model to overfit on certain cues that are not exclusive to one type of writing.

##Example used in baseline model

we check the example sued in baseline model (a human written text misclassified as AI).

In [25]:
from datasets import Dataset
import numpy as np
import torch

# Example Human-written text
text = """
Dear Senator,

Retain the Electoral College. The Electoral College consists of 538 electors, and a majority of 270 electors is required to elect the President. Each state has its own electors, which are chosen by the candidate’s political party. You should keep the Electoral College because it provides certainty of outcome, and the President represents everyone, not just one group.

The first reason why you should stay with the Electoral College is because you are certain that the outcome will be in favor of one of the candidates. A tie in the nationwide electoral vote may happen, but it is very unlikely, even though the 538 electors in the Electoral College is an even number. For example, in the 2012 election, Obama received 61.7 percent of the electoral votes compared to 51.3 percent of the popular vote cast for him. This is because all states award electoral votes on a winner-take-all basis — even a slight plurality in a state creates a landslide electoral vote victory in that state. However, because of the winner-take-all system in each state, candidates don’t spend time in states they know they have no chance of winning; they only focus on the close, tight races in the “swing” states. But the winning candidate’s share of the Electoral College invariably exceeds his share of the popular vote.

The second reason you should keep the Electoral College is because the President is everyone’s President. The Electoral College requires a presidential candidate to have transregional appeal. No region has enough electoral votes to elect a president by itself. For example, a solid regional favorite, such as Rodney was in the South, has no incentive to campaign heavily in those states, for he gains no additional electoral votes by increasing his plurality in states he knows for sure he will win. A president with only regional appeal is very unlikely to be a successful president. The residents of other regions may feel like their votes don’t count or that he really isn’t their president.

In conclusion, you should stay with the Electoral College simply because it is very unlikely that there will be a tie, and because the President is everyone’s.
"""

single_dataset = Dataset.from_dict({"text": [text]})

# Tokenize
single_tokenized = single_dataset.map(tokenize_function, batched=True)
single_tokenized.set_format("torch", columns=["input_ids", "attention_mask"])

# Predict
pred_output = trainer.predict(single_tokenized)

# Get predicted label ID
pred_label_id = np.argmax(pred_output.predictions, axis=1)[0]

# Get probabilities with softmax
probs = torch.softmax(torch.tensor(pred_output.predictions), dim=1).numpy()[0]

# Disable scientific notation and round to 2 decimal places
np.set_printoptions(suppress=True)
probs = np.round(probs, 2)

# Map labels
label_map = {0: "Human-Written", 1: "AI-Generated"}

print("Predicted label:", label_map[pred_label_id])
print("Prediction confidence (class probabilities):", probs.tolist())


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Predicted label: Human-Written
Prediction confidence (class probabilities): [1.0, 0.0]


As seen from the output, even though the baseline model misclassified the text as AI-generated (it is human written), The fine0tuned model is correctly identifying it as human written.

Next, for error analysis, we look at and analyse the false positive and negatives of our model.

In [30]:
import pandas as pd
import numpy as np

# Get predictions from trainer
preds_output = trainer.predict(test_tokenized_one_hot)
pred_labels = np.argmax(preds_output.predictions, axis=1)

# Convert one-hot labels back to integers
true_labels = preds_output.label_ids
if len(true_labels.shape) > 1:
    true_labels = np.argmax(true_labels, axis=1)

# Extract the original text from test_df
test_texts = test_df.reset_index(drop=True)["text"]

# Create a DataFrame for analysis
results_df = pd.DataFrame({
    "text": test_texts,
    "true_label": true_labels,
    "pred_label": pred_labels
})

# Map labels to names
label_map = {0: "Human", 1: "AI"}
results_df["true_name"] = results_df["true_label"].map(label_map)
results_df["pred_name"] = results_df["pred_label"].map(label_map)

# False Positives: AI → predicted as Human
false_positives = results_df[(results_df["true_label"] == 1) & (results_df["pred_label"] == 0)]

# False Negatives: Human → predicted as AI
false_negatives = results_df[(results_df["true_label"] == 0) & (results_df["pred_label"] == 1)]

print("False Positives (AI predicted as Human):", len(false_positives))
for i, row in false_positives.head(2).iterrows():
    print("\n--- False Positive Example ---")
    print(f"True Label: {row['true_name']}, Predicted: {row['pred_name']}")
    print(f"Text:\n{row['text']}")

# Show only 2 full examples of false negatives
print("\nFalse Negatives (Human predicted as AI):", len(false_negatives))
for i, row in false_negatives.head(2).iterrows():
    print("\n--- False Negative Example ---")
    print(f"True Label: {row['true_name']}, Predicted: {row['pred_name']}")
    print(f"Text:\n{row['text']}")


False Positives (AI predicted as Human): 10

--- False Positive Example ---
True Label: AI, Predicted: Human
Text:
 Albert Schweitzer once said, "Example is Nat the main thing in influencing either; it is the any thing." However, it is a glad example by your awn behavior the best way ta influence either people. People may fallow every step you the, because they want ta success in life by being influenced by either. It is the main thing ta Shaw people that they can the what they think they are Nat able ta the.

Firstly, people may fallow every step you the, because they want ta success in life by being influenced by either. Far instance, my friend John has a daughter whey wants ta became a DATAR. John's wife, whose name is Sarah, is a DATAR herself. Although their daughter, Sarah, is taking any biology classes in college, which it has nothing ta the with medicine. John and Sarah want ta influence Sarah ta became a DATAR. But Sarah is being influenced by her English teacher. Sarah has a 