# Reproducing PII Detection Results from Paper

This notebook reproduces the results from the paper "Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward" (asi-08-00055.pdf).

## Methodology Summary

- **Model**: BERT (Bidirectional Encoder Representations from Transformers)
- **Training Dataset**:
  - 43,000 records from pii-masking-200k (with PII)
  - 43,000 records from Generic Sentiment dataset (without PII)
  - Total: 86,000 balanced records (50% PII, 50% non-PII)
- **Testing Dataset**:
  - pii-masking-43k
  - Dialog Dataset (3,726 records)
  - Movie Review/IMDB (10,000 records)
  - GPT-4 sentences (1,000 records) - Note: This needs to be generated

## Hyperparameters

- Optimizer: adamw_torch
- Learning rate: 5×10⁻⁵
- Weight decay: 0.01
- Batch size: 64
- Epochs: 3
- Train/Val/Test split: 80%/10%/10%
- Cross-validation: 5-fold and stratified 5-fold

## Expected Results

- Accuracy: 99.558%
- Precision: 99.564%
- Recall: 99.558%
- F1-score: 99.559%


## 1. Import Libraries and Set Up Environment


In [None]:
import os
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

# Suppress the Windows/MacOS redirect warning (it's harmless)
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Check GPU availability and provide information
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("No GPU detected. Training will use CPU (this will be slower).")
    print("Note: For faster training, consider using a GPU-enabled environment.")
    print("      You may want to reduce batch_size if you encounter memory issues.")

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)


Using device: cuda
GPU: Tesla T4
CUDA Version: 12.6
GPU Memory: 14.74 GB


## 2. Load Training Datasets

### 2.1 Load pii-masking-200k Dataset


In [None]:
# Load pii-masking-200k dataset
print("Loading pii-masking-200k dataset...")
pii_200k = load_dataset("ai4privacy/pii-masking-200k")
print(f"Dataset structure: {pii_200k}")

# Get the training split
train_pii = pii_200k['train']
print(f"Total records in pii-masking-200k: {len(train_pii)}")
print(f"Sample record: {train_pii[0]}")

# Extract all records with language 'en'
print("\nFiltering records with language 'en'...")
pii_records = []
for example in train_pii:
    # Check if the language is 'en'
    if example.get('language') == 'en'or example.get('language') == 'de':
        pii_records.append({
            'tokens': example.get('mbert_text_tokens', []),
            'labels': example.get('mbert_bio_labels', []),
            'source_text': example.get('source_text', ''),
            'has_pii': any(label != 'O' for label in example.get('mbert_bio_labels', [])) # Determine if it has PII
        })

print(f"Selected {len(pii_records)} records with language 'en'")
print(f"Records with PII in English subset: {sum(1 for r in pii_records if r['has_pii'])}")
print(f"Records without PII in English subset: {sum(1 for r in pii_records if not r['has_pii'])}")

Loading pii-masking-200k dataset...


README.md: 0.00B [00:00, ?B/s]

english_pii_43k.jsonl:   0%|          | 0.00/73.8M [00:00<?, ?B/s]

french_pii_62k.jsonl:   0%|          | 0.00/116M [00:00<?, ?B/s]

german_pii_52k.jsonl:   0%|          | 0.00/97.8M [00:00<?, ?B/s]

italian_pii_50k.jsonl:   0%|          | 0.00/93.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/209261 [00:00<?, ? examples/s]

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['source_text', 'target_text', 'privacy_mask', 'span_labels', 'mbert_text_tokens', 'mbert_bio_labels', 'id', 'language', 'set'],
        num_rows: 209261
    })
})
Total records in pii-masking-200k: 209261
Sample record: {'source_text': "A student's assessment was found on device bearing IMEI: 06-184755-866851-3. The document falls under the various topics discussed in our Optimization curriculum. Can you please collect it?", 'target_text': "A student's assessment was found on device bearing IMEI: [PHONEIMEI]. The document falls under the various topics discussed in our [JOBAREA] curriculum. Can you please collect it?", 'privacy_mask': [{'value': '06-184755-866851-3', 'start': 57, 'end': 75, 'label': 'PHONEIMEI'}, {'value': 'Optimization', 'start': 138, 'end': 150, 'label': 'JOBAREA'}], 'span_labels': '[[0, 57, "O"], [57, 75, "PHONEIMEI"], [75, 138, "O"], [138, 150, "JOBAREA"], [150, 189, "O"]]', 'mbert_text_tokens

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


### 2.2 Load Generic Sentiment Dataset (Non-PII)


In [None]:
print("\n1. Loading mBERT tokenizer...")
try:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
    print("   ✓ mBERT tokenizer loaded successfully!")
    print(f"   Model: bert-base-multilingual-cased")
    print(f"   Tokenizer type: {type(tokenizer).__name__}")
    print(f"   Vocabulary size: {len(tokenizer.vocab)}")
except Exception as e:
    print(f"   ✗ Error loading mBERT tokenizer: {e}")
    print("   Trying to install transformers if needed...")
    sys.exit(1)


1. Loading mBERT tokenizer...


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

   ✓ mBERT tokenizer loaded successfully!
   Model: bert-base-multilingual-cased
   Tokenizer type: BertTokenizerFast
   Vocabulary size: 119547


In [None]:
import sys
# Load Generic Sentiment Dataset (Non-PII)
print("Loading Generic Sentiment Dataset...")

try:
    non_pii_df = pd.read_csv('/content/generic_sentiment_dataset_50k.csv')
    print(f"Loaded {len(non_pii_df)} records from Generic Sentiment Dataset")
    print("Sample record:")
    display(non_pii_df.head())

    # Assuming the non-PII data has a 'text' column
    # Tokenize and format it similar to the PII data
    non_pii_records = []
    for index, row in non_pii_df.iterrows():
        text = str(row.get('text', ''))
        # Tokenize using the mBERT tokenizer
        tokens = tokenizer.tokenize(text)

        # Filter out samples where token length > 512
        if len(tokens) > 512:
            # print(f"Skipping record {index} due to token length > 512: {len(tokens)}")
            continue

        non_pii_records.append({
            'tokens': tokens,
            'labels': ['O'] * len(tokens),  # All labels are 'O' for non-PII
            'source_text': text,
            'has_pii': False
        })

    print(f"Processed {len(non_pii_records)} records for non-PII training data after filtering.")

except FileNotFoundError:
    print("Error: '/content/generic_sentiment_dataset_50k.csv' not found.")
    print("Please upload the dataset file or provide the correct path.")
    non_pii_records = []
except Exception as e:
    print(f"An error occurred while processing the Generic Sentiment Dataset: {e}")
    non_pii_records = []

Loading Generic Sentiment Dataset...
Loaded 50000 records from Generic Sentiment Dataset
Sample record:


Unnamed: 0,sentiment,text,label
0,positive,good mobile. battery is 5000 mah is very big. ...,2
1,positive,Overall in hand ecpirience is quite good matt ...,2
2,positive,"1. Superb Camera,\n2. No lag\n3. This is my fi...",2
3,positive,Bigger size of application names doesn't allow...,2
4,negative,Just a hype of stock android which is not flaw...,0


Token indices sequence length is longer than the specified maximum sequence length for this model (603 > 512). Running this sequence through the model will result in indexing errors


Processed 49839 records for non-PII training data after filtering.


In [None]:
import pandas as pd
import csv

# 1. Load dữ liệu tiếng Đức (Code của bạn)
print("Loading German Sentiment Dataset...")
data_frames = []

try:
    # Giả sử file train_v1.4.tsv nằm ở /content/
    df = pd.read_csv(r'/content/train_v1.4.tsv',
                     sep='\t',
                     header=None,
                     names=['url', 'text', 'relevance', 'label', 'aspect'],
                     quoting=csv.QUOTE_NONE,
                     on_bad_lines='skip')
    data_frames.append(df)
except FileNotFoundError:
    print("Error: File not found.")

# 2. Gộp và xử lý
if data_frames:
    full_df = pd.concat(data_frames, ignore_index=True)
    print("-" * 30)
    print(f"Loaded {len(full_df)} records from German Dataset")
    print("Sample raw record:")
    display(full_df.head(1)) # Hiển thị 1 dòng để kiểm tra

    # 3. Xử lý Tokenize và Format (Phần bạn yêu cầu)
    german_non_pii_records = []

    print("\nProcessing records to Non-PII format...")

    for index, row in full_df.iterrows():
        # Lấy text, chuyển về string để tránh lỗi nếu có NaN
        text_content = str(row.get('text', ''))
        tokens = tokenizer.tokenize(text_content)

        if len(tokens) > 512:
            # print(f"Skipping record {index} due to token length > 512: {len(tokens)}")
            continue
        # Bỏ qua nếu dòng không có token nào (dòng trống)
        if not tokens:
            continue

        german_non_pii_records.append({
            'tokens': tokens,
            'labels': ['O'] * len(tokens),  # Tất cả nhãn là 'O' (Outside)
            'source_text': text_content,
            'has_pii': False
        })

    print(f"Processed {len(german_non_pii_records)} records for German non-PII training data")

    # Kiểm tra kết quả sau khi xử lý
    if len(german_non_pii_records) > 0:
        print("\nSample processed record:")
        print(german_non_pii_records[0])

else:
    print("Không có dữ liệu để xử lý.")

Loading German Sentiment Dataset...
------------------------------
Loaded 20941 records from German Dataset
Sample raw record:


Unnamed: 0,url,text,relevance,label,aspect
0,http://twitter.com/reneesa\_devin/statuses/628...,"@DB_Bahn ja, weil in Wuppertal Bauarbeiten sin...",True,neutral,Allgemein#Haupt:neutral



Processing records to Non-PII format...
Processed 20037 records for German non-PII training data

Sample processed record:
{'tokens': ['@', 'DB', '_', 'Bahn', 'ja', ',', 'weil', 'in', 'Wuppertal', 'Bau', '##arbeiten', 'sind', ',', 'so', '##weit', 'bin', 'ich', 'auch', ',', 'aber', 'wies', '##o', 'nur', 'am', 'Wochen', '##ende', 'und', 'grade', 'jetzt', '?'], 'labels': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'source_text': '@DB_Bahn ja, weil in Wuppertal Bauarbeiten sind, soweit bin ich auch, aber wieso nur am Wochenende und grade jetzt?', 'has_pii': False}


### 2.3 Combine Training Datasets


In [None]:
# Combine PII and non-PII records
all_training_records = pii_records + non_pii_records + german_non_pii_records   # Balance the dataset

print(f"Total training records: {len(all_training_records)}")
print(f"Records with PII: {sum(1 for r in all_training_records if r['has_pii'])}")
print(f"Records without PII: {sum(1 for r in all_training_records if not r['has_pii'])}")

# Shuffle the dataset
import random
random.shuffle(all_training_records)

# Create a dataset from the records
training_dataset = Dataset.from_list(all_training_records)
print(f"\nTraining dataset created: {training_dataset}")
print(f"Features: {training_dataset.features}")


Total training records: 166194
Records with PII: 96318
Records without PII: 69876

Training dataset created: Dataset({
    features: ['tokens', 'labels', 'source_text', 'has_pii'],
    num_rows: 166194
})
Features: {'tokens': List(Value('string')), 'labels': List(Value('string')), 'source_text': Value('string'), 'has_pii': Value('bool')}


In [None]:
# Simple EDA on the training_dataset
print("Training dataset structure:")
print(training_dataset)
print("\nTraining dataset features:")
print(training_dataset.features)
print(f"\nNumber of records in training dataset: {len(training_dataset)}")
print("\nSample record from training dataset:")
print(training_dataset[0])

# You can also convert to pandas DataFrame for more extensive EDA if needed
# try:
#     training_df = training_dataset.to_pandas()
#     print("\nConverted training dataset to pandas DataFrame:")
#     display(training_df.head())
# except Exception as e:
#     print(f"\nCould not convert training dataset to pandas DataFrame: {e}")

Training dataset structure:
Dataset({
    features: ['tokens', 'labels', 'source_text', 'has_pii'],
    num_rows: 166194
})

Training dataset features:
{'tokens': List(Value('string')), 'labels': List(Value('string')), 'source_text': Value('string'), 'has_pii': Value('bool')}

Number of records in training dataset: 166194

Sample record from training dataset:
{'tokens': ['RT', '@', 'LP', '##N', '##ation', '##al', ':', 'And', 'now', '@', 'Lindsey', '##G', '##rah', '##am', '##SC', 'will', 'send', 'more', 'troops', 'to', '#', 'Iraq', 'and', '#', 'Afghanistan', '.', 'Any', 'bet', '##s', 'on', 'what', 'country', 'is', 'next', '?', '#', 'GO', '##PD', '##eba', '##te', '#', 'war', '[UNK]'], 'labels': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'source_text': 'RT @LPNational: And now @LindseyGrahamSC will send more troops to

## 3. Load BERT Model and Tokenizer


In [None]:
# Load BERT model and tokenizer
# Using bert-base-uncased as it's commonly used and has ~110M parameters
# For 340M parameters, we might need bert-large-uncased or a custom model
model_name = "google-bert/bert-base-multilingual-cased"  # You can change this to "bert-large-uncased" for more parameters

print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Get all unique labels from the training data
all_labels = set()
for record in all_training_records:
    all_labels.update(record['labels'])

label_list = sorted(list(all_labels))
label_to_id = {label: idx for idx, label in enumerate(label_list)}
id_to_label = {idx: label for idx, label in enumerate(label_list)}

print(f"Number of labels: {len(label_list)}")
print(f"Labels: {label_list[:20]}...")  # Show first 20 labels

# Load model for token classification
num_labels = len(label_list)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id_to_label,
    label2id=label_to_id
)

print(f"Model loaded. Number of parameters: {sum(p.numel() for p in model.parameters()):,}")


Loading model: google-bert/bert-base-multilingual-cased


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Number of labels: 113
Labels: ['B-ACCOUNTNAME', 'B-ACCOUNTNUMBER', 'B-AGE', 'B-AMOUNT', 'B-BIC', 'B-BITCOINADDRESS', 'B-BUILDINGNUMBER', 'B-CITY', 'B-COMPANYNAME', 'B-COUNTY', 'B-CREDITCARDCVV', 'B-CREDITCARDISSUER', 'B-CREDITCARDNUMBER', 'B-CURRENCY', 'B-CURRENCYCODE', 'B-CURRENCYNAME', 'B-CURRENCYSYMBOL', 'B-DATE', 'B-DOB', 'B-EMAIL']...


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded. Number of parameters: 177,349,745


## 4. Preprocess Data for BERT


In [None]:
def process_data_for_mbert(examples):
    input_ids_list = []
    attention_mask_list = []
    labels_list = []

    # Duyệt qua từng mẫu trong batch
    for tokens, original_labels in zip(examples['tokens'], examples['labels']):
        # 1. Chuyển Tokens thành IDs
        # Vì tokens đã là subword của mBERT, ta dùng convert_tokens_to_ids
        ids = tokenizer.convert_tokens_to_ids(tokens)

        # 2. Xử lý nhãn
        # Chuyển nhãn chuỗi (B-PER...) sang ID số
        # Lưu ý: original_labels đang là chuỗi, cần map sang số
        label_ids = [label_to_id[l] for l in original_labels]

        # 3. Thêm Special Tokens ([CLS] và [SEP]) và Truncate
        # Giới hạn độ dài là 512. Trừ 2 cho [CLS] và [SEP]
        max_len = 512 - 2
        if len(ids) > max_len:
            ids = ids[:max_len]
            label_ids = label_ids[:max_len]

        # Thêm [CLS] (101) vào đầu và [SEP] (102) vào cuối
        final_input_ids = [tokenizer.cls_token_id] + ids + [tokenizer.sep_token_id]

        # Thêm nhãn cho [CLS] và [SEP] là -100 (để PyTorch bỏ qua khi tính loss)
        final_label_ids = [-100] + label_ids + [-100]

        # Tạo attention mask (1 cho token thật)
        final_attention_mask = [1] * len(final_input_ids)

        input_ids_list.append(final_input_ids)
        labels_list.append(final_label_ids)
        attention_mask_list.append(final_attention_mask)

    # Trả về định dạng mà Trainer mong muốn
    return {
        "input_ids": input_ids_list,
        "labels": labels_list,
        "attention_mask": attention_mask_list
    }

# Áp dụng hàm mới
print("Processing data direct mapping...")
tokenized_dataset = training_dataset.map(
    process_data_for_mbert,
    batched=True,
    remove_columns=training_dataset.column_names # Xóa cột cũ để đỡ tốn RAM
)

print(f"Sample processed: {tokenized_dataset[0]}")

Processing data direct mapping...


Map:   0%|          | 0/166194 [00:00<?, ? examples/s]

Sample processed: {'labels': [-100, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, -100], 'input_ids': [101, 56898, 137, 17521, 11537, 11809, 10415, 131, 12689, 11858, 137, 89132, 11447, 23497, 11008, 36175, 11337, 45567, 10798, 20836, 10114, 108, 21455, 10111, 108, 18776, 119, 47336, 13009, 10107, 10135, 12976, 12723, 10124, 13451, 136, 108, 41525, 63450, 101929, 10216, 108, 10338, 100, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## 5. Split Data into Train/Validation/Test Sets


In [None]:
# Split dataset: 80% train, 10% validation, 10% test
train_testvalid = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
test_valid = train_testvalid['test'].train_test_split(test_size=0.5, seed=42)

train_dataset = train_testvalid['train']
val_dataset = test_valid['train']
test_dataset = test_valid['test']

print(f"Training set: {len(train_dataset)} samples")
print(f"Validation set: {len(val_dataset)} samples")
print(f"Test set: {len(test_dataset)} samples")


Training set: 132955 samples
Validation set: 16619 samples
Test set: 16620 samples


## 6. Define Evaluation Metrics


In [None]:
def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=-1)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    # Flatten lists
    true_predictions_flat = [item for sublist in true_predictions for item in sublist]
    true_labels_flat = [item for sublist in true_labels for item in sublist]

    # Calculate metrics
    accuracy = accuracy_score(true_labels_flat, true_predictions_flat)
    precision, recall, f1, _ = precision_recall_fscore_support(
        true_labels_flat, true_predictions_flat, average='weighted', zero_division=0
    )

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


## 7. Training Configuration


In [None]:
# Training arguments as specified in the paper
# Adjust batch size for CPU if needed (smaller batch size for CPU to avoid memory issues)
batch_size = 32 if torch.cuda.is_available() else 16  # Smaller batch for CPU
eval_batch_size = 32 if torch.cuda.is_available() else 16

training_args = TrainingArguments(
    output_dir="./pii_detection_model",
    num_train_epochs=3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=eval_batch_size,
    learning_rate=5e-5,
    weight_decay=0.01,
    optim="adamw_torch",
    logging_dir="./logs",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=2,
    seed=42,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    dataloader_num_workers=0 if not torch.cuda.is_available() else 4,  # Reduce workers for CPU
    report_to="none",  # Disable wandb/tensorboard if not needed
)

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

print("Training configuration:")
print(f"  Device: {device}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size} (adjusted for {device})")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Weight decay: {training_args.weight_decay}")
print(f"  Optimizer: {training_args.optim}")
print(f"  Mixed precision (fp16): {training_args.fp16}")
if not torch.cuda.is_available():
    print("\n⚠️  Note: Training on CPU will be significantly slower.")
    print("   Estimated time: Several hours to days depending on dataset size.")
    print("   Consider using Google Colab (free GPU) or a cloud GPU instance for faster training.")

Training configuration:
  Device: cuda
  Epochs: 3
  Batch size: 32 (adjusted for cuda)
  Learning rate: 5e-05
  Weight decay: 0.01
  Optimizer: OptimizerNames.ADAMW_TORCH
  Mixed precision (fp16): True


## 8. Train Model with Train-Validation-Test Split


In [None]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
print("Starting training...")
trainer.train()

# Evaluate on test set
print("\nEvaluating on test set...")
test_results = trainer.evaluate(test_dataset)
print(f"\nTest Results:")
print(f"  Accuracy: {test_results['eval_accuracy']:.6f} ({test_results['eval_accuracy']*100:.3f}%)")
print(f"  Precision: {test_results['eval_precision']:.6f} ({test_results['eval_precision']*100:.3f}%)")
print(f"  Recall: {test_results['eval_recall']:.6f} ({test_results['eval_recall']*100:.3f}%)")
print(f"  F1-Score: {test_results['eval_f1']:.6f} ({test_results['eval_f1']*100:.3f}%)")

# Save the final model
trainer.save_model("./pii_detection_model_final")
print("\nModel saved to ./pii_detection_model_final")


Starting training...


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0538,0.045839,0.982974,0.981617,0.982974,0.978983
2,0.0394,0.041138,0.983837,0.982877,0.983837,0.982981
3,0.0334,0.038895,0.98523,0.984336,0.98523,0.984111



Evaluating on test set...



Test Results:
  Accuracy: 0.985634 (98.563%)
  Precision: 0.984882 (98.488%)
  Recall: 0.985634 (98.563%)
  F1-Score: 0.984565 (98.456%)

Model saved to ./pii_detection_model_final


In [None]:
from google.colab import drive
import os

# Mount Google Drive
print("\nMounting Google Drive...")
drive.mount('/content/drive')

# Define the path in Google Drive to save the model
drive_model_path = "/content/drive/MyDrive/pii_detection_model_final_drive"

# Create the directory if it doesn't exist
if not os.path.exists(drive_model_path):
    os.makedirs(drive_model_path)
    print(f"Created directory: {drive_model_path}")
else:
    print(f"Directory already exists: {drive_model_path}")

# Save the final trained model using the existing trainer object
print(f"\nSaving model to {drive_model_path}...")
trainer.save_model(drive_model_path)
print("Model successfully saved to Google Drive.")


Mounting Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Directory already exists: /content/drive/MyDrive/pii_detection_model_final_drive

Saving model to /content/drive/MyDrive/pii_detection_model_final_drive...
Model successfully saved to Google Drive.
