<a href="https://colab.research.google.com/github/sharmaratnesh/RatneshTestRepository/blob/master/bert_finetune_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set Up Google Colab Environment

This notebook demonstrates how to fine-tune a BERT model for email classification using Google Colab. You can use Colab for free GPU acceleration and easy access to Google Drive files.

**How to use:**
1. Go to https://colab.research.google.com/
2. Click `File > Upload notebook` and upload this notebook file.
3. Or, create a new notebook and copy-paste the cells from here.

In [2]:
# Import Required Libraries
import numpy as np
import pandas as pd

# Mount Google Drive

To access files from your Google Drive, run the following cell and follow the authentication steps.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Read and Write Files in Colab

You can read CSV files from Google Drive and write output files back to Drive. Adjust the file paths as needed.

In [3]:
# Example: Read a CSV file from Google Drive
train_path = '/content/consolidated_email_thread_labeled_data_for_training.csv'

df = pd.read_csv(train_path)
df.head()

Unnamed: 0,thread_id,text,classification,reasoning
0,1,\r\n\r\n -----Original Message-----\r\nFrom: =...,Confidential,Talks about termination. Need to be kept high ...
1,2,I'll be there... I will attend. Suzanne:\r\nHe...,Internal,Team lunch planning. Internal discussion.
2,3,"Hey there; \r\n""Do you know who your ""big toe""...",Sensitive,talks about work environmnet and facing aggres...
3,4,thanks for the update.\r\nPL that is ok. Than...,Internal,day to day discussion of team.
4,5,I think you can send it just so he has the for...,Confidential,confidential settlement and liquidation damage...


# Install and Use External Packages

You can install any required package using pip in Colab. For example, to install the latest Hugging Face Transformers and PyTorch:

In [4]:
# Install required packages (run if needed)
!pip install torch transformers pandas scikit-learn sentence-transformers faiss-cpu openpyxl

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


# Use GPU/TPU Acceleration

Colab provides free access to GPUs and TPUs. You can check and use them as follows:

In [5]:
# Check for GPU
import torch
print('CUDA available:', torch.cuda.is_available())
print('Device:', torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

CUDA available: False
Device: cpu


# Fine-tune BERT for Email Classification

The following cells will guide you through label encoding, train/val split, model setup, training, and saving the fine-tuned model for your 4-category email classification task.

In [6]:
# Label encoding and train/val split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['classification'].str.lower())
label2id = {label: int(idx) for idx, label in enumerate(label_encoder.classes_)}
id2label = {int(idx): label for idx, label in enumerate(label_encoder.classes_)}
print('Label mapping:', label2id)

train_df, val_df = train_test_split(df, test_size=0.1, stratify=df['label'], random_state=42)

Label mapping: {'confidential': 0, 'internal': 1, 'sensitive': 2}


In [7]:
# Dataset and DataLoader setup
from transformers import BertTokenizer
import torch
from torch.utils.data import Dataset

MAX_LENGTH = 256
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

class EmailDataset(Dataset):
    def __init__(self, df):
        self.texts = df['text'].astype(str).tolist()
        self.labels = df['label'].tolist()
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        enc = tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=MAX_LENGTH,
            return_tensors='pt'
        )
        item = {key: val.squeeze(0) for key, val in enc.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

train_dataset = EmailDataset(train_df)
val_dataset = EmailDataset(val_df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [9]:
# Model, Trainer, and TrainingArguments setup
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label2id))

training_args = TrainingArguments(
    output_dir='./bert_email_classifier',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch", # Changed from evaluation_strategy
    save_strategy="epoch",
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    save_total_limit=2,
    report_to="none"
)

def compute_metrics(eval_pred):
    import numpy as np
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    acc = (preds == labels).mean()
    return {"accuracy": acc}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.084944,0.333333
2,No log,1.077914,0.333333
3,1.085300,1.081252,0.333333




TrainOutput(global_step=12, training_loss=1.0909682512283325, metrics={'train_runtime': 582.9531, 'train_samples_per_second': 0.139, 'train_steps_per_second': 0.021, 'total_flos': 10656093417984.0, 'train_loss': 1.0909682512283325, 'epoch': 3.0})

In [12]:
# Save the model and label mapping to Google Drive
model.save_pretrained('/content/bert_email_classifier')
tokenizer.save_pretrained('/content/bert_email_classifier')
import json
with open('/content/bert_email_classifier/label2id.json', 'w') as f:
    json.dump(label2id, f, indent=2)
with open('/content/bert_email_classifier/id2label.json', 'w') as f:
    json.dump(id2label, f, indent=2)
print('Model and label mapping saved to Google Drive.')

Model and label mapping saved to Google Drive.


# Evaluate on Test Dataset

Now that the model is trained, we can evaluate its performance on the separate test dataset.

In [None]:
# Load the test dataset
test_path = '/content/consolidated_email_thread_labeled_data_for_testing.csv'
test_df = pd.read_csv(test_path)
display(test_df.head())

In [None]:
# Create a test dataset object
test_dataset = EmailDataset(test_df)

# Make predictions on the test set
predictions = trainer.predict(test_dataset)

In [None]:
from sklearn.metrics import classification_report
import numpy as np

# Get predicted labels
preds = np.argmax(predictions.predictions, axis=1)

# Get true labels
true_labels = predictions.label_ids

# Generate classification report
report = classification_report(true_labels, preds, target_names=label_encoder.classes_)
print(report)