# ðŸ¤– Text Classification with XLNet

This notebook demonstrates fine-tuning **XLNet** for multi-class
text classification on an emotion-labeled dataset.

The workflow covers:
- Text preprocessing and dataset balancing  
- Tokenization using an XLNet tokenizer  
- Fine-tuning a pre-trained XLNet model  
- Model evaluation and inference on unseen text  

The implementation highlights practical understanding of
Transformer-based sequence classification.


In [None]:
import pandas as pd
import numpy as np
from cleantext import clean
import re
from transformers import XLNetTokenizer, XLNetForSequenceClassification, TrainingArguments, Trainer, pipeline
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from emoji import replace_emoji
import datasets
import evaluate
import random

---

## Model Background

XLNet is a Transformer-based language model that improves upon
earlier architectures by using **permutation-based training**.

### Key characteristics
- Two variants: **Base** and **Large**  
- XLNet Base: ~110M parameters  
- XLNet Large: ~340M parameters  
- Built on a Transformer-XL architecture  

### Architectural comparison
- **GPT**: autoregressive (left-to-right prediction)  
- **BERT**: masked language modeling (bidirectional context)  
- **XLNet**: permutation-based modeling (captures both left and right context)


---

## Dataset Loading

Emotion-labeled text data is loaded from separate
training, validation, and test files.

The datasets are combined to enable:
- Unified preprocessing  
- Class balancing  
- Controlled train/validation/test splits


In [None]:
data_train = pd.read_csv("../../data/emotions_data/emotion-labels-train.csv")
data_test = pd.read_csv("../../data/emotions_data/emotion-labels-test.csv")
data_val = pd.read_csv("../../data/emotions_data/emotion-labels-val.csv")

---

## Text Preprocessing and Class Balancing

The raw text is cleaned to reduce noise and improve model performance.

### Preprocessing steps
- Emoji removal  
- Removal of user mentions  
- Basic text normalization  

### Class balancing
- Group samples by label  
- Downsample each class to the minimum class size


In [None]:
data_train.head()

In [None]:
data =  pd.concat([data_train, data_test, data_val], ignore_index=True)  #combine all data

In [None]:
data['text_clean'] = data['text'].apply(lambda x: replace_emoji(x, replace=''))  #remove emojis

In [None]:
data['text_clean'].head()    

In [None]:
data['text_clean'] = data['text_clean'].apply(lambda x: re.sub('@[^\s]+', '', x))  # Removes Punctuations + mentions


In [None]:
data['text_clean'].head()

In [None]:
data['label'].value_counts().plot(kind='bar')   #visualize label distribution

In [None]:
g =  data.groupby('label')  #group by label

In [None]:
data = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))  #balance the dataset

In [None]:
data['label'].value_counts().plot(kind='bar')

---

## Label Encoding and Dataset Splits

Emotion labels are encoded into integer values for model compatibility.

The dataset is split into:
- Training set  
- Validation set  
- Test set  

This separation enables reliable evaluation during fine-tuning.



In [None]:
data['label_int'] = LabelEncoder().fit_transform(data['label'])  #encode labels to integers

In [None]:
NUM_LABEL = 4  #number of unique labels

In [None]:
train_split, test_split = train_test_split(data, train_size=0.8)    #split data into train and test
train_split, val_split = train_test_split(train_split, test_size=0.9)  #split train data into train and validation

In [None]:
print(len(train_split))
print(len(test_split))
print(len(val_split))

---

## Dataset Preparation for Training

The processed data is converted into Hugging Face `Dataset`
objects for efficient batching and integration with
the Transformers training pipeline.


In [None]:
train_df =pd.DataFrame({
    "label" : train_split.label_int.values,
    "text" : train_split.text_clean.values
})    #train dataframe  
test_df =pd.DataFrame({
    "label" : test_split.label_int.values,
    "text" : test_split.text_clean.values
})    #test dataframe  

In [None]:
train_df = datasets.Dataset.from_dict(train_df)
test_df = datasets.Dataset.from_dict(test_df)

In [None]:
dataset_dict = datasets.DatasetDict({"train": train_df, "test": test_df})
dataset_dict

---

## Tokenization with XLNet

Text samples are tokenized using the XLNet tokenizer.

Tokenization includes:
- Padding to a fixed maximum length  
- Truncation of long sequences  
- Generation of input IDs, token type IDs, and attention masks


In [None]:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')  #load tokenizer

In [None]:
def tokenize_function(examples):  #tokenization function
    return tokenizer(examples['text'], padding="max_length", truncation=True,  max_length=128)

In [None]:
tokenized_datasets = dataset_dict.map(tokenize_function, batched=True)  #tokenize the dataset 

In [None]:
tokenized_datasets

---

## Fine-Tuning XLNet for Classification

A pre-trained XLNet model is loaded with a sequence
classification head.

To keep training efficient:
- A small subset of the dataset is used  
- Accuracy is used as the evaluation metric


In [None]:
print(tokenized_datasets["train"]['text'][0])

In [None]:
print(tokenized_datasets["train"]['input_ids'][0])

In [None]:
tokenizer.decode(5)  #decode input id to token 

In [None]:
print(tokenized_datasets["train"]['token_type_ids'][0])

In [None]:
print(tokenized_datasets["train"]["attention_mask"][0])

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

In [None]:
model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=NUM_LABEL, id2label = {0: 'anger', 1: 'fear', 2: 'joy', 3: 'sadness'})  #load pre-trained model with classification head

In [None]:
metric = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

---

## Model Training and Evaluation

The model is trained using the Hugging Face `Trainer` API.

Evaluation is performed at each epoch to monitor:
- Classification accuracy  
- Training convergence behavior


In [None]:
training_args = TrainingArguments(
    output_dir="./test_trainer", eval_strategy="epoch",  #evaluation strategy at each epoch
    num_train_epochs=3,)  #number of training epochs

In [None]:
trainer = Trainer(
    model=model,                         #the pre-trained model
    args=training_args,                  #training arguments
    train_dataset=small_train_dataset,   #training dataset
    eval_dataset=small_eval_dataset,      #evaluation dataset
    compute_metrics=compute_metrics,      #evaluation metrics
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

---

## Model Saving and Inference

After training:
- The fine-tuned model is saved to disk  
- Reloaded for inference  
- Used with a Transformers pipeline to predict
  emotion labels on unseen text


In [None]:
model.save_pretrained("fine_tuned_model")

In [None]:
fine_tuned_model = XLNetForSequenceClassification.from_pretrained("fine_tuned_model")

In [None]:
clf =  pipeline("text-classification", model=fine_tuned_model, tokenizer=tokenizer)

In [None]:
rand_int = random.randint(0, len(val_split))
print(val_split['text_clean'][rand_int])
answer = clf(val_split['text_clean'][rand_int], top_k=None)
print(answer)

---

## Summary

This notebook demonstrates:
- Practical fine-tuning of XLNet for text classification  
- Handling class imbalance in real-world datasets  
- End-to-end training using the Transformers ecosystem  
- Deployment-ready inference using pipelines  

The work reflects applied knowledge of modern
Transformer architectures and training workflows.
