## Objectives

This notebook demonstrates the process of text classification using three different transformer models: DistilBERT, BERT, and RoBERTa. We will train each model on labeled data, use them to label unlabeled data, and visualize the results.

**Steps to perform text-classification**
1. Prepare and preprocess data.
2. Hyperparameter Tuning using DistilBERT
3. Train and evaluate DistilBERT, BERT, and RoBERTa models.
4. Label the unlabeled dataset using the trained models.
5. Visualize the label distribution.

## Imports

In [None]:
!pip install pandas
!pip install scikit-learn
!pip install transformers==4.18.0
!pip install tensorflow==2.16.1
!pip install transformers torch
!pip install emoji2emotion



Collecting transformers==4.18.0
  Downloading transformers-4.18.0-py3-none-any.whl.metadata (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from transformers==4.18.0)
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1 (from transformers==4.18.0)
  Downloading tokenizers-0.12.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (6.5 kB)
Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.12.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━

In [None]:
import json
import os
import pandas as pd
import transformers
import matplotlib.pyplot as plt
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification #DistilBERT
from transformers import BertTokenizer, TFBertForSequenceClassification #BERT
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification #RoBERTa

from transformers import logging, TFTrainingArguments
from transformers.trainer_tf import TFTrainer
# from transformers import TextClassificationPipeline

import tensorflow as tf

**CODE FOR ADDING EMOJI_EMOTION SECTION**

In [None]:
!pip install emoji
!pip install emoji_emotion


In [None]:
import pandas as pd
from emoji_emotion import EmojiEmotion

# Load your dataset
df = pd.read_csv('/content/finalHindiDataset_withEmojis.csv')

# Initialize the EmojiEmotion class
emoji_emotion = EmojiEmotion()

# Function to get emotions from emojis
def extract_emotion(emoji_text):
    if emoji_text and emoji_text != "neutral":
        emojis = emoji_text.split()  # Adjust delimiter if needed
        emotion_counts = {emotion: 0 for emotion in ['happy', 'sadness', 'anger', 'fear', 'surprise', 'disgust']}

        for emoji in emojis:
            emotion = emoji_emotion.get(emoji)
            if emotion in emotion_counts:
                emotion_counts[emotion] += 1

        # Determine the emotion with the highest count
        detected_emotion = max(emotion_counts, key=emotion_counts.get)
        return detected_emotion if emotion_counts[detected_emotion] > 0 else 'none'

    return 'neutral'  # If no emoji or it's neutral

# Apply the function to the emoji column
df['detected_emotion'] = df['Emojis'].apply(extract_emotion)

# Save the updated dataset
df.to_csv('updated_dataset_with_detected_emotions.csv', index=False)

print("Emotion detection from emojis complete!")


ModuleNotFoundError: No module named 'emoji_emotion'

**CODE FOR ADDING SENTIMENT ANALYSIS COLUMN**

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

# Load dataset
df = pd.read_csv('/content/finalHindiDataset_withEmojis.csv')

# Load pre-trained model for Hindi sentiment analysis
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-bert")
model = AutoModelForSequenceClassification.from_pretrained("ai4bharat/indic-bert")

# Initialize sentiment analysis pipeline
sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, truncation=True)

# Function to get sentiment for Hindi text using Hugging Face pipeline
def get_hindi_sentiment(text):
    # Ensure text is truncated to 512 tokens max
    result = sentiment_analysis(text[:512])  # This ensures the text fits within the model's limit
    sentiment_label = result[0]['label']
    return 'positive' if sentiment_label == 'LABEL_1' else 'negative'

# Add sentiment column to the dataframe
df['hindi_sentiment'] = df['translated_text'].apply(get_hindi_sentiment)

# Save the updated dataset
df.to_csv('updated_dataset_with_hindi_sentiment.csv', index=False)

print("Hindi sentiment analysis column added successfully!")


## Data Preprocessing
In this section, we will load the dataset, preprocess the data, and split it into training and validation sets.

In [None]:
import pandas as pd
labeled_df=pd.read_csv('/content/finalHindiDataset_ithEmojis.csv')

In [None]:
labeled_df = labeled_df.sample(frac=1).reset_index(drop=True)


In [None]:
labeled_df=labeled_df[:1000]

In [None]:
# checking count for each label
labeled_df['Label'].value_counts()

In [None]:
# labeled_df=labeled_df.dropna

In [None]:
labeled_df

In [None]:
# print(labeled_df.head())

In [None]:
# Preparing data for training and validation
data_texts = labeled_df['translated_text'].to_list()
data_labels = labeled_df['Label'].to_list()

### Train Test Split

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(data_texts, data_labels, test_size = 0.2, random_state = 0 )

## Model Training and Evaluation

In [None]:
def compute_metrics(pred):
    '''
    function to compute metrics like accuracy, precision, recall, and F1-score to assess model performance
    '''
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

### DistilBERT

In this section, we will train a DistilBERT model on the labeled data and evaluate its performance.

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

**Steps:**
1. **Load Tokenizer and Model**: Initialize the tokenizer and model from the Hugging Face library.
2. **Tokenize Data**: Tokenize the data using the DistilBERT tokenizer.
3. **Create Datasets**: Prepare the training and validation datasets using tokenized data.
4. **Define Training Arguments**: Set the parameters for training, such as learning rate and batch size.
5. **Train the Model**: Train the DistilBERT model using the training dataset.
6. **Evaluate the Model**: Evaluate the trained model on the validation dataset.
7. **Save the Model**: Save the trained model and tokenizer for future use.


In [None]:
# pip install --upgrade tensorflow transformers


In [None]:
# Load tokenizer from the pre-trained DistilBERT model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# Load the pre-trained DistilBERT model for sequence classification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=2)

In [None]:
# tokenizing train_texts and val_texts
train_encodings = tokenizer(train_texts, truncation = True, padding = True )
val_encodings = tokenizer(val_texts, truncation = True, padding = True )

In [None]:
# Creating TensorFlow datasets for training and validation
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

In [None]:
def train_and_evaluate(learning_rate, epochs):
    # Define training arguments
    training_args = TFTrainingArguments(
        output_dir='./results',
        num_train_epochs=epochs,
        learning_rate=learning_rate,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=64,
        warmup_steps=30,
        logging_dir='./logs',
        eval_steps=10
    )

    # Using a distributed training
    with training_args.strategy.scope():
        trainer_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 2 )

    # Initializing the TFTrainer
    trainer = TFTrainer(
        model=trainer_model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )

    # Training the model
    trainer.train()

    # Evaluating the model
    eval_results = trainer.evaluate()

    return eval_results

#### **Hyperparameter Tuning**

In this section, we aim to explore the effect of different hyperparameters on DistilBERT model performance. Specifically, we will experiment with various learning rates and the number of training epochs. The chosen hyperparameters for testing are:

- **Learning Rates**: 5e-5, 3e-5, 2e-5
- **Epochs**: 5, 6, 7, 8, 9

We will train and evaluate the model for each combination of learning rate and epochs. The evaluation metrics we will consider include accuracy, precision, recall, F1 score, and loss. These metrics will help us determine the optimal hyperparameters for our task.

The code below performs the hyperparameter testing and stores the results in a DataFrame for further analysis.

In [None]:
# hyperparameters to test
learning_rates = [5e-5, 6e-5, 9e-5]
epochs_list = [5,6,7,8,9]

results = []

for lr in learning_rates:
    for epochs in epochs_list:
        print(f"Training with learning rate: {lr} and epochs: {epochs}")
        eval_results = train_and_evaluate(learning_rate=lr, epochs=epochs)
        results.append({
            'learning_rate': lr,
            'epochs': epochs,
            'accuracy': eval_results['eval_accuracy'],
            'precision': eval_results['eval_precision'],
            'loss': eval_results['eval_loss'],
            'recall': eval_results['eval_recall'],
            'f1': eval_results['eval_f1']
        })

# Converting results to DataFrame
results_df = pd.DataFrame(results)

In [None]:
results_df

#### Results
We trained the model with various learning rates and epochs. The following graphs show the performance of the model for different hyperparameter combinations.

In [None]:
results_df.to_csv("hyperparameter_tuning_results.csv")

In [None]:
# hyperparameter tuning results

# Plot accuracy
accuracy_fig = go.Figure()
for lr in learning_rates:
    subset = results_df[results_df['learning_rate'] == lr]
    accuracy_fig.add_trace(go.Scatter(x=subset['epochs'], y=subset['accuracy'], mode='lines', name=f'LR={lr}',
                                      hovertemplate='Learning Rate: %{customdata[0]}<br>Epochs: %{x}<br>Accuracy: %{customdata[1]}<br>Loss: %{customdata[2]}',
                                       customdata=subset[['learning_rate', 'accuracy', 'loss']]))

accuracy_fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='Accuracy',
    title='Accuracy vs. Epochs for Different Learning Rates',
    legend_title='Learning Rate',
    template='plotly_white',
    width=800
)

accuracy_fig.show()

# Plot F1-score
f1_fig = go.Figure()
for lr in learning_rates:
    subset = results_df[results_df['learning_rate'] == lr]
    f1_fig.add_trace(go.Scatter(x=subset['epochs'], y=subset['f1'], mode='lines', name=f'LR={lr}',
                                hovertemplate='Learning Rate: %{customdata[0]}<br>Epochs: %{x}<br>F1 Score: %{customdata[1]}<br>Loss: %{customdata[2]}',
                                       customdata=subset[['learning_rate', 'f1', 'loss']]))

f1_fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='F1 Score',
    title='F1 Score vs. Epochs for Different Learning Rates',
    legend_title='Learning Rate',
    template='plotly_white',
    width=800
)

f1_fig.show()

**Inference drawn:**

After conducting hyperparameter tuning by experimenting with different learning rates and epochs, we observe the following results:

- **Learning Rate 0.00005:**

    - Achieves the highest accuracy of 95.31% with 7 epochs.
    - Shows a precision of 94.66%, recall of 94.53%, and F1-score of 94.52%..

- **Learning Rate 0.00003:**

    - Achieves the highest accuracy of 96.09% with 9 epochs.
    - Shows a precision of 96.13%, recall of 96.09%, and F1-score of 96.10%.

- **Learning Rate 0.00002:**

    - Achieves the highest accuracy of 95.31% with 9 epochs.
    - Shows a precision of 95.42%, recall of 95.31%, and F1-score of 95.30%.

From these results, the sweet spot appears to be a learning rate of 0.00003 with 9 epochs, as it provides the highest accuracy and well-balanced precision, recall, and F1 scores. This combination offers the best trade-off between training time and model performance, making it the optimal choice for our DistilBERT model on this dataset.

**Key Observations**
- The model's performance improves as the number of epochs increases, indicating that the model benefits from more training iterations.
- Higher learning rates tend to lead to better performance, but the improvements diminish beyond a certain threshold, suggesting the importance of finding the right balance to prevent overfitting.

#### Training with optimal parameters

Learning rate: 3e-5 and epochs: 9

In [None]:
# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=9,
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=30,
    logging_dir='./logs',
    eval_steps=10
)

# Using a distributed training
with training_args.strategy.scope():
    trainer_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 2 )

# Initializing the TFTrainer
trainer = TFTrainer(
    model=trainer_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

In [None]:
# Training the model
trainer.train()

In [None]:
print("Learning rate: 3e-5 and epochs: 9\n")
# Evaluating the model
trainer.evaluate()

In [None]:
# Save the trained model and tokenizer
distilbert_save_directory = "distilbert_saved_models/"
trainer_model.save_pretrained(distilbert_save_directory)
tokenizer.save_pretrained(distilbert_save_directory)

### BERT
Now, we will train and evaluate a BERT model using the same dataset.

BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction.

**Steps:**
1. **Load BERT Tokenizer and Model**: Initialize the tokenizer and model from the Hugging Face library.
2. **Tokenize Data**: Tokenize the data using the BERT tokenizer.
3. **Create Datasets**: Prepare the training and validation datasets using tokenized data.
4. **Define Training Arguments**: Set the parameters for training.
5. **Train the Model**: Train the BERT model using the training dataset.
6. **Evaluate the Model**: Evaluate the performance of the trained BERT model on the validation dataset.
7. **Save the Model**: Save the trained model and tokenizer for future use.

In [None]:
# Load tokenizer from the pre-trained BERT model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load the pre-trained BERT model for sequence classification
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

In [None]:
# tokenizing train_texts and val_texts
bert_train_encodings = bert_tokenizer(train_texts, truncation = True, padding = True  )
bert_val_encodings = bert_tokenizer(val_texts, truncation = True, padding = True )

In [None]:
# Creating TensorFlow datasets for training and validation
bert_train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(bert_train_encodings),
    train_labels
))

bert_val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(bert_val_encodings),
    val_labels
))

#### Training with optimal parameters
Learning rate: 3e-5 and epochs: 9

In [None]:
# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=9,
    learning_rate=3e-5,
    per_device_train_batch_size=8, # reduced batch size to avoid Out Of Memory error
    per_device_eval_batch_size=8, # reduced batch size to avoid Out Of Memory error
    warmup_steps=30,
    logging_dir='./logs',
    eval_steps=10
)

# Using a distributed training
with training_args.strategy.scope():
    bert_trainer_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2 )

# Initializing the TFTrainer
trainer = TFTrainer(
    model=bert_trainer_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_val_dataset,
    compute_metrics=compute_metrics
)

In [None]:
# Training the model
trainer.train()

In [None]:
print("Model: BERT")
print("Learning rate: 3e-5 and epochs: 9\n")
# Evaluating the model
trainer.evaluate()

In [None]:
# Save the trained model and tokenizer
bert_save_directory = "bert_saved_models/"
bert_trainer_model.save_pretrained(bert_save_directory)
bert_tokenizer.save_pretrained(bert_save_directory)

### RoBERTa
Finally, we will train and evaluate a RoBERTa model using the same dataset.

RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

**It is same as BERT with better pretraining tricks:**

- dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
- train with larger batches
- use BPE with bytes as a sub-unit and not characters (because of unicode characters)

**Steps:**
1. **Load RoBERTa Tokenizer and Model**: Initialize the tokenizer and model from the Hugging Face library.
2. **Tokenize Data**: Tokenize the data using the RoBERTa tokenizer.
3. **Create Datasets**: Prepare the training and validation datasets using tokenized data.
4. **Define Training Arguments**: Set the parameters for training.
5. **Train the Model**: Train the RoBERTa model using the training dataset.
6. **Evaluate the Model**: Evaluate the performance of the trained RoBERTa model on the validation dataset.
7. **Save the Model**: Save the trained model and tokenizer for future use.

In [None]:
# Load tokenizer from the pre-trained RoBERTa model
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
# Load the pre-trained RoBERTa model for sequence classification
roberta_model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=3)

In [None]:
# tokenizing train_texts and val_texts
roberta_train_encodings = roberta_tokenizer(train_texts, truncation=True, padding=True)
roberta_val_encodings = roberta_tokenizer(val_texts, truncation=True, padding=True)

In [None]:
# Creating TensorFlow datasets for training and validation
roberta_train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(roberta_train_encodings),
    train_labels
))

roberta_val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(roberta_val_encodings),
    val_labels
))

#### Training with optimal parameters
Learning rate: 3e-5 and epochs: 9

In [None]:
# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=9,
    learning_rate=3e-5,
    per_device_train_batch_size=8, # reduced batch size to avoid Out Of Memory error
    per_device_eval_batch_size=8, # reduced batch size to avoid Out Of Memory error
    warmup_steps=30,
    logging_dir='./logs',
    eval_steps=10
)

# Using a distributed training
with training_args.strategy.scope():
    roberta_trainer_model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels = 3 )

# Initializing the TFTrainer
trainer = TFTrainer(
    model=roberta_trainer_model,
    args=training_args,
    train_dataset=roberta_train_dataset,
    eval_dataset=roberta_val_dataset,
    compute_metrics=compute_metrics
)

In [None]:
# Training the model
trainer.train()

In [None]:
print("Model: RoBERTa")
print("Learning rate: 3e-5 and epochs: 9\n")
# Evaluating the model
trainer.evaluate()

In [None]:
# Save the trained model and tokenizer
roberta_save_directory = "roberta_saved_models/"
roberta_trainer_model.save_pretrained(roberta_save_directory)
roberta_tokenizer.save_pretrained(roberta_save_directory)

In [None]:
# Load the fine-tuned tokenizer and model from the saved directory
tokenizer_fine_tuned = DistilBertTokenizer.from_pretrained(distilbert_save_directory)
model_fine_tuned = TFDistilBertForSequenceClassification.from_pretrained(distilbert_save_directory)

## Conclusion

Implemented and evaluated 3 pre-trained transformer models—DistilBERT, BERT, and RoBERTa—on a text classification task. The primary goal was to determine the most effective model for classifying the dataset into three categories: FIN_TABLE, NOISE, and TEXT. Through hyperparameter tuning and subsequent evaluations, following conclusions were derived:

#### Hyperparameter Tuning

Hyperparameter tuning was performed using DistilBERT due to its faster and more lightweight nature. The learning rates and epochs were varied, and the optimal combination was identified based on evaluation metrics such as accuracy, precision, recall, and F1 score.

- **Learning Rates:** 5e-5, 3e-5, 2e-5
- **Epochs:** 5, 6, 7, 8, 9

The optimal hyperparameters identified for DistilBERT were a learning rate of 3e-5 and 9 epochs, resulting in the highest performance across all evaluation metrics.

| Learning Rate | Epochs | Accuracy | Precision | Loss     | Recall   | F1       |
|---------------|--------|----------|-----------|----------|----------|----------|
| 3e-5          | 9      | 0.960938 | 0.961328  | 0.188554 | 0.960938 | 0.960983 |

#### Model Evaluation

Using the optimal hyperparameters identified, training was performed on all three models:

1. **DistilBERT Model**
   - **Learning Rate:** 3e-5
   - **Epochs:** 9
   - **Evaluation Metrics:**
     - **Loss:** 0.1885537952184677
     - **Accuracy:** 0.9609375
     - **Precision:** 0.961328125
     - **Recall:** 0.9609375
     - **F1 Score:** 0.9609832569391393

2. **BERT Model**
   - **Learning Rate:** 3e-5
   - **Epochs:** 9
   - **Evaluation Metrics:**
     - **Loss:** 0.33096411228179934
     - **Accuracy:** 0.9
     - **Precision:** 0.9034313725490195
     - **Recall:** 0.9
     - **F1 Score:** 0.9003702603702604

3. **RoBERTa Model**
   - **Learning Rate:** 3e-5
   - **Epochs:** 9
   - **Evaluation Metrics:**
     - **Loss:** 0.3747982978820801
     - **Accuracy:** 0.9125
     - **Precision:** 0.9145167895167894
     - **Recall:** 0.9125
     - **F1 Score:** 0.9127909226190475

Based on the evaluation metrics, DistilBERT outperformed both BERT and RoBERTa in terms of accuracy, precision, recall, and F1 score. Despite being a lighter and faster model, DistilBERT achieved a higher evaluation performance, making it the most effective model for this text classification task.

- **DistilBERT** demonstrated superior performance with an F1 score of 0.96098, making it the best choice for the task.
- **BERT** and **RoBERTa**, while still highly effective, did not perform as well as DistilBERT under the same hyperparameters.
