# Model Architecture Overview
The model being used for this classification task is **DistilBERT**, which is a lighter, faster version of the popular BERT (Bidirectional Encoder Representations from Transformers) model. DistilBERT retains much of BERT's power while being more efficient in terms of size and inference speed. It's built on the transformer architecture, which is particularly powerful for Natural Language Processing (NLP) tasks like sentence classification.



* ### Transformer Encoder:
  DistilBERT, like BERT, is based on the transformer architecture, which uses self-attention mechanisms to understand the relationships between words in a sentence, regardless of their position.

* ### Bidirectional:
  Unlike traditional models that read text sequentially (left-to-right or right-to-left), DistilBERT reads the text in both directions, allowing it to capture context from both sides of a word.

* ### Lightweight and Efficient:
DistilBERT is smaller than BERT by 60% while maintaining 97% of its language understanding capability. This is achieved by reducing the number of layers (12 in BERT vs. 6 in DistilBERT).







The model is fine-tuned on a specific task (in this case, **active/passive sentence classification**) using labeled training data. For this task, DistilBERT's architecture includes a classification head (a dense layer) on top of the transformer layers to predict class labels (active or passive).

##Flow of the Model:
*Input Sentence → Tokenization → Word Embeddings → Transformer Encoder → Classifier → Output Label*






*   Tokenization: The input sentence is tokenized into subword units (using the BERT tokenizer) before being passed to the model.

*   Embedding Layer: These tokens are converted into word embeddings (vectors) that represent each word's meaning in a high-dimensional space.

*   Traansformer Encoders: The embeddings are passed through a series of transformer layers that process the text bidirectionally (considering both left and right contexts).



*  Output Layer: The final output is a vector of size equal to the number of classes (2 in this case, for active or passive). The highest-value index of the output vector corresponds to the predicted class.








#Getting Started

To run the project that involves fine-tuning a DistilBERT model for sentence classification (Active/Passive), you will need several Python libraries for machine learning, data preprocessing, and model handling.

Import these necessary libraries required to build and run this model. execute the following block of code.

In [1]:
'''
tensorflow : For training the model, as you're using TensorFlow's Keras API to handle the neural network and training process.

transformer : Used for accessing pre-trained transformer models like DistilBERT and its tokenizer.

sklearn : need this for utilities like train/test splits or metrics evaluation (like accuracy, precision, recall, etc.).

numpy : A core library for numerical operations. It's widely used for array manipulation and processing model inputs/outputs.

pandas: Useful for handling and processing data in tabular form (if you're working with CSVs, dataframes, etc.).

'''

import tensorflow as tf
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer, create_optimizer
from sklearn.utils import shuffle
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

Once the libraries have been successfully imported, upload the dataset (in this case the dataset is a collection of 40 samples of sentences in **Active** and in **Passive** Voice).

I have used a custom dataset (if you want then you can download it from here). You can use your own **classification dataset**.


> Remember that we are training the model on Binary-Classified Dataset



In [2]:
data = pd.read_excel("/content/drive/MyDrive/Voice_Dataset/immverse_ai_eval_dataset.xlsx")
data = shuffle(data)  # Shuffle the data

#Pre-Processing
Preprocessing is crucial because the model doesn't work directly with raw text; it needs the text to be transformed into numerical input. Here’s a step-by-step explanation of the preprocessing steps involved:

1. **Split the dataset into training (60%), validation (20%), and test (20%) sets.**

In [3]:
train_data = data[:24]  # 60% train
val_data = data[24:32]  # 20% validation
test_data = data[32:]   # 20% test

2. **Loading the Tokenizer:**

 The tokenizer used here is DistilBERT's Tokenizer. It is responsible for converting the input text into tokens that the model can process. Tokenization involves breaking the input text into individual tokens, which are often words or subwords. The tokenizer uses a WordPiece tokenization algorithm, which allows it to handle unseen words by breaking them down into smaller meaningful subword units.

In [4]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

**Tokenizing the Sentence**

For each input sentence, the tokenizer does the following:


*  Splitting the text into words/subwords: The sentence is broken down into tokens (words or subword pieces) that the model can understand.

*   Padding and truncating: To make sure all input sequences are of the same length, we pad shorter sequences and truncate longer sequences to a fixed length (in this case, 128 tokens).

*   Converting tokens into token IDs: Each token is mapped to an integer index corresponding to its position in the pre-trained tokenizer’s vocabulary.



Run the follwoing code-block to tokenize the sentences.

In [5]:

def encode_data(dataset):
    texts = [row['sentence'] for _, row in dataset.iterrows()]
    labels = [row['voice'] for _, row in dataset.iterrows()]
    # Map string labels to numerical values
    labels = [0 if label == "Active" else 1 for label in labels]
    encodings = tokenizer(
        texts, truncation=True, padding=True, max_length=128, return_tensors="tf"
    )
    return encodings, tf.convert_to_tensor(labels)



The above given **encode_data()**  process the raw train_data, val_data, and test_data, converting them into a format that can be fed into a machine learning model. This typically involves tokenizing text data, encoding it into numerical representations, and preparing corresponding labels.

In [6]:
#These are the tokenized and encoded input data.
train_encodings, train_labels = encode_data(train_data)
val_encodings, val_labels = encode_data(val_data)
test_encodings, test_labels = encode_data(test_data)


Preparing datasets for training, validation, and testing in a TensorFlow pipeline. The following code uses **tf.data.Dataset.from_tensor_slices()** to create TensorFlow datasets from the encoded data and labels. This method slices the data into individual elements that can be batched and processed in parallel.

The **dict("*encodings*")** converts the encodings into a dictionary format, which is expected by TensorFlow models, especially when using models that take multiple input features (like BERT or other transformers).

In [7]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels)).batch(8)
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels)).batch(8)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), test_labels)).batch(8)

#Load the Pre-trained model (DistilBert)

We are using a basic, lightweight DistilBert model to finetune for the given task.

In [8]:
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 




> Key Points for the following Codes:

1. Pre-trained Model (DistilBERT): Fine-tunes a pre-trained transformer model for better results with limited data.
2. Data Augmentation: While not included in this snippet, you can manually rewrite some sentences to add more training samples.
3. Early Stopping & Checkpoints: Helps prevent overfitting and saves the best-performing model.
4. Batch Size: A smaller batch size (e.g., 8) improves model updates for tiny datasets.
5. Learning Rate (3e-5): A lower learning rate suits fine-tuning on small datasets.





Rune the following codes to train the model.

In [9]:
# Optimizer and Learning Rate Scheduler
num_train_steps = len(train_dataset) * 10  # 10 epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=3e-5, num_warmup_steps=0, num_train_steps=num_train_steps
)


In [10]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)


In [11]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

In [12]:
# Train the Model
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
    callbacks=[early_stopping]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#Model Evaluation and Saving

Run the following codes to evaluate the trained and then save it to a custom directory in your drive.

In [13]:
#Evaluate the model
test_loss, test_accuracy = model.evaluate(test_dataset)
print(f"Test Accuracy: {test_accuracy:.2f}")

Test Accuracy: 0.88


In [14]:
# Save the Model
model.save_pretrained("./text_classifier")
tokenizer.save_pretrained("./text_classifier")

('./text_classifier/tokenizer_config.json',
 './text_classifier/special_tokens_map.json',
 './text_classifier/vocab.txt',
 './text_classifier/added_tokens.json')

In [15]:
#Load the pretrained model
model = TFDistilBertForSequenceClassification.from_pretrained("./text_classifier")
tokenizer = DistilBertTokenizer.from_pretrained("./text_classifier")


Some layers from the model checkpoint at ./text_classifier were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./text_classifier and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Model Prediction

Once the tokenized input is ready, it is passed through the model. The transformer encoder processes the tokenized input to generate contextualized word embeddings, capturing the relationships between tokens. The final output of the encoder is a vector of logits, which are **raw prediction scores**.

The classification head (usually a dense layer) processes these logits to output a final prediction. These are then passed through a softmax or argmax function to get the final class prediction.



This output corresponds to either **0 (Active)** or **1 (Passive)**.

In [16]:
def predict_sentence_voice(sentence):

    inputs = tokenizer(sentence, return_tensors="tf", truncation=True, padding=True, max_length=128)

    logits = model(inputs)[0]

    # Convert logits to predicted class (0 = Active, 1 = Passive)
    predicted_class = tf.argmax(logits, axis=-1).numpy()[0]

    # Interpret the prediction
    if predicted_class == 0:
        return "Active"
    else:
        return "Passive"

#Testing the Model

Write any sentence in the given ***sentence*** variable and run this code block to determine if it is Active Voice or Passive Voice.

In [17]:
# Test the function with a sample sentence
sentence = "the letter was written by" # <--- write your sample sentence here
prediction = predict_sentence_voice(sentence)
print(f"Sentence: '{sentence}' is {prediction}.")

Sentence: 'the letter was written by' is Passive.
