# In this project we have used Transforemers in place of LSTM Models


The use of transformer-based models, such as BERT and RoBERTa, for natural language processing tasks like sentiment analysis on the IMDB movie review dataset has gained popularity due to many reasons.Some of them are mentioned below

**Attention Mechanism:**

Transformers leverage the attention mechanism, allowing them to consider all parts of the input sequence simultaneously. This is particularly beneficial for understanding the context and relationships between words in a sentence.

LSTMs process input sequentially, which might result in information loss over long sequences.


**Pre-trained Representations:**

BERT and RoBERTa models are pre-trained on large corpora, capturing rich contextual information from diverse language patterns. This pre-training helps the model understand language nuances and semantic relationships, making them effective for various downstream tasks.

LSTM models often require extensive training data to learn effective representations.


**Transfer Learning:**

Transformers are designed for transfer learning. Pre-trained transformer models can be fine-tuned on specific tasks with relatively smaller datasets, providing good performance even when labeled data is limited.

LSTMs might struggle to generalize well with limited data and may require larger amounts of labeled data for satisfactory performance.

**Contextual Embeddings:**

Transformers produce contextual embeddings, meaning the representation of a word can vary based on its context within a sentence. This contextual understanding is crucial for tasks like sentiment analysis, where the meaning of a word can change based on the surrounding words.

LSTMs generate fixed-size embeddings for each word, potentially missing out on context-specific information.


**Parallelization:**

Transformers can efficiently parallelize computations across multiple GPUs, making them more scalable for training on large datasets.

LSTMs process sequences sequentially, limiting the extent to which parallelization can be achieved.

**State-of-the-Art Performance:**

Transformer-based models have achieved state-of-the-art performance on a wide range of NLP benchmarks, including sentiment analysis, due to their ability to capture complex relationships in data.



# **Libraries**
Natural Language Processing (NLP) and sentiment analysis, leveraging specialized libraries is imperative for streamlined development. NumPy furnishes robust mathematical operations, Keras simplifies neural network creation, and the IMDb dataset from Keras offers a benchmark for sentiment analysis. The Transformers library emerges as a game-changer, providing pre-trained models like BERT and RoBERTa for efficient tokenization and contextual understanding. This powerful amalgamation allows for seamless model development, training, and evaluation.

In [None]:
import numpy as np
from keras.datasets import imdb
from keras.models import Model
from keras.layers import Input, Dense, Flatten
from transformers import BertTokenizer, TFBertModel, RobertaTokenizer, TFRobertaModel
from keras.preprocessing import sequence
from keras import backend as K

**top_words = 5000:**

This variable sets the desired size of the vocabulary. Only the top 5000 most frequent words in the dataset will be considered.

Note: *Reducing top_words, provides computational and memory benefits but comes with the trade-off of potential information loss and limitations in semantic representation.*



**(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words):**

***imdb.load_data(num_words=top_words)*** loads the IMDB movie review dataset from Keras.

The ***num_words*** parameter is set to ***top_words***, meaning only the most frequent 5000 words will be kept, and the rest will be replaced with a special token.
The dataset is split into training and testing sets, represented by ***(X_train, y_train)*** and ***(X_test, y_test)***

In [None]:
# Load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


 All sequences in both the training and test sets have a consistent length of 500. Sequences longer than 500 are truncated, and shorter ones are padded with zeros. This step is crucial for creating uniform input dimensions when training neural networks, which typically expect fixed-size input sequences.


**Note:** *Reducing max_review_length, provides computational benefits but comes with the trade-off of potential information loss and reduced context.*







In [None]:
# Truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

Transforms integer sequences (likely representing words or tokens) back into their original text form. It is essential for interpreting and understanding the processed sequences, especially when inspecting or analyzing the results of a natural language processing model.

In [None]:
# Convert integer sequences back to text
X_train_texts = [' '.join(map(str, x)) for x in X_train]
X_test_texts = [' '.join(map(str, x)) for x in X_test]

**Tokenization is the process of breaking down text into individual tokens or subwords.**

Tokenizing input sequences using two different transformer-based models:

# **BERT** (Bidirectional Encoder Representations from Transformers)

***bert_tokenizer:*** Initializes a tokenizer for the BERT model using the 'bert-base-uncased' pre-trained weights.

***max_sequence_length:*** Specifies the maximum length of the tokenized sequences.

***X_train_bert and X_test_bert:*** Lists that will store the tokenized representations of the input sequences for the training and test sets

# **RoBERTa** (Robustly optimized BERT approach)

***roberta_tokenizer:*** Initializes a tokenizer for the RoBERTa model using the 'roberta-base' pre-trained weights.

***X_train_roberta and X_test_roberta:*** Lists that will store the tokenized representations of the input sequences for the training and test sets using RoBERTa.



In [None]:
# Tokenize the input sequences for BERT
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_sequence_length = 512
X_train_bert = [bert_tokenizer.encode(text, add_special_tokens=True, max_length=max_sequence_length, truncation=True) for text in X_train_texts]
X_test_bert = [bert_tokenizer.encode(text, add_special_tokens=True, max_length=max_sequence_length, truncation=True) for text in X_test_texts]

# Tokenize the input sequences for RoBERTa
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
X_train_roberta = [roberta_tokenizer.encode(text, add_special_tokens=True, max_length=max_sequence_length, truncation=True) for text in X_train_texts]
X_test_roberta = [roberta_tokenizer.encode(text, add_special_tokens=True, max_length=max_sequence_length, truncation=True) for text in X_test_texts]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

**Loading pre-trained transformer models, using the Hugging Face Transformers library.**

# BERT (Bidirectional Encoder Representations from Transformers)

***TFBertModel:*** Initializes an instance of the BERT model architecture using TensorFlow.

***from_pretrained('bert-base-uncased'):*** Loads pre-trained weights and configurations for the 'bert-base-uncased' variant of BERT. This variant is uncased, meaning it doesn't distinguish between uppercase and lowercase letters.


# RoBERTa (Robustly optimized BERT approach)

***TFRobertaModel:*** Initializes an instance of the RoBERTa model architecture using TensorFlow.

***from_pretrained('roberta-base'):*** Loads pre-trained weights and configurations for the 'roberta-base' variant of RoBERTa.



In [None]:
# Load pre-trained BERT model
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Load pre-trained RoBERTa model
roberta_model = TFRobertaModel.from_pretrained('roberta-base')


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.weight', 'roberta.embeddings.position_ids', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaModel were not initialized from the PyTorch model and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

**Two separate neural network models, one for BERT and one for RoBERTa, using the Keras library.**

# Input Layers:

**Input:** Defines the input layers for the models. Each input layer represents the tokenized input sequences for BERT and RoBERTa, with a specified shape (max_sequence_length) and data type ('int32').

# BERT Model Layers:

**bert_output:** Passes the BERT input through the pre-trained BERT model (bert_model), retrieving the model's output tensor.

**Flatten:** Flattens the tensor to a one-dimensional array.

**Dense:** Adds a fully connected layer with one output unit and a sigmoid activation function, suitable for binary classification tasks.

**Model:** Defines the complete BERT model, specifying the input and output layers.

# RoBERTa Model Layers:
**roBERTa_output:** Passes the roBERTa input through the pre-trained roBERTa model (roBERTa_model), retrieving the model's output tensor.

**Flatten:** Flattens the tensor to a one-dimensional array.

**Dense:** Adds a fully connected layer with one output unit and a sigmoid activation function, suitable for binary classification tasks.

**Model:** Defines the complete roBERTa model, specifying the input and output layers.

In [None]:
# Define input layers for BERT and RoBERTa
input_layer_bert = Input(shape=(max_sequence_length,), dtype='int32')
input_layer_roberta = Input(shape=(max_sequence_length,), dtype='int32')

# BERT layers
bert_output = bert_model(input_layer_bert)[0]
flatten_layer_bert = Flatten()(bert_output)
output_layer_bert = Dense(1, activation='sigmoid')(flatten_layer_bert)
model_bert = Model(inputs=input_layer_bert, outputs=output_layer_bert)

# RoBERTa layers
roberta_output = roberta_model(input_layer_roberta)[0]
flatten_layer_roberta = Flatten()(roberta_output)
output_layer_roberta = Dense(1, activation='sigmoid')(flatten_layer_roberta)
model_roberta = Model(inputs=input_layer_roberta, outputs=output_layer_roberta)


An approach to make specific layers trainable in BERT and RoBERTa models. For BERT, only layers starting with 'pooler' retain trainability, allowing targeted adjustments. Meanwhile, all layers in the RoBERTa model are frozen, preserving its pre-trained knowledge.

This selective trainability empowers practitioners to optimize performance on specific tasks while efficiently leveraging the wealth of information embedded in these powerful transformer models.

In [None]:
# Make BERT and RoBERTa layers trainable
for layer in bert_model.layers:
    if layer.name.startswith('pooler'):
        layer.trainable = True
    else:
        layer.trainable = False

for layer in roberta_model.layers:
    layer.trainable = False


# ***Compile Method: This method configures the model for training. It requires three essential parameters:***

**loss:** Specifies the loss function to measure the model's performance during training. Here, it's set to 'binary_crossentropy', which is suitable for binary classification tasks.

**optimizer:** Determines the optimization algorithm for adjusting the model's weights during training. 'adam' is a popular optimizer known for its efficiency.

**metrics:** A list of metrics used to evaluate the model's performance. In this case, it includes 'accuracy' to track classification accuracy during training.

In [None]:
# Compile the models
model_bert.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_roberta.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


# pad_sequences Function: This function pads sequences to a specified maximum length.

***X_train_bert and X_test_bert:*** It pads the BERT sequences in both training and testing datasets to the max_sequence_length using post-padding and post-truncation.

***X_train_roberta and X_test_roberta:*** Similarly, it pads the RoBERTa sequences in both training and testing datasets to the same max_sequence_length with post-padding and post-truncation.

In [None]:
# Pad BERT sequences
X_train_bert = sequence.pad_sequences(X_train_bert, maxlen=max_sequence_length, padding='post', truncating='post')
X_test_bert = sequence.pad_sequences(X_test_bert, maxlen=max_sequence_length, padding='post', truncating='post')

# Pad RoBERTa sequences
X_train_roberta = sequence.pad_sequences(X_train_roberta, maxlen=max_sequence_length, padding='post', truncating='post')
X_test_roberta = sequence.pad_sequences(X_test_roberta, maxlen=max_sequence_length, padding='post', truncating='post')


The ***epoch*** represents one full pass through the entire training dataset. A single epoch may be insufficient for complex tasks, requiring multiple passes to optimize the model. However, too many epochs can lead to overfitting.

The ***batch size*** determines the number of training samples processed in one iteration. A larger batch offers computational efficiency but might sacrifice generalization. Smaller batches provide a regularizing effect and often yield better results. The interplay between epoch and batch size is crucial, striking a balance between model convergence and resource utilization.

After training, model evaluation on the test data reveals their generalization performance, crucial for assessing real-world effectiveness. The printed loss and accuracy metrics offer insights into the models' predictive capabilities.

In [None]:
# Train the models
model_bert.fit(X_train_bert, y_train, epochs=1, batch_size=64)
model_roberta.fit(X_train_roberta, y_train, epochs=1, batch_size=64)

# Evaluate BERT model
print("Evaluating BERT model:")
loss_bert, accuracy_bert = model_bert.evaluate(sequence.pad_sequences(X_test_bert, maxlen=max_sequence_length, padding='post', truncating='post'), y_test)
print(f"BERT Model - Loss: {loss_bert}, Accuracy: {accuracy_bert}")

# Evaluate RoBERTa model
print("\nEvaluating RoBERTa model:")
loss_roberta, accuracy_roberta = model_roberta.evaluate(sequence.pad_sequences(X_test_roberta, maxlen=max_sequence_length, padding='post', truncating='post'), y_test)
print(f"RoBERTa Model - Loss: {loss_roberta}, Accuracy: {accuracy_roberta}")


Evaluating BERT model:
BERT Model - Loss: 5.510513782501221, Accuracy: 0.5192400217056274

Evaluating RoBERTa model:
RoBERTa Model - Loss: 3.5839855670928955, Accuracy: 0.5161200165748596


In [None]:
# Train the models
model_bert.fit(X_train_bert, y_train, epochs=1, batch_size=32)
model_roberta.fit(X_train_roberta, y_train, epochs=1, batch_size=32)

# Evaluate BERT model
print("Evaluating BERT model:")
loss_bert, accuracy_bert = model_bert.evaluate(sequence.pad_sequences(X_test_bert, maxlen=max_sequence_length, padding='post', truncating='post'), y_test)
print(f"BERT Model - Loss: {loss_bert}, Accuracy: {accuracy_bert}")

# Evaluate RoBERTa model
print("\nEvaluating RoBERTa model:")
loss_roberta, accuracy_roberta = model_roberta.evaluate(sequence.pad_sequences(X_test_roberta, maxlen=max_sequence_length, padding='post', truncating='post'), y_test)
print(f"RoBERTa Model - Loss: {loss_roberta}, Accuracy: {accuracy_roberta}")


Evaluating BERT model:
BERT Model - Loss: 4.223028659820557, Accuracy: 0.5532000064849854

Evaluating RoBERTa model:
RoBERTa Model - Loss: 12.79500961303711, Accuracy: 0.5004799962043762


In [None]:
# Train the models
model_bert.fit(X_train_bert, y_train, epochs=3, batch_size=64)
model_roberta.fit(X_train_roberta, y_train, epochs=3, batch_size=64)

# Evaluate BERT model
print("Evaluating BERT model:")
loss_bert, accuracy_bert = model_bert.evaluate(sequence.pad_sequences(X_test_bert, maxlen=max_sequence_length, padding='post', truncating='post'), y_test)
print(f"BERT Model - Loss: {loss_bert}, Accuracy: {accuracy_bert}")

# Evaluate RoBERTa model
print("\nEvaluating RoBERTa model:")
loss_roberta, accuracy_roberta = model_roberta.evaluate(sequence.pad_sequences(X_test_roberta, maxlen=max_sequence_length, padding='post', truncating='post'), y_test)
print(f"RoBERTa Model - Loss: {loss_roberta}, Accuracy: {accuracy_roberta}")


Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3

Note:

*"We are unable to complete this model training due to computational limit and google colab limitation of limited credit on free google colab account".*