<a href="https://colab.research.google.com/github/ubinix-warun/mad-bootcamp-2024/blob/main/colab/Fine_tuning_BERT_Model_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Classification Using BERT**

## **Step1: Import the necessary libraries**

In [None]:
# Reference: https://www.geeksforgeeks.org/sentiment-classification-using-bert/

# Import preprocessing library
import os
import pandas as pd
from bs4 import BeautifulSoup
import re

# Import modeling library
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

# Others
import warnings
warnings.filterwarnings("ignore")

## **Step 2: Load the dataset**

In [None]:
# Load IMDB dataset: A dataset for binary sentiment classification with 25,000 highly polar movie reviews for training, and 25,000 for testing
dataset = tf.keras.utils.get_file(
	fname="aclImdb.tar.gz",
	origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
	cache_dir=os.getcwd(),
	extract=True)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
# Set directory path (Dataset Explanation: https://deepnote.com/app/ronakv/Sentiment-Analysis-9cb468b0-9200-400f-9896-e4e9d46dbc48)
dataset_dir = os.path.dirname(dataset)
imdb_dir = os.path.join(dataset_dir, 'aclImdb')
train_dir = os.path.join(imdb_dir,'train')
test_dir = os.path.join(imdb_dir,'test')

os.listdir(train_dir)

['unsupBow.feat',
 'neg',
 'urls_pos.txt',
 'unsup',
 'pos',
 'urls_unsup.txt',
 'urls_neg.txt',
 'labeledBow.feat']

In [None]:
def load_dataset(directory):
	data = {"sentence": [], "sentiment": []}
	for file_name in os.listdir(directory):
		if file_name == 'pos':
			positive_dir = os.path.join(directory, file_name)
			for text_file in os.listdir(positive_dir):
				text = os.path.join(positive_dir, text_file)
				with open(text, "r", encoding="utf-8") as f:
					data["sentence"].append(f.read())
					data["sentiment"].append(1)
		elif file_name == 'neg':
			negative_dir = os.path.join(directory, file_name)
			for text_file in os.listdir(negative_dir):
				text = os.path.join(negative_dir, text_file)
				with open(text, "r", encoding="utf-8") as f:
					data["sentence"].append(f.read())
					data["sentiment"].append(0)

	return pd.DataFrame.from_dict(data)

In [None]:
# Load the dataset from the train_dir and test_dir
train_df = load_dataset(train_dir)
test_df = load_dataset(test_dir)

In [None]:
# Training set
train_df.sample(n=5, random_state=1)

Unnamed: 0,sentence,sentiment
21492,What does the Marquis de Sade have to do with ...,1
9488,The fact that this movie made it all the way t...,0
16933,"Hitokiri (which translates roughly as ""assassi...",1
12604,"Reviewed at the Sept 12, 2006 2nd screening at...",1
8222,Closer to reality and containing more depth th...,0


In [None]:
# Test set
test_df.sample(n=5, random_state=1)

Unnamed: 0,sentence,sentiment
21492,I'm a 55-year-old fairly jaded gay white man. ...,1
9488,This movie came very close to being a good fli...,0
16933,I'll make this brief. This was a joy to watch....,1
12604,Lana Turner proved that she could really dance...,1
8222,Isaac Florentine has made some of the best wes...,0


In [None]:
# Test sentence example
test_df.loc[8222, 'sentence']

"Isaac Florentine has made some of the best western Martial Arts action movies ever produced. In particular US Seals 2, Cold Harvest, Special Forces and Undisputed 2 are all action classics. You can tell Isaac has a real passion for the genre and his films are always eventful, creative and sharp affairs, with some of the best fight sequences an action fan could hope for. In particular he has found a muse with Scott Adkins, as talented an actor and action performer as you could hope for. This is borne out with Special Forces and Undisputed 2, but unfortunately The Shepherd just doesn't live up to their abilities.<br /><br />There is no doubt that JCVD looks better here fight-wise than he has done in years, especially in the fight he has (for pretty much no reason) in a prison cell, and in the final showdown with Scott, but look in his eyes. JCVD seems to be dead inside. There's nothing in his eyes at all. It's like he just doesn't care about anything throughout the whole film. And this 

## **Step 3: Preprocessing**

In [None]:
# Clean texts
def text_cleaning(text):
	soup = BeautifulSoup(text, "html.parser")
	text = soup.get_text()
	pattern = r"[^a-zA-Z0-9\s,']"
	text = re.sub(pattern, '', text)
	return text

Regex, short for Regular Expression, is a sequence of characters that defines a search pattern, allowing for efficient string manipulation and pattern matching operations in text processing tasks. <br><br>
The pattern [^a-zA-Z0-9\s,'] is a regex that matches any character that is not:
*   a-z: Any lowercase letter.
*   A-Z: Any uppercase letter.
*   0-9: Any digit.
*   \s: Any whitespace character (such as spaces, tabs, or newlines).
*   ,: The comma character.
*   ': The apostrophe character.
<br> The ^ at the beginning inside the square brackets [] negates the character set, meaning it matches any character not listed.

In [None]:
# Ex.1
test = "<br /><br />(Wow!!!) He's very smart."
test_1 = BeautifulSoup(test, "html.parser").get_text()
print(test_1)

(Wow!!!) He's very smart.


In [None]:
# Ex.2
test_2 = re.sub(r"[^a-zA-Z0-9\s,']", '', test_1)
print(test_2)

Wow He's very smart


In [None]:
# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning) #.tolist()
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']

# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']

In [None]:
x_val, x_test, y_val, y_test = train_test_split(test_reviews,
													test_targets,
													test_size=0.5,
													stratify = test_targets)

## **Step 4: Tokenization & Encoding**

In [None]:
# Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Ex.1
sentence = "Her style is very conversational"
tokenizer.tokenize(sentence)

['her', 'style', 'is', 'very', 'conversation', '##al']

In [None]:
# Ex.2
encoding = tokenizer.encode(sentence)
encoding

[101, 2014, 2806, 2003, 2200, 4512, 2389, 102]

In [None]:
# Ex.3
tokenizer.convert_ids_to_tokens(encoding)

['[CLS]', 'her', 'style', 'is', 'very', 'conversation', '##al', '[SEP]']

In [None]:
# Ex.4
tokenizer.batch_encode_plus(["Her style is very conversational", "Her style is good"],
											padding=True,
											truncation=True,
											max_length=128,
											return_tensors='tf')

{'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[ 101, 2014, 2806, 2003, 2200, 4512, 2389,  102],
       [ 101, 2014, 2806, 2003, 2204,  102,    0,    0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>}

In [None]:
max_len= 128

# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
											padding=True,
											truncation=True,
											max_length = max_len,
											return_tensors='tf')

X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(),
											padding=True,
											truncation=True,
											max_length = max_len,
											return_tensors='tf')

X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(),
											padding=True,
											truncation=True,
											max_length = max_len,
											return_tensors='tf')

In [None]:
k = 4
print('Training Comments -->', Reviews[k])
print('\nInput Ids -->\n', X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->\n', tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->\n', X_train_encoded['attention_mask'][k])
print('\nLabels -->', Target[k])

Training Comments --> This is a movie that is bad in every imaginable way Sure we like to know what happened 12 years from the last movie, and it works on some level But the new characters are just not interesting Baby Melody is hideously horrible Alas, while the logic that humans can't stay underwater forever is maintained, other basic physical logic are ignored It's chilly if you don't have cold weather garments if you're in the Arctic I don't know why most comments here Return of Jafar rates worse, I thought this one is more horrible

Input Ids -->
 tf.Tensor(
[  101  2023  2003  1037  3185  2008  2003  2919  1999  2296 10047 22974
 22966  2126  2469  2057  2066  2000  2113  2054  3047  2260  2086  2013
  1996  2197  3185  1010  1998  2009  2573  2006  2070  2504  2021  1996
  2047  3494  2024  2074  2025  5875  3336  8531  2003 22293  2135  9202
 21862  2015  1010  2096  1996  7961  2008  4286  2064  1005  1056  2994
 11564  5091  2003  5224  1010  2060  3937  3558  7961  2024  643

## **Step 5: Build the classification model**

In [None]:
# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Compile the model with an appropriate optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
# Step 5: Train the model
history = model.fit(
	[X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
	Target,
	validation_data=([X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
	batch_size=32,
	epochs=3
)

Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


## **Step 6: Evaluate the model**

In [None]:
# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
	[X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
	y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')

Test loss: 0.35724735260009766, Test accuracy: 0.8761600255966187


In [None]:
path = 'path-to-save'
# Save tokenizer
tokenizer.save_pretrained(path +'/Tokenizer')

# Save model
model.save_pretrained(path +'/Model')

In [None]:
# Load tokenizer
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')

# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')

Some layers from the model checkpoint at path-to-save/Model were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at path-to-save/Model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [None]:
# Perform a more in-depth evaluation
pred = bert_model.predict(
	[X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])

# pred is of type TFSequenceClassifierOutput
logits = pred.logits

# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)

# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()

label = {
	1: 'Positive',
	0: 'Negative'
}

# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]

print('Predicted Label :', pred_labels[:10])
print('Actual Label :', Actual[:10])

Predicted Label : ['Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Positive', 'Positive', 'Negative', 'Positive', 'Negative']
Actual Label : ['Positive', 'Negative', 'Positive', 'Negative', 'Negative', 'Positive', 'Positive', 'Negative', 'Positive', 'Negative']


In [None]:
print("Classification Report: \n", classification_report(Actual, pred_labels))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.84      0.93      0.88      6250
    Positive       0.92      0.83      0.87      6250

    accuracy                           0.88     12500
   macro avg       0.88      0.88      0.88     12500
weighted avg       0.88      0.88      0.88     12500



## **Step 7: Prediction with user inputs**

In [None]:
def Get_sentiment(Review, Tokenizer, Model):
	# Convert Review to a list if it's not already a list
	if not isinstance(Review, list):
		Review = [Review]

	Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
																			padding=True,
																			truncation=True,
																			max_length=128,
																			return_tensors='tf').values()
	prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])

	# Use argmax along the appropriate axis to get the predicted labels
	pred_labels = tf.argmax(prediction.logits, axis=1)

	# Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
	pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
	return pred_labels

In [None]:
Review ='''Bahubali is a blockbuster Indian movie that was released in 2015.
It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love.
The movie has received rave reviews from critics and audiences alike for its stunning visuals,
spectacular action scenes, and captivating storyline.'''
Get_sentiment(Review, bert_tokenizer, bert_model)



['Positive']