# Sentiment Classification using BERT

Sentiment Classification is a common business use case for any company which sells products or services. Companies utilize the sentiment information of the consumers to improve their products/services and even build new products based on the feedback received. It helps in making rational decision.

### Approach & Objective

We will build a Sentiment Classification model using BERT. We will need a labelled dataset for model training and evaluation. If the data is not labelled, we have to label it manually.

- Data Collection
- Data Labelling
- Convert target variables into the numeric form
- Text Preprocessing
    - tokenization(subword tokenization handled by WordPiece tokenizer)
    - text encoding
    - text embedding
- Model training
- Model Evaluation
- Model Serving

### Installing all the dependencies

In [26]:
!pip install -qU transformers torch

### Loading & Preparing Data

In [27]:
import pandas as pd

data_location = "/content/drive/MyDrive/sentiment_analysis_data/test.csv"

df = pd.read_csv(data_location)
df.head()

Unnamed: 0,text,sentiment
0,Last session of the day http://twitpic.com/67ezh,neutral
1,Shanghai is also really exciting (precisely -...,positive
2,"Recession hit Veronique Branquinho, she has to...",negative
3,happy bday!,positive
4,http://twitpic.com/4w75p - I like it!!,positive


In [28]:
df[['sentiment']].value_counts()

sentiment
neutral      1430
positive     1103
negative     1001
dtype: int64

In [29]:
# Sample equal number of samples from each class
min_class_count = df['sentiment'].value_counts().min()
samples_per_class = min_class_count

# Sample data from each class
sampled_data = pd.concat([df[df['sentiment'] == category].sample(samples_per_class) for category in df['sentiment'].unique()])

# Reset the index of the sampled data
sampled_data.reset_index(drop=True, inplace=True)

sampled_data['sentiment'].value_counts()

neutral     1001
positive    1001
negative    1001
Name: sentiment, dtype: int64

In [30]:
sampled_data.head()

Unnamed: 0,text,sentiment
0,"And Clang rocks, so you`re using it, right? ...",neutral
1,tell me what you think of Pride Prejudice and...,neutral
2,"Change of plans, working inside bar tonight",neutral
3,Im so tired and sick i have to be better on S...,neutral
4,sad that will have to leave my beautiful apart...,neutral


In [31]:
# Converting sentiment class label into numeric form
class_to_label = {"negative": 0, "neutral": 1, "positive": 2}

sampled_data['sentiment_score'] = sampled_data['sentiment'].map(class_to_label)
sampled_data.head()

Unnamed: 0,text,sentiment,sentiment_score
0,"And Clang rocks, so you`re using it, right? ...",neutral,1
1,tell me what you think of Pride Prejudice and...,neutral,1
2,"Change of plans, working inside bar tonight",neutral,1
3,Im so tired and sick i have to be better on S...,neutral,1
4,sad that will have to leave my beautiful apart...,neutral,1


In [32]:
import torch

complete_text = sampled_data['text'].to_list()
labels = torch.tensor(sampled_data['sentiment_score'].to_list())

In [33]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

# Preprocess the extracted texts
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = tokenizer(complete_text, padding=True, truncation=True, return_tensors='pt')

tokenized_texts.input_ids.shape, tokenized_texts.attention_mask.shape

(torch.Size([3003, 70]), torch.Size([3003, 70]))

### Split data into training and validation sets

In [34]:
input_ids = tokenized_texts.input_ids
attention_mask = tokenized_texts.attention_mask

print("Input ids Shape:",input_ids.shape)
print("attention_mask Shape:",attention_mask.shape)
print("labels Shape:",labels.shape)

train_inputs, val_inputs, train_labels, val_labels = train_test_split(input_ids, labels, random_state=42, test_size=0.2)
train_masks, val_masks, _, _ = train_test_split(attention_mask, input_ids, random_state=42, test_size=0.2)

print()

print("Train Input ids Shape:",train_inputs.shape)
print("Train attention_mask Shape:",train_masks.shape)
print("Train labels Shape:",train_labels.shape)

print()

print("Test Input ids Shape:",val_inputs.shape)
print("Test attention_mask Shape:",val_masks.shape)
print("Test labels Shape:",val_labels.shape)

print()

print("Train Sample Count[Class-wise]", {label: list(train_labels.numpy()).count(label) for label in list(train_labels.numpy())})
print("Test Sample Count[Class-wise]", {label: list(val_labels.numpy()).count(label) for label in list(val_labels.numpy())})

Input ids Shape: torch.Size([3003, 70])
attention_mask Shape: torch.Size([3003, 70])
labels Shape: torch.Size([3003])

Train Input ids Shape: torch.Size([2402, 70])
Train attention_mask Shape: torch.Size([2402, 70])
Train labels Shape: torch.Size([2402])

Test Input ids Shape: torch.Size([601, 70])
Test attention_mask Shape: torch.Size([601, 70])
Test labels Shape: torch.Size([601])

Train Sample Count[Class-wise] {2: 823, 1: 787, 0: 792}
Test Sample Count[Class-wise] {0: 209, 2: 178, 1: 214}


this split resulted in unequal splitting of the train and test data. Let's split it using Stratified sampling technique

In [35]:
from sklearn.model_selection import StratifiedShuffleSplit

# Initialize StratifiedShuffleSplit with 80% train and 20% validation split
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Split the data into train and validation sets using the split indices generated by StratifiedShuffleSplit
for train_index, val_index in stratified_split.split(input_ids, labels):
    train_inputs, val_inputs = input_ids[train_index], input_ids[val_index]
    train_masks, val_masks = attention_mask[train_index], attention_mask[val_index]
    train_labels, val_labels = labels[train_index], labels[val_index]

# Print the shapes of the split data
print("Train Input ids Shape:", train_inputs.shape)
print("Train attention_mask Shape:", train_masks.shape)
print("Train labels Shape:", train_labels.shape)
print("Validation Input ids Shape:", val_inputs.shape)
print("Validation attention_mask Shape:", val_masks.shape)
print("Validation labels Shape:", val_labels.shape)

print()

print("Train Sample Count[Class-wise]", {label: list(train_labels.numpy()).count(label) for label in list(train_labels.numpy())})
print("Test Sample Count[Class-wise]", {label: list(val_labels.numpy()).count(label) for label in list(val_labels.numpy())})

Train Input ids Shape: torch.Size([2402, 70])
Train attention_mask Shape: torch.Size([2402, 70])
Train labels Shape: torch.Size([2402])
Validation Input ids Shape: torch.Size([601, 70])
Validation attention_mask Shape: torch.Size([601, 70])
Validation labels Shape: torch.Size([601])

Train Sample Count[Class-wise] {0: 801, 2: 800, 1: 801}
Test Sample Count[Class-wise] {1: 200, 2: 201, 0: 200}


In [36]:
# Create custom dataset and dataloaders
class CustomDataset(Dataset):
    def __init__(self, input_ids, attention_masks, labels):
        self.input_ids = input_ids
        self.attention_masks = attention_masks
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'labels': self.labels[idx]
        }

batch_size = 16

train_dataset = CustomDataset(train_inputs, train_masks, train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

val_dataset = CustomDataset(val_inputs, val_masks, val_labels)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

### Initialize and train BERT model

In [37]:
# Initialize and train BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
epochs = 3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device::", device)
# device = torch.device("cuda")
model.to(device)

for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            val_loss += outputs.loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {val_loss / len(val_loader)}')

# Save the trained model
model.save_pretrained('bert_sentiment_classification_model')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


device:: cuda
Epoch 1/3, Loss: 0.6624120669929605
Epoch 2/3, Loss: 0.5717421452465811
Epoch 3/3, Loss: 0.5887839758866712


### Model Evaluation

In [38]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
labels = labels.to(device)

In [39]:
# Put the model in evaluation mode
model.eval()

# Initialize variables to calculate accuracy
total_correct = 0
total_samples = 0

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Get model predictions
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)

        # Update accuracy metrics
        total_correct += torch.sum(predictions == labels).item()
        total_samples += len(labels)

# Calculate accuracy
accuracy = total_correct / total_samples
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.79


### Model Serving

In [40]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the trained model
model_path = 'bert_sentiment_classification_model'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()

# Function to predict sentiment class
def predict_sentiment_class(text):
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    predicted_class = torch.argmax(logits, dim=1).item()
    return predicted_class

In [41]:
sample_text = "I love this product and want to purchase more"

predicted_class = predict_sentiment_class(sample_text)
print(f'Predicted Class: {predicted_class}')

Predicted Class: 2


In [42]:
class_to_label

{'negative': 0, 'neutral': 1, 'positive': 2}