# NLP Sentiment Analysis with RoBERTa

Train and evaluate a RoBERTa model for sentiment analysis using NLTK's movie_reviews corpus (2,000 reviews labeled as "positive" or "negative").

RoBERTa (Robustly Optimized BERT Approach) is a state-of-the-art transformer model that outperforms traditional machine learning approaches.

### Data Pre-Processing

In [11]:
# Import libraries and download corpus, if needed
import nltk
# nltk.download('movie_reviews')
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split

# Load movie reviews as raw text
texts = []
labels = []

# RoBERTa works well with raw text, but needs binary labels
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        text = ' '.join(movie_reviews.words(fileid))
        texts.append(text)
        # Convert label to binary: 1 = positive, 0 = negative
        labels.append(1 if category == 'pos' else 0)

# Check results, displaying sample text and label
print(f"Loaded {len(texts)} movie reviews")
print(f"Example text: {texts[0][:50]}...")
print(f"Example label: {labels[0]} ({'positive' if labels[0] == 1 else 'negative'})")

Loaded 2000 movie reviews
Example text: plot : two teen couples go to a church party , dri...
Example label: 0 (negative)


In [12]:
# Split into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, # raw text of movie review
    labels, # binary label
    test_size=0.2, # reserve 20% for testing
    random_state=113 # set for reproducibility
)

### Model Training

- Tokenize text, converting from raw text to token IDs.
- Create a PyTorch Dataset, wrapping tokenized text so RoBERTa can use it.
- Load the pre-trained RoBERTa model (roberta-base).
- Specify training arguments, like number of epochs.
- Calculate performance metrics (accuracy, precision, recall, F1).

In [13]:
# Import libraries for RoBERTa
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments

# Load pre-trained RoBERTa tokenizer
model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)

In [14]:
# Tokenize text, converting text into numbers (token IDs) that RoBERTa can understand
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512)

In [15]:
# Create a PyTorch Dataset, wrapping the tokenized data so RoBERTa can use it
import torch

class MovieReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create dataset objects
train_dataset = MovieReviewDataset(train_encodings, train_labels)
test_dataset = MovieReviewDataset(test_encodings, test_labels)

In [16]:
# Load the pre-trained RoBERTa model (num_labels = 2 to categorize positive/negative)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Set up training configuration
# Ignore pinned memory warnings (harmless but annoying)
import warnings
warnings.filterwarnings('ignore', message='.*pin_memory.*')
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Specify training arguments like number of training epochs
training_args = TrainingArguments(
    num_train_epochs=3,              # Train for 3 complete passes through the data
    per_device_train_batch_size=10,  # Process 10 reviews at a time during training
    per_device_eval_batch_size=15,   # Process 15 reviews at a time during evaluation
    warmup_steps=500,                # Gradually increase learning rate for first 500 steps
    weight_decay=0.01,               # Prevent overfitting
    logging_steps=100,               # Log training progress every 100 steps
    eval_strategy="epoch",           # Evaluate at the end of each epoch
    save_strategy="epoch",           # Save model at the end of each epoch
    load_best_model_at_end=True,     # Load the best model when finished
)

def compute_metrics(pred):
    """Calculate accuracy and other metrics"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

In [18]:
# Train the model
# Go get a coffee; this may take several minutes. :)
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6934,0.390613,0.875,0.880952,0.837104,0.929648
2,0.4401,0.321818,0.9075,0.907268,0.905,0.909548
3,0.3366,0.36198,0.8825,0.882206,0.88,0.884422


TrainOutput(global_step=480, training_loss=0.458883011341095, metrics={'train_runtime': 467.8251, 'train_samples_per_second': 10.26, 'train_steps_per_second': 1.026, 'total_flos': 1262933065728000.0, 'train_loss': 0.458883011341095, 'epoch': 3.0})

### Model Evaluation

Evaluate performance on the test set. In a future version, implement k-fold cross-validation as a more rigorous assessment.

In [19]:
# Evaluate the model on the test set
results = trainer.evaluate()

# Display results
print("Test Results:")
print(f"Accuracy:  {results['eval_accuracy']*100:.2f}%")
print(f"Precision: {results['eval_precision']*100:.2f}%")
print(f"Recall:    {results['eval_recall']*100:.2f}%")
print(f"F1 Score:  {results['eval_f1']*100:.2f}%")

Test Results:
Accuracy:  90.75%
Precision: 90.50%
Recall:    90.95%
F1 Score:  90.73%
