# Phase 4: Sentiment & Emotion Analysis

## Overview
This notebook adds emotion scores to book descriptions using a fine-tuned emotion classification model. We'll classify book descriptions at the sentence level to capture multiple emotional tones, then extract maximum emotion scores per book for use in filtering and recommendation sorting.

## Objectives
1. Load fine-tuned emotion classification model (RoBERTa-based, 6 emotions + neutral)
2. Test whole description vs. sentence-level classification
3. Process all book descriptions to extract emotion scores
4. Create emotion columns (anger, disgust, fear, joy, sadness, surprise, neutral)
5. Merge emotion scores back into main dataset
6. Save final dataset with emotions

## Expected Output
- **Final Dataset**: `data/books_final.csv` (with categories + emotions)
- **Emotion Columns**: 7 columns with probability scores (0-1) for each emotion
- **Metric**: Maximum emotion score per book (strongest emotional tone present)

In [1]:
import pandas as pd
from pathlib import Path

# Load cleaned dataset
data_path = Path("../data/books_with_categories.csv")
books = pd.read_csv(data_path)

print(f"✓ Dataset loaded: {books.shape[0]} rows, {books.shape[1]} columns")

✓ Dataset loaded: 5197 rows, 16 columns


In [2]:
from transformers import pipeline
classifier = pipeline("text-classification", 
            model="j-hartmann/emotion-english-distilroberta-base", 
            top_k = None,
            device = "mps")
classifier("I love this!")

Device set to use mps


[[{'label': 'surprise', 'score': 0.48696979880332947},
  {'label': 'neutral', 'score': 0.2234414517879486},
  {'label': 'joy', 'score': 0.14913064241409302},
  {'label': 'anger', 'score': 0.07174026966094971},
  {'label': 'sadness', 'score': 0.04664652422070503},
  {'label': 'disgust', 'score': 0.01629774644970894},
  {'label': 'fear', 'score': 0.005773560609668493}]]

In [3]:
books["description"][0]

'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world ha

Running sentiment analyzer over the whole description.

In [4]:
classifier(books["description"][0])

[[{'label': 'fear', 'score': 0.6548416018486023},
  {'label': 'neutral', 'score': 0.16985200345516205},
  {'label': 'sadness', 'score': 0.11640852689743042},
  {'label': 'surprise', 'score': 0.020700637251138687},
  {'label': 'disgust', 'score': 0.0191007312387228},
  {'label': 'joy', 'score': 0.015161268413066864},
  {'label': 'anger', 'score': 0.0039351521991193295}]]

In [5]:
# Split description into sentences first
description = books["description"][0]
sentences = description.split(".")


# Classify each sentence
predictions =classifier(sentences)

In [6]:
sentences[4]
predictions[4]

[{'label': 'sadness', 'score': 0.9671574234962463},
 {'label': 'neutral', 'score': 0.01510419137775898},
 {'label': 'disgust', 'score': 0.006480610463768244},
 {'label': 'fear', 'score': 0.005394005216658115},
 {'label': 'surprise', 'score': 0.002286945004016161},
 {'label': 'anger', 'score': 0.0018428919138386846},
 {'label': 'joy', 'score': 0.00173388107214123}]

In [7]:
sorted(predictions[4], key=lambda x: x["label"])

[{'label': 'anger', 'score': 0.0018428919138386846},
 {'label': 'disgust', 'score': 0.006480610463768244},
 {'label': 'fear', 'score': 0.005394005216658115},
 {'label': 'joy', 'score': 0.00173388107214123},
 {'label': 'neutral', 'score': 0.01510419137775898},
 {'label': 'sadness', 'score': 0.9671574234962463},
 {'label': 'surprise', 'score': 0.002286945004016161}]

## Efficient Emotion Score Extraction

### Problem
The emotion classifier returns predictions for each sentence, with emotions ordered by score (not by label). This means:
- Each sentence has a different label order
- We need consistent ordering to compare across sentences
- Processing 5000+ books requires efficient extraction

### Solution Strategy
1. **Sort predictions by label** for each sentence to ensure consistent emotion order
2. **Extract scores per emotion** across all sentences for a description
3. **Take maximum score** for each emotion (captures strongest emotional tone present)
4. **Batch process** with progress tracking for all books

### Approach
- Split descriptions into sentences (by period)
- Classify all sentences at once (batch processing)
- Sort each sentence's predictions by label name
- Collect scores per emotion across sentences
- Extract maximum per emotion to get strongest emotional tone

In [8]:
import numpy as np

# Define emotion labels
emotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]

def calculate_max_emotion_scores(predictions):
    """
    Extract maximum emotion score for each emotion across all sentence predictions.
    
    Args:
        predictions: List of prediction dictionaries (one per sentence)
    
    Returns:
        Dictionary with max score for each emotion label
    """
    # Dictionary to hold all scores per emotion
    per_emotion_scores = {label: [] for label in emotion_labels}
    
    # Process each sentence prediction
    for prediction in predictions:
        # Sort predictions by label to ensure consistent order
        sorted_predictions = sorted(prediction, key=lambda x: x["label"])
        
        # Extract score for each emotion label
        for index, label in enumerate(emotion_labels):
            per_emotion_scores[label].append(sorted_predictions[index]["score"])
    
    # Return dictionary with max score for each emotion
    return {label: np.max(scores) for label, scores in per_emotion_scores.items()}

In [9]:
# # Initialize containers for emotion scores
# isbn = []
# emotion_scores = {label: [] for label in emotion_labels}

# # Test on first 10 books
# for i in range(10):
#     isbn.append(books["isbn13"][i])
#     sentences = books["description"][i].split(".")
#     predictions = classifier(sentences)
#     max_scores = calculate_max_emotion_scores(predictions)
#     for label in emotion_labels:
#         emotion_scores[label].append(max_scores[label])

In [None]:
# emotion_scores

In [None]:
from tqdm import tqdm

# Initialize containers
isbn = []
emotion_scores = {label: [] for label in emotion_labels}

# Process all books
for i in tqdm(range(len(books)), desc="Processing book emotions"):
    isbn.append(books["isbn13"][i])
    sentences = books["description"][i].split(".")
    predictions = classifier(sentences)
    max_scores = calculate_max_emotion_scores(predictions)
    
    for label in emotion_labels:
        emotion_scores[label].append(max_scores[label])

Processing book emotions:  82%|████████▏ | 4257/5197 [03:15<01:06, 14.10it/s]

In [None]:
emotions_df = pd.DataFrame(emotion_scores)
emotions_df["isbn13"] = isbn 

In [None]:
emotions_df.head()

In [None]:
books = pd.merge(books, emotions_df, on = "isbn13")

In [None]:
from pathlib import Path

output_path = Path("../data/books_with_emotions.csv")
books.to_csv(output_path, index=False)
print(f"✓ Dataset with emotions saved to: {output_path}")