<a href="https://colab.research.google.com/github/tuanqpham0921/Semantic-Book-Recommender-/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>📚 Book Recommendation Sentiment Analysis</center>

In this section, we enhance our vector search system by integrating **sentiment analysis** to provide richer and more personalized book recommendations. This allows users to filter or explore books based on the emotional tone they're seeking.

We classify books into one of **seven emotions**, using a pre-trained model from Hugging Face:  
👉 [`j-hartmann/emotion-english-distilroberta-base`](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base)

The seven emotion categories are:

- 😠 **Anger**  
- 🤢 **Disgust**  
- 😨 **Fear**  
- 😊 **Joy**  
- 😢 **Sadness**  
- 😲 **Surprise**  
- 😐 **Neutral**

By leveraging this model, we add a new dimension to recommendations — one that considers emotional resonance, not just content similarity.


In [3]:
!pip install transformers



In [4]:
# get the HuggingFace API key from Google Secrets Manager
from google.colab import userdata
HUGGINGFACEHUB_API_TOKEN = userdata.get('HUGGINGFACEHUB_API_TOKEN')

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import pandas as pd

# Replace 'path/to/your/book_cleaned.csv' with the actual path to your file in Google Drive
file_path = '/content/drive/MyDrive/Colab Notebooks/Semantic Book Recommender/books_with_simple_categories.csv'

try:
    books = pd.read_csv(file_path)
    print("CSV file loaded successfully!")
    display(books.head())
except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

CSV file loaded successfully!


Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtiles,tagged_description,simple_categories,isbn
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...,Fiction,
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...,Fiction,9780002000000.0
2,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,Rage of angels,"9780006178736 A memorable, mesmerizing heroine...",Fiction,
3,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,The Four Loves,9780006280897 Lewis' work on the nature of lov...,Nonfiction,9780006000000.0
4,9780006280934,6280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le...",Nonfiction,9780006000000.0


## 🧠 Genre Classification: Fiction vs. Non-Fiction

In this section, we implement **binary genre classification** using a simple dataset containing two categories: **Fiction** and **Non-Fiction**.

We utilize a **pre-fine-tuned model** from Hugging Face for this task, allowing us to classify books based on their descriptions or metadata with minimal training. This approach helps streamline genre tagging and enhances downstream tasks like filtering and recommendation.

### 📊 Model Performance

The pre-trained model we use reports an **accuracy of ~66%**, as evaluated on **Dataloop** — a collaborative data management and model evaluation platform. While this performance may not be optimal for production-level pipelines, it serves well for prototyping and lightweight applications.

### ✅ Benefits

- **Quick categorization** of large book datasets.
- Enables **genre-based filtering** alongside sentiment filtering.
- Lays the groundwork for more complex multi-label classification in future iterations.

> **Note**: For better accuracy, fine-tuning the model on your own book dataset is recommended.

In [8]:
from transformers import pipeline

classifier = pipeline("text-classification",
                      model="j-hartmann/emotion-english-distilroberta-base",
                      return_all_scores=True)

Device set to use cuda:0


In [9]:
classifier("I love this!")

  return forward_call(*args, **kwargs)


[[{'label': 'anger', 'score': 0.004419785924255848},
  {'label': 'disgust', 'score': 0.001611991785466671},
  {'label': 'fear', 'score': 0.00041385178337804973},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764591973274946},
  {'label': 'sadness', 'score': 0.0020923891570419073},
  {'label': 'surprise', 'score': 0.00852868054062128}]]

In [10]:
books["description"][0]

'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world ha

In [11]:
classifier(books["description"][0])

[[{'label': 'anger', 'score': 0.003935146611183882},
  {'label': 'disgust', 'score': 0.019100705161690712},
  {'label': 'fear', 'score': 0.6548413634300232},
  {'label': 'joy', 'score': 0.015161297284066677},
  {'label': 'neutral', 'score': 0.16985207796096802},
  {'label': 'sadness', 'score': 0.11640888452529907},
  {'label': 'surprise', 'score': 0.02070062793791294}]]

From the classifier, the predicted emotion for this excerpt is **"fear."** However, this may not be an accurate reflection of the overall sentiment.

For example, the first sentence reads:

> *"A novel that readers and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher..."*

This sounds more **surprising** or **awe-inspiring** than fearful.

Later, the tone shifts even further:

> *"Told in John Ames’s joyous, rambling voice that finds beauty, humour, and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world..."*

This clearly leans toward **joy** and **acceptance** rather than fear.

### 🧠 Insight:
This example highlights how a single label for an entire paragraph or description may not capture the full range of emotions. To improve accuracy, we can **split the text into smaller sentences or chunks** and analyze each one individually. This sentence-level sentiment analysis allows us to better capture **mixed emotions** and provide a more nuanced emotional profile for each book.


In [12]:
# this classifier can be done in split
sentences = books["description"][0].split(".")
predictions = classifier(sentences)

In [13]:
import numpy as np

# lets look at the first 5 sentences
for i in range(5):
  print(sentences[i])
  # Get the list of scores for the current sentence
  scores = predictions[i]
  # Find the index of the highest score
  max_score_index = np.argmax([score['score'] for score in scores])
  # Get the label corresponding to the highest score
  predicted_label = scores[max_score_index]['label']
  print("predicted_label: ", predicted_label)
  print("scores: ", scores)
  print()

A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives
predicted_label:  surprise
scores:  [{'label': 'anger', 'score': 0.009156348183751106}, {'label': 'disgust', 'score': 0.00262847519479692}, {'label': 'fear', 'score': 0.06816219538450241}, {'label': 'joy', 'score': 0.04794241115450859}, {'label': 'neutral', 'score': 0.1403856873512268}, {'label': 'sadness', 'score': 0.0021221605129539967}, {'label': 'surprise', 'score': 0.7296026945114136}]

 John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers
predicted_label:  neutral
scores:  [{'label': 'anger', 'score': 0.040478333830833435}, {'label': 'disgust', 'score': 0.2735914885997772}, {'label': 'fear', 'score': 0.006879060063511133}, {'label': 'joy', 'score': 0.10908304899930954}, {'label': 'neutral', 'score': 0.44937071204185486}, {'label': 'sadness', 'score': 0.09362724423408508}, {'label': 'su

We can observe that different sentences within the same book description express a **variety of emotions**. This reinforces the importance of performing sentiment analysis at a **finer granularity**, such as the sentence level, rather than classifying the entire text as a single emotion.

With this insight, we can now proceed to apply the **emotion classifier** across all book descriptions in the dataset. This will allow us to generate a more detailed emotional profile for each book, enhancing the quality and flexibility of our recommendation system.


In [14]:
import numpy as np

emotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]
isbn = []

def calculate_emotion_scores(predictions, emotion_score):
  # make a new one label
  for label in emotion_labels:
    emotion_score[label].append(0)

  for prediction in predictions:
    # Get the list of scores for the current sentence
    for item in prediction: # Iterate through the list of dictionaries
      label = item['label'] # Access the 'label' key
      score = item['score'] # Access the 'score' key
      emotion_score[label][-1] = max(emotion_score[label][-1], score)

In [15]:
from tqdm import tqdm

emotion_score = {label: [] for label in emotion_labels}

for i in tqdm(range(len(books))):
  isbn.append(books["isbn13"][i])
  sentences = books["description"][i].split(".")
  predictions = classifier(sentences)
  calculate_emotion_scores(predictions, emotion_score)

  0%|          | 6/5197 [00:00<02:48, 30.81it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 5197/5197 [02:23<00:00, 36.34it/s]


In [16]:
emotion_df = pd.DataFrame(emotion_score)
emotion_df["isbn13"] = isbn

emotion_df.head()

Unnamed: 0,anger,disgust,fear,joy,sadness,surprise,neutral,isbn13
0,0.064134,0.273591,0.928168,0.932797,0.967158,0.729603,0.646216,9780002005883
1,0.612619,0.348284,0.942528,0.704422,0.11169,0.252545,0.887939,9780002261982
2,0.064134,0.104007,0.972321,0.767237,0.11169,0.078766,0.549477,9780006178736
3,0.351484,0.150723,0.360706,0.251881,0.11169,0.078766,0.732685,9780006280897
4,0.081412,0.184495,0.095043,0.040564,0.475881,0.078766,0.88439,9780006280934


In [17]:
books_with_emotion = pd.merge(books, emotion_df, on="isbn13")

In [18]:
books_with_emotion.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,...,tagged_description,simple_categories,isbn,anger,disgust,fear,joy,sadness,surprise,neutral
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,...,9780002005883 A NOVEL THAT READERS and critics...,Fiction,,0.064134,0.273591,0.928168,0.932797,0.967158,0.729603,0.646216
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,...,9780002261982 A new 'Christie for Christmas' -...,Fiction,9780002000000.0,0.612619,0.348284,0.942528,0.704422,0.11169,0.252545,0.887939
2,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,...,"9780006178736 A memorable, mesmerizing heroine...",Fiction,,0.064134,0.104007,0.972321,0.767237,0.11169,0.078766,0.549477
3,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,...,9780006280897 Lewis' work on the nature of lov...,Nonfiction,9780006000000.0,0.351484,0.150723,0.360706,0.251881,0.11169,0.078766,0.732685
4,9780006280934,6280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,...,"9780006280934 ""In The Problem of Pain, C.S. Le...",Nonfiction,9780006000000.0,0.081412,0.184495,0.095043,0.040564,0.475881,0.078766,0.88439


In [21]:
# Assuming df is your DataFrame and these columns contain numeric emotion scores
emotion_columns = ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']
emotion_stats = books_with_emotion[emotion_columns].describe()
emotion_stats = emotion_stats.drop(index='count')
print(emotion_stats)

         anger   disgust      fear       joy   sadness  surprise   neutral
mean  0.164808  0.200597  0.308601  0.280208  0.223608  0.174044  0.760011
std   0.218574  0.212761  0.342392  0.317908  0.248027  0.189109  0.204867
min   0.000606  0.000821  0.000442  0.000550  0.001251  0.000779  0.000981
25%   0.064134  0.104007  0.051363  0.040564  0.111690  0.078766  0.549477
50%   0.064134  0.104007  0.093588  0.087731  0.111690  0.078766  0.838376
75%   0.138384  0.187477  0.580464  0.498712  0.177615  0.198874  0.936846
max   0.989582  0.989417  0.995326  0.992068  0.989361  0.983455  0.974344


### 📊 Emotion Distribution Analysis

After analyzing the emotion statistics, we observe that the **mean scores across most emotion categories are relatively consistent**, indicating a fairly balanced emotional distribution throughout the dataset.

However, **Neutral** stands out with the highest mean score — which is expected, as "neutral" often encompasses a wide range of mixed or subtle emotions. Its higher average suggests that many book descriptions do not strongly lean toward a single intense emotion, which aligns with the nature of generalized or informational writing styles in summaries.


In [None]:
# Save the DataFrame to a CSV file in Google Drive
output_path = '/content/drive/MyDrive/Colab Notebooks/Semantic Book Recommender/books_with_emotion.csv'
books_with_emotion.to_csv(output_path, index=False)

print(f"DataFrame saved successfully to {output_path}")

DataFrame saved successfully to /content/drive/MyDrive/Colab Notebooks/Semantic Book Recommender/books_with_emotion.csv


At this stage, the output is structured as a **DataFrame**, containing each book along with its associated emotions.

Now, we’ll focus on transforming this data into a **more interactive and user-friendly format** — one that allows users to explore, filter, and gain insights from the emotional profiles of books. This step is essential for making the sentiment analysis **accessible**, **informative**, and **practically useful** for readers and recommender systems alike.
