# Notebook 3: Weighted Rating (Popularity-Based) Model

**Objective:** Build a popularity-based recommendation system. This model doesn't personalize, but it's excellent for new users or as a "Top Rated" list. We'll use a weighted rating formula to balance the average rating with the number of ratings.

**Formula:** `Weighted Rating (WR) = (v / (v + m)) * R + (m / (v + m)) * C`
- `v`: number of ratings for the book
- `m`: minimum ratings required (a percentile threshold)
- `R`: average rating for the book
- `C`: mean rating across *all* books

In [1]:
import pandas as pd
import numpy as np
import os

# Define file paths
ARTIFACTS_DIR = '../artifacts'
BOOKS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_books.pkl')
RATINGS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_ratings.pkl')

In [2]:
# Load data
books_df = pd.read_pickle(BOOKS_PATH)
ratings_df = pd.read_pickle(RATINGS_PATH)

print("Data loaded.")

Data loaded.


## 1. Prepare Data

We only want to consider explicit ratings (1-10) for this model.

In [3]:
explicit_ratings_df = ratings_df[ratings_df['book_rating'] > 0]
print(f"Original ratings: {len(ratings_df)}, Explicit ratings: {len(explicit_ratings_df)}")

Original ratings: 1149780, Explicit ratings: 433671


In [4]:
# Merge with books to get titles
# We need to merge ratings with books based on 'isbn'
ratings_with_books = explicit_ratings_df.merge(books_df, on='isbn')
ratings_with_books.head()

Unnamed: 0,user_id,isbn,book_rating,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l
0,276726,0155061224,5,Rites of Passage,Judith Rae,2001,Heinle,http://images.amazon.com/images/P/0155061224.0...,http://images.amazon.com/images/P/0155061224.0...,http://images.amazon.com/images/P/0155061224.0...
1,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,http://images.amazon.com/images/P/052165615X.0...,http://images.amazon.com/images/P/052165615X.0...,http://images.amazon.com/images/P/052165615X.0...
2,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press,http://images.amazon.com/images/P/0521795028.0...,http://images.amazon.com/images/P/0521795028.0...,http://images.amazon.com/images/P/0521795028.0...
3,276744,038550120X,7,A Painted House,JOHN GRISHAM,2001,Doubleday,http://images.amazon.com/images/P/038550120X.0...,http://images.amazon.com/images/P/038550120X.0...,http://images.amazon.com/images/P/038550120X.0...
4,276747,0060517794,9,Little Altars Everywhere,Rebecca Wells,2003,HarperTorch,http://images.amazon.com/images/P/0060517794.0...,http://images.amazon.com/images/P/0060517794.0...,http://images.amazon.com/images/P/0060517794.0...


## 2. Calculate Rating Counts and Averages

In [5]:
# Group by book title to get rating counts and average ratings
# Using 'book_title' as the identifier. This might group different editions, which is often desired for recommendations.
rating_stats_df = ratings_with_books.groupby('book_title').agg(
    ratings_count=('book_rating', 'count'),
    average_rating=('book_rating', 'mean')
).reset_index()

print(f"Unique books with explicit ratings: {len(rating_stats_df)}")
rating_stats_df.head()

Unique books with explicit ratings: 135567


Unnamed: 0,book_title,ratings_count,average_rating
0,A Light in the Storm: The Civil War Diary of ...,1,9.0
1,"Ask Lily (Young Women of Faith: Lily Series, ...",1,8.0
2,Dark Justice,1,10.0
3,Earth Prayers From around the World: 365 Pray...,7,7.142857
4,Final Fantasy Anthology: Official Strategy Gu...,2,10.0


## 3. Calculate Weighted Rating

In [6]:
# Calculate C (mean rating across all books)
C = rating_stats_df['average_rating'].mean()
print(f"Mean rating (C): {C:.2f}")

# Calculate m (minimum ratings threshold)
# We'll use the 90th percentile. Only books with more ratings than 90% of other books will be considered.
m = rating_stats_df['ratings_count'].quantile(0.90)
print(f"Minimum ratings threshold (m): {int(m)}")

Mean rating (C): 7.52
Minimum ratings threshold (m): 5


In [7]:
# Filter for books that meet the threshold 'm'
qualified_books_df = rating_stats_df[rating_stats_df['ratings_count'] >= m].copy()
print(f"Number of qualified books: {len(qualified_books_df)}")

Number of qualified books: 13740


In [8]:
# Define the weighted rating function
def weighted_rating(x, m=m, C=C):
    v = x['ratings_count']
    R = x['average_rating']
    return (v / (v + m)) * R + (m / (v + m)) * C

# Apply the function to create the 'weighted_rating' column
qualified_books_df['weighted_rating'] = qualified_books_df.apply(weighted_rating, axis=1)

In [9]:
# Sort by weighted rating
qualified_books_df = qualified_books_df.sort_values('weighted_rating', ascending=False)
qualified_books_df.head(10)

Unnamed: 0,book_title,ratings_count,average_rating,weighted_rating
45377,Harry Potter and the Chamber of Secrets Postca...,23,9.869565,9.450495
119061,"The Two Towers (The Lord of the Rings, Part 2)",136,9.330882,9.266765
29403,Dilbert: A Book of Postcards,13,9.923077,9.256326
17283,Calvin and Hobbes,24,9.583333,9.228064
79786,Postmarked Yesteryear: 30 Rare Holiday Postcards,11,10.0,9.225867
98333,The Authoritative Calvin and Hobbes (Calvin an...,20,9.6,9.184555
70903,"My Sister's Keeper : A Novel (Picoult, Jodi)",22,9.545455,9.170884
115128,"The Return of the King (The Lord of the Rings,...",103,9.213592,9.135314
115125,"The Return of the King (The Lord of The Rings,...",16,9.625,9.12447
106043,The Giving Tree,26,9.423077,9.116576


## 4. Add Book Details and Export

In [10]:
# Get other book details (author, image) by merging with the original books_df
# We need to drop duplicates from books_df to avoid multiple entries for the same title
book_details_df = books_df.drop_duplicates('book_title')[['book_title', 'book_author', 'image_url_m', 'image_url_s']]

final_popular_books_df = qualified_books_df.merge(book_details_df, on='book_title')

# Reorder columns for clarity
final_popular_books_df = final_popular_books_df[[
    'book_title', 
    'book_author', 
    'ratings_count', 
    'average_rating', 
    'weighted_rating', 
    'image_url_m', 
    'image_url_s'
]]

final_popular_books_df.head()

Unnamed: 0,book_title,book_author,ratings_count,average_rating,weighted_rating,image_url_m,image_url_s
0,Harry Potter and the Chamber of Secrets Postca...,J. K. Rowling,23,9.869565,9.450495,http://images.amazon.com/images/P/0439425220.0...,http://images.amazon.com/images/P/0439425220.0...
1,"The Two Towers (The Lord of the Rings, Part 2)",J.R.R. TOLKIEN,136,9.330882,9.266765,http://images.amazon.com/images/P/0345339711.0...,http://images.amazon.com/images/P/0345339711.0...
2,Dilbert: A Book of Postcards,Scott Adams,13,9.923077,9.256326,http://images.amazon.com/images/P/0836213319.0...,http://images.amazon.com/images/P/0836213319.0...
3,Calvin and Hobbes,Bill Watterson,24,9.583333,9.228064,http://images.amazon.com/images/P/0836220889.0...,http://images.amazon.com/images/P/0836220889.0...
4,Postmarked Yesteryear: 30 Rare Holiday Postcards,Pamela E. Apkarian-Russell,11,10.0,9.225867,http://images.amazon.com/images/P/1888054557.0...,http://images.amazon.com/images/P/1888054557.0...


In [11]:
# Export to artifacts
final_popular_books_df.to_pickle(os.path.join(ARTIFACTS_DIR, 'top_weighted_books.pkl'))

print(f"Top weighted books DataFrame saved to {ARTIFACTS_DIR}")

Top weighted books DataFrame saved to ../artifacts
