# Notebook 4: Collaborative Filtering (Item-Based) Model

**Objective:** Build an item-based collaborative filtering model using K-Nearest Neighbors. This model will find books similar to a given book based on user rating patterns.

**Strategy:**
1.  To handle the massive dataset and avoid memory errors, we must first filter it down to a more dense subset.
2.  We'll select users who have provided a substantial number of ratings (e.g., > 200).
3.  We'll then select books that have received a substantial number of ratings (e.g., > 50).
4.  We'll build the model on this filtered dataset.
5.  This approach is inspired by the articles provided and is standard practice for this dataset.

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

# Define file paths
ARTIFACTS_DIR = '../artifacts'
BOOKS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_books.pkl')
RATINGS_PATH = os.path.join(ARTIFACTS_DIR, 'cleaned_ratings.pkl')

In [2]:
# Load data
books_df = pd.read_pickle(BOOKS_PATH)
ratings_df = pd.read_pickle(RATINGS_PATH)

print("Data loaded.")

Data loaded.


## 1. Merge and Filter Data

First, we merge ratings and books to get book titles.

In [3]:
ratings_with_books = ratings_df.merge(books_df, on='isbn')
ratings_with_books.head()

Unnamed: 0,user_id,isbn,book_rating,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,276726,0155061224,5,Rites of Passage,Judith Rae,2001,Heinle,http://images.amazon.com/images/P/0155061224.0...,http://images.amazon.com/images/P/0155061224.0...,http://images.amazon.com/images/P/0155061224.0...
2,276727,0446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books,http://images.amazon.com/images/P/0446520802.0...,http://images.amazon.com/images/P/0446520802.0...,http://images.amazon.com/images/P/0446520802.0...
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,http://images.amazon.com/images/P/052165615X.0...,http://images.amazon.com/images/P/052165615X.0...,http://images.amazon.com/images/P/052165615X.0...
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press,http://images.amazon.com/images/P/0521795028.0...,http://images.amazon.com/images/P/0521795028.0...,http://images.amazon.com/images/P/0521795028.0...


Now, we apply the filters to reduce sparsity.

In [4]:
# Filter 1: Users with at least 200 ratings
user_rating_counts = ratings_with_books['user_id'].value_counts()
qualified_users = user_rating_counts[user_rating_counts >= 200].index

filtered_df = ratings_with_books[ratings_with_books['user_id'].isin(qualified_users)]

print(f"Original ratings: {len(ratings_with_books)}")
print(f"Ratings after user filter: {len(filtered_df)}")

Original ratings: 1031136
Ratings after user filter: 475007


In [5]:
# Filter 2: Books with at least 50 ratings (from the filtered user group)
book_rating_counts = filtered_df['book_title'].value_counts()
qualified_books = book_rating_counts[book_rating_counts >= 50].index

final_df = filtered_df[filtered_df['book_title'].isin(qualified_books)]

print(f"Ratings after book filter: {len(final_df)}")
print(f"Final dataset shape: {final_df.shape}")
print(f"Unique Users: {final_df['user_id'].nunique()}")
print(f"Unique Books: {final_df['book_title'].nunique()}")

Ratings after book filter: 58823
Final dataset shape: (58823, 10)
Unique Users: 815
Unique Books: 707


## 2. Create Pivot Table

We need to create a user-item matrix. We'll use `book_title` as the index, `user_id` as columns, and `book_rating` as values. We only use explicit ratings for this model, so let's filter out 0s first.

In [6]:
explicit_final_df = final_df[final_df['book_rating'] > 0]

pivot_df = explicit_final_df.pivot_table(index='book_title', columns='user_id', values='book_rating')

# Fill NaNs with 0, as the model needs a complete matrix
pivot_df.fillna(0, inplace=True)

pivot_df.head()

user_id,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Create Sparse Matrix and Train Model

In [7]:
# Convert the pivot table to a sparse matrix for efficiency
book_sparse_matrix = csr_matrix(pivot_df.values)

print(book_sparse_matrix)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 14345 stored elements and shape (707, 777)>
  Coords	Values
  (0, 0)	9.0
  (0, 14)	8.0
  (0, 23)	10.0
  (0, 34)	9.0
  (0, 38)	8.0
  (0, 55)	7.0
  (0, 123)	9.0
  (0, 139)	10.0
  (0, 151)	9.0
  (0, 167)	8.0
  (0, 198)	10.0
  (0, 199)	10.0
  (0, 218)	10.0
  (0, 245)	10.0
  (0, 319)	10.0
  (0, 346)	10.0
  (0, 427)	7.0
  (0, 439)	9.0
  (0, 479)	10.0
  (0, 520)	9.0
  (0, 527)	8.0
  (0, 603)	9.0
  (0, 666)	8.0
  (0, 672)	7.0
  (0, 724)	10.0
  :	:
  (706, 14)	8.0
  (706, 23)	7.0
  (706, 79)	10.0
  (706, 96)	5.0
  (706, 104)	5.0
  (706, 134)	6.0
  (706, 208)	7.0
  (706, 257)	8.0
  (706, 268)	9.0
  (706, 286)	10.0
  (706, 326)	10.0
  (706, 341)	9.0
  (706, 391)	7.0
  (706, 395)	9.0
  (706, 422)	7.0
  (706, 426)	6.0
  (706, 432)	10.0
  (706, 438)	8.0
  (706, 470)	10.0
  (706, 511)	8.0
  (706, 608)	10.0
  (706, 621)	8.0
  (706, 651)	10.0
  (706, 727)	9.0
  (706, 771)	8.0


In [8]:
# Initialize and train the NearestNeighbors model
# We use 'cosine' similarity and 'brute' force algorithm
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(book_sparse_matrix)

print("Model trained successfully.")

Model trained successfully.


## 4. Test the Model

In [9]:
# Let's test with a random book from our pivot table
test_book_index = np.random.choice(pivot_df.shape[0])
test_book_name = pivot_df.index[test_book_index]

print(f"Finding recommendations for: {test_book_name}")

# Get the distances and indices of the 6 nearest neighbors (1st will be the book itself)
distances, indices = model.kneighbors(pivot_df.iloc[test_book_index, :].values.reshape(1, -1), n_neighbors=6)

for i in range(len(distances.flatten())):
    if i == 0:
        print(f"Query Book: {pivot_df.index[indices.flatten()[i]]} (Distance: {distances.flatten()[i]:.4f})")
    else:
        print(f"Recommendation {i}: {pivot_df.index[indices.flatten()[i]]} (Distance: {distances.flatten()[i]:.4f})")

Finding recommendations for: STONES FROM THE RIVER
Query Book: STONES FROM THE RIVER (Distance: 0.0000)
Recommendation 1: The Book of Ruth (Oprah's Book Club (Paperback)) (Distance: 0.7254)
Recommendation 2: Range of Motion (Distance: 0.7305)
Recommendation 3: Vinegar Hill (Oprah's Book Club (Paperback)) (Distance: 0.7330)
Recommendation 4: Cold Mountain : A Novel (Distance: 0.7349)
Recommendation 5: Ellen Foster (Distance: 0.7400)


## 5. Export Model and Data

In [10]:
# Save the model
with open(os.path.join(ARTIFACTS_DIR, 'knn_model.pkl'), 'wb') as f:
    pickle.dump(model, f)

# Save the pivot table (we need its index to map book titles)
pivot_df.to_pickle(os.path.join(ARTIFACTS_DIR, 'pivot_df.pkl'))

# We also need the main books_df for getting image URLs, which is already saved.

print(f"KNN model and pivot table saved to {ARTIFACTS_DIR}")

KNN model and pivot table saved to ../artifacts
