## Book Recommendation Project

### Books Dataset

Books are identified by their ISBN codes.  
Additionally, content-based information is included, such as **Book-Title**, **Book-Author**, **Year-Of-Publication**, and **Publisher**, which have been retrieved from Amazon Web Services.  
If a book has multiple authors, only the first author appears in the data.

Also included are cover image URLs in three sizes:  
**Image-URL-S**, **Image-URL-M**, **Image-URL-L** (small, medium, large).  
These URLs direct to Amazon's website.

### Ratings Dataset

This dataset contains book rating information.  
Ratings (**Book-Rating**) can be:
- **explicit**, on a scale of 1–10 (higher value = better rating), or  
- **implicit**, indicated by a value of 0 (user has not provided a numerical rating).

## Project Objective

The project's objective is to build a book recommendation system that utilizes the Surprise library to implement a user-specific recommendation model. The system aims to predict what kinds of books an individual user is likely to appreciate, based on previous ratings and the behavior of other users.

## Project Components

### 1. Data Preprocessing and Quality Checking

- Merging book and rating data
- Removing invalid ISBN codes
- Handling implicit entries (0-ratings)
- Possible filtering of infrequent users and books

### 2. Building a Recommendation Model with the Surprise Library

- Training the model on user–book ratings
- Experimenting with different algorithms (e.g., **SVD**, **KNNWithMeans**, **BaselineOnly**)
- Evaluating model performance with cross-validation (MAE, RMSE)

### 3. Generating Predictions and Recommendations

- Using an anti-test set to predict ratings for books the user has not yet read
- Creating user-specific **Top-N recommendations**

### 4. Analysis and Interpretation of Results

- Examining the model's accuracy and its limitations
- Considering the impact of data structure on model performance
- Presenting possibilities for further development (e.g., content-based enrichment, hybrid models)

The project's end result is a functional prototype-level book recommendation system that can predict user preferences and provide them with personalized book suggestions based on user data.

In [6]:
import pandas as pd
import numpy as np
from collections import defaultdict
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise import SVD, Dataset, Reader, accuracy, KNNBasic
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

In [7]:
df_books = pd.read_csv("Books.csv")
df_books

  df_books = pd.read_csv("Books.csv")


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


In [8]:
df_books = df_books[["ISBN", "Book-Title"]]
df_books

Unnamed: 0,ISBN,Book-Title
0,0195153448,Classical Mythology
1,0002005018,Clara Callan
2,0060973129,Decision in Normandy
3,0374157065,Flu: The Story of the Great Influenza Pandemic...
4,0393045218,The Mummies of Urumchi
...,...,...
271355,0440400988,There's a Bat in Bunk Five
271356,0525447644,From One to One Hundred
271357,006008667X,Lily Dale : The True Story of the Town that Ta...
271358,0192126040,Republic (World's Classics)


In [9]:
df_ratings = pd.read_csv("Ratings.csv")
df_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [10]:
df_merged = pd.merge(df_ratings, df_books, on="ISBN", how="inner")
df_merged

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title
0,276725,034545104X,0,Flesh Tones: A Novel
1,276726,0155061224,5,Rites of Passage
2,276727,0446520802,0,The Notebook
3,276729,052165615X,3,Help!: Level 1
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...
...,...,...,...,...
1031131,276704,0876044011,0,Edgar Cayce on the Akashic Records: The Book o...
1031132,276704,1563526298,9,Get Clark Smart : The Ultimate Guide for the S...
1031133,276706,0679447156,0,Eight Weeks to Optimum Health: A Proven Progra...
1031134,276709,0515107662,10,The Sherbrooke Bride (Bride Trilogy (Paperback))


In [11]:
"""
Check the lengths of the ISBN column and possible illogical values
Assume that ISBN should be 10 or 13 characters long
ISBN codes have two standard lengths:
- ISBN-10: 10 characters (in use before 2007)
- ISBN-13: 13 characters (current standard, in use since 2007)
"""
invalid_isbn = df_merged[~df_merged['ISBN'].str.replace('-', '').str.isdigit() | 
                         ~df_merged['ISBN'].str.replace('-', '').str.len().isin([10, 13])]
print("Illogical ISBN values:")
print(invalid_isbn)

Illogical ISBN values:
         User-ID        ISBN  Book-Rating  \
0         276725  034545104X            0   
3         276729  052165615X            3   
6         276744  038550120X            7   
10        276746  055356451X            0   
25        276762  034544003X            0   
...          ...         ...          ...   
1031064   276688  055308920X            0   
1031089   276688  068484267X            0   
1031093   276688  068810553X            0   
1031126   276704  059032120X            0   
1031129   276704  080410526X            0   

                                                Book-Title  
0                                     Flesh Tones: A Novel  
3                                           Help!: Level 1  
6                                          A Painted House  
10                                              Night Sins  
25       Southampton Row (Charlotte &amp; Thomas Pitt N...  
...                                                    ...  
1031064  

The original DataFrame has 1,031,136 rows × 4 columns, and there are 85,392 rows with illogical ISBN values.
In other words, approximately 8% of the data contains invalid ISBN values. In this project, the decision is made to clean the invalid ISBN values.