Book Recommender System using [Book Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

### Importing Usual Libraries

In [None]:
import numpy as np          # linear algebra
import pandas as pd         # data processing
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings; warnings.simplefilter('ignore')

### Importing Dataset

In [None]:
book_data = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX_Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
book_data.head()

* Let's see how one previously problematic line looks now after correction! 

In [None]:
book_data[book_data['ISBN']== "078946697X"]

Everything looks nice and clean.

In [None]:
book_data.shape

In [None]:
user_data= pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Users.csv', sep= ';', encoding= 'latin-1')
user_data.head()

In [None]:
rating_data= pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Book-Ratings.csv', sep= ';', encoding= 'latin-1')
rating_data.head()

### Data Preprocessing

At first, I'll investigate the tables to see if any improvement is needed. Then I will do necessary operations to make the data clean so that I can work better with them.

In [None]:
book_data.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis= 1, inplace= True)
book_data.columns= book_data.columns.str.strip().str.lower().str.replace('-', '_')
user_data.columns= user_data.columns.str.strip().str.lower().str.replace('-', '_')
rating_data.columns= rating_data.columns.str.strip().str.lower().str.replace('-', '_')

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
print(book_data.dtypes)
print('-'*40)
print(book_data.isnull().sum())

Let's take care of the missing value in 'book_author' column.

In [None]:
book_data.loc[(book_data['book_author'].isnull()),: ]

In [None]:
book_data.loc[(book_data['isbn'] == '9627982032'),'book_author'] = 'other'

* Let's look at the unique years to realize the time period as this dataset was created in 2004. 

In [None]:
book_data['year_of_publication'].unique()

In [None]:
# def replace_df_value(df, idx, col_name, val):
#     df.loc[idx, col_name] = val


# replace_df_value(book_data, 209538, 'book_title', 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)')
# replace_df_value(book_data, 209538, 'book_author', 'Michael Teitelbaum')
# replace_df_value(book_data, 209538, 'year_of_publication', 2000)
# replace_df_value(book_data, 209538, 'publisher', 'DK Publishing Inc')

# replace_df_value(book_data, 221678, 'book_title', 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)')
# replace_df_value(book_data, 221678, 'book_author', 'James Buckley')
# replace_df_value(book_data, 221678, 'year_of_publication', 2000)
# replace_df_value(book_data, 221678, 'publisher', 'DK Publishing Inc')

# replace_df_value(book_data, 220731,'book_title', "Peuple du ciel, suivi de 'Les Bergers")
# replace_df_value(book_data, 220731, 'book_author', 'Jean-Marie Gustave Le ClÃ?Â©zio')
# replace_df_value(book_data, 220731, 'year_of_publication', 2003)
# replace_df_value(book_data, 220731, 'publisher', 'Gallimard')

* So the corrections are made. We've seen there are two missing values in the 'publisher' column. Let's take care of that.

In [None]:
book_data.loc[(book_data['publisher'].isnull()),'publisher'] = 'no mention'

In [None]:
print(book_data['publisher'].isnull().sum())

**Let's investigate the user_rating dataset

In [None]:
print(user_data.shape)

In [None]:
user_data['user_id'].unique()

* So user_id's alright. Let's check out the age of the users.

In [None]:
user_data['age'].unique()

In [None]:
user_data.loc[(user_data['age'] > 90) | (user_data['age'] < 5)] = np.nan
user_data['age'].fillna((user_data['age'].mean()), inplace=True)
user_data['age']= user_data['age'].astype('int64')
user_data['age'].unique()

* We need to investigate out rating dataset too. 

In [None]:
rating_data.head()

* I'll only take the ISBNs that also belongs to the main book_data set.

In [None]:
unique_ratings = rating_data[rating_data.isbn.isin(book_data.isbn)]

* Ratings dataset should have ratings from users which exist in users dataset, unless new users are added to users dataset

In [None]:
rating_data = rating_data[rating_data.user_id.isin(user_data.user_id)]

In [None]:
print(rating_data.shape)
print(unique_ratings.shape)
print(book_data.shape)
print(user_data.shape)

In [None]:
unique_ratings['book_rating'].unique()

## Visualization

Lat's now make some pretty plots to visualize the data. 

The age distribution of the readers: 

In [None]:
user_data.age.hist(bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
plt.title('Age Distribution\n')
plt.xlabel('age')
plt.ylabel('count')
plt.savefig('age_dist.png', bbox_inches='tight')
plt.show()

Let's see which rating people tend to give more : 

In [None]:
sns.set_style('darkgrid')
sns.countplot(data= unique_ratings , x='book_rating')
plt.show()

This countplot shows users have rated 0 the most, which can mean they haven't rated bokks at all. We have to separate the explicit ratings represented by 1–10 and implicit ratings represented by 0.

In [None]:
ratings_explicit= unique_ratings[unique_ratings['book_rating'] != 0]
ratings_implicit= unique_ratings[unique_ratings['book_rating'] == 0]

In [None]:
sns.set_style('darkgrid')
sns.countplot(data= ratings_explicit , x='book_rating')
plt.show()

Now this countplot of book_rating indicates that higher ratings are more common amongst users and rating 8 has been rated highest number of times.

In [None]:
# book_data.year_of_publication = pd.to_numeric(book_data.year_of_publication, errors='coerce')

# # Check for 0's or NaNs in Year of Publication
# zero_year = book_data[book_data.year_of_publication == 0].year_of_publication.count()
# nan_year = book_data.year_of_publication.isnull().sum()

# print(f'There are {zero_year} entries as \'0\', and {nan_year} NaN entries in the Year of Publication field')

# # Replace all years of zero with NaN
# book_data.year_of_publication.replace(0, np.nan, inplace=True)

* As the problem of string 'year_of _publication' values was solved in the preprocessing steps, so now there's no need to convert the data types anymore and the code is commented out. Below the plot says there are 4619 'year_of_publication' values ranging from 0-99 which were not visible in the previous notebook. So, I plotted this interactive plot to see more accurately the 'year_of _publication' distribution. 

In [None]:
import plotly.express as px

fig = px.histogram(book_data, x = "year_of_publication", nbins = 30, width = 800, height = 500)
fig.update_xaxes(tick0 = 0 , dtick = 1000)
fig.show()

### Popularity Based Recommendation

Now let's try to build our first recommendation system based on popularity. This recommendations are usually given to every user irrespective of personal charecterization. 

In [None]:
ratings_explicit.head()

In [None]:
print(unique_ratings.shape)
print(ratings_explicit.shape)

I'll only consider ISBNs that were explicitely rated for this recommendation system.

In [None]:
new_book_df= pd.merge(book_data, ratings_explicit, on='isbn')
new_book_df.head()

In [None]:
print(new_book_df.shape)

In [None]:
new_book_df['book_title'].nunique()

In [None]:
top_ten_books= pd.DataFrame(new_book_df.groupby('book_title')['book_rating'].count()
                         .sort_values(ascending=False).head(10))

print('The top ten books recommendation : ')
top_ten_books

So our 'All Time Favourite" book recommendations are ready.

### Memory-Based Collaborative Filtering

Memory-Based Collaborative Filtering are of two kinds: 
1. user-item filtering 
2. item-item filtering

A user-item filtering will take a particular user and find users that are similar to that user based on similarity of ratings. Then it will recommend items that are similar to the ones the users liked.

Unlike user-item filtering, item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and recommends other items.

* Item-Item Collaborative Filtering: “Users who liked this item also liked …”
* User-Item Collaborative Filtering: “Users who are similar to you also liked …”

### Train - Test Split

In [None]:
from sklearn import model_selection
train_data, test_data = model_selection.train_test_split(new_book_df, test_size=0.20)

In [None]:
print(f'Training set lengths: {len(train_data)}')
print(f'Testing set lengths: {len(test_data)}')
print(f'Test set is {(len(test_data)/(len(train_data)+len(test_data))*100):.0f}% of the full dataset.')

In [None]:
# Get int mapping for user_id in train dataset

u_unique_train = train_data.user_id.unique()  
train_data_user2idx = {o:i for i, o in enumerate(u_unique_train)}

# Get int mapping for isbn in train dataset

i_unique_train = train_data.isbn.unique()  
train_data_book2idx = {o:i for i, o in enumerate(i_unique_train)}

# Get int mapping for user_id in test dataset

u_unique_test = test_data.user_id.unique()  
test_data_user2idx = {o:i for i, o in enumerate(u_unique_test)}

# Get int mapping for isbn in train dataset

i_unique_test = test_data.isbn.unique() 
test_data_book2idx = {o:i for i, o in enumerate(i_unique_test)}


In [None]:
# TRAINING SET
train_data['u_unique'] = train_data['user_id'].map(train_data_user2idx)
train_data['i_unique'] = train_data['isbn'].map(train_data_book2idx)

# TESTING SET
test_data['u_unique'] = test_data['user_id'].map(test_data_user2idx)
test_data['i_unique'] = test_data['isbn'].map(test_data_book2idx)

# Convert back to 3-column df
train_data = train_data[['u_unique', 'i_unique', 'book_rating']]
test_data = test_data[['u_unique', 'i_unique', 'book_rating']]

In [None]:
train_data.sample(5)

### User-Item Matrix for Train Data

At first I'll create an empty matrix of users * books and the will add the appropriate values to the matrix by extracting them from the dataset.

In [None]:
n_users = train_data['u_unique'].nunique()
n_books = train_data['i_unique'].nunique()

train_matrix = np.zeros((n_users, n_books))

for entry in train_data.itertuples():                  # entry[1] is the user-id, entry[2] is the book-isbn
    train_matrix[entry[1]-1, entry[2]-1] = entry[3]    # -1 is to counter 0-based indexing

In [None]:
train_matrix.shape

### User-Item Matrix for Test Data

In [None]:
n_users = test_data['u_unique'].nunique()
n_books = test_data['i_unique'].nunique()

test_matrix = np.zeros((n_users, n_books))

for entry in test_data.itertuples():
    test_matrix[entry[1]-1, entry[2]-1] = entry[3] 

In [None]:
test_matrix.shape

### Cosine Similarity Based Recommendation System

As I am doing this calculations on my PC, so it is not a good idea to perform this huge calculation. So I'll perform on a subset initially. It will take a bit of time to calculate.

A distance metric commonly used in recommender systems is *cosine similarity*, where the ratings are seen as vectors in ``n``-dimensional space and the similarity is calculated based on the angle between these vectors. 

To make item-item similarity we need to take the transpose of the matrix.

In [None]:
train_matrix_small = train_matrix[:5000, :5000]
test_matrix_small = test_matrix[:5000, :5000]

from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_matrix_small, metric='cosine')
item_similarity = pairwise_distances(train_matrix_small.T, metric='cosine') 

Now I'll define a function to predict the similarity :

In [None]:
def predict_books(ratings, similarity, type='user'): # default type is 'user'
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        
        # Use np.newaxis so that mean_user_rating has the same format as ratings
        
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [None]:
item_prediction = predict_books(train_matrix_small, item_similarity, type='item')
user_prediction = predict_books(train_matrix_small, user_similarity, type='user')

### Evaluation

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(prediction, test_matrix):
    prediction = prediction[test_matrix.nonzero()].flatten()
    test_matrix = test_matrix[test_matrix.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, test_matrix))

print(f'Item-based CF RMSE: {rmse(item_prediction, test_matrix_small)}')
print(f'User-based CF RMSE: {rmse(user_prediction, test_matrix_small)}')


* We see our recommendation system gives 7.94 RMSE score. I want to check if we can make any improvement in this score by using another method. For this I will use Single valu decomposition method from the Surprise library.

### SVD Based recommendation System

In [None]:
from surprise import Reader, Dataset

# Creating a 'Reader' object to set the limit of the ratings 

reader = Reader(rating_scale=(1, 10))

data = Dataset.load_from_df(ratings_explicit, reader)

In [None]:
from surprise import SVD, model_selection, accuracy

model = SVD()

# Train on books dataset

%time model_selection.cross_validate(model, data, measures=['RMSE'], cv=5, verbose=True)

### Train - Test Split

In [None]:
trainset, testset = model_selection.train_test_split(data, test_size=0.2)

model = SVD()

model.fit(trainset)
predictions = model.test(testset)

accuracy.rmse(predictions)

We see the RMSE score has improved a lot. It is now on average 1.64 which is pretty good. 

### Testing Results !

Let's take an arbitrary user-id and item-id to test our model. 

In [None]:
uid = 276744  
iid = '038550120X' 
pred = model.predict(uid, iid, verbose=True)

In [None]:
print(f'The estimated rating for the book with ISBN code {pred.iid} from user #{pred.uid} is {pred.est:.2f}.\n')
actual_rtg= ratings_explicit[(ratings_explicit.user_id==pred.uid) & 
                             (ratings_explicit.isbn==pred.iid)].book_rating.values[0]
print(f'The real rating given for this was {actual_rtg:.2f}.')

In [None]:
# The following function was adapted from the surprise docs
# and can be used to get the top book recommendations for each user.
from collections import defaultdict

def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
        
    return top_n

In [None]:
pred = model.test(testset)
top_n = get_top_n(pred)

In [None]:
def get_reading_list(userid):
    """
    Retrieve full book titles from full 'books_users_ratings' dataframe
    """
    reading_list = defaultdict(list)
    top_n = get_top_n(pred, n=10)
    print(top_n[userid])

    for n in top_n[userid]:
        book, rating = n
        title = new_book_df.loc[new_book_df.isbn==book].book_title.unique()[0]
        reading_list[title] = rating
    print(reading_list)
    return reading_list

In [None]:
# Just take a random look at user_id=116866
example_reading_list = get_reading_list(userid = 116866)
for book, rating in example_reading_list.items():
    print(f'{book}: {rating}')

Avobe recommended books seems pretty much related. So my first recommender engine is finished. 

In [None]:
example_reading_list