# Introduction

**Here, I have used pandas profiling to visualize our data and used KNN approach for the recommendation system. This dataset contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.**
![](https://www.detail-online.com/fileadmin/uploads/04-Blog/MKCA_Concourse_House-teaser-gross.jpg)

**Let's dive into the analysis and if you like my notebook, please appreciate me with an <font color=  'red'>Upvote</font>.**

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
from plotly import tools
init_notebook_mode(connected = True)
import plotly.figure_factory as ff

from PIL import Image
import requests
from io import BytesIO

# Dataset

In [None]:
#Users
u_cols = ['user_id', 'location', 'age']
users = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Users.csv', sep=';', names=u_cols, encoding='latin-1',low_memory=False)

#Books
i_cols = ['isbn', 'book_title' ,'book_author','year_of_publication', 'publisher', 'img_s', 'img_m', 'img_l']
books = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Books.csv', sep=';', names=i_cols, encoding='latin-1',low_memory=False)

#Ratings
r_cols = ['user_id', 'isbn', 'rating']
ratings = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Book-Ratings.csv', sep=';', names=r_cols, encoding='latin-1',low_memory=False)

In [None]:
users = users.iloc[1:]
books = books.iloc[1:]
ratings = ratings.iloc[1:]
ratings['rating'] = ratings['rating'].astype(int)

**Users Dataset**

In [None]:
dat = ff.create_table(users.head())
dat.update_layout(autosize=False,height=200, width = 700)

**Ratings Dataset**

In [None]:
dat = ff.create_table(ratings.head())
dat.update_layout(autosize=False,height=200, width = 700)

**Books Dataset**

In [None]:
dat = ff.create_table(books.head())
dat.update_layout(autosize=False,height=200, width = 4400)

# EDA

In [None]:
df = pd.merge(ratings,books,on='isbn')
dat = ff.create_table(df.head())
dat.update_layout(autosize=False,height=200, width = 3990)

**TOP RATED BOOKS: 5**

In [None]:
df['rating'] = df['rating'].astype(int) 
ratings = pd.DataFrame(df.groupby('book_title')['rating'].mean())
ratings['Total_Ratings'] = pd.DataFrame(df.groupby('book_title')['rating'].count())
Top5books = ratings[ratings['Total_Ratings'] > 30].sort_values(by = 'rating', ascending = False)[:5].reset_index()
Top5books1 = pd.merge(Top5books, df, on = 'book_title', how = 'inner')
dat = ff.create_table(Top5books)
dat.update_layout(autosize=False,height=200, width = 1200)

**Pandas Profiling Report**

In [None]:
pandas_profiling.ProfileReport(df)

# Recommendation System: K- Nearest Neighbors
**We will find clusters of similar users based on their common book ratings, and make predictions using the average rating of top-k nearest neigbors. We will be having books in rows and their ratings as their values with user id's in columns.**

**Combining Books and Ratings on the basis of ISBN**

In [None]:
dat = ff.create_table(df.head())
dat.update_layout(autosize=False,height=200, width = 3990)

**Data Manipulation**

In [None]:
df.drop(['year_of_publication', 'publisher', 'book_author', 'img_s', 'img_m', 'img_l'], axis = 1, inplace = True)

In [None]:
dat = ff.create_table(df.head())
dat.update_layout(autosize=False,height=200, width = 800)

**Combined Book Ratings**

In [None]:
combine_book_rating = df.dropna(axis = 0, subset = ['book_title'])

book_ratingCount = (combine_book_rating.
     groupby(by = ['book_title'])['rating'].
     count().
     reset_index().
     rename(columns = {'rating': 'totalRatingCount'})
     [['book_title', 'totalRatingCount']]
    )

dat = ff.create_table(book_ratingCount.head())
dat.update_layout(autosize=False,height=200, width = 1200)

**Ratings With Total Rating Count**

In [None]:
rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'book_title', right_on = 'book_title', how = 'left')

dat = ff.create_table(rating_with_totalRatingCount.head())
dat.update_layout(autosize=False,height=200, width = 1200)

**Statistics Of Total Ratings Count**

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(book_ratingCount['totalRatingCount'].describe())

**Distibution of the top 1% books**

In [None]:
print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))

<font color = 'gold'> Observations:</font>
**About 1% of books received 50 or more ratings. We will limit it to top 1%.**

**Keeping Threshold**

In [None]:
#threshold
popularity_threshold = 50
rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')
dat = ff.create_table(rating_popular_book.head())
dat.update_layout(autosize=False,height=200, width = 900)

**To Avoid Memory loss, we will keep constraints on Location of the users too. We will be taking locations: US, Canada and UK and Australia.Then, combining user data with the rating data and total rating count data.**

In [None]:
#to avoid memory issue:
users['location'].value_counts()

In [None]:
combined = rating_popular_book.merge(users, left_on = 'user_id', right_on = 'user_id', how = 'left')

all_user_rating = combined[combined['location'].str.contains("usa|canada|united kingdom|australia")]
all_user_rating=all_user_rating.drop('age', axis=1)

dat = ff.create_table(all_user_rating.head())
dat.update_layout(autosize=False,height=200, width = 1500)

# KNN Approach

**First, we will convert data into 2D matrix and later filling NA values to 0. Because, we will be calculating the distance between the rating vectors between book and the users. We will then transform the data(ratings) to the matrix dataframe into a scipy sparse matrix for more efficient classification.**

In [None]:
all_user_rating = all_user_rating.drop_duplicates(['user_id', 'book_title'])
all_user_rating_pivot = all_user_rating.pivot(index = 'book_title', columns = 'user_id', values = 'rating').fillna(0)
print('No null values now..')

**Sparse Matrix**

In [None]:
all_user_rating_matrix = csr_matrix(all_user_rating_pivot.values)
print('Sparse matrix created..')

# Training Our Recommendation Model

**We will compute the nearest neigbors using <font color = 'red'>Brute</font> algorithm and we specify metric as a <font color = 'red'>cosine</font>, so that the algorithm will calculate the cosine similarity between the rating vectors. Fitting the model...**

In [None]:
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(all_user_rating_matrix)

# Testing Our Recommendation Model

# 1.

In [None]:
query_index = np.random.choice(all_user_rating_pivot.shape[0])
distances, indices = model_knn.kneighbors(all_user_rating_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(all_user_rating_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, all_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

# 2.

In [None]:
query_index = np.random.choice(all_user_rating_pivot.shape[0])
distances, indices = model_knn.kneighbors(all_user_rating_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(all_user_rating_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, all_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

# 3.

In [None]:
query_index = np.random.choice(all_user_rating_pivot.shape[0])
distances, indices = model_knn.kneighbors(all_user_rating_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(all_user_rating_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, all_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

# The End