The collaborative filter approach focuses on finding users who have given similar ratings to the same books, thus creating a link between users, to whom will be suggested books that were reviewed in a positive way. In this way, we look for associations between users, not between books.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
books = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']
users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']
ratings = pd.read_csv('Book_Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']


In [None]:
print(ratings.shape)
print(list(ratings.columns))

In [None]:
plt.rc("font", size=15)
ratings.bookRating.value_counts(sort=False).plot(kind='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('system1.png', bbox_inches='tight')
plt.show()

In [None]:
print(books.shape)
print(list(books.columns))
print(users.shape)
print(list(users.columns))

In [None]:
users.Age.hist(bins=[0, 10, 20, 30, 40, 50, 100])
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig('system2.png', bbox_inches='tight')
plt.show()

## To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.

In [None]:
counts1 = ratings['userID'].value_counts()
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
counts = ratings['bookRating'].value_counts()
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]

## Collaborative Filtering Using k-Nearest Neighbors (kNN)

kNN is a machine learning algorithm to find clusters of similar users based on common book ratings, and make predictions using the average rating of top-k nearest neighbors. For example, we first present ratings in a matrix with the matrix having one row for each item (book) and one column for each user,

In [None]:
combine_book_rating = pd.merge(ratings, books, on='ISBN')  # Merging rating and book dataset based on "ISBN"
columns = ['yearOfPublication', 'publisher', 'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL']  # drooping all these columns
combine_book_rating = combine_book_rating.drop(columns, axis=1)
combine_book_rating.head()

We then group by book titles and create a new column for total rating count.

In [None]:
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle'])

book_ratingCount = (combine_book_rating.
     groupby(by = ['bookTitle'])['bookRating'].
     count().
     reset_index().
     rename(columns = {'bookRating': 'totalRatingCount'})
     [['bookTitle', 'totalRatingCount']]
    )
book_ratingCount.head()  #  "book_ratingCount" i will merge this column with "combine_book_rating"



# How many rating a book has, if a book has 



#we'll be creating a new column it is called as total rating you may have just one or two ratings or four ratings  so 
# if you have so many so less number of ratings that cannot be a popular book  I am just trying to create a threshold value where in
# particular book has a total rating count of more than 50 or 100 then only I'll be considering those rows so on this

We combine the "combine_book_rating" data with the total "book_ratingCount" data, this gives us exactly what we need to find out which books are popular and filter out lesser-known books.

In [None]:
rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')
rating_with_totalRatingCount.head()

In [None]:
popularity_threshold = 50
rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')
rating_popular_book.head()

In [None]:
combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')
# combinig rating_popular_books data with user data and filter only US and Cananda user location 
us_canada_user_rating = combined[combined['Location'].str.contains("usa|canada")] # if location column contain usa and cananda pickup those rows
us_canada_user_rating=us_canada_user_rating.drop('Age', axis=1)  # drooping age column beacuse we are not making recommadtion based on age
us_canada_user_rating.head()
us_canada_user_rating.shape

## Implementing kNN
We convert our table to a 2D matrix, and fill the missing values with zeros (since we will calculate distances between rating vectors). We then transform the values(ratings) of the matrix dataframe into a scipy sparse matrix for more efficient calculations.

Finding the Nearest Neighbors
We use unsupervised algorithms with sklearn.neighbors. The algorithm we use to compute the nearest neighbors is “brute”, and we specify “metric=cosine” so that the algorithm will calculate the cosine similarity between rating vectors. Finally, we fit the model.

In [None]:
from scipy.sparse import csr_matrix
us_canada_user_rating = us_canada_user_rating.drop_duplicates(['userID', 'bookTitle']) # drooping duplicates in userid and booktitles


us_canada_user_rating_pivot = us_canada_user_rating.pivot(index = 'bookTitle', columns = 'userID', values = 'bookRating').fillna(0)
# creating pivot table  based on booktitle, there will be booktitle on all indexs, colums will be userID and values will be bookrating
#It will fill in any missing values with the value 0, this is done by fillna(0) method


us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values)

In [None]:
us_canada_user_rating_pivot.head()

#us_canada_user_rating_matrix

In [None]:
from sklearn.neighbors import NearestNeighbors


model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(us_canada_user_rating_matrix)

In [None]:
query_index = np.random.choice(us_canada_user_rating_pivot.shape[0])
print(query_index)
distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)

us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1,-1)

In [None]:
us_canada_user_rating_pivot.index[query_index]

In [None]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(us_canada_user_rating_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))