## Build a Book Recommender System

Recommender systems are used in all sorts of organizations to help users make decisions and, for many companies, earn more revenue. In this project, we will build a book recommender system for Books’R’Us using Surprise.

Books’R’Us is a national bookstore chain that sells books of all sorts to people all over the country. They recently have built their website, and now want to add a book recommender system to their site.

We will prepare and train the recommender system using book review data left on their site.
This data has been put together in a Pandas DataFrame called **book_ratings**.

In [2]:
import pandas as pd
from surprise import Reader

book_ratings = pd.read_csv('goodreads_ratings.csv')
print(book_ratings.head())


                            user_id   book_id  \
0  d089c9b670c0b0b339353aebbace46a1   7686667   
1  6dcb2c16e12a41ae0c6c38e9d46f3292  18073066   
2  244e0ce681148a7586d7746676093ce9  13610986   
3  73fcc25ff29f8b73b3a7578aec846394  27274343   
4  f8880e158a163388a990b64fec7df300  11614718   

                          review_id  rating  \
0  3337e0e75701f7f682de11638ccdc60c       3   
1  7201aa3c1161f2bad81258b6d4686c16       5   
2  07a203f87bfe1b65ff58774667f6f80d       5   
3  8be2d87b07098c16f9742020ec459383       1   
4  a29c4ba03e33ad073a414ac775266c5f       4   

                                         review_text  \
0  Like Matched, this book felt like it was echoi...   
1  WOW again! 4,5 Stars \r\n So i wont forget to ...   
2  The second novel was hot & heavy. Not only in ...   
3  What a maddening waste of time. And I unfortun...   
4  4.5 stars! \r\n This was an awesome read! \r\n...   

                       date_added                    date_updated  \
0  Fri Apr 29 14

#### 1. Print dataset size and examine column data types

In [3]:
print(book_ratings.describe())
print(book_ratings.info())

            book_id       rating      n_votes   n_comments
count  3.500000e+03  3500.000000  3500.000000  3500.000000
mean   1.314011e+07     3.686000     3.038857     0.754286
std    9.143899e+06     1.251911    15.508018     3.474921
min    1.000000e+00     0.000000     0.000000     0.000000
25%    6.380978e+06     3.000000     0.000000     0.000000
50%    1.320679e+07     4.000000     0.000000     0.000000
75%    1.928702e+07     5.000000     1.000000     0.000000
max    3.589888e+07     5.000000   431.000000    77.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       3500 non-null   object
 1   book_id       3500 non-null   int64 
 2   review_id     3500 non-null   object
 3   rating        3500 non-null   int64 
 4   review_text   3500 non-null   object
 5   date_added    3500 non-null   object
 6   date_updated  3500 no

#### 2. In order to understand these ratings, let’s look at a count of all the ratings in the data.

In [4]:
print(book_ratings.rating.value_counts())

rating
4    1278
5    1001
3     707
2     269
1     125
0     120
Name: count, dtype: int64


#### 3. Unfortunately, it appears we have some data where the ratings are 0. The ratings on the website only go from 1 to 5 inclusive. We filter out all ratings that are not in this range.

In [5]:
book_ratings = book_ratings[book_ratings.rating!=0]

#### 4. We need to prepare this data for use in Surprise. First, we build a Surprise reader Object that utilizes the rating scale established above. 

In [6]:
from surprise import Reader
reader = Reader(rating_scale=(1, 5))

#### 5. Load book_ratings into a Surprise Dataset so it can be used with Surprise‘s algorithms.

In [7]:
from surprise import Dataset
rec_data = Dataset.load_from_df(book_ratings[['user_id',
                                              'book_id',
                                              'rating']],
                                reader)

#### 6. We have a dataset that is ready for use in Surprise. We split the data, and put 80% of the data into a training set, and 20% into a test set. We set a random_state of 7 to improve reproducibility.

In [8]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(rec_data, test_size=.2, random_state=7)

#### 7. We can finally train a recommender system. We use the KNNBasic from Surprise to train a collaborative filter using the training set.

In [9]:
from surprise import KNNBasic
knn = KNNBasic()
knn.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x1cffb503280>

#### 8. We calculate the RMSE of the recommender system using the testset data.

In [10]:
from surprise import accuracy
predictions = knn.test(testset)
accuracy.rmse(predictions)

RMSE: 1.1105


1.110471008157185

#### 9.We can get recommendations now.

User 8842281e1d1347389f2ab93d60773d4d gave the science-fiction book “The Three-Body Problem” (book_id=18245960) a 5.
What rating does the algorithm predict this user will give the science-fiction book “The Martian” (book_id=18007564)?

In [11]:
print(knn.predict('8842281e1d1347389f2ab93d60773d4d', '18007564').est)

3.8250739644970415


### Conclusion
We have successfully built a working prototype for a recommender system.
For further study, we can try adjusting the hyperparameters of the collaborative filter we built, and see if we can reduce the RMSE of the collaborative filter we built above.