# Basic Collaborative Recommendation System

### Introduction

The purpose of this Jupyter Notebook is to illustrate a basic collaborative based recommendation system centered around books.  The data set is the Book Crossing data set compiled by Cai-Nicolas Ziegler.  I follow roughly the same structure as Derrick Mwiti in his guide on *towardsdatascience.com*. The ultimate purpose of designing this system is to recommend books to my girlfriend based on if she likes her current book.
To start, there are some things that need to be said about the data set.  It was compiled in 2004, and so all of the data is quite old.  This means it will only be able to recommend a book if my girlfriend's current book is in the data set, which will only happen if it is pre 2004 or so.  Another issue is the sparsity of the data.  As we will see, about half of the books only have one review.  This means a large portion of the data set will have to be thrown away.  There are two consequences to this.  First, it makes it less likely my girlfriend's current book is in the data set, especially if she is reading a less popular book.  Thus, sometimes, the system will be unable to recommend anything.  Second, extreme sparcity will make the recommendations less accurate and reliable.  Despite these faults, let's see what we can make of it.  These challenges will be discussed further at the end.

### Creating the Recommendation System

In [3]:
import numpy as np
import pandas as pd

We are about to import 2 tables.  So let's prep the column names.

In [4]:
column_names_books = ['isbn', 'title', 'author', 'year', 'publisher']
column_names_ratings = ['user_id', 'isbn', 'rating']

In [5]:
ratings = pd.read_csv('Ratings.csv', sep=';', names=column_names_ratings)
books = pd.read_csv('Books.csv', sep=';', names=column_names_books)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


Let's take a quick look at the data sets.

In [6]:
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,User-ID,ISBN,Rating
1,276725,034545104X,0
2,276726,0155061224,5
3,276727,0446520802,0
4,276729,052165615X,3


In [23]:
books.head()

Unnamed: 0,isbn,title,author,year,publisher
1,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
3,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
4,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
5,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company


As we can see, the old column names became an entry, which is something we definetely do not want.  Let's drop that entry.

In [7]:
ratings = ratings.drop(0)
books = books.drop(0)

Out of precaution, let's make sure all entries are the approapriate type.  The only change that is essential to moving on is changing the rating column (on the ratings table) to an int.  The rest is solely a precaution.

In [8]:
books['isbn'] = books['isbn'].astype(str)
books['title'] = books['title'].astype(str)
books['author'] = books['author'].astype(str)
books['year'] = books['year'].astype(str)
books['publisher'] = books['publisher'].astype(str)
books.dtypes

isbn         object
title        object
author       object
year         object
publisher    object
dtype: object

In [9]:
ratings['user_id'] = ratings['user_id'].astype(str)
ratings['isbn'] = ratings['isbn'].astype(str)
ratings['rating'] = ratings['rating'].astype(int)
ratings.dtypes

user_id    object
isbn       object
rating      int32
dtype: object

It would be nice to merge the two tables together so that we can see the title and rating on the same entry without having to go through the isbn number.  The isbn number is used to merge the tables.

In [10]:
books_and_ratings = pd.merge(books, ratings, on='isbn')

In [11]:
books_and_ratings.head(5)

Unnamed: 0,isbn,title,author,year,publisher,user_id,rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,0


That looks like what we want.  The next thing we are going to need is a new table showing the average rating recieved for each book.  To accomplish this we group the above table by title and then take the mean of each one.  The middle line is just renaming the columns.

In [12]:
average_ratings = pd.DataFrame(books_and_ratings.groupby('title')['rating'].mean())
average_ratings.rename(columns={'rating':'average_rating'}, inplace=True)
average_ratings.head()

Unnamed: 0_level_0,average_rating
title,Unnamed: 1_level_1
"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",2.25
Always Have Popsicles,0.0
Apple Magic (The Collector's series),0.0
"Ask Lily (Young Women of Faith: Lily Series, Book 5)",8.0
Beyond IBM: Leadership Marketing and Finance for the 1990s,0.0


It's also important to keep track of the number of ratings each book recieved.  As stated in the introduction, this data set is very sparse, and so we expect to see a large fraction of books only have one rating.  To see this, we add another column to the table just created.

In [13]:
average_ratings['number_of_ratings'] = books_and_ratings.groupby('title')['rating'].count()
average_ratings.head()

Unnamed: 0_level_0,average_rating,number_of_ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",2.25,4
Always Have Popsicles,0.0,1
Apple Magic (The Collector's series),0.0,1
"Ask Lily (Young Women of Faith: Lily Series, Book 5)",8.0,1
Beyond IBM: Leadership Marketing and Finance for the 1990s,0.0,1


Now that the table is set up, we can use the value_counts() method to see how dire the situation is.

In [14]:
average_ratings['number_of_ratings'].value_counts()

1      127524
2       42734
3       20255
4       11364
5        7531
        ...  
828         1
443         1
314         1
313         1
569         1
Name: number_of_ratings, Length: 377, dtype: int64

As we can see, a whopping 127,524 books have only 1 rating.  That is about half of the data set.  We should not include these books as part of our recommender system.  In fact, we shouldn't include books with very few ratings at all.  At some point we need to find the right middle ground.  We want our data set to be large enough, but we also want our recommendations to be good.  After testing some of the values out, I settled on 25 ratings as the threshold.  This limits our dataset down to just under 6,000 books!  But our recommendations will be better, and these 6,000 books are the ones my girlfriend is more likely to read anyways.  
Before we trim the data set, let's merge everything into the same table.

In [15]:
df = pd.merge(books_and_ratings, average_ratings, on='title')
df.head()

Unnamed: 0,isbn,title,author,year,publisher,user_id,rating,average_rating,number_of_ratings
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0,3.5,2
1,801319536,Classical Mythology,Mark P. O. Morford,1998,John Wiley & Sons,269782,7,3.5,2
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5,4.928571,14
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0,4.928571,14
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8,4.928571,14


In the below cell, the entries with less than 25 number of ratings are found and then removed from the table.

In [16]:
index_names = df[df['number_of_ratings'] < 25].index
df.drop(index_names, inplace=True)

In [17]:
df

Unnamed: 0,isbn,title,author,year,publisher,user_id,rating,average_rating,number_of_ratings
31,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,8,0,2.996785,311
32,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,11676,9,2.996785,311
33,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,29526,9,2.996785,311
34,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,36836,0,2.996785,311
35,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,46398,9,2.996785,311
...,...,...,...,...,...,...,...,...,...
817213,0310205719,The Purpose-Driven Life: What on Earth Am I He...,Rick Warren,2002,Zondervan Publishing Company,256407,9,3.594937,79
817214,0310205719,The Purpose-Driven Life: What on Earth Am I He...,Rick Warren,2002,Zondervan Publishing Company,257700,9,3.594937,79
817215,0310205719,The Purpose-Driven Life: What on Earth Am I He...,Rick Warren,2002,Zondervan Publishing Company,270006,0,3.594937,79
817216,0310205719,The Purpose-Driven Life: What on Earth Am I He...,Rick Warren,2002,Zondervan Publishing Company,272810,8,3.594937,79


In [18]:
len(df.title.unique())

5854

So we have 5,854 books and roughly 400,000 total reviews for those books.  
  
Now, we create a pivot table, which will be a very large sparse matrix.  The entries will be the rating recieved, the row is the user, and the column is the book.  This is a very large matrix indeed as it contains every users rating for every book.

In [19]:
rating_matrix = df.pivot_table(index='user_id', columns='title', values='rating')
rating_matrix.head()

title,'Salem's Lot,10 Lb. Penalty,101 Dalmatians,"14,000 Things to Be Happy About",16 Lighthouse Road,1984,1st to Die: A Novel,2001: A Space Odyssey,2010: Odyssey Two,204 Rosewood Lane,...,Zodiac: The Eco-Thriller,Zombies of the Gene Pool,Zoya,ZwÃ?Â¶lf.,"\"" Lamb to the Slaughter and Other Stories (Penguin 60s S.)","\""O\"" Is for Outlaw","\""Surely You're Joking, Mr. Feynman!\"": Adventures of a Curious Character",e,iI Paradiso Degli Orchi,stardust
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10,,,,,,,,,,,...,,,,,,,,,,
100001,,,,,,,,,,,...,,,,,,,,,,
100002,,,,,,,,,,,...,,,,,,,,,,
100004,,,,,,,,,,,...,,,,,,,,,,
100009,,,,,,,,,,,...,,,,,,,,,,


The sparseness can be seen from the above.  Almost all entries are NaN meaning that particular user has not rated that particular book.  They are not all NaN though.  As we saw above, we have about 400,000 actual ratings, and yet almost all are still NaN.  
  
Out of curiosity, let us see what the most popular books were in this data set.

In [20]:
average_ratings.sort_values('number_of_ratings', ascending=False).head(10)

Unnamed: 0_level_0,average_rating,number_of_ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Wild Animus,1.019584,2502
The Lovely Bones: A Novel,4.468726,1295
The Da Vinci Code,4.642539,898
A Painted House,3.231504,838
The Nanny Diaries: A Novel,3.530193,828
Bridget Jones's Diary,3.527607,815
The Secret Life of Bees,4.447028,774
Divine Secrets of the Ya-Ya Sisterhood: A Novel,3.437838,740
The Red Tent (Bestselling Backlist),4.334716,723
Angels & Demons,3.708955,670


Given that the rating is a number from 0 to 9, it seems like the most popular books were not enjoyed very much.

### Testing the Recommendation System

At this point, the recommendation system is ready for use.  Let's say that my girlfriend just finished *The Lovely Bones* and wants to read something similar to it.  First we isolate the column corresponding to *The Lovely Bones* from our large matrix.

In [21]:
lovely_bones = rating_matrix['The Lovely Bones: A Novel']
lovely_bones.head()

user_id
10       NaN
100001   NaN
100002   NaN
100004   NaN
100009   NaN
Name: The Lovely Bones: A Novel, dtype: float64

Now we need to find the amount of correlation between this series and the ratings matrix as a whole.  This will yeild a number between (-1 and 1) for each pairing between the *The Lovely Bones* and every other movie.  A number close to 1 means that users who read both books rated them both similarly.  A number close to -1 indicated that users who read both books rated them differently.  If my girlfriend love *The Lovely Bones* then I should recommend books with correlation close to 1.  If she didn't like *The Lovely Bones*, I should recommend books with correlation close to -1.

In [22]:
similar_to_lovely_bones = rating_matrix.corrwith(lovely_bones).sort_values(ascending=False)
similar_to_lovely_bones.head(20)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


title
Jailbird                                                                         1.0
Thai Horse                                                                       1.0
Eat Mangoes Naked : Finding Pleasure Everywhere (and dancing with the Pits)      1.0
The Scions of Shannara (Heritage of Shannara (Paperback))                        1.0
Suspicion of Malice                                                              1.0
Stupid White Men. Eine Abrechnung mit dem Amerika unter George W. Bush           1.0
The Illuminatus Trilogy: The Eye in the Pyramid, the Golden Apple & Leviathan    1.0
8 Weeks to Optimum Health                                                        1.0
Memorias de una geisha                                                           1.0
Impetuous                                                                        1.0
Startide Rising (The Uplift Saga, Book 2)                                        1.0
Anna Karenina                                              

All of the above books, according to this basic recommender system, would be a great recommendation for fans of the *The Lovely Bones*.  If we included more values, the correlations would start to drop off.  So that is the strategy.  Once a book is read, the above lines need to get re-run and correlation coefficients found.  Once the correlation coefficients are found, a recommendation can be made.

### Conclusion

This was just the start of my discovery of recommender system.  It was a good start, but the system is ultimately too restricted and primitive to be very useful.  
  
Currently, I am developing a way to get a lot of data of more recent books.  Once I have an up to date data set, this strategy will work much better in practice.  I am also looking at more sophisticated algorithms, such as matrix factorization to increase the usefullness of this recommender system.