## Popularity Based and Collaborative Filtering Based Recommendation System

Mounting the drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Importing the Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
books = pd.read_csv("/content/drive/MyDrive/ML_Soham/Books.csv")
users = pd.read_csv("/content/drive/MyDrive/ML_Soham/Users.csv")
ratings = pd.read_csv("/content/drive/MyDrive/ML_Soham/Ratings.csv")

### Pre-processing

In [None]:
df = books.merge(ratings, on = "ISBN")

In [None]:
users.shape

(278858, 3)

In [None]:
users['User-ID'].unique

<bound method Series.unique of 0              1
1              2
2              3
3              4
4              5
           ...  
278853    278854
278854    278855
278855    278856
278856    278857
278857    278858
Name: User-ID, Length: 278858, dtype: int64>

In [None]:
df

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Book-Rating
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,2,0
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,8,5
2,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11400,0
3,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,8
4,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,41385,0
...,...,...,...,...,...,...,...,...,...,...
1031131,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,276463,7
1031132,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,276579,4
1031133,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,276680,0
1031134,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,276680,0


In [None]:
df_dropped = df.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis =1)

In [None]:
df_dropped

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5
2,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0
3,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8
4,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,0
...,...,...,...,...,...,...,...
1031131,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),276463,7
1031132,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,276579,4
1031133,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,276680,0
1031134,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,276680,0


Above, I have deleted all the unneccessary columns mainly the images of each book. The 'df_dropped' dataframe is a merged dataframe of the 'books' and 'ratings' which is inner-joined on the 'ISBN' column.

In [None]:
num_rating = df_dropped.groupby("Book-Title").count()['Book-Rating'].reset_index()

In [None]:
num_rating

Unnamed: 0,Book-Title,Book-Rating
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1
...,...,...
241066,Ã?Â?lpiraten.,2
241067,Ã?Â?rger mit Produkt X. Roman.,4
241068,Ã?Â?sterlich leben.,1
241069,Ã?Â?stlich der Berge.,3


In [None]:
num_rating.rename(columns = {'Book-Rating': 'No. of ratings'}, inplace = True)

In [None]:
num_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241071 entries, 0 to 241070
Data columns (total 2 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Book-Title      241071 non-null  object
 1   No. of ratings  241071 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.7+ MB


The above dataframe 'num-rating' is made by grouping the 'df_dropped' dataframe on the 'Book-Title' column. Then the number of rating per book-title is counted and stored in the third column.

### Modelling: Popularity Based Recommender System

Like we did in the previous section, here we are calculating the average rating per book and store it in a new df 'avg_rating'.

In [None]:
avg_rating = df_dropped.groupby('Book-Title').mean()['Book-Rating'].reset_index()
avg_rating.rename(columns = {'Book-Rating':'avg_rating'}, inplace = True)
avg_rating

Unnamed: 0,Book-Title,avg_rating
0,A Light in the Storm: The Civil War Diary of ...,2.250000
1,Always Have Popsicles,0.000000
2,Apple Magic (The Collector's series),0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,0.000000
...,...,...
241066,Ã?Â?lpiraten.,0.000000
241067,Ã?Â?rger mit Produkt X. Roman.,5.250000
241068,Ã?Â?sterlich leben.,7.000000
241069,Ã?Â?stlich der Berge.,2.666667


We will use the top 50 rated books in the list that have atleast 200 ratings by different users.

In [None]:
popular_books = num_rating.merge(avg_rating, on = 'Book-Title')
popular_books

Unnamed: 0,Book-Title,No. of ratings,avg_rating
0,A Light in the Storm: The Civil War Diary of ...,4,2.250000
1,Always Have Popsicles,1,0.000000
2,Apple Magic (The Collector's series),1,0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1,8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,1,0.000000
...,...,...,...
241066,Ã?Â?lpiraten.,2,0.000000
241067,Ã?Â?rger mit Produkt X. Roman.,4,5.250000
241068,Ã?Â?sterlich leben.,1,7.000000
241069,Ã?Â?stlich der Berge.,3,2.666667


In [None]:
popular_books = popular_books[popular_books['No. of ratings']>200].sort_values('avg_rating', ascending = False).head(50)

In [None]:
popular_books

Unnamed: 0,Book-Title,No. of ratings,avg_rating
80434,Harry Potter and the Prisoner of Azkaban (Book 3),428,5.852804
80422,Harry Potter and the Goblet of Fire (Book 4),387,5.824289
80441,Harry Potter and the Sorcerer's Stone (Book 1),278,5.73741
80426,Harry Potter and the Order of the Phoenix (Boo...,347,5.501441
60582,Ender's Game (Ender Wiggins Saga (Paperback)),249,5.409639
80414,Harry Potter and the Chamber of Secrets (Book 2),556,5.183453
191612,The Hobbit : The Enchanting Prelude to The Lor...,281,5.007117
187377,The Fellowship of the Ring (The Lord of the Ri...,368,4.94837
80445,Harry Potter and the Sorcerer's Stone (Harry P...,575,4.895652
211384,"The Two Towers (The Lord of the Rings, Part 2)",260,4.880769


The popular_books are the top 50 books which are rated by atleast 200 users and have the highest average rating. This is created by merging the 'num_rating' dataframe with the 'average_rating' dataframe on the column 'Book-Title'.

### Collaborative Filtering Recommender System

For collaborative filtering, we will use item-item collborative filtering. Furthermore, the utility matrix will be created using only users who have read atleast 200 books. That makes their rating credible.

In [None]:
df_credible_users = pd.DataFrame(df_dropped.groupby('User-ID').count()['Book-Rating'])

In [None]:
df_credible_userid = df_credible_users[df_credible_users['Book-Rating']>200]

So we get the final list of 811 users who have rated more than 200 books. Their rating will be taken into consideration while constructing the utility matrix of the movies and their corresponding vectors.

In [None]:
df_credible_userid.index

Int64Index([   254,   2276,   2766,   2977,   3363,   4017,   4385,   6251,
              6323,   6543,
            ...
            271705, 273979, 274004, 274061, 274301, 274308, 275970, 277427,
            277639, 278418],
           dtype='int64', name='User-ID', length=811)

In [None]:
filtered_rating = df_dropped[df_dropped['User-ID'].isin(df_credible_userid.index)]

In [None]:
filtered_rating

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating
3,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8
6,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,85526,0
7,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,96054,0
10,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,177458,0
21,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,110912,10
...,...,...,...,...,...,...,...
1031124,0231128444,Slow Food(The Case For Taste),Carlo Petrini,2003,Columbia University Press,275970,0
1031125,0520242335,Strong Democracy : Participatory Politics for ...,Benjamin R. Barber,2004,University of California Press,275970,0
1031126,0762412119,"Burpee Gardening Cyclopedia: A Concise, Up to ...",Allan Armitage,2002,Running Press Book Publishers,275970,0
1031127,1582380805,Tropical Rainforests: 230 Species in Full Colo...,"Allen M., Ph.D. Young",2001,Golden Guides from St. Martin's Press,275970,0


The filtered_rating dataframe is composed on only users who have rated atleast 200 books.

In [None]:
y = filtered_rating.groupby('Book-Title').count()['Book-Rating']>=50

Apart from filtering the credible users we are also going to filter popular books. Each book that we consider for our recommendation system should have atleast 50 ratings. These books are stored in 'famous_books'.

In [None]:
famous_books = y[y].index

In [None]:
final_ratings = filtered_rating[filtered_rating['Book-Title'].isin(famous_books)]

In [None]:
final_ratings

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating
31,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,11676,9
33,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,36836,0
34,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,46398,9
38,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,113270,0
39,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,113519,0
...,...,...,...,...,...,...,...
1028414,1878702831,Echoes,Nancy Morse,1992,Meteor Publishing Corporation,238781,0
1028600,0394429869,I Know Why the Caged Bird Sings,Maya Angelou,1996,Random House,239594,8
1028602,0449001164,The Promise,CHAIM POTOK,1997,Ballantine Books,239594,7
1028815,0743527631,The Pillars of the Earth,Ken Follett,2002,Encore,240144,0


The final_ratings dataframe has the books with more than 50 ratings per book and users who have rated atleast 200 books.

In [None]:
pt = final_ratings.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating')

In [None]:
pt.fillna(0, inplace = True)

In [None]:
pt

User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Here `pt` is our utility matrix. Each row is a vector of users ratings denoting a particular book.

In [None]:
mn = pt.mean(axis = 1)  
mn

Book-Title
1984                                                                 0.262140
1st to Die: A Novel                                                  0.425309
2nd Chance                                                           0.336420
4 Blondes                                                            0.085185
A Bend in the Road                                                   0.220370
                                                                       ...   
Year of Wonders                                                      0.216049
You Belong To Me                                                     0.104938
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values    0.146914
Zoya                                                                 0.079012
\O\" Is for Outlaw"                                                  0.240741
Length: 706, dtype: float64

In [None]:
centred_pt = pt.apply(lambda col: col - mn, axis = 0)
centred_pt.mean(axis = 1)

Book-Title
1984                                                                -1.685894e-17
1st to Die: A Novel                                                  5.605941e-17
2nd Chance                                                          -3.225129e-16
4 Blondes                                                           -1.275728e-16
A Bend in the Road                                                   2.492519e-16
                                                                         ...     
Year of Wonders                                                     -4.708853e-16
You Belong To Me                                                     7.545062e-16
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values    2.312279e-16
Zoya                                                                 2.104284e-16
\O\" Is for Outlaw"                                                  1.356254e-16
Length: 706, dtype: float64

Here we can also use the `centred_pt` to calculate the similarity scores between the movies. In this dataframe each row vector of the movies have mean 0. They are usually used to denote the different tastes of the different users.

In [None]:
similarity_scores = cosine_similarity(pt)

In [None]:
similarity_scores.shape

(706, 706)

In [None]:
sorted(list(enumerate(similarity_scores[0])),key = lambda x:x[1], reverse = True)[0:5]

[(0, 0.9999999999999999),
 (47, 0.2702651417103732),
 (545, 0.2639619371123496),
 (82, 0.2366937434740099),
 (634, 0.23299389358170397)]

Here we will select out the 5 most similar books to the users choice and recommend it to the user.

In [None]:
def recommend(book_name):
  index = np.where(pt.index == book_name)[0][0]
  similar_items = sorted(list(enumerate(similarity_scores[index])),key = lambda x:x[1], reverse = True)[0:5]

  data = []
  for i in similar_items:
    print(pt.index[i[0]]," ", final_ratings[final_ratings['Book-Title'] == pt.index[i[0]]]['Book-Author'].drop_duplicates().values, "\n")

In [None]:
recommend('1984')

1984   ['George Orwell'] 

Animal Farm   ['George Orwell'] 

The Handmaid's Tale   ['Margaret Atwood'] 

Brave New World   ['Aldous Huxley'] 

The Vampire Lestat (Vampire Chronicles, Book II)   ['ANNE RICE'] 



Here we take a book as user input and in turn we have returned 5 most similar books to the user's choice. This concludes our task.