# Collaborative Filtering

Let's say there are 2 users A and B. if user A like some books, and users B like some other books,
so if books like by user B are similar to books like by A, then For user B we recommend the books like by A.

In [2]:
import numpy as np
import pandas as pd

## Book Dataset

In [3]:
book = pd.read_csv('datasets/Books.csv', error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
book.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


In [5]:
# drop unnecessary columns
book.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True)

# reset index
book.reset_index(drop=True, inplace=True)

# change the column name to one particular format
book.rename(columns={'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year'}, inplace=True)

In [6]:
book.head(2)

Unnamed: 0,ISBN,title,author,year,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


## Users Dataset

In [7]:
users = pd.read_csv('datasets/Users.csv', error_bad_lines=False)

In [8]:
users.rename(columns={'User-ID':'user'}, inplace=True)

In [9]:
users.head(2)

Unnamed: 0,user,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [10]:
users.shape

(278858, 3)

## Ratings Dataset

In [11]:
ratings = pd.read_csv('datasets/Ratings.csv', error_bad_lines=False)

In [12]:
ratings.rename(columns={'User-ID' : 'user', 'Book-Rating':'ratings'}, inplace=True)

In [13]:
ratings.head(2)

Unnamed: 0,user,ISBN,ratings
0,276725,034545104X,0
1,276726,0155061224,5


In [14]:
ratings.shape

(1149780, 3)

## Preprocessing and Merging

So for collaborative filtering, will use Pivot tables.
In this our column represents users, indexes represent books, and values represnt rating made by a user for that book

so while doing so, need to restrict the no. of users. \
why ? \
for e.g. \
the opinion of the users who read more than 50 books holds more weight, than the one who read 2 or 3 books 

likewise, we will also restrict the no. of books. \
only consider those books on which at least 60 ratings were given

In [15]:
print("these are the people who give only low rating:\n",ratings.user.value_counts().sort_values(ascending=True).head())

these are the people who give only low rating:
 223598    1
7874      1
16070     1
20168     1
18121     1
Name: user, dtype: int64


In [16]:
print("only one rating given to these books: \n",ratings.ISBN.value_counts().sort_values(ascending=True).head(2)) 

only one rating given to these books: 
 0262200600    1
1585710180    1
Name: ISBN, dtype: int64


## restricting user count

In [17]:
# for restricting user

def return_users_val(k):
    x = ratings['user'].value_counts() > k
    return x[x].shape


print("total no. of users", ratings.user.shape)
print("no. of users who give at least 60 rating", return_users_val(60))
print("no. of users who give at least 80 rating", return_users_val(80))
print("no. of users who give at least 100 rating", return_users_val(100))
print("no. of users who give at least 150 rating", return_users_val(150))
print("no. of users who give at least 180 rating", return_users_val(180))
print("no. of users who give at least 200 rating", return_users_val(200))

total no. of users (1149780,)
no. of users who give at least 60 rating (2884,)
no. of users who give at least 80 rating (2236,)
no. of users who give at least 100 rating (1825,)
no. of users who give at least 150 rating (1223,)
no. of users who give at least 180 rating (1010,)
no. of users who give at least 200 rating (899,)


# now only select those user, who have given atleast more than 180 ratings

In [18]:
tmp = ratings['user'].value_counts() > 180
tmp1 = tmp[tmp].index
df = ratings[ratings['user'].isin(tmp1)]

In [19]:
df.shape

(547483, 3)

In [20]:
book.head(3)

Unnamed: 0,ISBN,title,author,year,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial


##  merge dataframes

In [21]:
newdf = df.merge(book, on='ISBN')
print(df.shape, newdf.shape) # this variation in shape, cuz we didn't have data of all books mentioned in df
newdf.head()

(547483, 3) (507068, 7)


Unnamed: 0,user,ISBN,ratings,title,author,year,Publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc


##  on books count

In [22]:
tmpdf = newdf.copy()

tmp1df = tmpdf.groupby('title')['ratings'].count().reset_index().rename(columns={'ratings' : 'no. of raitngs'})

In [23]:
tmp1df.head()

Unnamed: 0,title,no. of raitngs
0,A Light in the Storm: The Civil War Diary of ...,3
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


### join newdf with tmp1df 

In [24]:
newdf = newdf.merge(tmp1df, on='title')
print(newdf.shape)
newdf.head()

(507068, 8)


Unnamed: 0,user,ISBN,ratings,title,author,year,Publisher,no. of raitngs
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [25]:
newdf.rename(columns={'no. of raitngs' : 'no_of_ratings'}, inplace=True)

In [26]:
newdf.drop_duplicates(['user', 'title'], inplace=True)

In [27]:
newdf.head(2)

Unnamed: 0,user,ISBN,ratings,title,author,year,Publisher,no_of_ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [28]:
print(newdf[newdf['no_of_ratings'] >= 30].shape)
print(newdf[newdf['no_of_ratings'] >= 40].shape)
print(newdf[newdf['no_of_ratings'] >= 50].shape)
print(newdf[newdf['no_of_ratings'] >= 60].shape)

(103831, 8)
(81434, 8)
(65608, 8)
(53861, 8)


In [29]:
## will choose no_of_ratings >= 60
newdf = newdf[newdf['no_of_ratings'] >= 60]

In [30]:
newdf.head()

Unnamed: 0,user,ISBN,ratings,title,author,year,Publisher,no_of_ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [31]:
# saving this clean data, in case ...

# newdf.to_csv('Datasets/clean_data.csv')

## Now we have Data in right Format, 
## let's see what exactly we are trying to do.

## so, from given data we  are need to see the pattern in ratings given by user to books.

### Create pivot table

In [32]:
pivot_book = newdf.pivot_table(columns='user', index='title', values='ratings')

In [33]:
pivot_book

user,254,2033,2276,2766,2977,3363,3757,4017,4385,6242,...,274301,274308,274808,275970,276680,277427,277478,277639,278188,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,0.0,,,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,,10.0,,,,,,,,...,,0.0,,,,,,0.0,,
4 Blondes,,,,,,,,,,,...,,,,,,,,,,
A Bend in the Road,0.0,,,7.0,,,,,,,...,,,,,,,,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wuthering Heights,,,,,,,,,,,...,,0.0,,,,,,,,
Year of Wonders,,,,,7.0,,,,,7.0,...,,,,0.0,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,,0.0,,,,,...,,,,0.0,,,,,,
Zoya,,,,,,,,,,,...,,,,,,,,,,


In [34]:
pivot_book.fillna(0, inplace=True)

### Convert pivot table into sparse matrix

In [35]:
from scipy.sparse import csr_matrix

In [36]:
# pivot table contains so many nan vals, converting this into a sparse matirx will also helps in furthur calculations

sparse_mat = csr_matrix(pivot_book)

## Model


In [37]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm='brute')

In [38]:
model.fit(sparse_mat)

NearestNeighbors(algorithm='brute')

In [39]:
pivot_book.index[430]

'The Glass Lake'

In [40]:
pivot_book[pivot_book.index==newdf.iloc[430,:].title]

user,254,2033,2276,2766,2977,3363,3757,4017,4385,6242,...,274301,274308,274808,275970,276680,277427,277478,277639,278188,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
One for the Money (Stephanie Plum Novels (Paperback)),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0


In [41]:
np.where(pivot_book.index=="The Glass Lake")[0][0]

430

In [42]:
distance, suggestion = model.kneighbors(pivot_book.iloc[180, :].values.reshape(1, -1), n_neighbors=10)

In [43]:
# print(distance)
# print(suggestion)
print(pivot_book.index[suggestion])

[['Harry Potter and the Chamber of Secrets (Book 2)'
  'Harry Potter and the Prisoner of Azkaban (Book 3)'
  'Harry Potter and the Goblet of Fire (Book 4)'
  "Harry Potter and the Sorcerer's Stone (Book 1)" 'The Cradle Will Fall'
  "Tom Clancy's Op-Center (Tom Clancy's Op Center (Paperback))"
  'Truly, Madly Manhattan' 'Bittersweet' 'Secrets' 'Zoya']]


  This is separate from the ipykernel package so we can avoid doing imports until


In [44]:
import pickle

In [46]:
pickle.dump(model, open('model.pkl', 'wb'))

In [47]:
pickle.dump(pivot_book, open('pivot.pkl', 'wb'))

In [48]:
distance, suggestion = model.kneighbors(pivot_book.iloc[200, :].values.reshape(1, -1), n_neighbors=10)
suggestion = suggestion[0]
print(pivot_book.index[suggestion])

Index(['How to Be Good', 'The Cradle Will Fall', 'A Civil Action', 'Invasion',
       'Pleading Guilty',
       'Tom Clancy's Op-Center (Tom Clancy's Op Center (Paperback))',
       'Fatal Cure', 'Journey', 'Sleepers', 'Zoya'],
      dtype='object', name='title')
