## Collaborative Filtering

In [1]:
# import pandas
import pandas as pd
from sklearn.neighbors import NearestNeighbors

We will create an recommender engine based on Item Based Collaborative Filtering (IBCF) which searches for the most similar books based on the user ratings. We can download the data from [here](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing).

In [25]:
book_ratings = pd.read_csv('../BX-Book-Ratings.csv',sep=";", encoding="latin").sample(frac=.02)
book_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
407576,98391,0373750080,9
164567,36369,8486587247,10
4621,278418,0307020460,0
211892,49154,0515132020,9
1068296,255092,0449217493,0
...,...,...,...
22686,4896,0811202925,0
118174,26883,0805055908,0
1017657,243930,0330315862,9
742427,179733,0440221501,0


In [26]:
books = pd.read_csv('../BX-Books.csv',sep=";", encoding="latin", on_bad_lines='skip', low_memory=False)
print(books.shape)
books.head()

(271360, 8)


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


* Explore both datasets

In [27]:
book_ratings['User-ID'].value_counts()

11676     285
198711    163
98391     125
153662    119
35859     101
         ... 
185137      1
251495      1
131032      1
249632      1
4896        1
Name: User-ID, Length: 9605, dtype: int64

* create dataframe with name 'df_book_features' from book_ratings that have `ISBN` as index, `User-ID` as columns and values are `Book-Rating`.
    - The data are quite big so it's OK to use a sample only in case your PC has limited RAM.


In [28]:
df_book_features = book_ratings.merge(
    books[['ISBN', 'Book-Title']], left_on='ISBN', right_on='ISBN',
).set_index('ISBN')
print(df_book_features.shape)
df_book_features

(20670, 3)


Unnamed: 0_level_0,User-ID,Book-Rating,Book-Title
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0373750080,98391,9,"Want Ad Wedding (Harlequin American Romance, 1..."
0307020460,278418,0,Scuffy the Tugboat and His Adventures Down the...
0515132020,49154,9,Heaven and Earth (Three Sisters Island Trilogy)
0515132020,159834,0,Heaven and Earth (Three Sisters Island Trilogy)
0515132020,176667,0,Heaven and Earth (Three Sisters Island Trilogy)
...,...,...,...
0374479828,76626,0,Tristan and Iseult
0345410998,196052,0,Street Boys
0060012358,214786,0,The Amazing Maurice and His Educated Rodents
0811202925,4896,0,Siddhartha


* create the instance of the NearestNeighbors class

* fit the NearestNeighbors using'df_book_features'

In [33]:
from sklearn.neighbors import NearestNeighbors

In [32]:
# Put the data in a format that can be used for NearestNeighbors
book_rating = df_book_features.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating')
print(book_rating.shape)
book_rating.head()

(16391, 8633)


User-ID,39,95,165,178,199,226,254,408,446,476,...,278506,278522,278524,278535,278545,278554,278582,278633,278637,278843
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1047213,,,,,,,,,,,...,,,,,,,,,,
1811150,,,,,,,,,,,...,,,,,,,,,,
2114038,,,,,,,,,,,...,,,,,,,,,,
2190915,,,,,,,,,,,...,,,,,,,,,,
2215497,,,,,,,,,,,...,,,,,,,,,,


In [34]:
neigh = NearestNeighbors(n_neighbors=5)
neigh.fit(book_rating)

MemoryError: Unable to allocate 1.05 GiB for an array with shape (16391, 8633) and data type float64

* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books 
    * the output will be the Book-Titles of the top 5 most similar books.
    * for every book in the top 5 most similar books, print also the distance from the selected book (ISBN we chose as input to the function)

* Apply the function to book of your choice