**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [21]:
# Boiler plate
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [22]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv("users.csv", sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv("ratings.csv", sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [23]:
print(books.shape)
print(users.shape)
print(ratings.shape)


(271360, 8)
(278858, 3)
(1149780, 3)


## Exploring books dataset

In [24]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [30]:
#books_bkp = books.copy()
books = books_bkp.copy()
books_df = books.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis = 1) 
books_df.head()


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [33]:
books_df.groupby('yearOfPublication').size()

yearOfPublication
0                    3570
1806                    1
1900                    1
1901                    7
1902                    2
1904                    1
1906                    1
1908                    1
1910                    1
1911                   10
1914                    1
1917                    1
1920                   27
1921                    2
1923                    8
1924                    1
1925                    2
1926                    1
1927                    1
1928                    2
1929                    7
1930                   12
1931                    2
1932                    3
1933                    2
1934                    1
1935                    3
1936                    5
1937                    5
1938                    6
                     ... 
1986                 1583
1987                 1768
1988                 1947
1989                 2111
1990                 2266
1991                 2463
1992                

In [36]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
dtypes: object(5)
memory usage: 10.4+ MB


In [34]:
books_df.describe()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
count,271360,271360,271359,271360,271358
unique,271360,242135,102023,202,16807
top,449905543,Selected Poems,Agatha Christie,2002,Harlequin
freq,1,27,632,13903,7535


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [48]:
#pub = ['DK Publishing Inc','Gallimard']
#print(books_df[~books_df.yearOfPublication.isin(pub)])
#print(books_df[books_df.yearOfPublication == 'DK Publishing Inc'])
print(books_df[(books_df.yearOfPublication == 'Gallimard') | (books_df.yearOfPublication == 'DK Publishing Inc' )])



              ISBN                                          bookTitle  \
209538  078946697X  DK Readers: Creating the X-Men, How It All Beg...   
220731  2070426769  Peuple du ciel, suivi de 'Les Bergers\";Jean-M...   
221678  0789466953  DK Readers: Creating the X-Men, How Comic Book...   

       bookAuthor  yearOfPublication  \
209538       2000  DK Publishing Inc   
220731       2003          Gallimard   
221678       2000  DK Publishing Inc   

                                                publisher  
209538  http://images.amazon.com/images/P/078946697X.0...  
220731  http://images.amazon.com/images/P/2070426769.0...  
221678  http://images.amazon.com/images/P/0789466953.0...  


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [49]:
books_df.drop(books_df[(books_df.yearOfPublication == 'Gallimard') | (books_df.yearOfPublication == 'DK Publishing Inc' )].index, inplace = True) 

In [50]:
# checking
print(books_df[(books_df.yearOfPublication == 'Gallimard') | (books_df.yearOfPublication == 'DK Publishing Inc' )])

Empty DataFrame
Columns: [ISBN, bookTitle, bookAuthor, yearOfPublication, publisher]
Index: []


### Change the datatype of yearOfPublication to 'int'

In [51]:
books_df.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object

In [52]:
books_df['yearOfPublication'] = books_df['yearOfPublication'].astype(int)

In [53]:
books_df.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

In [58]:
books_df[books_df['publisher'].isnull()]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


### Drop NaNs in `'publisher'` column


In [61]:
#books_df.dropna(column='publisher')
books_df = books_df[pd.notnull(books_df['publisher'])]


In [62]:
#testing
books_df[books_df['publisher'].isnull()]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


In [64]:
books_df.shape

(271355, 5)

In [None]:
# in all 5 rows dropped , 3 for invalid yearOfPublication and 2 for Nan in publisher

## Exploring Users dataset

In [65]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [68]:
#a = users['Age'].unique()
sorted(users['Age'].unique())

[nan,
 0.0,
 1.0,
 2.0,
 3.0,
 4.0,
 5.0,
 6.0,
 7.0,
 8.0,
 9.0,
 10.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 23.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 29.0,
 30.0,
 31.0,
 32.0,
 33.0,
 34.0,
 35.0,
 36.0,
 37.0,
 38.0,
 39.0,
 40.0,
 41.0,
 42.0,
 43.0,
 44.0,
 45.0,
 46.0,
 47.0,
 48.0,
 49.0,
 50.0,
 51.0,
 52.0,
 53.0,
 54.0,
 55.0,
 56.0,
 57.0,
 58.0,
 59.0,
 60.0,
 61.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 72.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 81.0,
 82.0,
 83.0,
 84.0,
 85.0,
 86.0,
 87.0,
 88.0,
 89.0,
 90.0,
 91.0,
 92.0,
 93.0,
 94.0,
 95.0,
 96.0,
 97.0,
 98.0,
 99.0,
 100.0,
 101.0,
 102.0,
 103.0,
 104.0,
 105.0,
 106.0,
 107.0,
 108.0,
 109.0,
 110.0,
 111.0,
 113.0,
 114.0,
 115.0,
 116.0,
 118.0,
 119.0,
 123.0,
 124.0,
 127.0,
 128.0,
 132.0,
 133.0,
 136.0,
 137.0,
 138.0,
 140.0,
 141.0,
 143.0,
 146.0,
 147.0,
 148.0,
 151.0,
 152.0,
 156.0,
 157.0,
 159.0,


Age column has some invalid entries like nan, 0 and very high values like 100 and above

In [70]:
users.describe()

Unnamed: 0,userID,Age
count,278858.0,168096.0
mean,139429.5,34.751434
std,80499.51502,14.428097
min,1.0,0.0
25%,69715.25,24.0
50%,139429.5,32.0
75%,209143.75,44.0
max,278858.0,244.0


### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [75]:
#users_bkp = users.copy()
#values = {'Age': , 'Age': >90}
#users.fillna(value=values)

#pd.users.replace('?', np.nan)
#users_bkp.head()
users['Age'] = np.where(users['Age'] > 90.0, np.nan, users['Age'])
users['Age'] = np.where(users['Age'] < 5.0, np.nan, users['Age'])



In [76]:
# checking
users.describe()

Unnamed: 0,userID,Age
count,278858.0,166784.0
mean,139429.5,34.72384
std,80499.51502,13.585761
min,1.0,5.0
25%,69715.25,24.0
50%,139429.5,32.0
75%,209143.75,44.0
max,278858.0,90.0


In [85]:
users.groupby('Age').size()
#sorted(users['Age'].unique())

Age
5.0       26
6.0       18
7.0       27
8.0       54
9.0       62
10.0      84
11.0     121
12.0     192
13.0     885
14.0    1962
15.0    2383
16.0    2570
17.0    3044
18.0    3703
19.0    3950
20.0    4056
21.0    4438
22.0    4714
23.0    5456
24.0    5687
25.0    5618
26.0    5547
27.0    5383
28.0    5347
29.0    5293
30.0    4778
31.0    4665
32.0    4781
33.0    4699
34.0    4656
        ... 
61.0    1035
62.0     882
63.0     792
64.0     680
65.0     593
66.0     545
67.0     465
68.0     426
69.0     373
70.0     315
71.0     286
72.0     223
73.0     200
74.0     170
75.0     119
76.0     114
77.0      82
78.0      73
79.0      62
80.0      48
81.0      46
82.0      25
83.0      24
84.0      22
85.0      17
86.0       7
87.0       6
88.0       2
89.0       2
90.0       5
Length: 86, dtype: int64

### Replace null values in column `Age` with mean

In [89]:
Age_mean = users['Age'].mean()
Age_mean_int = round(Age_mean)
users['Age'].fillna(Age_mean_int,inplace = True)


In [93]:
#sorted(users['Age'].unique())

In [91]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


### Change the datatype of `Age` to `int`

In [94]:
users['Age'] = users['Age'].astype(int)

In [95]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 5.3+ MB


In [96]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [97]:
ratings.shape

(1149780, 3)

In [98]:
n_users = users.shape[0]
n_books = books.shape[0]

In [99]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [101]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [102]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [103]:
ratings_bkp = ratings.copy()
books_bkp2 = books.copy()

In [106]:
#print(ratings.reset_index(drop=True) == books.reset_index(drop=True))
ratings.sort_index(inplace=True)
books.sort_index(inplace=True)

In [108]:
#ratings.set_index('ISBN',inplace=True)
#books.set_index('ISBN',inplace=True)


In [114]:
ratings = ratings_bkp.copy()
books = books_bkp.copy()

In [119]:
ratings_merged = pd.merge(ratings, books_df, on='ISBN')  


In [121]:
ratings = ratings_merged.drop(['bookTitle', 'bookAuthor', 'yearOfPublication','publisher'], axis = 1) 

In [122]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031130 entries, 0 to 1031129
Data columns (total 3 columns):
userID        1031130 non-null int64
ISBN          1031130 non-null object
bookRating    1031130 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [123]:
ratings_bkp1 = ratings.copy()
users_bkp1 = users.copy()

In [124]:
ratings_merged = pd.merge(ratings, users, on='userID')  

In [126]:
ratings = ratings_merged.drop(['Location', 'Age'], axis = 1) 

In [127]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031130 entries, 0 to 1031129
Data columns (total 3 columns):
userID        1031130 non-null int64
ISBN          1031130 non-null object
bookRating    1031130 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [129]:
ratings.drop(ratings[ratings.bookRating == 0].index, inplace = True) 

### Find out which rating has been given highest number of times

In [136]:
ratings.groupby('bookRating').size()

bookRating
1      1481
2      2375
3      5118
4      7617
5     45355
6     31687
7     66401
8     91804
9     60776
10    71225
dtype: int64

In [135]:
np.sort(ratings.groupby('bookRating').size())

array([ 1481,  2375,  5118,  7617, 31687, 45355, 60776, 66401, 71225,
       91804], dtype=int64)

In [137]:
# Rating 8 was given the highest number of times

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [142]:
#movie_data.groupby('title')['rating'].mean().sort_values(ascending=False).head()  
ratings.groupby('userID')['userID'].count().sort_values(ascending=False).head()  

userID
11676     6943
98391     5689
189835    1899
153662    1845
23902     1180
Name: userID, dtype: int64

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

### Generate the predicted ratings using SVD with no.of singular values to be 50

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [2]:
userID = 2110

In [3]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [67]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
14448,2110,60987529,7
14449,2110,64472779,8
14450,2110,140022651,10
14452,2110,142302163,8
14453,2110,151008116,5


In [68]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [70]:
book_data.shape

(103, 5)

In [71]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [73]:
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,60987529,7,Confessions of an Ugly Stepsister : A Novel,Gregory Maguire,2000,Regan Books
1,2110,64472779,8,All-American Girl,Meg Cabot,2003,HarperTrophy
2,2110,140022651,10,Journey to the Center of the Earth,Jules Verne,1965,Penguin Books
3,2110,142302163,8,The Ghost Sitter,Peni R. Griffin,2002,Puffin Books
4,2110,151008116,5,Life of Pi,Yann Martel,2002,Harcourt


### Get top 10 recommendations for above given userID from the books not already rated by that user