**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd

In [2]:
#Loading data
books = pd.read_csv("books/books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('books/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('books/ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
#Checking info for books
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [4]:
#Checking info for users
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [5]:
#Checking info for ratings
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


## Exploring books dataset

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [7]:
#Dropping last 3 columns containing image URLs from books dataset
books.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis=1, inplace=True)

In [8]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [9]:
#Checking unique values of yearOfPublication column
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [10]:
#Checking the rows having 'DK Publishing Inc' as yearOfPublication
books[books['yearOfPublication'] == 'DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [11]:
#Importing numpy
import numpy as np

#Dropping rows having 'DK Publishing Inc' and 'Gallimard' as `yearOfPublication`
books.drop(index=books[books['yearOfPublication'] == 'DK Publishing Inc'].index, inplace=True)
books.drop(index=books[books['yearOfPublication'] == 'Gallimard'].index, inplace=True)

### Change the datatype of yearOfPublication to 'int'

In [12]:
#Converting datatype of yearOfPublication to numeric
books['yearOfPublication'] = pd.to_numeric(books['yearOfPublication'])

In [13]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [14]:
#Dropping NaN from 'publisher' column
books.dropna(subset=['publisher'], inplace=True)

## Exploring Users dataset

In [15]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [16]:
#Sorting unique values in 'Age' column
np.sort(users['Age'].unique())

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [17]:
#Replacing NaN for Ages below 5 or above 90
users.loc[users['Age'] < 5, 'Age'] = np.NaN
users.loc[users['Age'] > 90, 'Age'] = np.NaN

### Replace null values in column `Age` with mean

In [18]:
#Replacing null values in 'Age' column with mean
users['Age'].fillna(users['Age'].mean(), inplace=True)

### Change the datatype of `Age` to `int`

In [19]:
#Converting datatype of Age to numeric
users['Age'] = pd.to_numeric(users['Age'])

In [20]:
print(sorted(users.Age.unique()))

[5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 34.72384041634689, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0]


## Exploring the Ratings Dataset

### check the shape

In [21]:
ratings.shape

(1149780, 3)

In [22]:
n_users = users.shape[0]
n_books = books.shape[0]

In [23]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [24]:
#Finding records in ratings Dataset that are there in books dataset
ratings = ratings[ratings['ISBN'].isin(books['ISBN'])].dropna()

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [25]:
#Finding records in ratings Dataset that are there is users dataset
ratings = ratings[ratings['userID'].isin(users['userID'])].dropna()

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [26]:
#Let's query for bookrating != 0 for ratings dataset
ratings = ratings.query('bookRating != 0')

### Find out which rating has been given highest number of times

In [27]:
#Group by bookRating, count & sort in descending order
pd.DataFrame(ratings.groupby(by='bookRating')['userID'].count()).sort_values(by='userID', ascending=False)

Unnamed: 0_level_0,userID
bookRating,Unnamed: 1_level_1
8,91804
10,71225
7,66401
9,60776
5,45355
6,31687
4,7617
3,5118
2,2375
1,1481


### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [28]:
#Creating new DataFrame consisting on userID & if they have rated at least 100 books or not
ratings_100 = pd.DataFrame(ratings.groupby(by='userID')['bookRating'].count() > 100)

#Extract only userID who have rated at least 100 books
ratings_100 = pd.DataFrame(ratings_100[ratings_100['bookRating'] == True].index)

#Finding records in this new Dataset that are there in ratings dataset
ratings_new = ratings[ratings['userID'].isin(ratings_100['userID'])].dropna()

#delete temporary Dataset created
del ratings_100

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [29]:
# pivot entire dataset using pivot(), here index = 'userID', column = ISBN
R_df = ratings_new.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [30]:
#Import svds from sparse.linalg module in scipy library
from scipy.sparse.linalg import svds

#Creating 50 singular vvlues 
U, sigma, Vt = svds(R_df, k=50)

#Converting sigma into diagonal matrix
sigma = np.diag(sigma)

# Predictions are a dot product of (U, sigma), Vt, Three matrix 
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

#Convert the predictions to DataFrame & set columns from ratings matrix
predicted = pd.DataFrame(all_users_predicted_ratings, index=R_df.index, columns=R_df.columns)

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [31]:
userID = 2110

In [32]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [33]:
#Get predicted ratings for userID 2110 & sort in descending order
predicted.iloc[user_id-1,:].sort_values(ascending = False)

ISBN
059035342X    0.666278
0345370775    0.356946
0345384911    0.332482
044021145X    0.328190
043935806X    0.305998
0451151259    0.302311
0439139597    0.284296
0439064872    0.278464
0380759497    0.278080
0451167317    0.249057
0345353145    0.240258
0880389117    0.238956
0618002227    0.237127
0451160525    0.234123
0446310786    0.229384
0451173317    0.227405
0060392452    0.226435
0440213525    0.226044
0618002235    0.225940
0345335287    0.222455
0451156609    0.221148
0441845630    0.220842
1560768304    0.220842
0451180232    0.220840
0439136350    0.220328
0345317580    0.218592
0439136369    0.218479
0451142934    0.218052
0312980140    0.217973
0670835382    0.215775
                ...   
078686804X   -0.041588
042518630X   -0.041742
0786000899   -0.042079
0553567683   -0.042111
0688088686   -0.042342
0345361571   -0.042936
042517770X   -0.043048
0671673661   -0.043183
0446603090   -0.043258
0684195984   -0.046513
0684195976   -0.046605
0679405283   -0.046727
055380

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [34]:
#Creating dataframe containing userID 2110's explicitly interacted books
user_data = pd.DataFrame(ratings_new[ratings_new['userID'] == 2110].drop(columns=['userID']))

In [35]:
user_data.head()

Unnamed: 0,ISBN,bookRating
14448,60987529,7
14449,64472779,8
14450,140022651,10
14452,142302163,8
14453,151008116,5


In [36]:
user_data.shape

(103, 2)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [37]:
#Finding records in user_data Dataset that are there in books dataset
book_data = books[books['ISBN'].isin(user_data['ISBN'])].dropna()

#Merging user_data & book_data data
user_full_info = pd.merge(user_data, book_data, on='ISBN')

In [38]:
book_data.shape

(103, 5)

In [39]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [40]:
user_full_info.head()

Unnamed: 0,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,60987529,7,Confessions of an Ugly Stepsister : A Novel,Gregory Maguire,2000,Regan Books
1,64472779,8,All-American Girl,Meg Cabot,2003,HarperTrophy
2,140022651,10,Journey to the Center of the Earth,Jules Verne,1965,Penguin Books
3,142302163,8,The Ghost Sitter,Peni R. Griffin,2002,Puffin Books
4,151008116,5,Life of Pi,Yann Martel,2002,Harcourt


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [41]:
def recommend_books(predictions_df, userid, books_df, original_ratings_df, num_recommendations = False):
    sorted_user_predictions = predictions_df.loc[userID].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userid)]
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    
    # books that user has seen, rated and using dropna() finding count of books that are not rated before as rating is nan
    print('User {0} has already rated {1} books.'.format(userid, user_full.dropna().shape[0]))
    
    #print how many n(books) you will recommend 
    
    print('Recommending the highest {0} predicted ratings books not already rated.'.format(num_recommendations))
    
    
    # finding the predictions by comparing original data with predictions outputs 
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {userid: 'Predictions'}).
                       
                       #sort the predictions 
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [42]:
already_rated, predictions, sorted_user_predictions, user_data2, user_full = recommend_books(predicted, userID, books, ratings, 10)
predictions

User 2110 has already rated 103 books.
Recommending the highest 10 predicted ratings books not already rated.


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
1192,0345370775,Jurassic Park,Michael Crichton,1999,Ballantine Books
6184,0345384911,Crystal Line,Anne McCaffrey,1993,Del Rey Books
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
5458,043935806X,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic
2031,0451151259,Eyes of the Dragon,Stephen King,1988,Penguin Putnam~mass
5383,0439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
3413,0439064872,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
976,0380759497,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos
6048,0451167317,The Dark Half,Stephen King,1994,Signet Book
2435,0345353145,Sphere,MICHAEL CRICHTON,1988,Ballantine Books
