## Book Rental Recommendation.

### Description

Book Rent is the largest online and offline book rental chain in India. They provide books of various genres, such as thrillers, mysteries, romances, and science fiction. The company charges a fixed rental fee for a book per month. Lately, the company has been losing its user base. The main reason for this is that users are not able to choose the right books for themselves. The company wants to solve this problem and increase its revenue and profit. 

### Project Objective:

You, as an ML expert, should focus on improving the user experience by personalizing it to the user's needs. You have to model a recommendation engine so that users get recommendations for books based on the behavior of similar users. This will ensure that users are renting the books based on their tastes and traits.

Note: You have to perform user-based collaborative filtering and item-based collaborative filtering.

### Dataset description:

BX-Users: It contains the information of users.

user_id - These have been anonymized and mapped to integers

Location - Demographic data is provided

Age - Demographic data is provided

If available, otherwise, these fields contain NULL-values.

 

### BX-Books: 

isbn - Books are identified by their respective ISBNs. Invalid ISBNs have already been removed from the dataset.

book_title

book_author

year_of_publication

publisher


 

### BX-Book-Ratings: Contains the book rating information. 

user_id

isbn

rating - Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1–10 (higher values denoting higher appreciation), or implicit, expressed by 0.


### Note: Download the “BX-Book-Ratings.csv”, “BX-Books.csv”, “BX-Users.csv”, and “Recommend.csv” using the link given in the Book Rental Recommendation project problem statement.

 

### Following operations should be performed:

Read the books dataset and explore it

Clean up NaN values

Read the data where ratings are given by users

Take a quick look at the number of unique users and books

Convert ISBN variables to numeric numbers in the correct order

Convert the user_id variable to numeric numbers in the correct order

Convert both user_id and ISBN to the ordered list, i.e., from 0...n-1

Re-index the columns to build a matrix

Split your data into two sets (training and testing)

Make predictions based on user and item variables

Use RMSE to evaluate the predictions




In [1]:
import numpy as np
import pandas as pd

### 1. Read the books dataset and explore it.

In [2]:
books = pd.read_csv("BX-Books.csv", delimiter=',', encoding="latin-1", on_bad_lines='skip')

users = pd.read_csv("BX-Users.csv",  sep=',', encoding='latin-1', on_bad_lines='skip')

recommend = pd.read_csv("Recommend.csv",  sep=',', encoding='latin-1', on_bad_lines='skip')

ratings = pd.read_csv("BX-Book-Ratings.csv",  sep=',', encoding='latin-1', on_bad_lines='skip')

  books = pd.read_csv("BX-Books.csv", delimiter=',', encoding="latin-1", on_bad_lines='skip')
  users = pd.read_csv("BX-Users.csv",  sep=',', encoding='latin-1', on_bad_lines='skip')


In [100]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271379 entries, 0 to 271378
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   isbn                 271379 non-null  object
 1   book_title           271379 non-null  object
 2   book_author          271377 non-null  object
 3   year_of_publication  271379 non-null  object
 4   publisher            271377 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


In [None]:
books.head()

In [105]:
#We check there are no missing values and no object data types

books.isnull().sum().sum()

4

In [106]:
books.isnull().sum()

isbn                   0
book_title             0
book_author            2
year_of_publication    0
publisher              2
dtype: int64

In [110]:
# Lets Handle missing value for book_author column. As book_author column is a categorical data, the guideline suggest to 
# replace NaN with the mode's first value

books['book_author'].fillna( books['book_author'].mode()[0] , inplace=True)

In [109]:
# Lets Handle missing value for publisher column. As publisher column is a categorical data, the guideline suggest to 
# replace NaN with the mode's first value

books['publisher'].fillna( books['publisher'].mode()[0] , inplace=True)

In [111]:
books.isnull().sum()  # so we dont have any missing value in books

isbn                   0
book_title             0
book_author            0
year_of_publication    0
publisher              0
dtype: int64

In [112]:
# Check and remove all duplicate records from the dataframe (drop_duplicates)

books.drop_duplicates(inplace=True)

In [113]:
books.shape

(271379, 5)

###  Read the Users dataset and explore it.

In [120]:
users.shape

(278859, 3)

In [121]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278859 entries, 0 to 278858
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   user_id   278859 non-null  object 
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), object(2)
memory usage: 6.4+ MB


In [122]:
users.head()

Unnamed: 0,user_id,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [123]:
#We check there are no missing values and no object data types

users.isnull().sum().sum()

110764

In [124]:
users.isnull().sum()

user_id          0
Location         1
Age         110763
dtype: int64

In [128]:

# Lets Handle missing value for Age column. As Age  column is a continuous ND data, the guideline suggest to 
# replace NaN with the mean value

users['Age'].fillna( users['Age'].mean() , inplace=True)

In [126]:
users.head()

Unnamed: 0,user_id,Location,Age
0,1,"nyc, new york, usa",34.751434
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",34.751434
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",34.751434


In [131]:
# Lets Handle missing value for Location column. As Location column is a categorical data, the guideline suggest to 
# replace NaN with the mode's first value

users['Location'].fillna( users['Location'].mode()[0] , inplace=True)

In [132]:
users.isnull().sum()

user_id     0
Location    0
Age         0
dtype: int64

In [133]:
# Check and remove all duplicate records from the dataframe (drop_duplicates)

users.drop_duplicates(inplace=True)  # there is not any dublicate

In [135]:
users.shape 

(278859, 3)

### Read the recommend dataset and explore it.

In [138]:
recommend.info

<bound method DataFrame.info of        196   242  3  881250949
0      186   302  3  891717742
1       22   377  1  878887116
2      244    51  2  880606923
3      166   346  1  886397596
4      298   474  4  884182806
...    ...   ... ..        ...
99994  880   476  3  880175444
99995  716   204  5  879795543
99996  276  1090  1  874795795
99997   13   225  2  882399156
99998   12   203  3  879959583

[99999 rows x 4 columns]>

In [139]:
recommend.head()

Unnamed: 0,196,242,3,881250949
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


In [140]:
recommend.shape

(99999, 4)

In [144]:
recommend.isnull().sum().sum()

0

### Read the ratings dataset and explore it.

- Read the data where ratings are given by users

In [146]:
ratings.isnull().sum().sum()

0

In [147]:
ratings.shape

(1048575, 3)

In [148]:
ratings.info

<bound method DataFrame.info of          user_id        isbn  rating
0         276725  034545104X       0
1         276726   155061224       5
2         276727   446520802       0
3         276729  052165615X       3
4         276729   521795028       6
...          ...         ...     ...
1048570   250764   451410777       0
1048571   250764   452264464       8
1048572   250764  048623715X       0
1048573   250764   486256588       0
1048574   250764   515069434       0

[1048575 rows x 3 columns]>

In [149]:
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0
3,276729,052165615X,3
4,276729,521795028,6


In [151]:
ratings.describe()

Unnamed: 0,user_id,rating
count,1048575.0,1048575.0
mean,128508.9,2.879907
std,74218.76,3.85787
min,2.0,0.0
25%,63394.0,0.0
50%,128835.0,0.0
75%,192779.0,7.0
max,278854.0,10.0


#### Now merging the two data set books and rating.

In [3]:
books_ratings = pd.merge(books, ratings, on = 'isbn')
books_ratings.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,user_id,rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,0


In [4]:
# Number of rates for each book
rating_counts = pd.DataFrame(books_ratings["book_title"].value_counts())
rating_counts.head(10)

Unnamed: 0_level_0,count
book_title,Unnamed: 1_level_1
Wild Animus,2264
The Lovely Bones: A Novel,1164
The Da Vinci Code,828
A Painted House,766
The Nanny Diaries: A Novel,759
Bridget Jones's Diary,740
The Secret Life of Bees,704
Divine Secrets of the Ya-Ya Sisterhood: A Novel,669
The Red Tent (Bestselling Backlist),668
Angels &amp; Demons,616


In [201]:
rating_counts.shape

(230238, 1)

In [5]:
rating_counts['count']

book_title
Wild Animus                                                                                 2264
The Lovely Bones: A Novel                                                                   1164
The Da Vinci Code                                                                            828
A Painted House                                                                              766
The Nanny Diaries: A Novel                                                                   759
                                                                                            ... 
Bits and Pieces to Ponder                                                                      1
Doing Our Own Thing: The Degradation of Language and Music and Why We Should, Like, Care       1
Malice In London                                                                               1
What Would You Do?                                                                             1
L'Occhio Nero Al Pa

In [6]:
# Let's remove the books with a rate less than 100 from the data set.

rare_books = rating_counts[rating_counts["count"] < 100]
common_books = books_ratings[~books_ratings["book_title"].isin(rare_books)]
common_books.shape

(941148, 7)

In [7]:
common_books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,user_id,rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,0


We now have 941148 books with over 100 rates.

In [8]:
#user_books_df = common_books.pivot_table(index=["user_id"], columns=["book_title"], values="rating")
#pt = common_books.pivot_table(index='book_title',columns='user_id',values = 'rating')

#pt.shape


### 4. Take a quick look at the number of unique users and books.

In [31]:
# Number of the unique user.
n_users = books_ratings['user_id'].nunique()
print("Number of users: {}".format(n_users))

Number of users: 83644


In [34]:
# Number of the unique books.
n_books = books_ratings['book_title'].nunique()
print("Number of books: {}".format(n_books))

Number of books: 230238


### 6. Convert the user_id variable to numeric numbers in the correct order.

In [9]:
# convert user_id into the numeric number.

user_id_list = books_ratings['user_id'].unique()
print("length of user_id list: ", len(user_id_list))

length of user_id list:  83644


In [10]:
def userid_numeric(user_id):
    itemindex = np.where(user_id_list==user_id)
    return itemindex[0][0]

### 5. Convert ISBN variables to numeric numbers in the correct order

In [11]:
# do the same with ISBN and it into the numeric number.

isbn_list = books_ratings['isbn'].unique()
print("length of isbn list: ", len(isbn_list))

length of isbn list:  257832


In [12]:
isbn_list

array(['195153448', '2005018', '60973129', ..., '1561709085', '312180640',
       '8874960018'], dtype=object)

In [13]:
def isbn_numeric_id(isbn):
    itemindex = np.where(isbn_list==isbn)
    return itemindex[0][0]

In [14]:
books_ratings.shape

(941148, 7)

### 7. Convert both user_id and ISBN to the ordered list, i.e., from 0...n-1

In [15]:
books_ratings['user_id_order'] = books_ratings['user_id'].apply(userid_numeric)

In [16]:
books_ratings['isbn_order'] = books_ratings['isbn'].apply(isbn_numeric_id)

In [17]:
books_ratings.head(20)

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,user_id,rating,user_id_order,isbn_order
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0,0,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5,1,1
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0,2,1
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8,3,1
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,0,4,1
5,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,67544,8,5,1
6,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,85526,0,6,1
7,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,96054,0,7,1
8,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,116866,9,8,1
9,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,123629,9,9,1


In [18]:
books_ratings.shape

(941148, 9)

### 8 - Re-index the columns to build a matrix

In [19]:
ordered_cols = ['user_id_order', 'isbn_order', 'rating','book_title', 'book_author', 'year_of_publication','publisher',
               'user_id', 'isbn' ] 
books_ratings = books_ratings.reindex(columns =ordered_cols)

In [20]:
books_ratings.head()

Unnamed: 0,user_id_order,isbn_order,rating,book_title,book_author,year_of_publication,publisher,user_id,isbn
0,0,0,0,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,195153448
1,1,1,5,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,2005018
2,2,1,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,2005018
3,3,1,8,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,2005018
4,4,1,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,2005018


In [21]:
books_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941148 entries, 0 to 941147
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   user_id_order        941148 non-null  int64 
 1   isbn_order           941148 non-null  int64 
 2   rating               941148 non-null  int64 
 3   book_title           941148 non-null  object
 4   book_author          941146 non-null  object
 5   year_of_publication  941148 non-null  object
 6   publisher            941146 non-null  object
 7   user_id              941148 non-null  int64 
 8   isbn                 941148 non-null  object
dtypes: int64(4), object(5)
memory usage: 64.6+ MB


### 9. Split your data into two sets (training and testing)

In [25]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(books_ratings,
                                test_size=0.2,
                                random_state=10)

In [26]:
train.shape

(752918, 9)

In [27]:
test.head()

Unnamed: 0,user_id_order,isbn_order,rating,book_title,book_author,year_of_publication,publisher,user_id,isbn
939086,2674,255784,0,McSe Windows NT Server 4 for Dummies,Ken Majors,1998,John Wiley &amp; Sons Inc,240051,764504002
710360,511,105777,10,FoxTrot : En Masse,Bill Amend,1992,Andrews McMeel Publishing,101851,836218973
312354,1047,15728,0,"Valley of the Horses (Auel, Jean M. , Earth's ...",Jean M. Auel,1983,Bantam Doubleday Dell,44845,553234811
510868,1301,45954,0,Dreams in the Key of Blue,John Philpin,2000,Bantam Books,73394,055358006X
266030,3071,11538,5,Guilty Pleasures (Anita Blake Vampire Hunter (...,Laurell K. Hamilton,1995,Jove Books,139827,051513449X


In [28]:
test.shape

(188230, 9)

### 10. Make predictions based on user and item variables.

In [35]:
train_matrix = np.zeros((n_users, n_books))
for line in train.itertuples():
    train_matrix[line[1]-1, line[2]-1] = line[3]
    
test_matrix = np.zeros((n_users, n_books))
for line in test.itertuples():
    test_matrix[line[1]-1, line[2]-1] = line[3]   

IndexError: index 232517 is out of bounds for axis 1 with size 230238