**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import numpy as np
import pandas as pd
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']
books.head(5)
users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']
ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

## Exploring books dataset

In [2]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [3]:
books_df = pd.DataFrame(books)
#books_df.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis=1)
#books_df.drop(books_df.index[[5,7]])
books_df.drop(columns=['imageUrlS', 'imageUrlM', 'imageUrlL'], axis = 1, inplace = True)
books_df.head(2)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


**yearOfPublication**

### Check unique values of yearOfPublication


In [4]:
books_df.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [5]:
yrofpub = books_df[books_df.yearOfPublication == 'DK Publishing Inc']
yrofpub

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [6]:
#books_df.ix[~(books_df['yearOfPublication'] == 'DK Publishing Inc') & (books_df['yearOfPublication'] == 'Gallimard')]
# indexNames = books_df[ (books_df['yearOfPublication'] == 'DK Publishing Inc') & (books_df['yearOfPublication'] == 'Gallimard')].index
# books_df.drop(indexNames , inplace=True)
# indexNames
# books_df[books_df.yearOfPublication == 'DK Publishing Inc']
books_df = books_df[~(books_df['yearOfPublication'].isin(["DK Publishing Inc", "Gallimard"]))]
#books_df.drop(books_df.yearOfPublication.isin(["DK Publishing Inc", "Gallimard"]).index, inplace = True)
#books_df = books_df[ (books_df['yearOfPublication'] != 'DK Publishing Inc') & (books_df['yearOfPublication'] != 'Gallimard')].index
#books_df[books_df.yearOfPublication == 'DK Publishing Inc']


In [7]:
#books_df[~books_df['yearOfPublication'].isin(['DK Publishing Inc', 'Gallimard'])]

In [8]:
books_df['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Change the datatype of yearOfPublication to 'int'

In [9]:
books_df['yearOfPublication'] = books_df['yearOfPublication'].astype(int)
#books_df['yearOfPublication'].astype(int)
# books_df['yearOfPublication'].astype(int)

In [10]:
books_df.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [11]:
books_df.dropna(subset=['publisher'])
books_df.isna()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


## Exploring Users dataset

In [12]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [13]:
asc_age = users['Age'].unique()
asc_age.sort()
print (asc_age)

[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 113. 114. 115. 116. 118. 119. 123. 124. 127. 128. 132. 133. 136. 137.
 138. 140. 141. 143. 146. 147. 148. 151. 152. 156. 157. 159. 162. 168.
 172. 175. 183. 186. 189. 199. 200. 201. 204. 207. 208. 209. 210. 212.
 219. 220. 223. 226. 228. 229. 230. 231. 237. 239. 244.  nan]


Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [14]:
from numpy import NaN
users_df = pd.DataFrame(users)
users_df = users_df.replace(users_df['Age'] < 5, NaN) 
users_df = users_df.replace(users_df['Age'] > 90, NaN)
users_df['Age']
#users.loc[users['Age'] < 5 & users['Age'] > 90, 'Age'] = Nan
#users_df = users_df.fillna([users['Age'] < 5 & users['Age'] > 90, 'Age'])


0          NaN
1         18.0
2          NaN
3         17.0
4          NaN
5         61.0
6          NaN
7          NaN
8          NaN
9         26.0
10        14.0
11         NaN
12        26.0
13         NaN
14         NaN
15         NaN
16         NaN
17        25.0
18        14.0
19        19.0
20        46.0
21         NaN
22         NaN
23        19.0
24        55.0
25         NaN
26        32.0
27        24.0
28        19.0
29        24.0
          ... 
278828     NaN
278829    28.0
278830     NaN
278831    62.0
278832    25.0
278833     NaN
278834    18.0
278835    47.0
278836     NaN
278837    15.0
278838     NaN
278839    45.0
278840     NaN
278841     NaN
278842    28.0
278843    28.0
278844     NaN
278845    23.0
278846     NaN
278847     NaN
278848    23.0
278849     NaN
278850    33.0
278851    32.0
278852    17.0
278853     NaN
278854    50.0
278855     NaN
278856     NaN
278857     NaN
Name: Age, Length: 278858, dtype: float64

### Replace null values in column `Age` with mean

In [15]:
users_df['Age'].fillna((users_df['Age'].mean()), inplace=True)
users_df['Age']

0         34.751434
1         18.000000
2         34.751434
3         17.000000
4         34.751434
5         61.000000
6         34.751434
7         34.751434
8         34.751434
9         26.000000
10        14.000000
11        34.751434
12        26.000000
13        34.751434
14        34.751434
15        34.751434
16        34.751434
17        25.000000
18        14.000000
19        19.000000
20        46.000000
21        34.751434
22        34.751434
23        19.000000
24        55.000000
25        34.751434
26        32.000000
27        24.000000
28        19.000000
29        24.000000
            ...    
278828    34.751434
278829    28.000000
278830    34.751434
278831    62.000000
278832    25.000000
278833    34.751434
278834    18.000000
278835    47.000000
278836    34.751434
278837    15.000000
278838    34.751434
278839    45.000000
278840    34.751434
278841    34.751434
278842    28.000000
278843    28.000000
278844    34.751434
278845    23.000000
278846    34.751434


### Change the datatype of `Age` to `int`

In [16]:
users_df['Age'] = users_df['Age'].astype(np.int64)
users_df.dtypes

userID       int64
Location    object
Age          int64
dtype: object

In [17]:
print(sorted(users_df.Age.unique()))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 113, 114, 115, 116, 118, 119, 123, 124, 127, 128, 132, 133, 136, 137, 138, 140, 141, 143, 146, 147, 148, 151, 152, 156, 157, 159, 162, 168, 172, 175, 183, 186, 189, 199, 200, 201, 204, 207, 208, 209, 210, 212, 219, 220, 223, 226, 228, 229, 230, 231, 237, 239, 244]


## Exploring the Ratings Dataset

### check the shape

In [18]:
ratings.shape

(1149780, 3)

In [19]:
n_users = users.shape[0]
n_books = books.shape[0]

In [20]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [21]:
#ratings_df = ratings['ISBN']==books_df['ISBN']
merged_bk_rt = pd.merge(books_df, ratings, on=['ISBN'], how='outer')
merged_bk_rt.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,userID,bookRating
0,195153448,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,2.0,0.0
1,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,8.0,5.0
2,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,11400.0,0.0
3,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,11676.0,8.0
4,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,41385.0,0.0


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [22]:
merged_usr_rt = pd.merge(users_df, ratings, on=['userID'], how='outer')
merged_usr_rt.head()

Unnamed: 0,userID,Location,Age,ISBN,bookRating
0,1,"nyc, new york, usa",34,,
1,2,"stockton, california, usa",18,195153448.0,0.0
2,3,"moscow, yukon territory, russia",34,,
3,4,"porto, v.n.gaia, portugal",17,,
4,5,"farnborough, hants, united kingdom",34,,


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [28]:
merged_bk_rt = merged_bk_rt[merged_bk_rt['bookRating']!=0.0]
merged_bk_rt

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,userID,bookRating
1,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,8.0,5.0
3,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,11676.0,8.0
5,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,67544.0,8.0
8,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,116866.0,9.0
9,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,123629.0,9.0
11,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,200273.0,8.0
12,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,210926.0,9.0
13,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,219008.0,7.0
14,0002005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,263325.0,6.0
16,0060973129,Decision in Normandy,Carlo D'Este,1991.0,HarperPerennial,2954.0,8.0


### Find out which rating has been given highest number of times

In [30]:
merged_bk_rt['bookRating'].mode()
# User rating of 8.0 has the highest frequency

0    8.0
dtype: float64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [36]:
user_list = merged_bk_rt['userID'].value_counts()
d = {'userID': user_list.index, 'nbrRated': user_list.values}
df_user = pd.DataFrame(d)

In [37]:
df_user.shape

(77805, 2)

In [39]:
df_user = df_user[df_user['nbrRated'] > 99] # Keep only users who have rated  > 100 books

In [40]:
df_user.shape

(495, 2)

In [42]:
eda_ratings = merged_bk_rt[merged_bk_rt['userID'].isin(df_user['userID'])]

In [45]:
eda_ratings.shape

(117645, 7)

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [47]:
eda_ratings_pivot_df = eda_ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating')

In [48]:
eda_ratings_pivot_df.head()

ISBN,0375404120,9022906116,0*708880258,0.330241664,0000000000,00000000000,0000000000000,0000000029841,0000000051,0000018030,...,O67174142X,O9088446X,Q380708353,SBN67001026X,UNGRANDHOMMED,X000000000,ZR903CX0003,"\0432534220\""""","\2842053052\""""",Ô½crosoft
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033.0,,,,,,,,,,,...,,,,,,,,,,
2110.0,,,,,,,,,,,...,,,,,,,,,,
2276.0,,,,,,,,,,,...,,,,,,,,,,
3757.0,,,,,,,,,,,...,,,,,,,,,,
4017.0,,,,,,,,,,,...,,,,,,,,,,


In [49]:
eda_ratings_pivot_df.replace(np.NaN,0,inplace = True)

In [50]:
eda_ratings_pivot_df.head()

ISBN,0375404120,9022906116,0*708880258,0.330241664,0000000000,00000000000,0000000000000,0000000029841,0000000051,0000018030,...,O67174142X,O9088446X,Q380708353,SBN67001026X,UNGRANDHOMMED,X000000000,ZR903CX0003,"\0432534220\""""","\2842053052\""""",Ô½crosoft
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3757.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
eda_ratings_pivot_matrix = eda_ratings_pivot_df.as_matrix()

  """Entry point for launching an IPython kernel.


In [52]:
eda_ratings_mean = np.mean(eda_ratings_pivot_matrix, axis = 1)
eda_ratings_demeaned = eda_ratings_pivot_matrix - eda_ratings_mean.reshape(-1, 1)

In [53]:
eda_ratings_demeaned.shape

(495, 78487)

In [54]:
user_id = list(eda_ratings_pivot_df.index)

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [55]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(eda_ratings_demeaned, k = 50)

In [56]:
U.shape

(495, 50)

In [57]:
sigma = np.diag(sigma)
sigma.shape

(50, 50)

In [58]:
Vt.shape

(50, 78487)

In [59]:
pred_ratings_all_matrix = np.dot(np.dot(U, sigma), Vt) + eda_ratings_mean.reshape(-1, 1)

In [60]:
pred_ratings_df = pd.DataFrame(pred_ratings_all_matrix,columns = eda_ratings_pivot_df.columns, index = eda_ratings_pivot_df.index)

In [61]:
pred_ratings_df['userID'] = pred_ratings_df.index

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [24]:
userID = 2110

In [25]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [62]:
predicted_rating = pred_ratings_df[pred_ratings_df['userID'] == userID].transpose()

In [63]:
predicted_rating = predicted_rating.sort_values([2110], ascending = False)
predicted_rating

userID,2110.0
ISBN,Unnamed: 1_level_1
userID,2110.000000
059035342X,0.493361
0345384911,0.291787
044021145X,0.288819
0451151259,0.275690
0380759497,0.249352
0439064872,0.235509
0345370775,0.225814
0880389117,0.220355
0345335287,0.215857


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [64]:
user_data = eda_ratings[eda_ratings['userID'] == 2110]

In [65]:
user_data.replace(0,np.NaN, inplace=True)#changing 0 ratings to NaN so that these can be dropped from the dataset

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


In [66]:
user_data.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [67]:
user_data.shape

(103, 7)

In [None]:
user_data.shape

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [68]:
book_data = books[books['ISBN'].isin(user_data['ISBN'])]

In [69]:
book_data.shape

(103, 8)

In [70]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt,http://images.amazon.com/images/P/0151008116.0...,http://images.amazon.com/images/P/0151008116.0...,http://images.amazon.com/images/P/0151008116.0...
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books,http://images.amazon.com/images/P/015216250X.0...,http://images.amazon.com/images/P/015216250X.0...,http://images.amazon.com/images/P/015216250X.0...
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy,http://images.amazon.com/images/P/0064472779.0...,http://images.amazon.com/images/P/0064472779.0...,http://images.amazon.com/images/P/0064472779.0...
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books,http://images.amazon.com/images/P/0345307674.0...,http://images.amazon.com/images/P/0345307674.0...,http://images.amazon.com/images/P/0345307674.0...
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket,http://images.amazon.com/images/P/0671527215.0...,http://images.amazon.com/images/P/0671527215.0...,http://images.amazon.com/images/P/0671527215.0...


In [71]:
user_full_info = pd.merge(user_data, book_data, on="ISBN", how="left") 

In [72]:
user_full_info.head()

Unnamed: 0,ISBN,bookTitle_x,bookAuthor_x,yearOfPublication_x,publisher_x,userID,bookRating,bookTitle_y,bookAuthor_y,yearOfPublication_y,publisher_y,imageUrlS,imageUrlM,imageUrlL
0,0151008116,Life of Pi,Yann Martel,2002.0,Harcourt,2110.0,5.0,Life of Pi,Yann Martel,2002,Harcourt,http://images.amazon.com/images/P/0151008116.0...,http://images.amazon.com/images/P/0151008116.0...,http://images.amazon.com/images/P/0151008116.0...
1,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001.0,Magic Carpet Books,2110.0,8.0,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books,http://images.amazon.com/images/P/015216250X.0...,http://images.amazon.com/images/P/015216250X.0...,http://images.amazon.com/images/P/015216250X.0...
2,0064472779,All-American Girl,Meg Cabot,2003.0,HarperTrophy,2110.0,8.0,All-American Girl,Meg Cabot,2003,HarperTrophy,http://images.amazon.com/images/P/0064472779.0...,http://images.amazon.com/images/P/0064472779.0...,http://images.amazon.com/images/P/0064472779.0...
3,0345307674,Return of the Jedi (Star Wars),James Kahn,1983.0,Del Rey Books,2110.0,10.0,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books,http://images.amazon.com/images/P/0345307674.0...,http://images.amazon.com/images/P/0345307674.0...,http://images.amazon.com/images/P/0345307674.0...
4,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984.0,Pocket,2110.0,9.0,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket,http://images.amazon.com/images/P/0671527215.0...,http://images.amazon.com/images/P/0671527215.0...,http://images.amazon.com/images/P/0671527215.0...


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [73]:
recomm_rating = predicted_rating[~predicted_rating.index.isin(user_full_info['ISBN']) ]
recomm_rating = recomm_rating[1:]
recomm_rating = recomm_rating.nlargest(10,[2110])
recomm_rating = pd.merge(recomm_rating,books, on = "ISBN", how="left") 
recomm_rating

Unnamed: 0,ISBN,2110.0,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,0345384911,0.291787,Crystal Line,Anne McCaffrey,1993,Del Rey Books,http://images.amazon.com/images/P/0345384911.0...,http://images.amazon.com/images/P/0345384911.0...,http://images.amazon.com/images/P/0345384911.0...
1,044021145X,0.288819,The Firm,John Grisham,1992,Bantam Dell Publishing Group,http://images.amazon.com/images/P/044021145X.0...,http://images.amazon.com/images/P/044021145X.0...,http://images.amazon.com/images/P/044021145X.0...
2,0451151259,0.27569,Eyes of the Dragon,Stephen King,1988,Penguin Putnam~mass,http://images.amazon.com/images/P/0451151259.0...,http://images.amazon.com/images/P/0451151259.0...,http://images.amazon.com/images/P/0451151259.0...
3,0380759497,0.249352,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos,http://images.amazon.com/images/P/0380759497.0...,http://images.amazon.com/images/P/0380759497.0...,http://images.amazon.com/images/P/0380759497.0...
4,0439064872,0.235509,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic,http://images.amazon.com/images/P/0439064872.0...,http://images.amazon.com/images/P/0439064872.0...,http://images.amazon.com/images/P/0439064872.0...
5,0345370775,0.225814,Jurassic Park,Michael Crichton,1999,Ballantine Books,http://images.amazon.com/images/P/0345370775.0...,http://images.amazon.com/images/P/0345370775.0...,http://images.amazon.com/images/P/0345370775.0...
6,0880389117,0.220355,Flint the King (Dragonlance: Preludes),Mary Kirchoff,1990,Wizards of the Coast,http://images.amazon.com/images/P/0880389117.0...,http://images.amazon.com/images/P/0880389117.0...,http://images.amazon.com/images/P/0880389117.0...
7,1560768304,0.214248,"The Dragons of Krynn (Dragonlance Dragons, Vol...",Margaret Weis,1994,Wizards of the Coast,http://images.amazon.com/images/P/1560768304.0...,http://images.amazon.com/images/P/1560768304.0...,http://images.amazon.com/images/P/1560768304.0...
8,0441845630,0.214248,Unicorn Point (Apprentice Adept (Paperback)),Piers Anthony,1990,ACE Charter,http://images.amazon.com/images/P/0441845630.0...,http://images.amazon.com/images/P/0441845630.0...,http://images.amazon.com/images/P/0441845630.0...
9,0618002235,0.209075,"The Two Towers (The Lord of the Rings, Part 2)",J. R. R. Tolkien,1999,Houghton Mifflin Company,http://images.amazon.com/images/P/0618002235.0...,http://images.amazon.com/images/P/0618002235.0...,http://images.amazon.com/images/P/0618002235.0...


In [74]:
print('The top 10 recommendations for user 2110 are \n' , recomm_rating)

The top 10 recommendations for user 2110 are 
          ISBN    2110.0                                          bookTitle  \
0  0345384911  0.291787                                       Crystal Line   
1  044021145X  0.288819                                           The Firm   
2  0451151259  0.275690                                 Eyes of the Dragon   
3  0380759497  0.249352                 Xanth 15: The Color of Her Panties   
4  0439064872  0.235509   Harry Potter and the Chamber of Secrets (Book 2)   
5  0345370775  0.225814                                      Jurassic Park   
6  0880389117  0.220355             Flint the King (Dragonlance: Preludes)   
7  1560768304  0.214248  The Dragons of Krynn (Dragonlance Dragons, Vol...   
8  0441845630  0.214248       Unicorn Point (Apprentice Adept (Paperback))   
9  0618002235  0.209075     The Two Towers (The Lord of the Rings, Part 2)   

         bookAuthor yearOfPublication                     publisher  \
0    Anne McCaffrey    