**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd

In [2]:
#Loading data
b = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
b.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

u = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
u.columns = ['userID', 'Location', 'Age']

r = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
r.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
print(b.shape)
b.info()

(271360, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [4]:
print(u.shape)
u.info()

(278858, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [5]:
print(r.shape)
r.info()

(1149780, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


## Exploring books dataset

In [6]:
b.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [7]:
b.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis=1,inplace=True)

In [8]:
b.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [9]:
b['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [10]:
b[b['yearOfPublication']=='DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [11]:
b = b[~b['yearOfPublication'].isin(['DK Publishing Inc','Gallimard'])]
#books = books[(books.yearOfPublication!='DK Publishing Inc') & (books.yearOfPublication!='Gallimard')]

In [12]:
b['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Change the datatype of yearOfPublication to 'int'

In [13]:
b['yearOfPublication'] = b['yearOfPublication'].astype(int)  

In [14]:
b.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [15]:
b['publisher'] = b['publisher'].dropna()

## Exploring Users dataset

In [16]:
print(u.shape)
u.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [17]:
age = u['Age'].unique()
print (sorted(age))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [18]:
import numpy as np
u.mask((u['Age'] <5.0) | (u['Age']>90.0), inplace=True)
#users.loc[(users['Age']<5.0) | (users['Age']>90.0),'Age'] = np.NaN

In [19]:
u['Age'].unique()

array([nan, 18., 17., 61., 26., 14., 25., 19., 46., 55., 32., 24., 20.,
       34., 23., 51., 31., 21., 44., 30., 57., 43., 37., 41., 54., 42.,
       50., 39., 53., 47., 36., 28., 35., 13., 58., 49., 38., 45., 62.,
       63., 27., 33., 29., 66., 40., 15., 60., 79., 22., 16., 65., 59.,
       48., 72., 56., 67., 80., 52., 69., 71., 73., 78.,  9., 64., 12.,
       74., 75., 76., 83., 68., 11., 77., 70.,  8.,  7., 81., 10.,  5.,
        6., 84., 82., 90., 85., 86., 87., 89., 88.])

### Replace null values in column `Age` with mean

In [20]:
u['Age'] = u['Age'].fillna(u['Age'].mean())

### Change the datatype of `Age` to `int`

In [21]:
u['Age'] = u['Age'].astype(int)  

In [22]:
print(sorted(u.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [23]:
r.shape

(1149780, 3)

In [24]:
n_users = u.shape[0]
n_books = b.shape[0]

In [25]:
r.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [26]:
r = r[r['ISBN'].isin(b['ISBN'])]

In [27]:
r.shape

(1031132, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [28]:
r = r[r['userID'].isin(u['userID'])]

In [29]:
r.shape

(1026153, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [30]:
r.bookRating.unique()

array([ 0,  5,  3,  6,  7,  9,  8, 10,  1,  4,  2], dtype=int64)

In [31]:
r = r[r.bookRating!=0]

### Find out which rating has been given highest number of times

In [32]:
r.bookRating.value_counts()

8     91365
10    70963
7     66101
9     60499
5     45154
6     31551
4      7576
3      5082
2      2360
1      1465
Name: bookRating, dtype: int64

Rating 8 is given the highest number of times

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [33]:
user_counts=r['userID'].value_counts()
r_df=r[r['userID'].isin(user_counts[user_counts >=100].index)]

In [34]:
r_df.shape

(102977, 3)

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [35]:
r_df['bookRating'].replace(np.nan,0,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [36]:
# spliting entire dataset into two parts 80-20 split
from sklearn.model_selection import train_test_split

trainDF, tempDF = train_test_split(r_df, test_size = 0.2, random_state = 100)

In [37]:
tempDF.head()

Unnamed: 0,userID,ISBN,bookRating
449168,107784,373240848,5
973123,234828,345389921,7
1048098,250709,312959974,5
652767,158226,1563895196,7
334598,79441,805009329,8


In [38]:
testDF = tempDF.copy()
tempDF.bookRating = np.nan

r_df = pd.concat([trainDF, tempDF]).reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [39]:
r_df.head()

Unnamed: 0,index,userID,ISBN,bookRating
0,289107,69078,684842696,9.0
1,940822,227705,425100057,8.0
2,720986,174304,920668364,9.0
3,915009,223087,375503862,6.0
4,229442,52917,743219600,6.0


In [40]:
ratings_matrix = r_df.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)

In [41]:
ratings_matrix.tail()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(ratings_matrix, k = 50)
# Strength matrix having latent feature x Latent feature

print(sigma)

[131.53259802 131.86432012 132.62194658 135.71355716 135.74998601
 137.42710262 137.61458589 139.11623808 140.53671845 141.52934212
 143.09431351 143.42518179 143.79823063 145.72180315 148.48930437
 148.54047864 150.03740151 151.5816514  151.8364972  154.93920186
 156.44999175 158.01194694 158.86445587 161.61599517 164.02690612
 165.40832206 166.89550756 167.96299285 170.7193613  171.76113272
 177.39135505 178.01908362 181.20302677 181.26816142 184.67957506
 185.91919256 190.74553062 193.76516402 199.9624407  206.11999775
 212.17433621 219.45118895 222.43615126 231.53427002 237.20302076
 254.50332341 262.57262496 340.63353848 568.22210612 605.4619755 ]


In [43]:
# converting sigma into a diagonal matrix 

sigma = np.diag(sigma)

In [44]:
# Predictions are a dot product of (U, sigma), Vt, Three matrix 

all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [45]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = ratings_matrix.columns)
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.019765,-0.014054,-0.009370,-0.014054,0.0,0.006325,-0.017657,0.000147,0.000147,0.0,...,0.0,0.001989,0.008221,-0.020973,0.0,0.002831,0.028006,4.460485e-04,0.002776,0.070793
1,-0.004391,-0.001979,-0.001319,-0.001979,0.0,0.000722,0.002601,-0.002006,-0.002006,0.0,...,0.0,0.000214,-0.000040,0.002929,0.0,0.000894,0.000423,1.227725e-04,0.000715,-0.007688
2,-0.003625,-0.007168,-0.004779,-0.007168,0.0,0.002748,-0.004516,-0.003009,-0.003009,0.0,...,0.0,0.000913,0.016504,0.006033,0.0,0.003161,0.004011,-3.666755e-04,0.003067,-0.039418
3,-0.015641,0.002793,0.001862,0.002793,0.0,0.017471,-0.013903,-0.002316,-0.002316,0.0,...,0.0,0.010260,0.030610,-0.002084,0.0,0.020033,0.001327,2.268140e-03,0.024792,-0.040766
4,0.004365,-0.000885,-0.000590,-0.000885,0.0,-0.003280,0.008802,-0.001147,-0.001147,0.0,...,0.0,-0.000307,-0.023819,0.003060,0.0,-0.000893,-0.006937,6.242962e-04,0.000842,0.063320
5,0.004346,0.027710,0.018474,0.027710,0.0,0.003552,0.033671,-0.002140,-0.002140,0.0,...,0.0,0.001982,0.007801,-0.008640,0.0,0.004679,0.004500,1.024502e-03,0.003752,0.045722
6,-0.012028,0.021382,0.014255,0.021382,0.0,0.005509,0.016232,0.001833,0.001833,0.0,...,0.0,0.003280,0.002375,0.010341,0.0,0.006280,0.001432,1.291474e-03,0.008509,0.022697
7,0.005440,0.008189,0.005459,0.008189,0.0,0.017356,0.002327,0.024070,0.024070,0.0,...,0.0,0.006595,0.023793,0.008097,0.0,0.007661,0.024945,-6.727702e-04,0.013123,-0.033706
8,0.040404,-0.024775,-0.016517,-0.024775,0.0,0.014680,-0.025952,-0.014389,-0.014389,0.0,...,0.0,0.005287,0.022584,0.006779,0.0,0.010189,0.008991,1.252485e-03,0.013504,-0.003636
9,0.038780,-0.003638,-0.002425,-0.003638,0.0,0.035772,-0.015877,-0.027625,-0.027625,0.0,...,0.0,0.014306,0.068642,-0.004761,0.0,0.031868,0.016475,2.397238e-03,0.040687,0.025295


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [46]:
userID = 2110

In [47]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [48]:
sorted_user_predictions = preds_df.loc[user_id].sort_values(ascending = False)
sorted_user_predictions

ISBN
059035342X    0.654127
044021145X    0.473488
0441003435    0.417661
0440211727    0.413928
0345350499    0.406484
0345354931    0.404001
0345368959    0.403138
0345318862    0.402560
043936213X    0.399582
0345370775    0.398111
051511605X    0.392267
0345322231    0.385572
0440213525    0.373708
0345318854    0.372685
0345384911    0.372223
006016848X    0.371408
0316666343    0.368452
069620780X    0.367877
0345313151    0.367766
0385504209    0.367385
0345322215    0.366318
0345313097    0.365528
0812523016    0.365465
1560768304    0.365150
0812517725    0.364035
0380759470    0.362970
0345383273    0.362350
0812551478    0.360556
0812548094    0.360353
0886773776    0.359660
                ...   
0374423075   -0.061898
0670852341   -0.061998
0515136530   -0.062008
0440235502   -0.062045
0486275426   -0.063856
0395177111   -0.065355
0140501800   -0.066191
0446674249   -0.066453
0679731180   -0.067418
0142000809   -0.069186
0451198808   -0.069601
0060953225   -0.071336
038549

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [49]:
user_data = r_df[r_df.userID == 2110]
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
652,14463,2110,345317580,10.0
2324,14606,2110,1565111575,10.0
3787,14576,2110,679805265,10.0
4506,14462,2110,345314255,10.0
5558,14582,2110,743486625,10.0


In [50]:
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
652,14463,2110,345317580,10.0
2324,14606,2110,1565111575,10.0
3787,14576,2110,679805265,10.0
4506,14462,2110,345314255,10.0
5558,14582,2110,743486625,10.0


In [51]:
user_data.shape

(103, 4)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [52]:
user_full_info = (user_data.merge(b, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
user_full_info.head()

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,345317580,10.0,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
19,14553,2110,590629808,10.0,"The Message (Animorphs , No 4)",K. A. Applegate,1996,Scholastic
36,14508,2110,439240700,10.0,"The Power of Two (T*Witches, No 1)",H. B. Gilmour,2001,Apple
37,14464,2110,345335287,10.0,The Black Unicorn (Magic Kingdom of Landover N...,Terry Brooks,1990,Del Rey Books
39,14507,2110,439222303,10.0,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [53]:
# read above first
def recommend_books(predictions_df, userID, books_df, original_ratings_df, num_recommendations = False):
    user_row_number = userID  #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    
    # finding the predictions by comparing original data with predictions outputs 
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                       
                       #sort the predictions 
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full


In [54]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_books(preds_df,user_id, b, r_df, 10)

In [55]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
2143,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
456,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
16588,0441003435,The Adept (Adept),Katherine Kurtz,2003,Ace Books
953,0440211727,A Time to Kill,JOHN GRISHAM,1992,Dell
2122,0345350499,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey
174,0345354931,Night Mare (Xanth Novels (Paperback)),Piers Anthony,1990,Del Rey Books
16356,0345368959,The Dolphins of Pern (Dragonriders of Pern (Pa...,Anne McCaffrey,1995,Del Rey Books
20773,0345318862,Golem in the Gears (Xanth Novels (Paperback)),PIERS ANTHONY,1986,Del Rey
9026,043936213X,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,2001,Scholastic
1195,0345370775,Jurassic Park,Michael Crichton,1999,Ballantine Books
