# Book-Crossing: User review ratings
A collection of book ratings

In [4]:
import numpy as np
import pandas as pd

head = ['ISBN', 'book_title' ,'book_author','year_of_publication', 'publisher', 'img_s', 'img_m', 'img_l']
df_ratings = pd.read_csv('data\BX-Book-Ratings.csv',sep=';',encoding= 'unicode_escape')
df_books = pd.read_csv('data\\BX-Books.csv',encoding='unicode_escape',sep=';',skiprows=1,
                       names=head, low_memory=False)
df_users = pd.read_csv('data\\BX-Users.csv',sep=';',encoding= 'unicode_escape')


## Data Exploration

### Users

In [5]:
df_users.head(3)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",


In [6]:
df_users.describe()

Unnamed: 0,User-ID,Age
count,278858.0,168096.0
mean,139429.5,34.751434
std,80499.51502,14.428097
min,1.0,0.0
25%,69715.25,24.0
50%,139429.5,32.0
75%,209143.75,44.0
max,278858.0,244.0


There are 278858 users!

### Books

In [7]:
df_books.head(3)

Unnamed: 0,ISBN,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...


In [8]:
df_books['ISBN'].describe()

count         271379
unique        271379
top       2226133933
freq               1
Name: ISBN, dtype: object

In [9]:
df_books['book_title'].describe()

count             271379
unique            242154
top       Selected Poems
freq                  27
Name: book_title, dtype: object

There are 271379 books, some with the same title. The books have four features we'll exploit: _title, author, publisher_ and _year of publication_

### Ratings

In [10]:
df_ratings.head(3)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0


In [11]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [12]:
#df_ratings = df_ratings.astype({'User-ID':object})

In [13]:
#rate = df_ratings.sort_values(by=['User-ID','ISBN']).reset_index(drop=True)
#rate

In [14]:
#pd.pivot_table(rate.iloc[:50],index=['User-ID'],columns='ISBN', values='Book-Rating')

In [15]:
df_ratings.ISBN.describe()

count        1149780
unique        340556
top       0971880107
freq            2502
Name: ISBN, dtype: object

There are more books (340556) in the ratings dataset than in the books dataset. We will need to limit our ratings dataset to reflect only the books in the books dataset

In [16]:
df_ratings['Book-Rating'].value_counts()

0     716109
8     103736
10     78610
7      76457
9      67541
5      50974
6      36924
4       8904
3       5996
2       2759
1       1770
Name: Book-Rating, dtype: int64

Ratings distribution is highly unbalanced: from just under 2000 to over 700000

## Content Based Recommendation
Given the nature of the data, we'll create a content based recommendation engine using the Books title, author, publisher and average ratings

Drop books features not needed

In [17]:
books = df_books.drop(['img_s','img_m','img_l'],axis=1)

Obtain the average rating by movie

In [18]:
avg_ratings = df_ratings[['ISBN','Book-Rating']].groupby(['ISBN']).mean()
avg_ratings = avg_ratings.reset_index()
avg_ratings.describe()

Unnamed: 0,Book-Rating
count,340556.0
mean,2.943595
std,3.345574
min,0.0
25%,0.0
50%,1.8
75%,5.0
max,10.0


In [19]:
df_merged = pd.merge(books,avg_ratings, on='ISBN')
df_merged

Unnamed: 0,ISBN,book_title,book_author,year_of_publication,publisher,Book-Rating
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,0.000000
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,4.928571
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,5.000000
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,4.272727
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,0.000000
...,...,...,...,...,...,...
270165,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),7.000000
270166,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,4.000000
270167,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,0.000000
270168,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,0.000000


Lets remove nulls

In [20]:
df_merged.isnull().sum()

ISBN                   0
book_title             0
book_author            1
year_of_publication    0
publisher              2
Book-Rating            0
dtype: int64

In [21]:
df_merged = df_merged.dropna(axis=0)
df_merged.isnull().sum()

ISBN                   0
book_title             0
book_author            0
year_of_publication    0
publisher              0
Book-Rating            0
dtype: int64

Create a new column that combines all relevant features, call it __all_info__

In [22]:
df_merged['all_info'] = df_merged['book_title'] +' written by '+ df_merged['book_author'] +' published by ' + df_merged['publisher']
df_merged.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,ISBN,book_title,book_author,year_of_publication,publisher,Book-Rating,all_info
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,0.0,Classical Mythology written by Mark P. O. Morf...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,4.928571,Clara Callan written by Richard Bruce Wright p...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,5.0,Decision in Normandy written by Carlo D'Este p...


Let's now use TFIDF Vectorizer to obtain the features in the book content - title, author and publisher

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.7, min_df = 1, stop_words='english')
features_fit = vectorizer.fit(df_merged['all_info'])
features = features_fit.transform(df_merged['all_info'])
features.shape

(270167, 116364)

In [27]:
sample_user = df_merged.sample(5, random_state=20)
user_list = sample_user['all_info'].tolist()
user_features = features_fit.transform(user_list)
user_features.shape

(5, 116364)

For each book in the user list, use pairwise operation to obtain books with the most similar features

In [28]:
from sklearn.metrics.pairwise import linear_kernel, pairwise_distances

# metrics include: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].

similar_in_features = pairwise_distances(user_features, features, metric = 'cosine')
similar_sort_indices = similar_in_features.argsort()

for row in range(sample_user.shape[0]):
    print('Recommendations for', sample_user['book_title'].iloc[row])
    recommend = df_merged['book_title'].iloc[similar_sort_indices[row,1:6]].tolist()  # select top 5 similar for each book
    print(pd.DataFrame(recommend, columns=['Title']))
    print('')

Recommendations for Jill
                 Title
0           The Island
1  Every Move You Make
2     My Lady Caroline
3   Circle of the Lily
4    The Scottish Rose

Recommendations for My Education: A Book of Dreams
                                              Title
0                    My Education: A Book of Dreams
1                                             Queer
2                          William Burroughs Reader
3                                    The Cat Inside
4  With William Burroughs: A Report from the Bunker

Recommendations for Champion of the Sidhe
                 Title
0      Riders of Sidhe
1  Master of the Sidhe
2  Riders of the Sidhe
3  A Storm upon Ulster
4           Otherworld

Recommendations for Der Bestseller.
                      Title
0           Der Bestseller.
1      Der Schrei der Eule.
2     Der See von Han-yuan.
3            The Bestseller
4  Der Geliebte der Mutter.

Recommendations for The Lord of the Rings (Leatherette Collector's Edition)
          

### Authors

In [29]:
for row in range(sample_user.shape[0]):
    print('Recommendations for', sample_user['book_title'].iloc[row])
    recommend = df_merged['book_author'].iloc[similar_sort_indices[row,1:6]].tolist()  # select top 5 similar for each book
    print(pd.DataFrame(recommend, columns=['Author']))
    print('')

Recommendations for Jill
       Author
0  Jill Jones
1  Jill Jones
2  Jill Jones
3  Jill Jones
4  Jill Jones

Recommendations for My Education: A Book of Dreams
                 Author
0  William S. Burroughs
1     William Burroughs
2     William Burroughs
3  William S. Burroughs
4  William S. Burroughs

Recommendations for Champion of the Sidhe
             Author
0  Kenneth C. Flint
1  Kenneth C. Flint
2  Kenneth C. Flint
3  Kenneth C. Flint
4  Kenneth C. Flint

Recommendations for Der Bestseller.
               Author
0    Olivia Goldsmith
1  Patricia Highsmith
2    Robert van Gulik
3    Olivia Goldsmith
4          Urs Widmer

Recommendations for The Lord of the Rings (Leatherette Collector's Edition)
             Author
0  J. R. R. Tolkien
1  J. R. R. Tolkien
2  J. R. R. Tolkien
3  J. R. R. Tolkien
4  J. R. R. Tolkien



### Initialize our users, book ratings and features

In [87]:
n_books = 271379
n_users = 278858
n_features = 4

# get the indices with nonzero ratings to create a sparse users_book-rating matrix
ratings_indices = df_ratings['Book-Rating'].to_numpy().nonzero()[0]
ratings_ISBN = df_ratings['ISBN'][ratings_indices]
ratings_ID = df_ratings['User-ID'][ratings_indices]

row_indices = (ratings_ID - 1)  #row indices represent user_ids - which start at 1 

In [90]:
ratings_ID

1          276726
3          276729
4          276729
6          276736
7          276737
            ...  
1149773    276704
1149775    276704
1149777    276709
1149778    276721
1149779    276723
Name: User-ID, Length: 433671, dtype: int64

In [12]:
relevant_ISBN = df_books.ISBN.values

In [16]:
np.where(all_ISBN=='034545104X')[0][0]

2966

In [19]:
all_ISBN in df_ratings.ISBN.values

  """Entry point for launching an IPython kernel.


False

In [29]:
sum(relevant_rows == 1)

43534

In [86]:
len(df_ratings)

1149780