# Using book ratings to predict similar books to a specific book

Starting with a dataset of Amazon book reviews that has about 3 mil rows. I created a smaller dataset by pulling out only the columns I needed, plus dropping rows that have a null value. 
```
# rating_cols = ['Id', 'User_id', 'Title', 'review/score']
# reviews = pd.read_csv("../datasets/Books_rating.csv", names=rating_cols, usecols=range(4), header=0)

# reviews.dropna(inplace= True)
# reviews.to_csv('../datasets/Book_ratings_clean.csv')
```
to sample a subset randomly `reviews = reviews.sample(frac=0.5, random_state=1).reset_index(drop=True)` 
I didn't need to do this, but it would improve the speed (would also leave out some hits though..)

Next, I updated the column names and imported it into a postgres table so I could easily search it. I'm sure it's just as fast to do it here, but I'm more comfortable with SQL at the moment. 

In [1]:
import pandas as pd
# rating_cols = ['Id', 'User_id', 'Title', 'review/score']
# reviews = pd.read_csv("../datasets/Books_rating.csv", names=rating_cols, usecols=range(4), header=0)

# reviews.dropna(inplace= True)
# reviews.to_csv('../datasets/Book_ratings_clean.csv')

In [2]:
# using that cleaner dataset
ratings = pd.read_csv("../datasets/Book_ratings_clean.csv")
ratings.count()

id         414575
book_id    414575
title      414575
rating     414575
user_id    414575
dtype: int64

many books only have 1 review, aggregating the rating counts and mean to make it easier to work with 

In [3]:
import numpy as np
bookReviews = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
bookReviews.head()


Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"""A Truthful Impression of the Country"": British and American Travel Writing in China, 1880-1949",1,70.27
"""Beauty Shop-Philly Style""",2,13.95
"""Civilizing"" Rio: Reform and Resistance in a Brazilian City 1889-1930",1,32.95
"""Come and See"" Kids: The Life of Jesus (Catholic Bible Study for Children)",2,9.65
"""Come to Me""",1,10.19


Too many rows for my laptop to churn through, so dropping out books with less than 10 reviews

In [4]:
popularBooks = bookReviews['rating']['size'] >= 30
bookReviews[popularBooks].sort_values([('rating', 'mean')], ascending=False)[:15]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Starkissed,35,730.0
The Collected Works of J.G. Frazer: The Golden Bough (The Collected Works of James G),49,630.0
Biochemistry,31,210.51
Stone of Tears (Sword of Truth Series),386,167.25
Examkrackers MCAT Audio Osmosis with Jordan and Jon,74,129.5
Just Say No,77,119.95
"Learning Spanish Like Crazy: Spoken Spanish, Vol. 1 (2 volume set)",55,118.49
Out,155,117.95
The Portrait of a Lady (The Classic Collection),47,112.25
Music & Silence,30,110.95


Using the pivot_table function on a DataFrame will construct a user / book rating matrix. This process takes about 5 mins to run on my 2020 MBP, watch the activity monitor to see the process. 
Next, I cast this back to a dataframe to work with, then joined it to the original dataframe to get ratings and user_id's. Left join here leaves out books from the original set that aren't in the popular list (10 or more reviews). 

In [7]:
popularBooksDF = bookReviews[popularBooks]
# bookReviews['title'] = bookReviews['title'].to_string()
popularBooksDF.head()
# popularBooksDF.count()
# join this back with the original ratings df to get the ratings and user_ids
popularBooksDF = popularBooksDF.join(ratings.set_index('title'), on='title', how='left')
popularBooksDF.count()

  popularBooksDF = popularBooksDF.join(ratings.set_index('title'), on='title', how='left')


(rating, size)    191851
(rating, mean)    191851
id                191851
book_id           191851
rating            191851
user_id           191851
dtype: int64

In [8]:
bookReviewsPTable = popularBooksDF.pivot_table(index=['user_id'],columns=['title'],values='rating')
bookReviewsPTable.head()

title,"""More More More,"" Said the Baby Board Book (Caldecott Collection)","1,000 Indian Recipes","1,000 Vegetarian Recipes",10 Days to Faster Reading,100 Samurai Sudoku Puzzles,1001 Books You Must Read Before You Die,1001 Winning Chess Sacrifices and Combinations,101 Things I Wish I Knew When I Got Married: Simple Lessons to Make Love Last,101 Things to Do with a Tortilla,110 People Who Are Screwing Up America (and Al Franken Is #37),...,Zane's Gettin' Buck Wild: Sex Chronicles II,Zane's Skyscraper: A Novel,Zanesville: A Novel,Zen Guitar,Zen Shorts (Caldecott Honor Book),Zen and the Art of Poker: Timeless Secrets to Transform Your Game,Zen in the Martial Arts,Zin! Zin! Zin! A Violin (Aladdin Picture Books),comeback - a mother and daughter's journey through hell and back,the Picture of Dorian Gray
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00290423P2GEY37XWVAW,,,,,,,,,,,...,,,,,,,,,,
A00841771VHT9WNGU0X3J,,,,,,,,,,,...,,,,,,,,,,
A00891092QIVH4W1YP46A,,,,,,,,,,,...,,,,,,,,,,
A00940571GAOITYS675AR,,,,,,,,,,,...,,,,,,,,,,
A0099149GMVW6X5BHP2U,,,,,,,,,,,...,,,,,,,,,,


In [9]:
specificBookRatings = bookReviewsPTable["The Picture of Dorian Gray"]
specificBookRatings.count()

577

from the frank kane course: Pandas' corrwith function makes it really easy to compute the pairwise correlation of the book's vector of user rating with every other book! After that, we'll drop any results that have no data, and construct a new DataFrame of books and their correlation score (similarity) to the specific one from above:

In [10]:
similarBooks = bookReviewsPTable.corrwith(specificBookRatings)
similarBooks = similarBooks.dropna()
df = pd.DataFrame(similarBooks)
df.head(10)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
A Christmas Carol (Classic Fiction),1.0
"A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)",1.0
A Tale of Two Cities - Literary Touchstone Edition,-1.0
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio)),1.0
Geoffrey Chaucer (Bloomsbury Poetry Classics),-1.0
The Jungle Book,-1.0
The Picture of Dorian Gray,1.0
The Picture of Dorian Gray (The Classic Collection),0.998289
"The Red Badge of Courage (Lake Illustrated Classics, Collection 1)",1.0
Twenty Thousand Leagues Under the Sea (Library Edition),-1.0


ToDo- resolve that error. for now it's safe.

In [11]:
similarBooks.sort_values(ascending=False)

title
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio))    1.000000
The Picture of Dorian Gray                                                                                1.000000
The Red Badge of Courage (Lake Illustrated Classics, Collection 1)                                        1.000000
Wuthering Heights                                                                                         1.000000
the Picture of Dorian Gray                                                                                1.000000
A Christmas Carol (Classic Fiction)                                                                       1.000000
A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)        1.000000
The Picture of Dorian Gray (The Classic Collection)                                                       0.998289
A Tale of Two Cities - Literary Touchstone Edition                        

Ten might be too low.. may play around with it. 

In [15]:
df = bookReviews[popularBooks].join(pd.DataFrame(similarBooks, columns=['similarity']))

  df = bookReviews[popularBooks].join(pd.DataFrame(similarBooks, columns=['similarity']))


In [16]:
df.head()

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"""More More More,"" Said the Baby Board Book (Caldecott Collection)",56,7.19,
"1,000 Indian Recipes",42,7.0,
"1,000 Vegetarian Recipes",40,7.0,
10 Days to Faster Reading,35,7.99,
100 Samurai Sudoku Puzzles,30,9.99,


In [17]:
df.sort_values(['similarity'], ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio)),294,18.54,1.0
The Picture of Dorian Gray,1184,22.48,1.0
"The Red Badge of Courage (Lake Illustrated Classics, Collection 1)",286,14.95,1.0
Wuthering Heights,1736,15.975,1.0
the Picture of Dorian Gray,592,12.99,1.0
A Christmas Carol (Classic Fiction),957,13.98,1.0
"A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)",957,88.94,1.0
The Picture of Dorian Gray (The Classic Collection),592,25.04,0.998289
A Tale of Two Cities - Literary Touchstone Edition,827,5.99,-1.0
Geoffrey Chaucer (Bloomsbury Poetry Classics),129,14.26,-1.0


Follow-ups: remove original book from results, and learn how to work with big data more efficiently 