# Using book ratings to predict similar books to a specific book

Starting with a dataset of Amazon book reviews that has about 3 mil rows. I created a smaller dataset by pulling out only the columns I needed, plus dropping rows that have a null value. 
```
# rating_cols = ['Id', 'User_id', 'Title', 'review/score']
# reviews = pd.read_csv("../datasets/Books_rating.csv", names=rating_cols, usecols=range(4), header=0)

# reviews.dropna(inplace= True)
# reviews.to_csv('../datasets/Book_ratings_clean.csv')
```
to sample a subset randomly `reviews = reviews.sample(frac=0.5, random_state=1).reset_index(drop=True)` 
I didn't need to do this, but it would improve the speed (would also leave out some hits though..)

Next, I updated the column names and imported it into a postgres table so I could easily search it. I'm sure it's just as fast to do it here, but I'm more comfortable with SQL at the moment. 

In [1]:
import pandas as pd
# rating_cols = ['Id', 'User_id', 'Title', 'review/score']
# reviews = pd.read_csv("../datasets/Books_rating.csv", names=rating_cols, usecols=range(4), header=0)

# reviews.dropna(inplace= True)
# reviews.to_csv('../datasets/Book_ratings_clean.csv')

In [2]:
# using that cleaner dataset
ratings = pd.read_csv("../datasets/Book_ratings_clean.csv")
ratings.head()

Unnamed: 0,id,book_id,title,rating,user_id
0,10,829814000,Wonderful Worship in Smaller Churches,19.4,AZ0IOBU20TBOP
1,11,829814000,Wonderful Worship in Smaller Churches,19.4,A373VVEU6Z9M0N
2,12,829814000,Wonderful Worship in Smaller Churches,19.4,AGKGOH65VTRR4
3,13,829814000,Wonderful Worship in Smaller Churches,19.4,A3OQWLU31BU1Y
4,14,595344550,Whispers of the Wicked Saints,10.95,A3Q12RK71N74LB


many books only have 1 review, aggregating the rating counts and mean to make it easier to work with 

In [3]:
import numpy as np
bookReviews = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
bookReviews.head()


Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"""A Truthful Impression of the Country"": British and American Travel Writing in China, 1880-1949",1,70.27
"""Beauty Shop-Philly Style""",2,13.95
"""Civilizing"" Rio: Reform and Resistance in a Brazilian City 1889-1930",1,32.95
"""Come and See"" Kids: The Life of Jesus (Catholic Bible Study for Children)",2,9.65
"""Come to Me""",1,10.19


Too many rows for my laptop to churn through, so dropping out books with less than 10 reviews

In [4]:
popularBooks = bookReviews['rating']['size'] >= 10
bookReviews[popularBooks].sort_values([('rating', 'mean')], ascending=False)[:15]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Starkissed,35,730.0
The Collected Works of J.G. Frazer: The Golden Bough (The Collected Works of James G),49,630.0
Atlas of Hematology,10,455.0
Social Theory and Methodology: From Max Weber: Essays in Sociology (International Library of Sociology),11,410.0
Richard II (Shakespeare: The Critical Tradition),13,360.0
"Vietnamese, Comprehensive: Learn to Speak and Understand Vietnamese with Pimsleur Language Programs",10,345.0
"Polish, Comprehensive: Learn to Speak and Understand Polish with Pimsleur Language Programs",12,287.38
"Spanish III, Comprehensive: Learn to Speak and Understand Latin American Spanish with Pimsleur Language Programs",28,287.38
"Italian II, Second Edition: Compehensive Compact Discs",11,287.38
"Czech, Comprehensive: Learn to Speak and Understand Czech with Pimsleur Language Programs",12,287.38


Using the pivot_table function on a DataFrame will construct a user / book rating matrix. This process takes about 5 mins to run on my 2020 MBP, watch the activity monitor to see the process. 
Next, I cast this back to a dataframe to work with, then joined it to the original dataframe to get ratings and user_id's. Left join here leaves out books from the original set that aren't in the popular list (10 or more reviews). 

In [29]:
popularBooksDF = popularBooks.to_frame()
# join this back with the original ratings df to get the ratings and user_ids
popularBooksDF = popularBooksDF.join(ratings.set_index('title'), on='title', how='left')
popularBooksDF.head()

Unnamed: 0_level_0,size,id,book_id,rating,user_id
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"""A Truthful Impression of the Country"": British and American Travel Writing in China, 1880-1949",False,2588738,0472111973,70.27,A912C7977MO6O
"""Beauty Shop-Philly Style""",False,1274800,1420859323,13.95,A23JCW11WWEQCQ
"""Beauty Shop-Philly Style""",False,1274801,1420859323,13.95,A1Y2KST29TFAPF
"""Civilizing"" Rio: Reform and Resistance in a Brazilian City 1889-1930",False,1036878,027102870X,32.95,A276Y65EABFM69
"""Come and See"" Kids: The Life of Jesus (Catholic Bible Study for Children)",False,755357,1931018286,9.65,A1JM25M2PSVPXE


In [9]:
bookReviewsPTable = popularBooksDF.pivot_table(index=['user_id'],columns=['title'],values='rating')
bookReviewsPTable.head()

title,"""Cool Stuff"" They Should Teach in School: Cruise into the Real World...with styyyle (jobs/people skills/attitude/goals/money)","""Happiness Is Not My Companion"": The Life of General G. K. Warren","""I just got a job in sales. Now what?"" A Playbook for Skyrocketing Your Commissions","""Life Was Never Meant to Be a Struggle""","""Mom, I Hate My Life!"": Becoming Your Daughter's Ally Through the Emotional Ups and Downs of Adolescence (A Hand-in-Hand Book)","""More More More,"" Said the Baby Board Book (Caldecott Collection)","""Then Junior Said to Jeff. . ."": The Best NASCAR Stories Ever Told (Best Sports Stories Ever Told)","'night, Mother: A Play (Mermaid Dramabook)",.NET Game Programming with DirectX 9.0,.NET Windows Forms Custom Controls,...,"Zondervan KJV Study Bible, Large Print",Zondervan NIV Matthew Henry Commentary,Zondervan's Compact Bible Dictionary,Zoo - ology,bills open kitchen,comeback - a mother and daughter's journey through hell and back,creative childbirth,e-Business and e-Commerce How to Program,how nature works: the science of self-organized criticality,the Picture of Dorian Gray
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00117421L76WVWG4UX95,,,,,,,,,,,...,,,,,,,,,,
A0015610VMNR0JC9XVL1,,,,,,,,,,,...,,,,,,,,,,
A00290423P2GEY37XWVAW,,,,,,,,,,,...,,,,,,,,,,
A00841771VHT9WNGU0X3J,,,,,,,,,,,...,,,,,,,,,,
A0085845UER34CCMXCHL,,,,,,,,,,,...,,,,,,,,,,


In [38]:
specificBookRatings = bookReviewsPTable["Pride and Prejudice (Bloom's Guides)"]
specificBookRatings.count()

KeyError: "Pride and Prejudice (Bloom's Guides)"

from the frank kane course: Pandas' corrwith function makes it really easy to compute the pairwise correlation of the book's vector of user rating with every other book! After that, we'll drop any results that have no data, and construct a new DataFrame of books and their correlation score (similarity) to the specific one from above:

In [21]:
similarBooks = bookReviewsPTable.corrwith(specificBookRatings)
similarBooks = similarBooks.dropna()
df = pd.DataFrame(similarBooks)
df.head(10)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
A Christmas Carol (Classic Fiction),1.0
"A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)",1.0
A Tale of Two Cities - Literary Touchstone Edition,-1.0
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio)),1.0
Geoffrey Chaucer (Bloomsbury Poetry Classics),-1.0
The Jungle Book,-1.0
The Picture of Dorian Gray,1.0
The Picture of Dorian Gray (The Classic Collection),0.998289
"The Red Badge of Courage (Lake Illustrated Classics, Collection 1)",1.0
Twenty Thousand Leagues Under the Sea (Library Edition),-1.0


ToDo- resolve that error. for now it's safe.

In [30]:
similarBooks.sort_values(ascending=False)

title
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio))    1.000000
The Picture of Dorian Gray                                                                                1.000000
The Red Badge of Courage (Lake Illustrated Classics, Collection 1)                                        1.000000
Wuthering Heights                                                                                         1.000000
the Picture of Dorian Gray                                                                                1.000000
A Christmas Carol (Classic Fiction)                                                                       1.000000
A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)        1.000000
The Picture of Dorian Gray (The Classic Collection)                                                       0.998289
A Tale of Two Cities - Literary Touchstone Edition                        

Ten might be too low.. may play around with it. 

In [31]:
df = bookReviews[popularBooks].join(pd.DataFrame(similarBooks, columns=['similarity']))

  df = bookReviews[popularBooks].join(pd.DataFrame(similarBooks, columns=['similarity']))


In [32]:
df.head()

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"""Cool Stuff"" They Should Teach in School: Cruise into the Real World...with styyyle (jobs/people skills/attitude/goals/money)",21,11.21,
"""Happiness Is Not My Companion"": The Life of General G. K. Warren",11,24.95,
"""I just got a job in sales. Now what?"" A Playbook for Skyrocketing Your Commissions",16,19.95,
"""Life Was Never Meant to Be a Struggle""",17,5.0,
"""Mom, I Hate My Life!"": Becoming Your Daughter's Ally Through the Emotional Ups and Downs of Adolescence (A Hand-in-Hand Book)",16,17.45,


In [33]:
df.sort_values(['similarity'], ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio)),294,18.54,1.0
The Picture of Dorian Gray,1184,22.48,1.0
"The Red Badge of Courage (Lake Illustrated Classics, Collection 1)",286,14.95,1.0
Wuthering Heights,1736,15.975,1.0
the Picture of Dorian Gray,592,12.99,1.0
A Christmas Carol (Classic Fiction),957,13.98,1.0
"A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)",957,88.94,1.0
The Picture of Dorian Gray (The Classic Collection),592,25.04,0.998289
A Tale of Two Cities - Literary Touchstone Edition,827,5.99,-1.0
Geoffrey Chaucer (Bloomsbury Poetry Classics),129,14.26,-1.0


Follow-ups: remove original book from results, and learn how to work with big data more efficiently 