## Project Overview

The objective of this project is to give personalized book recommendations using GoodReads. For this, We will first identify users with similar tastes in books. And the personalized recommendations (predictions) are based on what these users have liked. The GoodReads Book data has already been scraped by UCSD Research Engineers and is available at https://sites.google.com/eng.ucsd.edu/ucsdbookgraph. 
Data files used:
* goodreads_interactions.csv - gives the ratings the users have given for the books that they read
* books_titles.json - gives information about the books such as book id, title, number of ratings on good reads, url and cover image of the book
* book_id_map.csv - Book id on goodreads_interactions.csv file and books_titles.json are different. This file maps book_id on goodreads_interactions.csv file to book_id on books_titles.json file. 

### Project Steps:
* Identify similar users
* Create matrix
* Recommend books

## Reading in books we like

In [1]:
import pandas as pd

In [3]:
my_books = pd.read_csv("liked_books.csv", index_col = 0)

In [5]:
my_books

Unnamed: 0,user_id,book_id,rating,title
0,-1,2517439,5,"The Forever War (The Forever War, #1)"
1,-1,113576,5,The Smartest Guys in the Room: The Amazing Ris...
2,-1,35100,5,Battle Cry of Freedom
3,-1,228221,5,The Mask of Command
5,-1,17662739,5,"2001: A Space Odyssey (Space Odyssey, #1)"
6,-1,356824,5,India After Gandhi: The History of the World's...
7,-1,12125412,5,The Lady or the Tiger?: and Other Logic Puzzles
8,-1,139069,5,Endurance: Shackleton's Incredible Voyage
10,-1,76680,5,"Foundation (Foundation, #1)"
11,-1,1898,5,Into Thin Air: A Personal Account of the Mount...


In [6]:
my_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27 entries, 0 to 532
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  27 non-null     int64 
 1   book_id  27 non-null     int64 
 2   rating   27 non-null     int64 
 3   title    27 non-null     object
dtypes: int64(3), object(1)
memory usage: 1.1+ KB


In [7]:
my_books["book_id"] = my_books["book_id"].astype(str)

## Finding similar users

In [8]:
!head book_id_map.csv

book_id_csv,book_id
0,34684622
1,34536488
2,34017076
3,71730
4,30422361
5,33503613
6,33517540
7,34467031
8,6383669


In [12]:
csv_book_mapping = {}
with open("book_id_map.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split(",")    
        csv_book_mapping[csv_id] = book_id        

In [13]:
book_set = set(my_books['book_id'])

In [15]:
!head goodreads_interactions.csv

user_id,book_id,is_read,rating,is_reviewed
0,948,1,5,0
0,947,1,5,1
0,946,1,5,0
0,945,1,5,0
0,944,1,5,0
0,943,1,5,0
0,942,1,5,0
0,941,1,5,0
0,940,1,5,0


In [16]:
!wc -l goodreads_interactions.csv

228648343 goodreads_interactions.csv


In [19]:
overlap_users = {}

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.strip().split(",")
        book_id = csv_book_mapping.get(csv_id)
        if book_id in book_set:
            if user_id not in overlap_users:
                overlap_users[user_id] = 1
            else:
                overlap_users[user_id] += 1
            
        

In [20]:
len(overlap_users)

316341

In [21]:
filtered_overlap_users = set([k for k in overlap_users if overlap_users[k] > my_books.shape[0]/5])

## Finding similar user book ratings

In [22]:
interactions_list = []

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.strip().split(",")
        if user_id in filtered_overlap_users:
            book_id = csv_book_mapping.get(csv_id)
            interactions_list.append([user_id, book_id, rating])

In [23]:
len(interactions_list)

5638701

In [24]:
interactions_list[0]

['282', '627206', '4']

In [25]:
interactions = pd.DataFrame(interactions_list, columns = ["user_id", "book_id", "rating"])

In [26]:
interactions = pd.concat([my_books[["user_id", "book_id", "rating"]], interactions])

In [27]:
interactions

Unnamed: 0,user_id,book_id,rating
0,-1,2517439,5
1,-1,113576,5
2,-1,35100,5
3,-1,228221,5
5,-1,17662739,5
...,...,...,...
5638696,804100,475178,0
5638697,804100,186074,0
5638698,804100,153008,0
5638699,804100,45107,0


In [28]:
interactions["book_id"] = interactions["book_id"].astype(str)
interactions["user_id"] = interactions["user_id"].astype(str)
interactions["rating"] = pd.to_numeric(interactions["rating"])

In [29]:
interactions["user_id"].unique()

array(['-1', '282', '874', ..., '442043', '712588', '804100'],
      dtype=object)

In [31]:
interactions["user_index"] = interactions["user_id"].astype("category").cat.codes

In [32]:
interactions.iloc[2000]

user_id        874
book_id       5308
rating           3
user_index    1216
Name: 1973, dtype: object

In [37]:
len(interactions["user_index"].unique())

1259

interactions["book_index"] = interactions["book_id"].astype("category").cat.codes

In [36]:
len(interactions["book_index"].unique())

802870

Create a sparse matrix

In [38]:
interactions.head()

Unnamed: 0,user_id,book_id,rating,user_index,book_index
0,-1,2517439,5,0,414880
1,-1,113576,5,0,38971
2,-1,35100,5,0,575858
3,-1,228221,5,0,356004
5,-1,17662739,5,0,214285


In [40]:
from scipy.sparse import coo_matrix

ratings_mat_coo = coo_matrix((interactions["rating"], (interactions["user_index"], interactions["book_index"])))

In [42]:
ratings_mat_coo

<1259x802870 sparse matrix of type '<class 'numpy.int64'>'
	with 5638728 stored elements in COOrdinate format>

In [43]:
ratings_mat = ratings_mat_coo.tocsr()

## Finding users similar to us

In [45]:
interactions[interactions["user_id"] == "-1"]

Unnamed: 0,user_id,book_id,rating,user_index,book_index
0,-1,2517439,5,0,414880
1,-1,113576,5,0,38971
2,-1,35100,5,0,575858
3,-1,228221,5,0,356004
5,-1,17662739,5,0,214285
6,-1,356824,5,0,581743
7,-1,12125412,5,0,59763
8,-1,139069,5,0,124430
10,-1,76680,5,0,722098
11,-1,1898,5,0,276178


In [46]:
my_index = 0

In [47]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(ratings_mat[my_index, :], ratings_mat).flatten()

In [50]:
similarity[2]

0.06143442518998915

In [51]:
import numpy as np

indices = np.argpartition(similarity, -15)[-15:]

In [52]:
indices

array([1188,  942,  218,  129,  496,  435, 1208,  795, 1213, 1210, 1143,
        321,  294,  862,    0], dtype=int64)

In [57]:
similar_users = interactions[interactions["user_index"].isin(indices)].copy()

In [59]:
similar_users = similar_users[similar_users["user_id"] != "-1"]

In [60]:
similar_users

Unnamed: 0,user_id,book_id,rating,user_index,book_index
45312,4133,5359,3,942,632143
45313,4133,10464963,4,942,13492
45314,4133,3858,3,942,593622
45315,4133,11827808,4,942,51904
45316,4133,7913305,4,942,732465
...,...,...,...,...,...
5638521,712588,32388712,3,1143,543119
5638522,712588,16322,5,1143,183365
5638523,712588,860543,0,1143,759827
5638524,712588,853510,5,1143,756768


## Creating Book recommendations

In [61]:
book_recs = similar_users.groupby("book_id")["rating"].agg(["count", "mean"])

In [62]:
book_recs

Unnamed: 0_level_0,count,mean
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6,3.833333
100322,1,0.000000
100365,1,0.000000
10046142,1,0.000000
1005,3,0.000000
...,...,...
99561,2,2.500000
99610,1,3.000000
99664,1,4.000000
9969571,3,2.333333


In [63]:
books_titles = pd.read_json("book_titles.json")
books_titles["book_id"] = books_titles["book_id"].astype(str)

In [64]:
book_recs = book_recs.merge(books_titles, how = "inner", on = "book_id")

In [65]:
book_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title
0,1,6,3.833333,Harry Potter and the Half-Blood Prince (Harry ...,1713866,https://www.goodreads.com/book/show/1.Harry_Po...,https://images.gr-assets.com/books/1361039191m...,harry potter and the halfblood prince harry po...
1,100322,1,0.000000,Assata: An Autobiography,11057,https://www.goodreads.com/book/show/100322.Assata,https://images.gr-assets.com/books/1328857268m...,assata an autobiography
2,100365,1,0.000000,The Mote in God's Eye,48736,https://www.goodreads.com/book/show/100365.The...,https://images.gr-assets.com/books/1399490037m...,the mote in gods eye
3,10046142,1,0.000000,Dancing in the Glory of Monsters: The Collapse...,2391,https://www.goodreads.com/book/show/10046142-d...,https://images.gr-assets.com/books/1328757755m...,dancing in the glory of monsters the collapse ...
4,1005,3,0.000000,Think and Grow Rich,87634,https://www.goodreads.com/book/show/1005.Think...,https://s.gr-assets.com/assets/nophoto/book/11...,think and grow rich
...,...,...,...,...,...,...,...,...
2843,99561,2,2.500000,Looking for Alaska,804587,https://www.goodreads.com/book/show/99561.Look...,https://images.gr-assets.com/books/1394798630m...,looking for alaska
2844,99610,1,3.000000,The Best Laid Plans,17434,https://www.goodreads.com/book/show/99610.The_...,https://images.gr-assets.com/books/1353374848m...,the best laid plans
2845,99664,1,4.000000,The Painted Veil,24606,https://www.goodreads.com/book/show/99664.The_...,https://images.gr-assets.com/books/1320421719m...,the painted veil
2846,9969571,3,2.333333,Ready Player One,376328,https://www.goodreads.com/book/show/9969571-re...,https://images.gr-assets.com/books/1500930947m...,ready player one


In [66]:
book_recs["adjusted_count"] = book_recs["count"]*(book_recs["count"]/book_recs["ratings"])

In [67]:
book_recs["Score"] = book_recs["mean"]*book_recs["adjusted_count"]

In [69]:
# Remove books we have already read from the recommendations
book_recs = book_recs[~book_recs["book_id"].isin(my_books["book_id"])]

In [72]:
book_recs.head()

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,Score
0,1,6,3.833333,Harry Potter and the Half-Blood Prince (Harry ...,1713866,https://www.goodreads.com/book/show/1.Harry_Po...,https://images.gr-assets.com/books/1361039191m...,harry potter and the halfblood prince harry po...,2.1e-05,8.1e-05
1,100322,1,0.0,Assata: An Autobiography,11057,https://www.goodreads.com/book/show/100322.Assata,https://images.gr-assets.com/books/1328857268m...,assata an autobiography,9e-05,0.0
2,100365,1,0.0,The Mote in God's Eye,48736,https://www.goodreads.com/book/show/100365.The...,https://images.gr-assets.com/books/1399490037m...,the mote in gods eye,2.1e-05,0.0
3,10046142,1,0.0,Dancing in the Glory of Monsters: The Collapse...,2391,https://www.goodreads.com/book/show/10046142-d...,https://images.gr-assets.com/books/1328757755m...,dancing in the glory of monsters the collapse ...,0.000418,0.0
4,1005,3,0.0,Think and Grow Rich,87634,https://www.goodreads.com/book/show/1005.Think...,https://s.gr-assets.com/assets/nophoto/book/11...,think and grow rich,0.000103,0.0


In [71]:
my_books.head()

Unnamed: 0,user_id,book_id,rating,title
0,-1,2517439,5,"The Forever War (The Forever War, #1)"
1,-1,113576,5,The Smartest Guys in the Room: The Amazing Ris...
2,-1,35100,5,Battle Cry of Freedom
3,-1,228221,5,The Mask of Command
5,-1,17662739,5,"2001: A Space Odyssey (Space Odyssey, #1)"


In [73]:
my_books["mod_title"] = my_books["title"].str.replace("^[a-zA-Z0-9 ]", "", regex = True).str.lower()

In [74]:
my_books["mod_title"] = my_books["mod_title"].str.replace("\s+", " ", regex = True)

In [75]:
book_recs = book_recs[~book_recs["mod_title"].isin(my_books["mod_title"])]

In [76]:
book_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,Score
0,1,6,3.833333,Harry Potter and the Half-Blood Prince (Harry ...,1713866,https://www.goodreads.com/book/show/1.Harry_Po...,https://images.gr-assets.com/books/1361039191m...,harry potter and the halfblood prince harry po...,0.000021,0.000081
1,100322,1,0.000000,Assata: An Autobiography,11057,https://www.goodreads.com/book/show/100322.Assata,https://images.gr-assets.com/books/1328857268m...,assata an autobiography,0.000090,0.000000
2,100365,1,0.000000,The Mote in God's Eye,48736,https://www.goodreads.com/book/show/100365.The...,https://images.gr-assets.com/books/1399490037m...,the mote in gods eye,0.000021,0.000000
3,10046142,1,0.000000,Dancing in the Glory of Monsters: The Collapse...,2391,https://www.goodreads.com/book/show/10046142-d...,https://images.gr-assets.com/books/1328757755m...,dancing in the glory of monsters the collapse ...,0.000418,0.000000
4,1005,3,0.000000,Think and Grow Rich,87634,https://www.goodreads.com/book/show/1005.Think...,https://s.gr-assets.com/assets/nophoto/book/11...,think and grow rich,0.000103,0.000000
...,...,...,...,...,...,...,...,...,...,...
2843,99561,2,2.500000,Looking for Alaska,804587,https://www.goodreads.com/book/show/99561.Look...,https://images.gr-assets.com/books/1394798630m...,looking for alaska,0.000005,0.000012
2844,99610,1,3.000000,The Best Laid Plans,17434,https://www.goodreads.com/book/show/99610.The_...,https://images.gr-assets.com/books/1353374848m...,the best laid plans,0.000057,0.000172
2845,99664,1,4.000000,The Painted Veil,24606,https://www.goodreads.com/book/show/99664.The_...,https://images.gr-assets.com/books/1320421719m...,the painted veil,0.000041,0.000163
2846,9969571,3,2.333333,Ready Player One,376328,https://www.goodreads.com/book/show/9969571-re...,https://images.gr-assets.com/books/1500930947m...,ready player one,0.000024,0.000056


In [77]:
book_recs = book_recs[book_recs["count"] > 2]

In [78]:
book_recs = book_recs[book_recs["mean"] > 4]

In [79]:
top_recs = book_recs.sort_values("Score", ascending = False)

In [80]:
top_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,Score
2558,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,https://www.goodreads.com/book/show/78983.Kane...,https://s.gr-assets.com/assets/nophoto/book/11...,kane and abel kane and abel 1,0.000213,0.000904
1441,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,https://www.goodreads.com/book/show/2767793-th...,https://images.gr-assets.com/books/1480717763m...,the hero of ages mistborn 3,0.000107,0.000456
2260,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,https://www.goodreads.com/book/show/62291.A_St...,https://images.gr-assets.com/books/1497931121m...,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
1173,2318271,3,4.333333,The Last Lecture,245804,https://www.goodreads.com/book/show/2318271.Th...,https://images.gr-assets.com/books/1388075896m...,the last lecture,3.7e-05,0.000159
1100,22034,3,4.333333,The Godfather,259150,https://www.goodreads.com/book/show/22034.The_...,https://images.gr-assets.com/books/1394988109m...,the godfather,3.5e-05,0.00015
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,https://www.goodreads.com/book/show/119322.The...,https://images.gr-assets.com/books/1505766203m...,the golden compass his dark materials 1,1.6e-05,7e-05
1906,4381,3,4.333333,Fahrenheit 451,591506,https://www.goodreads.com/book/show/4381.Fahre...,https://images.gr-assets.com/books/1351643740m...,fahrenheit 451,1.5e-05,6.6e-05
600,157993,3,4.333333,The Little Prince,763309,https://www.goodreads.com/book/show/157993.The...,https://images.gr-assets.com/books/1367545443m...,the little prince,1.2e-05,5.1e-05


## Improve the display of the books

In [81]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)

In [82]:
top_recs.style.format({'url':make_clickable, 'cover_image': show_image})

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,Score
2558,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,Goodreads,,kane and abel kane and abel 1,0.000213,0.000904
1441,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,Goodreads,,the hero of ages mistborn 3,0.000107,0.000456
2260,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,Goodreads,,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
1173,2318271,3,4.333333,The Last Lecture,245804,Goodreads,,the last lecture,3.7e-05,0.000159
1100,22034,3,4.333333,The Godfather,259150,Goodreads,,the godfather,3.5e-05,0.00015
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,Goodreads,,the golden compass his dark materials 1,1.6e-05,7e-05
1906,4381,3,4.333333,Fahrenheit 451,591506,Goodreads,,fahrenheit 451,1.5e-05,6.6e-05
600,157993,3,4.333333,The Little Prince,763309,Goodreads,,the little prince,1.2e-05,5.1e-05


## Next Steps

* Try changing the filtered_overlap_users by increasing or decreasing, so this might give better recommendations
* Try adding more books to your liked books
* Try with more similar users, here we have used only 15 users