#### Author: Morgan Fang & Lingyu Guo. All rights Reserved.

# 1. Revised introduction of your problem

The main problem we are solving is designing a recommendation system to both recommend good books that users may like and introduce friends to people with the same taste in books. This idea has been widely addressed before, for example, from YouTube, Facebook, and many other apps.  YouTube always guesses what customers like to watch and show those videos on users’ main page, and it also gives users an option to subscribe to a channel or a YouTuber that is related to the videos he or she recently watched. What we want to do is pretty similar to these platforms. Although recommendation methods exist in many prior works to help users make decisions like selecting movies, music, products and dishes, those systems are not always completely based on the preferences of users themselves. The preferences of other users and what is popular now have great impacts on the recommendation as well. Therefore, it is desirable for us  to have a method of recommending books to users based on the readers’ own preferences and more in a network relationship way. 

To test our recommendation system, we will use the SVD algorithm of Surprise library to predict the user's rating of the movie as a baseline. Then, we will observe two indicators, RMSE (Root Mean Square Error) and MAE (Mean Absolute Error) to evaluate our prediction results. We hope this research can help readers to find the books they like faster and encourage them to read more. Besides the convenience of the users, it will build connections between similar readers as well, providing a more active and sociable reading environment for users to share their opinions.


# 2. Review of the relevant prior work


In this notebook, we are going to build a book review recommender system that uses book ratings to recommend similar books that a reader may like according to the books he/she has read on the platform before. There are many pieces of research that analyzed the recommendation systems used in many popular book retailing platforms online, such as Amazon.com or Douban (Leino & Räihä, 2007). Leino (2007) stated that the purpose of any recommender system is to direct the users to the items that best satisfy them, but he did not mention how to avoid users to be recommended with books they dislike. In another prior research,  the author had researched on the recommender systems of various online sales platform in India, where he stated that various techniques like Collaborative Filtering, Content-based, and Demographic have been adopted for the recommendation but there are several drawbacks causing these techniques to fail in providing effective recommendations (Chandak et al., 2015).   

However, our book review recommender system is different from those as we use various attributes to control the contents that we are going to recommend, including user-based and item-based collaborative filtering as well as demographic and geographic recommendation techniques. Besides, we aim to emphasize the impact of negative ratings when making recommendations. In this way, we would improve the efficiency as well as the effectiveness of our system.


# 3. Background and business inspirations

The initial inspiration is that we are confused by some current book review recommender systems because sometimes they cannot understand what we like and dislike. In other words, when it recommends books that I dislike, it would not make sense to induce us to read more. Therefore, our aim is to build a hybrid recommender system that induces book readers to buy more books, in order to promote sales for online bookselling platforms. For example, as a very quick search of “apple” on amazon.com could lead to completely different outcomes. Therefore, what we are going to do is to predict what the users really want to find using collaborative filtering.  In addition, we would expect the recommendation system would bring convenience to the online book users, especially in China, many readers would read before they buy, so the referral from peers (including ratings, reviews etc.) would be of vital importance. Thus, the platforms could be more sociable using our recommender system.

# 4. Major Algorithms and methods

4.1 k Nearest Neighbors (kNN) algorithm

kNN algorithm is an unsupervised learning algorithm, where the function is only approximated locally and all computation is deferred until function evaluation. For both regression and classification purposes, kNN could help a lot in predicting the neighbors of the users conveniently because it does not need results to feed its learning process.

4.2 Collaborative Filtering

The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with tastes similar to themselves. Also, we would use Singular Value Decomposition (SVD) in our recommendation system to find the books that they did not give ratings, and then project them via SVD to predict their ratings to similar books. Besides, we will test whether our system is making sense by checking whether some of the recommendations are actually the books they gave high ratings.

4.3 Affiliation Network
Affiliation networks would help the recommender system focus on not only the actors in the social network but also the societies. It will focus on the subsets or groups of readers instead of the direct ties between them. In this project, we would draw an affiliation network graph to show an outline of the structure of the dataset we used.


# 5. Coding Part

## 5.1 Preprocessing

### 5.1.1 Install Packages

In [None]:
# You can install it if you haven't install yet
#!pip install plotly

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from PIL import Image
import requests


In [None]:
from io import BytesIO
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
from plotly import tools
import plotly.figure_factory as ff

### 5.1.2 import rating data

In [None]:
ratings = pd.read_csv('BX-Book-Ratings.csv',sep=";",error_bad_lines=False, encoding='latin-1')
ratings.head(10)

In [None]:
ratings.info()

In [None]:
ratings.shape

In [None]:
rating = ratings['Book-Rating']
rating_mean = rating.mean() 
rating_mean

In [None]:
ratings[ratings == 0].count()

### 5.1.3 Deleting duplicate columns

In [None]:
ratings.drop_duplicates(inplace=True, keep='first') 

print(ratings.shape)

### 5.1.4 missing value

In [None]:
ratings = ratings.dropna()
print(ratings.shape)

### 5.1.5 remove rows that rating = 0

In [None]:
ratings['Book-Rating'].mean()

In [None]:
ratings = ratings[ratings['Book-Rating'] != 0]
ratings.info()

In [None]:
#rating_clean.to_csv("rating_clean.csv")

## 5.2 import books data and users data

### 5.2.1 books

In [None]:
books = pd.read_csv('BX_Books.csv',sep=";",error_bad_lines=False, encoding='latin-1')
books.head(3)

### 5.2.2 users

In [None]:
users = pd.read_csv('BX-Users.csv',sep=";",error_bad_lines=False, encoding='latin-1')
users

In [None]:
#users_df['User-ID'].describe()

In [None]:
print(users_df['User-ID'].isnull().sum())

In [None]:
users_df0 = users_df.dropna()

In [None]:
users_df0['User-ID'].astype(np.int64)

### 5.2.3 merging dataset

In [None]:
rating_clean.head(3)

In [None]:
B1 = pd.merge(rating_clean, users_df0, on='User-ID', how='left')
B1

In [None]:
B2 = pd.merge(B1, bookings_df, on='ISBN', how='left')
B2

In [None]:
#B2.to_csv("B2.csv")

### 5.2.4 cleaning

In [None]:
B3 = B2.dropna()
print(B3.shape)

In [None]:
#B3

In [None]:
#ratings1 = B3['Book-Rating']
#ratings1_mean = ratings1.mean() 
#ratings1_mean

### 5.2.5 columns renaming

In [None]:
B3.rename(columns={
    'User-ID': 'User_ID', 
    'Book-Rating': 'Book_Rating', 
    'Book-Title': 'Book_Title',
    'Book-Author': 'Book_Author',
    'Year-Of-Publication': 'Year_Of_Publication'
}, inplace=True)


In [None]:
#B3.head(2)

In [None]:
#B3['Country'] = B3['Country'].apply(lambda x:x[:-1])
B3.head(3)

In [None]:
#B3.to_csv("B3.csv")

In [None]:
B3.info()

## 5.3 summary statistics

In [None]:
#B3.describe()

In [None]:
bn = B3["Book_Title"].value_counts()
bn

In [None]:
#B3["Book_Title"].describe()


In [None]:
B3["User_ID"].value_counts()


In [None]:
user = B3['User_ID'].astype("str")
user.describe()

### 5.3.1 age

Further Cleaning

In [None]:
B4 = B3.drop(B3[B3['Age'] >= 80].index)
B4.shape

In [None]:
B4 = B4.drop(B4[B4['Age'] <= 10].index)
B4.shape

### 5.3.2 Year_Of_Publication

In [None]:
B4 = B4.drop(B4[B4['Year_Of_Publication'] >= 2010].index)
B4.shape

In [None]:
B4['Year_Of_Publication'].describe()

In [None]:
B4 = B4.drop(B4[B4['Year_Of_Publication'] <= 1200].index)
B4.shape

In [None]:
bn = B4["Book_Title"].value_counts()
bn

In [None]:
user = B4['User_ID']
user.drop_duplicates(inplace=True, keep='first') 

user = pd.merge(user, B4, on='User_ID', how='left')
user['Age'].plot(kind='hist', title='Age Distribution',)
B4['Book_Rating'].plot(kind='hist', title='Book_Rating Distribution',)

In [None]:
from matplotlib import pyplot as plt
from matplotlib import font_manager

data1 = B4.groupby(by="Book_Title").count().sort_values(by="Book_Rating", ascending=False)[:5]["Book_Rating"]
_x = data1.index
_y = data1.values


plt.figure(figsize=(29,8), dpi=100)
plt.bar(range(len(_x)), _y, width=0.5)

plt.xticks(range(len(_x)), _x)
plt.xlabel("Book Title")
plt.ylabel("Num Counts")
plt.title("Top Rated Books")
plt.show()


In [None]:
#use B4 as our primary dataset
B4.info()

In [None]:
user = B4['User_ID'].astype("str")

In [None]:
B4["User_ID"].value_counts()

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = ['98391', '153662', '235105 ', '16795', '171118']
students = [5689,1833,1017,956,954]
ax.bar(langs,students)
plt.xlabel("User ID")
plt.ylabel("Num Counts")
plt.title("Top5 Rating Users")
plt.show()

In [None]:
#user = B4['User_ID'].astype("str")
#user.describe()

## 5.4 Network and Clustering 

## 5.4.1 Affiliation Network and visualization

#### 5.4.1.1 preprocessing

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import nxviz as nv
from nxviz import CircosPlot
from networkx.algorithms import bipartite

In [None]:
#import and set sample size to 2000
data = pd.read_csv('B4.csv')
data=data.head(5000)
#data.info()
data['ISBN']=pd.to_numeric(data['ISBN'],errors='coerce')
data.dropna(inplace=True)
data.head()
data.info()


In [None]:
data.drop(['Unnamed: 0','Image-URL-S','Image-URL-M','Image-URL-L'],axis=1,inplace=True)
#data = data[data['Book_Rating']==10]
data.head()
data.info()

#### 5.4.1.2 Adjacency Matrix

In [None]:
G.adj

In [None]:
print(nx.adjacency_matrix(G).todense())

### 5.4.2 Graphing and Visualizing

In [None]:
G = nx.Graph()
m=list(data['User_ID'])
n=list(data['Book_Title'])
zip_list=list(zip(m,n))
# Add nodes with the node attribute "bipartite"
G.add_nodes_from(m, bipartite=0)
G.add_nodes_from(n, bipartite=1)
G.add_edges_from(list(zip(m,n))) 
    
bipartite.is_bipartite(G)

In [None]:
pdd=pd.DataFrame(zip_list,columns=['source','target'])
pdd.head()
pdd.to_csv('edgelist.csv')

In [None]:
top_nodes = {n for n, d in G.nodes(data=True) if d["bipartite"] == 0}
bottom_nodes = set(G) - top_nodes

In [None]:
nodes = G.nodes()
degree = G.degree()
colors = [degree[n] for n in nodes]

pos = nx.bipartite_layout(G,top_nodes)
cmap = plt.cm.viridis_r
#cmap = plt.cm.Greys

vmin = min(colors)
vmax = max(colors)

fig = plt.figure(figsize = (15,15), dpi=100)

nx.draw(G,pos,alpha = 0.8, nodelist = nodes, node_color = 'r', node_size = 10, with_labels= True,font_size = 6,font_color='b', width = 0.2, cmap = cmap, edge_color ='blue')
#fig.set_facecolor('#0B243B')

plt.show()

In [None]:
c = CircosPlot(G,node_color='bipartite',node_grouping='bipartite')
c.draw()
plt.show()


In [None]:
#write to gexf file for further development and optimization
nx.write_gexf(G,'bi-network.gexf')
print('success!!')

### 5.4.3 Calculate centrality

In [None]:
cent = nx.degree_centrality(G)
name = []
centrality = []

for key, value in cent.items():
    name.append(key)
    centrality.append(value)

In [None]:
cent = pd.DataFrame()    
cent['name'] = name
cent['centrality'] = centrality
cent = cent.sort_values(by='centrality', ascending=False)

In [None]:
plt.figure(figsize=(10, 25))
bb = sns.barplot(x='centrality', y='name', data=cent[:15], orient='h')
bb = plt.xlabel('Degree Centrality')
bb = plt.ylabel('Correspondent')
bb = plt.title('Top 15 Degree Centrality Scores in Enron Email Network')
plt.show()

### 5.4.4 Calculate betweenness

In [None]:
between = nx.betweenness_centrality(G)
name = []
betweenness = []

In [None]:
for key, value in between.items():
    name.append(key)
    betweenness.append(value)

bet = pd.DataFrame()
bet['name'] = name
bet['betweenness'] = betweenness
bet = bet.sort_values(by='betweenness', ascending=False)

In [None]:
plt.figure(figsize=(10, 25))
aa = sns.barplot(x='betweenness', y='name', data=bet[:10], orient='h')
aa = plt.xlabel('Degree Betweenness Centrality')
aa = plt.ylabel('Correspondent')
aa = plt.title('Top 10 Betweenness Centrality Scores in Hillary Clinton Email Network')
plt.show()

### 5.4.5 K-means Clustering

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

In [None]:
#import and set sample size to 2000
data = pd.read_csv('B4.csv')
data=data.head(5000)
#data.info()
data['ISBN']=pd.to_numeric(data['ISBN'],errors='coerce')
data.dropna(inplace=True)
data1=data.copy()
data.drop(['Book_Title','Publisher','Unnamed: 0','Location','Image-URL-S','Image-URL-M','Image-URL-L'],axis=1,inplace=True)
data.head()
data.info()

#use get_dummies function to change those qualitative columns into binary ones
data_encoded = pd.get_dummies(data)
data_encoded
data1.info()

In [None]:
# minmax scaler (this part is referred from HW2)
scaler = MinMaxScaler()
train_X,test_X = train_test_split(data_encoded, test_size=0.3, random_state=930)
X_train = scaler.fit_transform(train_X)
X_test = scaler.transform(test_X)

X = scaler.transform(data_encoded)


In [None]:
# KMeans
# choose k value with elbow method
K = range(1, 20)
meanDispersions = []
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X_train)

    meanDispersions.append(kmeans.inertia_)

plt.plot(K, meanDispersions, 'rx-')
plt.xlabel('k')
plt.ylabel('Average Dispersion')
plt.title('Selecting k with the Elbow Method')
plt.show() 



In [None]:
# From the graph I would select k=2 as the optimal number of clusters

# cluster
kmeans = KMeans(n_clusters=2)

y1 = kmeans.fit_predict(X_train)
y2 = kmeans.predict(X_test)
whole_data = kmeans.predict(X)

#generate two subsets with data generated from last step 
train = pd.DataFrame(train_X,columns = data_encoded.columns)
test = pd.DataFrame(test_X,columns = data_encoded.columns)
#then add the prediction of clustering to these data
train['Cluster'] = y1
test['Cluster'] = y2
data1['Cluster'] = whole_data
#data.to_csv('clustered.csv')

In [None]:
# check the outcomes of each cluster
groupby1 = data1.groupby(by='Cluster').mean()
groupby1

## 5.5 Collaborative Filtering

### 5.5.1 Further data cleaning and merging

In [None]:
data.head(5)

In [None]:
books = pd.read_csv('BX_Books.csv',sep=';',error_bad_lines=False, encoding='latin-1')
books.drop(['Image-URL-S','Image-URL-M','Image-URL-L'],axis=1,inplace=True)
books.head()
books.info()

In [None]:
books.rename(columns={'Book-Title':'book_name','Book-Author':'author','Year-Of-Publication':'year','Publisher':'publisher'},inplace=True)
books.head()

In [None]:
#change the column names to make life easier
users.rename(columns={'User-ID':'user_id','Location':'location','Age':'age'},inplace=True)
users.head()

In [None]:
ratings.rename(columns={'User-ID':'user_id','ISBN':'ISBN','Book-Rating':'book-rating'},inplace=True)

In [None]:
#check the data structure

print(books.info())
print(ratings.info())
print(users.info())

In [None]:
# reduce the magnitude of data by filtering those users who have reviewed more than 30 books (frequent users)
x = ratings['user_id'].value_counts()>30
x.shape

In [None]:
#filtered out ratings that frequent users have made.
index1 = x.index
ratings = ratings[ratings['user_id'].isin(index1)]
ratings.head()
ratings.info()

### 5.5.2 collaborative filtering

In [None]:
merged = ratings.merge(books, on = 'ISBN')
merged.head()
merged.info()

In [None]:
#merge the data with number of ratings
merged_groupby=merged.groupby('book_name')['book-rating'].count().reset_index()
merged_groupby.rename(columns={'book-rating':'number_of_ratings'},inplace=True)
#filter books with more than 30 reviews 
merged_groupby=merged_groupby[merged_groupby['number_of_ratings']>30]
merged_groupby.head()

In [None]:
#merge the above two files together to get an integrated book review data with total review count for each book;then remove the duplicates
integrated_merged=merged.merge(merged_groupby, on='book_name')
integrated_merged.drop_duplicates(['user_id','book_name'],inplace=True)
integrated_merged.head()
integrated_merged.info()

### 5.5.3 constructing a pivot table

In [None]:
pivot=pd.pivot_table(integrated_merged, columns='user_id',index='book_name',fill_value=0,values='book-rating')
pivot.shape
pivot

In [None]:
pivot_csr=csr_matrix(pivot)
pivot_csr

## 5.5.4 construct kNN models

In [None]:
#construct kNN models
model=NearestNeighbors(algorithm='brute')
model.fit(pivot_csr)

In [None]:
pivot.iloc[:,:].values.reshape(1,-1)
#example of k neighbors 
distances,suggestions=model.kneighbors(pivot.iloc[55,:].values.reshape(1,-1))

In [None]:
distances

In [None]:
suggestions

In [None]:
#test the kNN collaborative filtering model
for i in range(len(suggestions)):
    print(pivot.index[suggestions[i]])
    print(suggestions[i])

# 6 Book Review Recommender System

## 6.1 Building the recommender system

In [None]:
list1=list(books['book_name'])
list1

In [None]:
#final recommender system function building
def book_recommend(book_name):
    if book_name in list1:
        book_id = np.where(pivot.index == book_name)[0][0]
        distances, recommendations = model.kneighbors(pivot.iloc[book_id,:].values.reshape(1,-1))
        print('begin to recommend all books similar to this book!!!')
        for i in range(len(recommendations)):
            if i == 0:
                print(f"For book \"{book_name}\" we would recommend the following:")
            if not i:
                list2=pivot.index[recommendations[i]]
                for j in range(len(list2)):
                    print(list2[j])
    else:
        raise ValueError

In [None]:
name=input('Please Input a book name: ')
book_recommend(name)

# 7 Conclusion and Business Insights

The datasets we used are from Kaggle website, which contains about 278 thousands anonymized users providing over 1 million ratings about 271 thousands books. After dropped duplicate, missing and abnormal values, the merged dataset now looks like this on the right side, and it contains about 262 thousands ratings now.

It shows that users are mainly from 20 to 50 years old in this dataset，and we also get a pretty right skewed book rating distribution, that most of the rating scores are from 5 to 10.

Additionally, we can see the top 5 most rated books are ‘wild animals’,’The lovely bones’,’The Da Vinci Code’,’The secret life of bees’, and ‘Bridget Jone’s Diary. We also get the top 5 readers who gave the most ratings on different books.

Next, we use user Id and book title as nodes and ratings of the users towards books as edges to build an affiliate network. Now nodes are get connected, and we can already know the centrality and the betweenness in this graph.

To be more clear, this graph shows weighted relationships between nodes. We can easily tell which the most popular books are.

The kNN model is the core algorithm of our recommender system, whose mechanism is finding the target node by finding out k nearest neighbors of it. We also use SVD to reduce dimensions, as well as using pivot table to speed up the calculation process.

# Please Upvote if you think it helps!