<center><h1>Recommendations at Scale !! </h1>
<img src="https://i.gifer.com/DTtv.gif" width="800">
</center>
<br><br>

* [Collaborative Filtering](#section-one)
* [Paradox of Choice](#section-two)
* [Pre-Processing](#section-four)
    - [Candidate Generation](#section-four-one)
    - [Gaussian Normalization](#section-four-two)
* [Machine Learning and Matrix Factorization Models](#section-five)
    - [Machine Learning based Model](#section-five-one)
    - [Matrix Factorization](#section-five-two)
* [Recommendation Evaluation](#section-six)
* [Scoring](#section-seven)
* [Neural Net Recommendation](#section-eight)

<a id="section-one"></a>
# Collaborative Filtering  ðŸ‘¥

**To address some of the limitations of content-based filtering, It relies on the concept that similar users has similar taste or choices of reading  similar books.** 

> ðŸ“Œ **If a user named *Akhil* liked certain genre of books, authors and etc.. and a similar user *Ram* also like the same genre of books, authors.. then both *Akhil* and *Ram* are categorized into similar users. If *Akhil* reads a book and if he likes it then the same book will be recommended to *Ram***

<img src="https://socital.com/wp-content/uploads/2019/09/Collaborative-filtering.jpg" width="400">

<a id="section-two"></a>
# Paradox of Choice

### Can One Desire Too Much of a Good Thing?

Recommedations have to be few out of a large corpora of dataset. And common architecture for recommendation systems consists of the following components:
* candidate generation
* scoring
* re-ranking
<center><img src="https://i0.wp.com/doist.com/blog/wp-content/uploads/sites/3/2015/07/paradox-of-choice.jpg?quality=85&strip=all&ssl=1" width="400"></center>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="section-four"></a>

## Importing Necessary Libraries

#### Here I am trying to use SURPRISE library which is like scikit library for recommendation algorithms


In [None]:
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor,KNNBasic,KNNWithMeans,KNNWithZScore,KNNBaseline,SVD,BaselineOnly,SVDpp,NMF,SlopeOne,CoClustering
from surprise.accuracy import rmse
from surprise import accuracy


## Dataset Loading..

In [None]:
users = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Users.csv', sep='\";\"', names=['User-ID', 'Location', 'Age'], encoding='latin-1', skiprows=1)
books = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Books.csv', sep='\";\"', names=['ISBN', 'Book-Title' ,'Book-Author','Year-Of-Publication', 'Publisher', 'Image-Url-S', 'Image-Url-M', 'Image-Url-L'], encoding='latin-1', skiprows=1)
ratings = pd.read_csv('../input/bookcrossing-dataset/Book reviews/BX-Book-Ratings.csv', sep='\";\"', names=['User-ID', 'ISBN', 'Book-Rating'], encoding='latin-1', skiprows=1)

## Data Cleaning

* Replacing NULL values
* Removing Unnecessary characters

In [None]:
users['User-ID'] = users['User-ID'].str.replace("\"","")
users['Location'] = users['Location'].str.replace("\";NULL","")
users['Age'] = users['Age'].fillna("0")
users['Age'] = users['Age'].str.replace("\"","")
books['ISBN'] = books['ISBN'].str.replace("\"","")
books['Book-Title'] = books['Book-Title'].str.replace("\"","")
ratings['User-ID'] = ratings['User-ID'].str.replace("\"","")
ratings['Book-Rating'] = ratings['Book-Rating'].str.replace("\"","").astype(int)

<a id="section-four-one"></a>

## Candidate Generation

**This is the first stage of the Recommender Systems. Not all books and users are taken as quality books and users. There will be few stringent and lenient users.**

Stringent Users: They are insensitive towards ratings, they won't rate higher ratings and mostly give medium ratings for books

Lenient Users: They are very sensitive towards ratings, they will rate higher ratings as 9, 10 always for most of the books

### Normalization of users ratings is required

In [None]:
# Quality books having atleast 5 reviews

quality_ratings = ratings[ratings['Book-Rating']!=0]
quality_book = quality_ratings['ISBN'].value_counts().rename_axis('ISBN').reset_index(name = 'Count')
quality_book = quality_book[quality_book['Count']>5]['ISBN'].to_list()
quality_ratings = quality_ratings[quality_ratings['ISBN'].isin(quality_book)]
quality_ratings

In [None]:
# Quality Users making atleast 5 reviews

quality_user = quality_ratings['User-ID'].value_counts().rename_axis('User-ID').reset_index(name = 'Count')
quality_user = quality_user[quality_user['Count']>5]['User-ID'].to_list()
quality_ratings = quality_ratings[quality_ratings['User-ID'].isin(quality_user)]
quality_ratings

<a id="section-four-two"></a>
## Gaussian Normalization

* All ratings are normalized as gaussian distribution 
* Gaussian Ratings are scaled on (0-5) Rating scale



\begin{equation*}
R_{norm}^{u_i}(b) = \frac{R_b - R_{mean}^{u_i}}{\sqrt{\sum_{j} (R_{b_j} - R_{mean}^{u_i})^2}}
\end{equation*}

In [None]:
# Normalizing the Ratings

mean_rating_user = quality_ratings.groupby('User-ID')['Book-Rating'].mean().reset_index(name='Mean-Rating-User')
mean_data = pd.merge(quality_ratings, mean_rating_user, on='User-ID')
mean_data['Diff'] = mean_data['Book-Rating'] - mean_data['Mean-Rating-User']
mean_data['Square'] = (mean_data['Diff'])**2
norm_data = mean_data.groupby('User-ID')['Square'].sum().reset_index(name='Mean-Square')
norm_data['Root-Mean-Square'] = np.sqrt(norm_data['Mean-Square'])
mean_data = pd.merge(norm_data, mean_data, on='User-ID')
mean_data['Norm-Rating'] = mean_data['Diff']/(mean_data['Root-Mean-Square'])  
mean_data['Norm-Rating'] = mean_data['Norm-Rating'].fillna(0)
max_rating = mean_data.sort_values('Norm-Rating')['Norm-Rating'].to_list()[-1]
min_rating = mean_data.sort_values('Norm-Rating')['Norm-Rating'].to_list()[0]
mean_data['Norm-Rating'] = 5*(mean_data['Norm-Rating'] - min_rating)/(max_rating-min_rating)
mean_data['Norm-Rating'] = np.ceil(mean_data['Norm-Rating']).astype(int)
norm_ratings = mean_data[['User-ID','ISBN','Norm-Rating']]
mean_data.sort_values('Norm-Rating')

In [None]:

reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(norm_ratings[['User-ID', 'ISBN', 'Norm-Rating']], reader)

<a id="section-five"></a>
# Machine Learning and Matrix Factorization Models 


Performing Cross validation and checking RMSE of all Machine Learning and Matrix Factorization algorithms available in surprise library


In [None]:
benchmark = []
for algorithm in [SVD(), 
                  SVDpp(), 
                  SlopeOne(), 
                  NMF(), 
                  NormalPredictor(), 
                  KNNBaseline(), 
                  KNNBasic(), 
                  KNNWithMeans(),
                  KNNWithZScore(), 
                  BaselineOnly(),
                  CoClustering()]:
    
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)

#### We can observe that Baseline ML algorithm and SVD based Matrix Factorization has last RMSE.

> ðŸ“Œ  0.62 of RMSE says that predicted rating may have an error of 0.62

In [None]:
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results

<a id="section-five-one"></a>
# Machine Learning based Model

### BaselineOnly

Algorithm predicting the baseline estimate for given user and item.

\begin{equation*}
b_{ui}=Î¼+b_u+b_i
\end{equation*}

If user u is unknown, then the bias b<sub>u</sub> is assumed to be zero. The same applies for item i with b<sub>i</sub>.

using SGD: Stocahstic Gradient Descent to minimize the loss with regularization parameter 0.5

In [None]:
# Baseline

train_set, test_set = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options={'method': 'sgd','learning_rate': .00005, 'n_epochs':30, 'reg':0.5})
fit = algo.fit(train_set)
pred = fit.test(test_set)
accuracy.rmse(pred)

<a id="section-five-two"></a>
# Matrix Factorization Method

### SVD

The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. When baselines are not used, This is equivalent to Probabilistic Matrix.

The prediction r<sup>ui</sup> is set as:
\begin{equation*}
r^{ui}=Î¼+b_u+b_i+q_i^Tp_u
\end{equation*}


To estimate all the unknown, we minimize the following regularized squared error:

\begin{equation*}
\sum_{r_{ui}âˆˆR_{train}}(r_{ui}âˆ’r^{ui})^2+Î»(b^2_i+b^2_u+||q_i||^2+||p_u||^2)
\end{equation*}


In [None]:
# SVD 

algo = SVD(reg_bi = 0.5, lr_bi=0.005)
fit = algo.fit(train_set)
pred = fit.test(test_set)
accuracy.rmse(pred)

In [None]:
recommend = algo.trainset
users_norm = list(set(norm_ratings['User-ID'].to_list()))
books_norm = list(set(norm_ratings['ISBN'].to_list()))
norm_ratings['User-ID'].unique()

In [None]:
pred_users = [user for user in users_norm if recommend.knows_user(recommend.to_inner_uid(user))]
pred_books = []
for book in books_norm:
    try:
        if recommend.knows_item(recommend.to_inner_iid(book)):
            pred_books.append(book)
    except:
        pass
    

In [None]:
pred_users[:5]

<a id="section-six"></a>
# Recommendation Evaluation

In [None]:
def recommend_books(user_id, count):
    result=[]
    for b in pred_books:
        result.append([b,algo.predict(user_id,b,r_ui=4).est])
    recom = pd.DataFrame(result, columns=['ISBN','Rating'])
    merge = pd.merge(recom,books, on='ISBN' )
    return merge.sort_values('Rating', ascending=False).head(count)

In [None]:
recommendation = recommend_books('36938', 5)

<a id="section-seven"></a>
# Scoring 

After candidate generation, another model scores and ranks the generated candidates to select the set of items to display. The recommendation system may have multiple candidate generators that use different sources, such as the following:

* User features that account for personalization.
* geographic information into account.
* Popular or trending items.


Here scoring is done based on published year

In [None]:
scoring = recommendation.sort_values('Year-Of-Publication')
view = "".join(["<span><img src='"+a+"'></span>" for a in scoring['Image-Url-M'].to_list()])
scoring[['Book-Title']]

In [None]:
view

<center><h1>My Top 5 Recommendations</h1></center>
<span><img src='http://images.amazon.com/images/P/0446310786.01.MZZZZZZZ.jpg'></span><span><img src='http://images.amazon.com/images/P/059035342X.01.MZZZZZZZ.jpg'></span><span><img src='http://images.amazon.com/images/P/0316666343.01.MZZZZZZZ.jpg'></span><span><img src='http://images.amazon.com/images/P/0385504209.01.MZZZZZZZ.jpg'></span><span><img src='http://images.amazon.com/images/P/0142001740.01.MZZZZZZZ.jpg'></span>

<a id="section-eight"></a>
# Neural Net Model

Will be updated !!