In [None]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## **SURPRISE**
<font size=4><b>Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data.</b></font>
<font size=4><b>Surprise was designed with the following purposes in mind:</b></font>
* <font size=3>Give users perfect control over their experiments.</font>
* <font size=3>Alleviate the pain of Dataset handling. Users can use both built-in datasets (Movielens, Jester), and their own custom datasets.</font>
* <font size=3>Provide various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others. Also, various similarity measures (cosine, MSD, pearson…) are built-in.</font>
* <font size=3>Make it easy to implement new algorithm ideas.</font>
* <font size=3>Provide tools to evaluate, analyse and compare the algorithms’ performance. Cross-validation procedures can be run very easily using powerful CV iterators (inspired by scikit-learn excellent tools), as well as exhaustive search over a set of parameters.</font>

### Import the libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from surprise import Dataset
from surprise import Reader
from surprise import SVD,accuracy,KNNBasic
from surprise.model_selection import cross_validate,train_test_split
import missingno as msno

In [None]:
df=pd.read_csv('../input/book-ratings/ratings.csv')

In [None]:
df

### Checking for missing values

In [None]:
msno.matrix(df)

In [None]:
df.info()

In [None]:
df.shape

<font size=4><b>Checking which ratings have been provided most by the users.</b></font>

In [None]:
plt.style.use('default')
plt.figure(figsize=(6,5))
sns.countplot(x='rating',data=df,color='dimgrey')
sns.despine(bottom=True,left=True)
plt.title('Distibution of Ratings')

<font size=4><b>Checking distinct users</b></font>

In [None]:
df['user_id'].nunique()

In [None]:
highest_no_rat=df.groupby('book_id')['user_id'].count().sort_values().tail(10)
highest_no_rat.plot.barh(figsize=(7,4),color='dimgrey')
plt.style.use('default')
sns.set_context('paper')
sns.despine()
plt.xlabel('No. of User Ratings',fontsize=10)
plt.title('Book_ID with highest No. of User Ratings',fontsize=10)

<font size=4><b>Reader Class</b></font>
<br>
<font size=3>The Reader class is used to parse a file containing ratings. The default format in which it accepts data is that each rating is stored in a separate line in the order user item rating. This order and the separator can be configured using parameters:</font>
* <font size=3>line_format is a string that stores the order of the data with field names separated by a space, as in "item user rating".</font>
* <font size=3>  sep is used to specify separator between fields, such as ','.</font>
* <font size=3>  rating_scale is used to specify the rating scale. The default is (1, 5).</font>
* <font size=3>  skip_lines is used to indicate the number of lines to skip at the beginning of the file. The default is 0.</font>

<font size=4><b>Dataset Module</b></font>
<br>
<font size=3>The Dataset module is used to load data from files, Pandas dataframes, or even built-in datasets available for experimentation.</font>

In [None]:
reader=Reader(rating_scale=(1,5))

data=Dataset.load_from_df(df,reader)

algo=SVD()

cross_validate(algo,data,cv=3)


### Predict the rating of a book with user_id=2 and book_id=4081

In [None]:
algo.predict(2,4081,r_ui=4)

In [None]:
train,test=train_test_split(data,test_size=.25)

### **CoCLustering**
* <font size=3>A collaborative filtering algorithm based on co-clustering.</font>
* <font size=3>Users and items are assigned some clusters Cu, Ci, and some co-clusters Cui.</font>
* <font size=3>The prediction r<sub>ui</sub> is given as:</font>
* <font size=3>r<sub>ui</sub>=C'<sub>ui</sub>+(u<sub>u</sub>-C'<sub>u</sub>)+(u<sub>i</sub>-C'<sub>i</sub>)</font>
* <font size=3>where C'<sub>ui</sub> is the average rating of co-cluster Cui, Cu' is the average rating of u’s cluster, and Ci' is the average rating of i’s cluster.</font>

#### Same as sklearn's methods,we can use **fit** and **test** methods for training the model and predicing on unseen data

In [None]:
from surprise import CoClustering
co_clu=CoClustering()
co_clu.fit(train)
preds=co_clu.test(test)
accuracy.rmse(preds)

#### Same as before let's predict on user_id=2 and book_id=4081 with the original rating (r_ui=4)

In [None]:
co_clu.predict(2,4081,r_ui=4)

#### The predicted rating was close enough to actual rating using the CoClustering algorithm.