In [264]:
import numpy as np

In [265]:
from surprise import Reader, Dataset


# 1. How Surprise works

## 1.1 Specifying How to read the file through Reader object

In [266]:
reader = Reader(line_format='item user rating timestamp', sep=',')

In [267]:
# Reader class will automatically re arranges the indices to form tuples ie.,
# (user, item, rating, timestamp) irrespective of what our order is...
reader.indexes

[1, 0, 2, 3]

## 1.2 Load the dataset from file (sample.csv)

-  by using Reader object, which knows how the data is stored in the file.

In [268]:
# LOAD the dataset.....
data = Dataset.load_from_file('sample.csv',reader=reader)

In [269]:
# loading the dataset, literally means making tuples of our data..
data.ratings_file, data.raw_ratings[:3], len(data.raw_ratings)

('sample.csv',
 [('1013034', '14890', 5.0, '2005-07-07'),
  ('413056', '17169', 5.0, '2005-06-14'),
  ('823022', '9526', 3.0, '2003-04-01')],
 100000)

## 1.3 Preparing Dataset :

### 1.3.1 Creating the Trainset and Testset Object:
- If you want to split the data into train and test dataset

In [270]:
from surprise.model_selection import train_test_split

*  Now let's make Training and Testing Set to train our algorithm....


* It returns a Trainset object... and testset( list of tuples)

In [271]:
trainset, testset = train_test_split(data, test_size=.25)

* **Trainingset is a Trainset class object that store various information along with user item rating, which is useful for furthur computations **.


* **Trainset** has some  ***defaultdict's of user and item***. A dictionary of UserIds(replaced with indeces of ordering) that contains list of (item, ratings) pairs. Even Item Id's are also 
 replaced with indices (in the order of occurance,)


 * But the data is not corrupted by doing this. It preserves same User-Item
as before. But now with new Indices....

In [272]:
type(trainset)

surprise.trainset.Trainset

In [273]:
list(trainset.ur.items())[:10]

[(0, [(0, 4.0)]),
 (1, [(1, 4.0), (2350, 3.0), (159, 3.0)]),
 (2, [(2, 5.0), (2659, 4.0), (73, 4.0)]),
 (3, [(3, 3.0)]),
 (4, [(4, 3.0), (1844, 3.0), (1246, 4.0)]),
 (5, [(5, 5.0)]),
 (6, [(6, 5.0)]),
 (7, [(7, 4.0), (673, 3.0)]),
 (8, [(8, 2.0), (3251, 4.0)]),
 (9, [(9, 3.0)])]

- If you want to know abou any user, we can get them by indexing defaultdict

In [274]:
trainset.ur[4]

[(4, 3.0), (1844, 3.0), (1246, 4.0)]

* *Here is how the Testset looks like* 

In [275]:
# (user, item, rating) pairs

type(testset), testset[:2]

(list, [('1974804', '7163', 3.0), ('286397', '2389', 3.0)])

### 1.3.2 Creating only Trainset:
- If you want to train with all the data in input file

- It is done only when you have saperate file for testing

In [276]:
trainset_1 = data.build_full_trainset()

* The training set structure is same as above ( some info along with default dicts of user and item)

***We can get that mappings of users and items to inner_item_ids and inner_user_ids***

In [277]:
# 'OriginalItemIds' to 'NewItemIds' are stored in "_raw2inner_items" 
# Similarly, The mappings from 'OriginalUserIds' to 'NewUserIds' are like..

print('User_mapping:',list(trainset._raw2inner_id_users.items())[:3])
print('Item_mapping:',list(trainset._raw2inner_id_items.items())[:3])

User_mapping: [('1427149', 0), ('997033', 1), ('1058842', 2)]
Item_mapping: [('16242', 0), ('8904', 1), ('4306', 2)]


***Some info that a Trainingset has ***

In [278]:
# It has several useful methods that are userful while training the model..
# Example:
trainset.global_mean, trainset.n_items, trainset.knows_item(3241)
# and much more, where they are used inside the library while training

(3.608373333333333, 8376, True)

## 1.4 Run the SVD algorithm on the data (Trainingset Object)

==============================================================
### A sneak Peek in the algorithm

- It is not purely SVD, it is **inspired from SVD**.


- The standard SVD is modified for this Rating Prediction problem, by considering user biases, Item biases to make this new SVD. It uses **SGD** to solve optimization problem..(** With Regularization**)


- In this we will solve the optimization problem of minimiZing the error by considering Biases of user and Item..

http://www.albertauyeung.com/post/python-matrix-factorization/

**Implementation is completely different in the above blogpost, But the concept is same**

> ***Our Predicted Rating can be calculated from the following***

\begin{aligned}
\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u
\end{aligned}

==============================================================

In [279]:
# import algorithm.....
from surprise import SVD

* lr_all : Learning rate for all hyperparameters.. (can make diffrnt lrngrates for diffrnt param)


* reg_all: Regularization term for all parameters. (Can make diffrnt RegParams for diffrnt param)


* with default parameters... (lr_all and reg_all)

>We can make unbiased SVD also.....

In [280]:
algo = SVD(random_state=15, biased=True, verbose=True, lr_all=0.007, reg_all=0.02) 
algo

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1c125ea7ba8>

* **20 is the defalut no of epochs for SGD**

In [281]:
 # Train the model on the dataset (trainset)
algo.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1c125ea7ba8>

## 1.5 Model Evaluation using RMSE

### 1.5.1 We can make it very easy if we have created testset

In [282]:
predictions = algo.test(testset)

In [283]:
print(type(predictions[0]))
print('-'*40)
print('Prediction Instance')
print('-'*40)
print(predictions[0])

<class 'surprise.prediction_algorithms.predictions.Prediction'>
----------------------------------------
Prediction Instance
----------------------------------------
user: 1974804    item: 7163       r_ui = 3.00   est = 3.33   {'was_impossible': False}


***We can directly compute RMSE from predictions***

In [284]:
# calculating RMSE 
from surprise import accuracy
accuracy.rmse(predictions, verbose=False)

1.0369002579766469

### 1.5.2 If we have Testdata in saperately( not in Testset format)

* ***1.5.2.1 We can predict the rating directly from the library***

In [285]:
# We can also ask for individual predictions of the User-Item prediction.
algo.estimate(u=str(2422966), i=str(6439))

3.608373333333333

- ***1.5.2.2 Getting the ACTUAL RATING from the data with USERID and ITEMID***
    - (to compute error ==> RMSE)

* ***ir*** is a DEFAULT DICT of items with lists having tuples of (user, rating)
* item and user are in the form of inner_user_id and inner_item_id

In [188]:
item_ratings = trainset.ir

- Lets just say, we want the rating of (user, item) from our training set..

In [286]:
u, i = '1013034', '14890' # actual ids from file

- convert the **userid and itemid** into **innerIds in **trainingset..

In [225]:
iid = trainset.to_inner_iid(i) # Item_Inner_Id
uid = trainset.to_inner_uid(u)  # User_Inner_Id

 - We can get all the ***(user, rating)*** pairs who rated that (***movie ie., iid***)
 
 - We will convert that into dictionary for faster accesing

In [226]:
item_rating_dict = dict(item_ratings[iid])

- We will predict it only if it is in our training set

In [231]:
trainset.knows_item(iid) and trainset.knows_user(uid)

True

* ***We can Now get the rating of any user who rated that movie..***
    * We don''t get any key error, because we are checking before acceing the rating

In [230]:
item_rating_dict[uid]

5.0