In [1]:
import numpy as np

In [2]:
from surprise import Reader, Dataset


# 1. How Surprise works

## 1.1 Specifying How to read the file through Reader object

In [3]:
reader = Reader(line_format='item user rating timestamp', sep=',')

In [4]:
# Reader class will automatically re arranges the indices to form tuples ie.,
# (user, item, rating, timestamp) irrespective of what our order is...
reader.indexes

[1, 0, 2, 3]

## 1.2 Load the dataset from file (sample.csv)

-  by using Reader object, which knows how the data is stored in the file.

In [5]:
# LOAD the dataset.....
data = Dataset.load_from_file('sample.csv',reader=reader)

In [6]:
# loading the dataset, literally means making tuples of our data..
data.ratings_file, data.raw_ratings[:3], len(data.raw_ratings)

('sample.csv',
 [('1013034', '14890', 5.0, '2005-07-07'),
  ('413056', '17169', 5.0, '2005-06-14'),
  ('823022', '9526', 3.0, '2003-04-01')],
 100000)

## 1.3 Preparing Dataset :

### 1.3.1 Creating the Trainset and Testset Object:
- If you want to split the data into train and test dataset

In [7]:
from surprise.model_selection import train_test_split

*  Now let's make Training and Testing Set to train our algorithm....


* It returns a Trainset object... and testset( list of tuples)

In [8]:
trainset, testset = train_test_split(data, test_size=.25)

* **Trainingset is a Trainset class object that store various information along with user item rating, which is useful for furthur computations **.


* **Trainset** has some  ***defaultdict's of user and item***. A dictionary of UserIds(replaced with indeces of ordering) that contains list of (item, ratings) pairs. Even Item Id's are also 
 replaced with indices (in the order of occurance,)


 * But the data is not corrupted by doing this. It preserves same User-Item
as before. But now with new Indices....

In [9]:
type(trainset)

surprise.trainset.Trainset

In [10]:
list(trainset.ur.items())[:10]

[(0, [(0, 4.0)]),
 (1, [(1, 3.0), (6014, 3.0)]),
 (2, [(2, 4.0), (1598, 3.0)]),
 (3, [(3, 3.0), (1028, 5.0)]),
 (4, [(4, 5.0)]),
 (5, [(5, 2.0), (1954, 5.0), (2517, 4.0)]),
 (6, [(6, 2.0), (1349, 3.0)]),
 (7, [(7, 4.0)]),
 (8, [(8, 4.0)]),
 (9, [(9, 3.0), (4202, 5.0)])]

- If you want to know abou any user, we can get them by indexing defaultdict

In [11]:
trainset.ur[4]

[(4, 5.0)]

* *Here is how the Testset looks like* 

In [12]:
# (user, item, rating) pairs

type(testset), testset[:2]

(list, [('286106', '13795', 4.0), ('1544601', '1307', 2.0)])

### 1.3.2 Creating only Trainset:
- If you want to train with all the data in input file

- It is done only when you have saperate file for testing

In [13]:
trainset_1 = data.build_full_trainset()

* The training set structure is same as above ( some info along with default dicts of user and item)

***We can get that mappings of users and items to inner_item_ids and inner_user_ids***

In [14]:
# 'OriginalItemIds' to 'NewItemIds' are stored in "_raw2inner_items" 
# Similarly, The mappings from 'OriginalUserIds' to 'NewUserIds' are like..

print('User_mapping:',list(trainset._raw2inner_id_users.items())[:3])
print('Item_mapping:',list(trainset._raw2inner_id_items.items())[:3])

User_mapping: [('548549', 0), ('2411423', 1), ('118552', 2)]
Item_mapping: [('4745', 0), ('4171', 1), ('14185', 2)]


***Some info that a Trainingset has ***

In [15]:
# It has several useful methods that are userful while training the model..
# Example:
trainset.global_mean, trainset.n_items, trainset.knows_item(3241)
# and much more, where they are used inside the library while training

(3.6051466666666667, 8367, True)

## 1.4 Run the SVD algorithm on the data (Trainingset Object)

==============================================================
### A sneak Peek in the algorithm

- It is not purely SVD, it is **inspired from SVD**.


- The standard SVD is modified for this Rating Prediction problem, by considering user biases, Item biases to make this new SVD. It uses **SGD** to solve optimization problem..(** With Regularization**)


- In this we will solve the optimization problem of minimiZing the error by considering Biases of user and Item..

http://www.albertauyeung.com/post/python-matrix-factorization/

**Implementation is completely different in the above blogpost, But the concept is same**

> ***Our Predicted Rating can be calculated from the following***

\begin{aligned}
\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u
\end{aligned}

==============================================================

In [16]:
# import algorithm.....
from surprise import SVD

* lr_all : Learning rate for all hyperparameters.. (can make diffrnt lrngrates for diffrnt param)


* reg_all: Regularization term for all parameters. (Can make diffrnt RegParams for diffrnt param)


* with default parameters... (lr_all and reg_all)

>We can make unbiased SVD also.....

In [17]:
algo = SVD(random_state=15, biased=True, verbose=True, lr_all=0.007, reg_all=0.02) 
algo

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1cf240bdf98>

* **20 is the defalut no of epochs for SGD**

In [18]:
 # Train the model on the dataset (trainset)
algo.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1cf240bdf98>

## 1.5 Model Evaluation using RMSE

### 1.5.1 We can make it very easy if we have created testset

In [19]:
predictions = algo.test(testset)

In [35]:
predictions[0]

Prediction(uid='286106', iid='13795', r_ui=4.0, est=3.4096806037322622, details={'was_impossible': False})

In [31]:
testset[:10]

[('286106', '13795', 4.0),
 ('1544601', '1307', 2.0),
 ('1149601', '14909', 4.0),
 ('709168', '1144', 3.0),
 ('2580101', '3224', 5.0),
 ('429701', '17671', 4.0),
 ('1152574', '25', 5.0),
 ('398579', '14240', 5.0),
 ('455843', '9818', 3.0),
 ('528287', '6206', 4.0)]

In [20]:
print(type(predictions[0]))
print('-'*40)
print('Prediction Instance')
print('-'*40)
print(predictions[0])

<class 'surprise.prediction_algorithms.predictions.Prediction'>
----------------------------------------
Prediction Instance
----------------------------------------
user: 286106     item: 13795      r_ui = 4.00   est = 3.41   {'was_impossible': False}


***We can directly compute RMSE from predictions***

In [21]:
# calculating RMSE 
from surprise import accuracy
accuracy.rmse(predictions, verbose=False)

1.0394377863252822

### 1.5.2 If we have Testdata in saperately( not in Testset format)

* ***1.5.2.1 We can predict the rating directly from the library***

In [22]:
# We can also ask for individual predictions of the User-Item prediction.
algo.estimate(u=str(2422966), i=str(6439))

3.6051466666666667

- ***1.5.2.2 Getting the ACTUAL RATING from the data with USERID and ITEMID***
    - (to compute error ==> RMSE)

* ***ir*** is a DEFAULT DICT of items with lists having tuples of (user, rating)
* item and user are in the form of inner_user_id and inner_item_id

In [23]:
item_ratings = trainset.ir

- Lets just say, we want the rating of (user, item) from our training set..

In [24]:
u, i = '1013034', '14890' # actual ids from file

- convert the **userid and itemid** into **innerIds in **trainingset..

In [26]:
iid = trainset.to_inner_iid(i) # Item_Inner_Id
uid = trainset.to_inner_uid(u)  # User_Inner_Id

ValueError: User 1013034 is not part of the trainset.

 - We can get all the ***(user, rating)*** pairs who rated that (***movie ie., iid***)
 
 - We will convert that into dictionary for faster accesing

In [27]:
item_rating_dict = dict(item_ratings[iid])

- We will predict it only if it is in our training set

In [33]:
trainset.knows_item(i),  trainset.knows_user(u)

(False, False)

* ***We can Now get the rating of any user who rated that movie..***
    * We don''t get any key error, because we are checking before acceing the rating

In [34]:
item_rating_dict[u]

KeyError: '1013034'

# 2. Building Our Own Prediction Algorithm

- There is a class named **AlgoBase** from which every class in this library is derived from. 


- It has ***estimate()*** method, which is called by the ***predict(uid, iid)***  it returns ***predicted_rating***.


- We Can build our own prediction algorithm from this...

In [1]:
from surprise import AlgoBase, Dataset

In [2]:
class OurPredictionAlgo_1(AlgoBase):
    
    # We have to call the __init__ of AlgoBase, It is mandatory, 
    def __init__(self):    
        AlgoBase.__init__(self,)
        
    def estimate(self, u, i):
        # we can write our own predictor.
        # we can also return any other info in ***details*** dictionary..
        return self.trainset.global_mean

OR, we can makeour own prediction algorithms with similarity or baseline models..

In [3]:
class OurPredictionAlgo_2(AlgoBase):
    
    def __init__(self, sim_options={}, bsl_options={}):
        """If we want to set some parameters of AlgoBase, we can pass like this"""
        # We have to call the __init__ of AlgoBase, It is mandatory, 
        AlgoBase.__init__(self, sim_options= sim_options,
                         bsl_options = bsl_options)
    
    
    def fit(self, trainset):
        """
         we can use this to train our model and some other purposes if you wnat to.
         compute the similarities(usr_usr or itm_itm) and baselines (user and item biases)
         And we ADD it to our MODEL, which can be used while predicting...
        
        """
        AlgoBase.fit(self, trainset)
        # Both are computed based on options that we provided at initialization
        self.bu, self.bi, self.compute_baselines()
        self.sim = self.compute_similarities()
    
    
    def estimate(self, u, i):
        """This code snipper from docs....(github.com/.../examples/..)
        1. Compute similarities between u and v, where v describes all other
        2. users that have also rated item i.
        3. Sort these neighbors by similarity
        """
        
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')

        neighbors = [(v, self.sim[u, v]) for (v, r) in self.trainset.ir[i]]
        neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)

        print('The 3 nearest neighbors of user', str(u), 'are:')
        for v, sim_uv in neighbors[:3]:
            print('user {0:} with sim {1:1.2f}'.format(v, sim_uv))

        # ... Aaaaand return the baseline estimate anyway ;)
        bsl = self.trainset.global_mean + self.bu[u] + self.bi[i]
        return bsl

        

#### Our probe.txt format : 

- movie_id_1:
    - user_id_1
    - user_id_2
    - ......
    - ......
- movie_id_2:
    - user_id_42
    - user_id_15
    - ......
    - ......
- ......
    - ......
    - ......

# 3. Load our Actual Netflix prize dataset from our csv file..

- Our train.csv file is formatted (movie, user, rating, data) and free of duplicates...


- We will form our trainset from this train.csv file.

 ### 3.1 Our train.csv format : 

- 
    - **_movie_id_, _user_id_, _rating_, _date_** ( This line is not present in file..)

    -  1        ,  1488844 ,    3.0  , 2005-09-06
    -  1        ,   822109 ,    5.0  , 2005-05-13
    -  1        ,   885013 ,    4.0  , 2005-10-19
    -  1        ,     30878,    4.0  , 2005-12-26
    - ........................................
    - .................................
    - ...........................


In [4]:
from surprise import Reader, Dataset

- It reorder the format to (user, item, rating [,timestamp])***.

In [5]:
nfx_reader = Reader(line_format='item user rating timestamp', sep=',')
nfx_reader.indexes

[1, 0, 2, 3]

In [6]:
from datetime import datetime

In [7]:
# load the data from the file in specified format
start = datetime.now()
nfx_data = Dataset.load_from_file('train.csv', reader=nfx_reader)
print("time :",datetime.now()-start)

time : 0:03:20.163452


In [16]:
print("There are ",len(nfx_data.raw_ratings)," tuples(ratings)")

There are  100479540  tuples(ratings)


In [19]:
nfx_data.raw_ratings[:3]

[('1488844', '1', 3.0, '2005-09-06'),
 ('822109', '1', 5.0, '2005-05-13'),
 ('885013', '1', 4.0, '2005-10-19')]