1. The basics

    Creating your own prediction algorithm is pretty simple: an algorithm is nothing but a class derived from AlgoBase that has an estimate method. This is the method that is called by the predict() method. It takes in an inner user id, an inner item id (see this note), and returns the estimated rating r^ui:

In [2]:
from surprise import AlgoBase
from surprise import Dataset
from surprise.model_selection import cross_validate

class MyOwnAlgorithm(AlgoBase):
    
    def __init__(self):
        # Always call base method before doing anything.
        AlgoBase.__init__(self)
        
    def estimate(self,u,i):
        return 3

In [3]:
data = Dataset.load_builtin('ml-100k')

In [4]:
algo = MyOwnAlgorithm()

In [5]:
cross_validate(algo,data,verbose=True)

Evaluating RMSE, MAE of algorithm MyOwnAlgorithm on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2447  1.2486  1.2380  1.2423  1.2470  1.2441  0.0038  
MAE (testset)     1.0038  1.0040  0.9952  1.0011  1.0041  1.0017  0.0034  
Fit time          0.00    0.02    0.02    0.02    0.02    0.01    0.01    
Test time         0.08    0.06    0.05    0.06    0.06    0.06    0.01    


{'test_rmse': array([1.24472889, 1.24863926, 1.23798223, 1.24227614, 1.24703649]),
 'test_mae': array([1.00385, 1.004  , 0.9952 , 1.00115, 1.0041 ]),
 'fit_time': (0.0,
  0.015622615814208984,
  0.015620708465576172,
  0.015618562698364258,
  0.015624046325683594),
 'test_time': (0.07807540893554688,
  0.062484025955200195,
  0.04686689376831055,
  0.062480926513671875,
  0.062484025955200195)}

    This algorithm is the dumbest we could have thought of: it just predicts a rating of 3, regardless of users and items.

    If you want to store additional information about the prediction, you can also return a dictionary with given details:

In [2]:
def estimate(self,u,i):
    details = {'info1':'That was',
               'info2':'easy stuff:)'}
    return 3,details

    This dictionary will be stored in the prediction as the details field and can be used for later analysis.

2. The fit method

    Now, let’s make a slightly cleverer algorithm that predicts the average of all the ratings of the trainset. As this is a constant value that does not depend on current user or item, we would rather compute it once and for all. This can be done by defining the fit method:

In [4]:
class MyOwnAlgorithm(AlgoBase):
    
    def __init__(self):
        
        # Always call base method before doing anything.
        AlgoBase.__init__(self)
        
    def fit(self,trainset):
        
        # Here again: call base method before doing anything.
        AlgoBase.fit(self,trainset)
        
        # Compute the average rating. We might as well use the 
        # trainset.global_mean attribute ;)
        self.the_mean = np.mean([r for (_, _, r) in self.trainset.all_ratings()])
        
        return self
    
    def estimate(self,u,i):
        
        return self.the_mean

    The fit method is called e.g. by the cross_validate function at each fold of a cross-validation process, (but you can also call it yourself). Before doing anything, you should call the base class fit() method.

    Note that the fit() method returns self. This allows to use expression like algo.fit(trainset).test(testset).

3. The trainset attribute

    Once the base class fit() method has returned, all the info you need about the current training set (rating values, etc…) is stored in the self.trainset attribute. This is a Trainset object that has many attributes and methods of interest for prediction.

    To illustrate its usage, let’s make an algorithm that predicts an average between the mean of all ratings, the mean rating of the user and the mean rating for the item:

In [6]:
def estimate(self,u,i):
    
    sum_means = self.trainset.global_mean
    div = 1
    
    if self.trainset.knows_user(u):
        sum_means += np.mean([r for (_, r) in self.trainset.ur[u]])
        div += 1
    if self.trainset.knows_item(i):
        sum_means += np.mean([r for (_, r) in self.trainset.ir[i]])
        div += 1
        
    return sum_means / div

    Note that it would have been a better idea to compute all the user means in the fit method, thus avoiding the same computations multiple times.

4. When the prediction is impossible

    It’s up to your algorithm to decide if it can or cannot yield a prediction. If the prediction is impossible, then you can raise the PredictionImpossible exception. You’ll need to import it first:

In [7]:
from surprise import PredictionImpossible

    This exception will be caught by the predict() method, and the estimation r^ui will be set according to the default_prediction() method, which can be overridden. By default, it returns the average of all ratings in the trainset.

5. Using similarities and baselines

    Should your algorithm use a similarity measure or baseline estimates, you’ll need to accept bsl_options and sim_options as parameters to the __init__ method, and pass them along to the Base class. See how to use these parameters in the Using prediction algorithms section.

    Methods compute_baselines() and compute_similarities() can be called in the fit method (or anywhere else).

In [8]:
# From file examples/building_custom_algorithms/.with_baselines_or_sim.py
class MyOwnAlgorithm(AlgoBase):
    
    def __init__(self,sim_options={},bsl_options={}):
        
        AlgoBase.__init__(self,sim_options=sim_options,bsl_options=bsl_options)
        
    def fit(self,trainset):
        
        AlgoBase.fit(self,trainset)
        
        # compute baselines and similarities
        self.bu, self.bi = self.compute_baselines()
        self.sim = self.compute_similarities()
        
        return self
    
    def estimate(self,u,i):
        
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unknown.')
            
        # Compute similarities between u and v, where v describes all other
        # users that have also rated item i.
        neighbors = [(v, self.sim[u, v]) for (v, r) in self.trainset.ir[i]]
        # Sort these neighbors by similarity
        neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)

        print('The 3 nearest neighbors of user', str(u), 'are:')
        for v, sim_uv in neighbors[:3]:
            print('user {0:} with sim {1:1.2f}'.format(v, sim_uv))

        # ... Aaaaand return the baseline estimate anyway ;)

    Feel free to explore the prediction_algorithms package source to get an idea of what can be done.