In [247]:
import pandas as pd
import random

In [46]:
class Data:
    def __init__(self):
        
        self.available_databases=['ml_100k', 'ml_1m','jester']
        #self.the_data_reader= getattr(Data, 'read_'+database_name.lower())
        #self.the_data_reader= 'read_'+database_name.lower()
    def show_available_databases(self):
        print('The avaliable database are:')
        for i,database in enumerate(self.available_databases):
            print(str(i)+': '+database)            


        
    def read_data(self,database_name):
        self.database_name=database_name
        self.the_data_reader= getattr(self, 'read_'+database_name.lower())
        self.the_data_reader()
        # The datasets for collaborative filtering must be:
        # The dataframe containing the ratings. 
        # It must have three columns, corresponding to the user (raw) ids, 
        #the item (raw) ids, and the ratings, in this order.    

    def read_ml_100k(self):
        from surprise import Dataset
        data = Dataset.load_builtin('ml-100k')
        self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
        self.df.drop(columns=['timestamp'],inplace=True)
        self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True)

    def read_ml_1m(self):
        from surprise import Dataset
        data = Dataset.load_builtin('ml-1m')
        self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
        self.df.drop(columns=['timestamp'],inplace=True)
        self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True)

    def read_jester(self):
        from surprise import Dataset
        data = Dataset.load_builtin('jester')
        self.df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
        self.df.drop(columns=['timestamp'],inplace=True)
        self.df.rename({'user_id':'userID','item_id':'itemID'},axis=1,inplace=True)

## Dataset

A classe *Data* lê um dos 3 *datasets* disponibilizados pelo pacote *surprise (ml_100k, ml_1m, jester)*. Um exemplo de dataframe é o seguinte:

In [255]:
data=Data()
data.read_data('ml_100k')
data.df

Unnamed: 0,userID,itemID,rating
0,196,242,3.0
1,186,302,3.0
2,22,377,1.0
3,244,51,2.0
4,166,346,1.0
...,...,...,...
99995,880,476,3.0
99996,716,204,5.0
99997,276,1090,1.0
99998,13,225,2.0


As bases disponibilizadas no pacote consistem em classificações numéricas de 1 a 5 realizadas por usuários. Por este motivo, os únicos sistemas de recomendação possíveis são aqueles baseados em classificações (collaborative-filtering), sendo deixados de fora aqueles baseados em interlocuções sistema-usúario, baseados em linguagem natural ou demográficos.

É necessária uma expansão das classes para englobar os sistemas de reomendação baseados nestas outras informações sobre os usuários.

In [172]:
class Method:
    def __init__(self,df):
        
        self.df=df
        self.available_methods=[
            'surprise.NormalPredictor',
            'surprise.BaselineOnly',
            'surprise.KNNBasic',
            'surprise.KNNWithMeans',
            'surprise.KNNWithZScore',
            'surprise.KNNBaseline',
            'surprise.SVD',
            'surprise.SVDpp',
            'surprise.NMF',
            'surprise.SlopeOne',
            'surprise.CoClustering',
        ]        
        
    def show_methods(self):
        print('The avaliable methods are:')
        for i,method in enumerate(self.available_methods):
            print(str(i)+': '+method)



    def run(self,the_method):
        self.the_method=the_method
        if(self.the_method[0:8]=='surprise'):
            self.run_surprise()
        elif(self.the_method[0:6]=='Gensim'):
            self.run_gensim()
        elif(self.the_method[0:13]=='Transformers-'):
            self.run_transformers()
        else:
            print('This method is not defined! Try another one.')

    def run_surprise(self):
        from surprise import Reader
        from surprise import Dataset
        from surprise.model_selection import train_test_split
        reader = Reader(rating_scale=(1, 5))
        data = Dataset.load_from_df(self.df[['userID', 'itemID', 'rating']], reader)        
        trainset, testset = train_test_split(data, test_size=.30)
        the_method=self.the_method.replace("surprise.", "")
        eval(f"exec('from surprise import {the_method}')")
        the_algorithm=locals()[the_method]()
        the_algorithm.fit(trainset)
        self.predictions=the_algorithm.test(testset)
        list_predictions=[(uid,iid,r_ui,est) for uid,iid,r_ui,est,_ in self.predictions]        
        self.predictions_df = pd.DataFrame(list_predictions, columns =['user_id', 'item_id', 'rating','predicted_rating'])

## Method

A biblioteca *surprise* disponibiliza 11 modelos classificadores que tentam prever qual a classificação dos dados de treino, se baseando em diversas técnicas diferentes de *collaborative-filtering*. Os modelos disponibilizados com uma breve explicação em inglês estão citados abaixo, para mais informações favor consultar a [documentação do pacote.](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)

*random_pred.NormalPredictor*:
Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

*baseline_only.BaselineOnly*:
Algorithm predicting the baseline estimate for given user and item.

*knns.KNNBasic*:
A basic collaborative filtering algorithm.

*knns.KNNWithMeans*:
A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

*knns.KNNWithZScore*:
A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

*knns.KNNBaseline*:
A basic collaborative filtering algorithm taking into account a baseline rating.

*matrix_factorization.SVD*:
The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

*matrix_factorization.SVDpp*:
The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

*matrix_factorization.NMF*:
A collaborative filtering algorithm based on Non-negative Matrix Factorization.

*slope_one.SlopeOne*:
A simple yet accurate collaborative filtering algorithm.

*co_clustering.CoClustering*:
A collaborative filtering algorithm based on co-clustering.

É possível passar como argumento para esta classe um dataframe personalizado. O dataframe em questão precisa ter 3 colunas com os seguintes nome: ['userID', 'itemID', 'rating'].

In [181]:
class Evaluator:

    def __init__(self,predictions_df):

        self.available_evaluators=['surprise.rmse','surprise.mse',
                                   'surprise.mae','surprise.fcp']
        self.predictions_df=predictions_df
        
    def show_evaluators(self):
        print('The avaliable evaluators are:')
        for i,evaluator in enumerate(self.available_evaluators):
            print(str(i)+': '+evaluator)
        


    def run(self,the_evaluator):        
        self.the_evaluator=the_evaluator
        if(self.the_evaluator[0:8]=='surprise'):
            self.run_surprise()
        else:
            print('This evaluator is not available!')

    def run_surprise(self):
        import surprise
        from surprise import accuracy
        predictions=[surprise.prediction_algorithms.predictions.Prediction(row['user_id'],row['item_id'],row['rating'],row['predicted_rating'],{}) for index,row in self.predictions_df.iterrows()]
        self.predictions=predictions
        self.the_evaluator= 'accuracy.' + self.the_evaluator.replace("surprise.", "")
        from surprise import accuracy
        print(eval(f'{self.the_evaluator}(predictions,verbose=True)'))

## Evaluator

A biblioteca *surprise* disponibiliza 4 métodos diferentes para avaliar a acurácia da previsão das classificações

## Experimento

O código abaixo foi formulado pelo professor Cajueiro se baseando no método KNNBasic de previsão de classificações, e posteriormente foi expandido pelo Vítor para englobar outros diferentes métodos. O código agora é capaz de usar qualquer combinação de recursos da biblioteca surprise para sistemas de recomendação baseado em *collaborative-filtering*

In [246]:
data=Data()
data.show_available_databases()
data.read_data('ml_100k')
method=Method(data.df)  
method.show_methods()
method.run('surprise.KNNWithMeans')
predictions_df=method.predictions_df
evaluator=Evaluator(predictions_df)
evaluator.show_evaluators()
evaluator.run('surprise.mse')

The avaliable database are:
0: ml_100k
1: ml_1m
2: jester
The avaliable methods are:
0: surprise.NormalPredictor
1: surprise.BaselineOnly
2: surprise.KNNBasic
3: surprise.KNNWithMeans
4: surprise.KNNWithZScore
5: surprise.KNNBaseline
6: surprise.SVD
7: surprise.SVDpp
8: surprise.NMF
9: surprise.SlopeOne
10: surprise.CoClustering
Computing the msd similarity matrix...
Done computing similarity matrix.
The avaliable evaluators are:
0: surprise.rmse
1: surprise.mse
2: surprise.mae
3: surprise.fcp
MSE: 0.9252
0.9251836900859874


O código abaixo foi formulado pelo Vítor para gerar *ratings* de forma aleatória para 20 usuários fictícios do *ppf* realizando 10.000 classificações. Depois disso o código replica a metodologia de previsão de classificações e avaliação do sistema.

In [252]:
opo = pd.read_csv('oportunidades.csv')
df = [(random.randrange(20), random.randrange(len(opo)), random.randrange(1,5)) for i in range(10000)]
df = pd.DataFrame(df, columns = ['userID', 'itemID', 'rating'])

method=Method(df)  
method.show_methods()
method.run('surprise.KNNWithMeans')
predictions_df=method.predictions_df
evaluator=Evaluator(predictions_df)
evaluator.show_evaluators()
evaluator.run('surprise.mse')

The avaliable methods are:
0: surprise.NormalPredictor
1: surprise.BaselineOnly
2: surprise.KNNBasic
3: surprise.KNNWithMeans
4: surprise.KNNWithZScore
5: surprise.KNNBaseline
6: surprise.SVD
7: surprise.SVDpp
8: surprise.NMF
9: surprise.SlopeOne
10: surprise.CoClustering
Computing the msd similarity matrix...
Done computing similarity matrix.
The avaliable evaluators are:
0: surprise.rmse
1: surprise.mse
2: surprise.mae
3: surprise.fcp
MSE: 1.5102
1.5102012144185555
