Experimento  - Comparação dos métodos de Filtragem Colaborativa (FC)

Nesse experimento, será implementado uma comparação entre os métodos de FC baseado em memória e baseado em Modelo.

A comparação é feita usando os seguintes dataset



*   MovieLens (ML): Sobre avaliações de filmes
*   Restaurant & consumer data Data Set (RCD): Avaliação de Restaurante
*   LastFM (LFM): Sobre Musica
*   Book-Crossing Dataset (BCD): Avaliação sobre Livros




In [1]:
#!pip install surprise


In [2]:
#!pip install pyspark

In [3]:
azure = True
g_colab = False

In [4]:
import pandas as pd
import numpy as np

In [5]:
if azure:
    path_bcd ="./dataset FC/RatingBook/ratings.csv"
    path_ml  = "./dataset FC/ML/ml-100k/rating_ml.csv"
    path_rdc = "./dataset FC/RDC/rating_final.csv"
    
    
if g_colab:
    from google.colab import drive 
    drive.mount('/content/gdrive')
    path_bcd ="/content/gdrive/My Drive/Monografia/dataset FC/RatingBook/ratings.csv"
    path_ml = "/content/gdrive/My Drive/Monografia/dataset FC/ML/ml-100k/rating_ml.csv"
    path_rdc = "/content/gdrive/My Drive/Monografia/dataset FC/RDC/rating_final.csv"

In [6]:
df_book=pd.read_csv(path_bcd)
df_book = df_book.rename(columns = {"book_id":"itemid","user_id":"userid"})

qr = df_book[["userid","rating"]].groupby("userid").count() 
#qr = qr[qr["rating"] > 3]
ur = qr.index.unique()
df_ur  = pd.DataFrame(ur)
df_book = df_book.merge(df_ur,how = 'inner',on =  "userid")

print('Escala Minima Rating:',np.min(df_book["rating"]))
print('Escala maxima Rating:',np.max(df_book["rating"]))
print("Numero de avaliaçoes",len(df_book))
print("Numero de Usuários",len(df_book.userid.unique()))
print("Numero de itens",len(df_book.itemid.unique()))
df_book.head()

Escala Minima Rating: 1
Escala maxima Rating: 5
Numero de avaliaçoes 981756
Numero de Usuários 53424
Numero de itens 10000


Unnamed: 0,itemid,userid,rating
0,1,314,5
1,3,314,3
2,5,314,4
3,6,314,5
4,12,314,4


In [7]:
df_rdc=pd.read_csv(path_rdc)
print("Numero de avaliaçoes",len(df_rdc))
print("Numero de Usuários",len(df_rdc.userID.unique()))
print("Numero de itens",len(df_rdc.placeID.unique()))
df_rdc['userID'] = pd.to_numeric(df_rdc['userID'].map(lambda x: x.replace('U','')))
df_rdc = df_rdc.rename(columns = {"userID":"userid","placeID":"itemid"})
df_rdc = df_rdc.drop(columns = {"food_rating","service_rating"})

#trocando a escala de 0 a 2 para 1 a 3.
df_rdc["rating"] = df_rdc["rating"] + 1

print("Numero de avaliaçoes",len(df_rdc))
print('Escala Minima Rating:',np.min(df_rdc["rating"]))
print('Escala maxima Rating:',np.max(df_rdc["rating"]))
df_rdc.head()

Numero de avaliaçoes 1161
Numero de Usuários 138
Numero de itens 130
Numero de avaliaçoes 1161
Escala Minima Rating: 1
Escala maxima Rating: 3


Unnamed: 0,userid,itemid,rating
0,1077,135085,3
1,1077,135038,3
2,1077,132825,3
3,1077,135060,2
4,1068,135104,2


In [8]:
#Normalização do Rating entre 1 a 5
max_rating = np.max(df_rdc["rating"]) 
df_rdc["rating"] = (df_rdc["rating"]*4) / max_rating + 1
df_rdc["rating"] = df_rdc["rating"].round(2)
df_rdc.head()

Unnamed: 0,userid,itemid,rating
0,1077,135085,5.0
1,1077,135038,5.0
2,1077,132825,5.0
3,1077,135060,3.67
4,1068,135104,3.67


In [9]:
df_ml=pd.read_csv(path_ml,sep= ";")
print('Escala Minima Rating:',np.min(df_ml["rating"]))
print('Escala maxima Rating:',np.max(df_ml["rating"]))
print("Numero de avaliaçoes",len(df_ml))
df_ml.head()

Escala Minima Rating: 1
Escala maxima Rating: 5
Numero de avaliaçoes 100000


Unnamed: 0,userid,itemid,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


***Modelo baseado em Filtragem Colaborativa usando Singular Value Decomposition (SVD) ***


Nessa etapa sera implementado o SVD para todos os datasets

Nesse passo sera criado a partição de treino e teste para todos os dataset. O objetivo é usar as mesmas  partição para todos os métodos.


In [10]:
from surprise.model_selection import cross_validate
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
import pandas as pd
import numpy  as np

In [11]:
# Ler os dados sobre rating 
reader_ml  = Reader(rating_scale=(1, 5))
reader_bcd  = Reader(rating_scale=(1, 5))
reader_rdc  = Reader(rating_scale=(1, 5))

data_ml = Dataset.load_from_df(df_ml[['userid','itemid','rating']], reader_ml)
data_book = Dataset.load_from_df(df_book[['userid','itemid','rating']], reader_bcd)
data_rdc = Dataset.load_from_df(df_rdc[['userid','itemid','rating']], reader_rdc)
 
# Treinamento com 70% da massa de dados e deixar 30% para fazer a validação
trainset_ml,  testset_ml  = train_test_split(data_ml,   test_size=.30,random_state=42)
trainset_bcd, testset_bcd = train_test_split(data_book, test_size=.30,random_state=42)
trainset_rdc, testset_rdc = train_test_split(data_rdc,  test_size=.30,random_state=42)


SVD_ml  = SVD()
SVD_bcd = SVD()
SVD_rdc = SVD()



# Criando o Modelo
SVD_ml.fit(trainset_ml)
SVD_bcd.fit(trainset_bcd)
SVD_rdc.fit(trainset_rdc)

#predição base de Test
SVD_predictions_ml  = SVD_ml.test(testset_ml)
SVD_predictions_bcd = SVD_bcd.test(testset_bcd)
SVD_predictions_rdc = SVD_rdc.test(testset_rdc)



In [12]:
# Verificar o valor do erro quadrático médico para verificar a qualidade do  modelo.
print("rmse MovieLens:",accuracy.rmse(SVD_predictions_ml))
print("rmse Book:",accuracy.rmse(SVD_predictions_bcd))
print("rmse Restaurant:",accuracy.rmse(SVD_predictions_rdc))

RMSE: 0.9450
rmse MovieLens: 0.9450446152058093
RMSE: 0.8488
rmse Book: 0.8488450533877537
RMSE: 0.8905
rmse Restaurant: 0.8904663383210858


In [13]:
# Verificar o valor do erro quadrático médico para verificar a qualidade do  modelo.
print("mae MovieLens:",accuracy.mae(SVD_predictions_ml))
print("mae Book:",accuracy.mae(SVD_predictions_bcd))
print("mae Restaurant:",accuracy.mae(SVD_predictions_rdc))

MAE:  0.7448
mae MovieLens: 0.7448075130624082
MAE:  0.6649
mae Book: 0.6648640003267104
MAE:  0.7537
mae Restaurant: 0.7537094859882338


*Modelo baseado em Filtragem Colaborativa usando ALS*


In [14]:
import pyspark
from pyspark.sql import SQLContext
import datetime
import os
import sys
import glob
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import MatrixFactorizationModel
from sklearn.feature_extraction.text import TfidfVectorizer
from pyspark import SparkContext, SparkConf
from io import StringIO
from sklearn.metrics import mean_squared_error, mean_absolute_error
from pyspark import SparkContext, SQLContext   
from pyspark.ml.recommendation import ALS     
from contextlib import contextmanager
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession

In [15]:
conf=SparkConf()
conf.set("spark.executor.memory", "110g")
conf.set("spark.driver.memory", "110g")
conf.set("spark.cores.max", "16")
#conf.set("spark.driver.extraClassPath")
   # driver_home+'/jdbc/postgresql-9.4-1201-jdbc41.jar:'\
   # +driver_home+'/jdbc/clickhouse-jdbc-0.1.52.jar:'\
   # +driver_home+'/mongo/mongo-spark-connector_2.11-2.2.3.jar:'\
   # +driver_home+'/mongo/mongo-java-driver-3.8.0.jar') 

sc = SparkContext.getOrCreate(conf)

sqlContext = SQLContext(sc)

In [16]:
df_ml2  = sqlContext.createDataFrame(df_ml)
df_bcd2 = sqlContext.createDataFrame(df_book)
df_rdc2 = sqlContext.createDataFrame(df_rdc)

## usando a semente 42 para garantir que as partições geradas são as mesmas
trainset_ml_spark, testset_ml_spark   = df_ml2.randomSplit([0.7,0.3],seed =42)

trainset_bcd_spark, testset_bcd_spark = df_bcd2.randomSplit([0.7,0.3],seed =42)

trainset_rdc_spark, testset_rdc_spark = df_rdc2.randomSplit([0.7,0.3],seed =42)

In [17]:
# MovieLens 
als_ml = ALS(userCol="userid",itemCol="itemid",ratingCol="rating",rank=5, maxIter=10, seed=42)
model_ml = als_ml.fit(trainset_ml_spark)

# Book 
als_bcd = ALS(userCol="userid",itemCol="itemid",ratingCol="rating",rank=5, maxIter=10, seed=42)
model_bcd = als_bcd.fit(trainset_bcd_spark)

# Restaurant 
als_rdc = ALS(userCol="userid",itemCol="itemid",ratingCol="rating",rank=5, maxIter=10, seed=42)
model_rdc = als_rdc.fit(trainset_rdc_spark)

In [19]:
# MovieLens 

predictions_ml = model_ml.transform(testset_ml_spark[["userid","itemid"]]) 
ratesAndPreds_ml = testset_ml_spark.join(other=predictions_ml,on=['userid','itemid'],how='inner').na.drop() 

rating_ml = np.array(ratesAndPreds_ml.select("rating").collect()).ravel()
prediction_ml = np.array(ratesAndPreds_ml.select("prediction").collect()).ravel()
uid_ml =  np.array(ratesAndPreds_ml.select("userid").collect()).ravel()

# Book

predictions_bcd = model_bcd.transform(testset_bcd_spark[["userid","itemid"]]) 
ratesAndPreds_bcd = testset_bcd_spark.join(other=predictions_bcd,on=['userid','itemid'],how='inner').na.drop() 

rating_bcd = np.array(ratesAndPreds_bcd.select("rating").collect()).ravel()
prediction_bcd = np.array(ratesAndPreds_bcd.select("prediction").collect()).ravel()
uid_bcd =  np.array(ratesAndPreds_bcd.select("userid").collect()).ravel()
# Restaurant 

predictions_rdc = model_rdc.transform(testset_rdc_spark[["userid","itemid"]]) 
ratesAndPreds_rdc = testset_rdc_spark.join(other=predictions_rdc,on=['userid','itemid'],how='inner').na.drop() 

rating_rdc = np.array(ratesAndPreds_rdc.select("rating").collect()).ravel()
prediction_rdc = np.array(ratesAndPreds_rdc.select("prediction").collect()).ravel()
uid_rdc = np.array(ratesAndPreds_rdc.select("userid").collect()).ravel()

In [20]:
# Calculando o Valor do RMSE:
print('MovieLens RMSE: ',mean_squared_error(prediction_ml,rating_ml))
print('Book RMSE: ',mean_squared_error(prediction_bcd,rating_bcd))
print('RDC RMSE: ',mean_squared_error(prediction_rdc,rating_rdc))

MovieLens RMSE:  0.8618483408709152
Book RMSE:  0.8321611698323095
RDC RMSE:  1.6867267492541786


In [21]:
# Calculando o Valor do MAE:
print('MovieLens MAE: ',mean_absolute_error(prediction_ml,rating_ml))
print('Book MAE: ',mean_absolute_error(prediction_bcd,rating_bcd))
print('RDC MAE: ',mean_absolute_error(prediction_rdc,rating_rdc))

MovieLens MAE:  0.7360417826275114
Book MAE:  0.7047965554333295
RDC MAE:  0.9747706817526748


### Calculando Recall

In [22]:
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

### Calculando Recall para o SVD

In [23]:
# Escolhendo Corte como 3.5
from collections import defaultdict
#predição base de Test
k = 1000000
threshold = 3.5

precisions_ml, recalls_ml = precision_recall_at_k(SVD_predictions_ml,k=k, threshold=threshold)

precisions_bcd, recalls_bcd = precision_recall_at_k(SVD_predictions_bcd,k=k, threshold=threshold)

precisions_rdc, recalls_rdc = precision_recall_at_k(SVD_predictions_rdc,k=k, threshold=threshold)



print('Movielens ',sum(rec for rec in recalls_ml.values()) / len(recalls_ml))


print('Books ',sum(rec for rec in recalls_bcd.values()) / len(recalls_bcd))

print('RDC', sum(rec for rec in recalls_rdc.values()) / len(recalls_rdc))

Movielens  0.6762366501451783
Books  0.755301047431303
RDC 0.8327823691460055


## Calculando o Recall para o ALS

In [24]:
def recall_from_als(df,threshold = 4,k=20):
    
    #df = df.sort_values(by = ["est"], ascending = False)
    df_recall = pd.DataFrame([], columns = {"uid","est","true_r"})
    n_rel_true = []
    n_rel_est = []
    n_rel_and_rec_k = []
    user = df["uid"].unique()
    
    # Number of recommended items in top k
    
    for uid in range(len(user)):
        
        df_temp = df[df["uid"] == df["uid"][uid]]
        df_temp = df_temp.sort_values(by = ["est"], ascending = False)
        
        df_temp = df_temp.head(k)
        
        df_recall = df_recall.append(df_temp,ignore_index=True)
        
        
        # Number of relevant items
        
    for i in range(len(df_recall)):
            
            
        if df_recall["true_r"][i] > threshold:
            n_rel_true.append(1)
        else:
            n_rel_true.append(0)
                    
                    
        if df_recall["est"][i] > threshold:
            n_rel_est.append(1)
        else:
            n_rel_est.append(0)
                            
        if n_rel_est[i] == 1 and n_rel_true[i] == 1:
            
            n_rel_and_rec_k.append(1)
        
        else:
            
            n_rel_and_rec_k.append(0)
            
            
            
                
                
    return sum(n_rel_and_rec_k) / sum(n_rel_true) 

In [25]:

#MovieLens
df_final_ml = pd.DataFrame({'uid':uid_ml,'est':prediction_ml,'true_r':rating_ml})

#Book
df_final_bcd = pd.DataFrame({'uid':uid_bcd,'est':prediction_bcd,'true_r':rating_bcd})

#RDC
df_final_rdc = pd.DataFrame({'uid':uid_rdc,'est':prediction_rdc,'true_r':rating_rdc})

In [26]:
print("Recall para Movilens",recall_from_als(df_final_ml,k=1000,threshold = 3.5))



of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Recall para Movilens 0.6473882462748758


In [27]:
print("Recall para Book",recall_from_als(df_final_bcd,k=1000,threshold = 3.5))

Recall para Book 0.7740156697422844


In [28]:
print("Recall para rdc",recall_from_als(df_final_rdc,k=1000,threshold = 3.5))

Recall para rdc 0.4626865671641791
