# Trabalho de Filtragem Colaborativa

Modelo de filtragem colaborativa usa as informações de ratings dos usuários para prover recomendações. <br>
**Ideia Principal:** A semelhança entre os usuários a partir do que se observa da interseção dos seus ratings permite inferir que os dados não inputados por apenas um destes poderia ter a mesma semelhança com relação ao outro. <br><br>
**Problemas Principais:** 
- A **esparsidade** dos dados, afinal a informação que um usuário provê é normalmente a um subconjunto muito pequeno dos itens. Logo a maioria da base de dados é de dados *faltantes* ou *não observados*. 
- **Cold-Start**: a falta de dados inicial para ter informação relevante seja para entender a personlidade de um usuário ou a preditibilidade de um item<br>
<br>
Há 2 métodos de filtragem colaborativa: <br>

**Memory-Based:** Também chamado de *neighborhood-based collaborative filtering algorithms*. Que se dividem basicamente em *user-based collaborative filtering* e *item-based collaborative filtering*. <br>
**Model-Based:** Modelos baseados em **machine learning** e **data mining** há um processo de aprendizado prévio para parametrizição. Alguns métodos são Decisions Trees, métodos bayesianos, modelos baseados em regras e latent factor method. <br>

## Implementação usando MovieLens
Carregar a base de 25M ou 100K


In [1]:
import pandas as pd                          #DataFrames e operações associadas
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity          #Similaridade
import math  
import scipy.stats
import sklearn.metrics  
from colorama import Fore, Back, Style       #prints coloridos e estilizados
from IPython.display import display
import time

ratings = pd.DataFrame()
movies = pd.DataFrame()

def carregar_base(ratings_filename, movies_filename, n_ratings = 25000000):
    global ratings
    ratings = pd.read_csv(ratings_filename)[:n_ratings] #ler as primeiras n linhas da base completa; default 25.000.000
    global movies
    movies = pd.read_csv(movies_filename).set_index("movieId")
    print(f'Arquivo \'{ratings_filename}\' carregado.')

carregar_base('ratings.csv', 'movies.csv', n_ratings=1000000)
#carregar_base('ratings_small.csv', 'movies_small.csv')

#### Funções para auxiliar
def get_filmes_avaliados(usuario_):
    if(type(usuario_)==list):
        return sorted(ratings[ratings['userId'].isin(usuario_)]['movieId'].unique().tolist())
    else:
        return ratings[ratings['userId']==usuario_]['movieId'].values.tolist()

def eliminar_colunas_zeradas(matriz):
    return matriz.loc[:, (matriz != 0).any(axis=0)] #elimina todas as colunas cujos todos os valores são 0

def get_nomes_filmes(indices):
    return movies.loc[indices]['title'].values.tolist()

def get_media_avaliacao(usuario_, decimais=2):
    return round(ratings[ratings['userId'].isin([usuario_])]['rating'].mean(),decimais)

def print_destaque(texto):
    print(Back.BLUE + Fore.LIGHTYELLOW_EX+ f' {texto} ')
    print(Style.RESET_ALL)

Arquivo 'ratings.csv' carregado.


In [2]:
#Pequeno relatório dos datasets originais
def print_report(ratings):
    n_ratings = len(ratings)
    global lista_usuarios
    lista_usuarios = sorted(ratings['userId'].unique())
    global lista_filmes_avaliados
    lista_filmes_avaliados = sorted(ratings['movieId'].unique())
    print(f"Total de ratings: {n_ratings}")
    print(f"Total de filmes: {len(movies)}")
    print(f"Filmes avaliados: {len(lista_filmes_avaliados)}")
    print(f"Total de usuários: {len(lista_usuarios)}")
    print(f"Média de ratings/user: {round(n_ratings/len(lista_usuarios), 2)}")
    print(f"Shape de Ratings: {ratings.shape}")
    esparsidade = round(1.0 -n_ratings/float(len(lista_usuarios) * len(lista_filmes_avaliados)),3)
    print(f"O nível de esparsidade do dataset é {esparsidade * 100}%\n")

    print_destaque('Relatório dos ratings dados pelos usuários')
    print(ratings.groupby('userId')['rating'].count().describe())

print_destaque('Relatório da base original usada')
print_report(ratings)
ratings.sample(5).sort_index()


[44m[93m Relatório da base original usada 
[0m
Total de ratings: 1000000
Total de filmes: 62423
Filmes avaliados: 21952
Total de usuários: 6747
Média de ratings/user: 148.21
Shape de Ratings: (1000000, 4)
O nível de esparsidade do dataset é 99.3%

[44m[93m Relatório dos ratings dados pelos usuários 
[0m
count    6747.000000
mean      148.214021
std       237.566936
min        20.000000
25%        35.000000
50%        69.000000
75%       155.000000
max      4227.000000
Name: rating, dtype: float64


Unnamed: 0,userId,movieId,rating,timestamp
216399,1526,4018,3.0,1136343818
383609,2644,5377,4.0,1546155436
402866,2778,344,4.5,1213465963
483401,3305,2193,3.0,1329577232
716561,4874,736,3.0,859315456


### Classe Treino_Teste Cross-Validation
<center><img src="img/x-validation.jpg" style="max-width: 40%"></center>

In [3]:
class Treino_Teste:
    fatias = []
    k = 0
    def __init__(self, database, k):
        database = database.sample(frac=1) #misturar randomicamente os dados
        self.k = k
        self.fatias = np.array_split(database, k)

    def get_treino_teste(self):
        treino = pd.concat(self.fatias[:-1])
        teste = self.fatias[-1].sort_values(by=['userId','movieId'])
        return treino, teste

    def get_k(self):
        return self.k
    
    def proxima_folha(self): #próximo split
        self.fatias.append(self.fatias.pop(0))

tt = Treino_Teste(ratings,10)
ratings_treino, ratings_teste = tt.get_treino_teste()
print_destaque('Relatório da base de treino (9/10) da original')
print_report(ratings_treino)

[44m[93m Relatório da base de treino (9/10) da original 
[0m
Total de ratings: 900000
Total de filmes: 62423
Filmes avaliados: 21215
Total de usuários: 6747
Média de ratings/user: 133.39
Shape de Ratings: (900000, 4)
O nível de esparsidade do dataset é 99.4%

[44m[93m Relatório dos ratings dados pelos usuários 
[0m
count    6747.000000
mean      133.392619
std       213.814675
min        14.000000
25%        31.000000
50%        62.000000
75%       141.000000
max      3774.000000
Name: rating, dtype: float64


## Gerar matriz Movies X User

In [4]:
def gerar_matriz_movies_user(dados):
    global ratings_usuarios
    ratings_usuarios = dados.groupby(['userId', 'movieId'])['rating'].first().unstack(fill_value=0.0)
    ratings_usuarios = pd.DataFrame(data=ratings_usuarios, columns=lista_filmes_avaliados).fillna(0)
    return ratings_usuarios

#### Funções para auxiliar
def listar_filmes_ja_vistos(usuario):
    #filmes_ja_vistos_bin = matriz_filmes_X_usuarios.loc[usuario].gt(0)   #gerar array com o que usuário já deu rating: True ou False
    #return filmes_ja_vistos_bin.index[filmes_ja_vistos_bin].to_list() #com base no anterior, listar filmes que já viu        
    if(type(usuario)==list):
        filmes_usuario = ratings_usuarios.loc[usuario].sum(axis = 0)
    else: #if(type(usuario)==int):
        filmes_usuario = ratings_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario!=0].index.to_list()

def listar_filmes_nao_vistos(usuario):
    if(type(usuario)==list):
        filmes_usuario = ratings_usuarios.loc[usuario].sum(axis = 0)
    else: 
        filmes_usuario = ratings_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario==0].index.to_list()
    
def listar_notas_usuario(userId):
    return ratings_usuarios[ratings_usuarios.index==userId]
   
gerar_matriz_movies_user(ratings_treino)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,207309,207311,207367,207405,207642,207830,207890,208002,208080,208737
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6743,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6744,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6745,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6746,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## User-Based Collaborative Filtering

### Matriz de Similaridade por Usuário
A medida comumente usada é a similaridade do cosseno.
Essa medida de similaridade deve seu nome ao fato de ser igual ao cosseno do ângulo entre os dois vetores que estão sendo comparados:  vetores de similaridade de usuário (ou item) de ratings. Quanto menor o ângulo entre dois vetores, maior será o cosseno, resultando em um fator de similaridade mais alto. 

Dado 2 vetores, A e B, a similiridade por cosseno, cos($\theta$), é representada pelo produto escalar
$$\text{cosine similarity} =S_C (x,y):= \cos(\theta) = {\mathbf{x} \cdot \mathbf{y} \over \|\mathbf{x}\| \|\mathbf{y}\|} = \frac{ \sum\limits_{i=1}^{n}{x_i  y_i} }{ \sqrt{\sum\limits_{i=1}^{n}{x_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{y_i^2}} }$$

### Pegar os k usuários mais similares ao Target selecionado
<center><img src="img/user-based-similaridades.jpg" ></center>

Como selecionar os mais similares?
- Todos os vizinhos
- Selecionar randomicamente
- Todos acima de um threshold
- **Top-k por simaridade**

Problemas
- Custo computacional
- Mais vizinhos = mais ruído
- Poucos vizinhos = pouca cobertura
- Usar entre 25 e 100

In [5]:
def obter_mais_similares(target, matriz_similaridade, k = 25, min_score = 0):
    try:
        similares = matriz_similaridade.loc[target].sort_values(ascending=False).drop(target)
        similares = similares[similares >= min_score]
        return similares.iloc[:k]
    except:
        return []

def gerar_matriz_similaridade_user(dados):
    global users_cosine
    users_cosine_array = cosine_similarity(dados)
    users_cosine = pd.DataFrame(data=users_cosine_array, index=dados.index, columns=dados.index)
    return users_cosine

gerar_matriz_similaridade_user(ratings_usuarios).round(3).head()

userId,1,2,3,4,5,6,7,8,9,10,...,6738,6739,6740,6741,6742,6743,6744,6745,6746,6747
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.021,0.054,0.022,0.017,0.0,0.098,0.018,0.025,0.028,...,0.032,0.0,0.05,0.025,0.14,0.047,0.0,0.022,0.044,0.021
2,0.021,1.0,0.162,0.156,0.134,0.084,0.07,0.152,0.11,0.156,...,0.048,0.189,0.143,0.12,0.088,0.084,0.221,0.13,0.202,0.221
3,0.054,0.162,1.0,0.323,0.06,0.105,0.034,0.066,0.049,0.129,...,0.011,0.155,0.08,0.018,0.194,0.18,0.135,0.093,0.183,0.107
4,0.022,0.156,0.323,1.0,0.05,0.068,0.0,0.074,0.065,0.076,...,0.0,0.058,0.086,0.0,0.093,0.157,0.109,0.053,0.172,0.082
5,0.017,0.134,0.06,0.05,1.0,0.102,0.162,0.284,0.187,0.256,...,0.122,0.045,0.259,0.139,0.108,0.043,0.152,0.151,0.147,0.214


### Gerar a Recomendação de acordo com a nota dada pelos usuários similares

Podemos usar várias métricas de acordo com os ratings dos vizinhos: mínimo, máximo, média, mediana, média ponderada, agregação supervisionada.
<br>
Usaremos **média ponderada** logo:
1. Para cada filme que se deseja saber a nota: 
2. Para cada usuário similar da lista:
    1. Se nota foi dada: somar nota seguindo a fórmula
$$ notaMédia = {\sum coeficiente * nota \over \sum coeficiente} $$
<center><img src="img/user-based-similaridades2.jpg" ></center>

In [6]:
def predizer_notas_userb_mediap(target, filmes, matriz_similaridade = users_cosine, matriz_dados = ratings_usuarios, k=25, min_threshold=0):  # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme                                        
    similares = obter_mais_similares(target, matriz_similaridade, 25)

    resultado = pd.DataFrame(columns=filmes, index=['User Pred', '# Notas'] ) 
    resultado.columns.name = 'movieId'
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares.index:
            try: #Tenta achar a nota na matriz de dados, mas pode não existir se o filme só aparece na matriz de teste
                nota = matriz_dados.loc[similar,filme]
            except:
                nota = 0
            if (nota != 0):
                coeficiente = similares[similar]
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['User Pred',filme] = round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['User Pred',filme] = 0
            resultado.at['# Notas',filme] = qtd_notas
    return resultado

def recomendar_user_based(target, k_similares = 30, qtd_sugestoes = 20, min_threshold=5):
    usuarios_mais_similares = obter_mais_similares(target, users_cosine, k_similares).index
    filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target) 
    filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist())
    filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
    recomendacao = predizer_notas_userb_mediap(target, filmes_a_avaliar, users_cosine, ratings_usuarios, k_similares, min_threshold=5)
    return recomendacao.T.sort_values(by=['User Pred','# Notas',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])

In [7]:
recomendar_user_based(1)


Unnamed: 0_level_0,User Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1199,4.7,6,Brazil (1985)
1206,4.6,11,"Clockwork Orange, A (1971)"
7438,4.6,9,Kill Bill: Vol. 2 (2004)
2997,4.6,7,Being John Malkovich (1999)
1258,4.5,11,"Shining, The (1980)"
56782,4.5,6,There Will Be Blood (2007)
1080,4.5,5,Monty Python's Life of Brian (1979)
541,4.4,11,Blade Runner (1982)
4878,4.4,10,Donnie Darko (2001)
1219,4.4,8,Psycho (1960)


#### Nota por *Mean-Centered* (trad: média centralizada?)
$$ notaPredita_u = \mu_u + {\sum coeficiente * (nota-\mu_v) \over \sum coeficiente} $$

In [8]:
def predizer_notas_userb_meanc(target, filmes, matriz_similaridade = users_cosine, matriz_dados = ratings_usuarios, k=25, min_threshold=0): 
                                            # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme
    similares = obter_mais_similares(target, matriz_similaridade, 25)

    resultado = pd.DataFrame(columns=filmes, index=['User Pred2', '# Notas.'] ) 
    resultado.columns.name = 'movieId'
    media_target = get_media_avaliacao(target)
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares.index:
            nota = matriz_dados.loc[similar,filme]
            if (nota != 0):
                coeficiente = similares[similar]
                numerador += coeficiente * (nota - get_media_avaliacao(similar))
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['User Pred2',filme] = media_target + round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['User Pred2',filme] = 0
            resultado.at['# Notas.',filme] = qtd_notas
    return resultado.fillna(0)

def recomendar_user_based_2(target, k_similares = 30, qtd_sugestoes = 20, min_threshold=5):
    usuarios_mais_similares = obter_mais_similares(target, users_cosine, k_similares).index
    filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target) 
    filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist())
    filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
    recomendacao = predizer_notas_userb_meanc(target, filmes_a_avaliar, users_cosine, ratings_usuarios, k_similares, min_threshold=5)
    return recomendacao.T.sort_values(by=['User Pred2','# Notas.',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])

In [9]:
recomendar_user_based_2(1) # Reaprar que a qt de notas avaliadas é a mesma com os mesmos parâmetros mas o cálculo é bem diferente


Unnamed: 0_level_0,User Pred2,# Notas.,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
56782,4.81,6.0,There Will Be Blood (2007)
541,4.71,11.0,Blade Runner (1982)
7438,4.71,9.0,Kill Bill: Vol. 2 (2004)
1199,4.71,6.0,Brazil (1985)
86320,4.71,5.0,Melancholia (2011)
111,4.61,8.0,Taxi Driver (1976)
1208,4.61,8.0,Apocalypse Now (1979)
5971,4.61,6.0,My Neighbor Totoro (Tonari no Totoro) (1988)
1080,4.61,5.0,Monty Python's Life of Brian (1979)
5995,4.51,10.0,"Pianist, The (2002)"


## Testar modelo usando kfold e Raiz do Erro Quadrático Médio

### Calcular margem de erro 

Uma medida frequentemente usada na verificação da acurácia de modelos numéricos é o Erro Quadrático Médio (MSE na sigla em Inglês) como descrito, por exemplo, em Wilks (2006).MSE é sempre positivo. MSE = 0 indica simulação perfeita. MSE é definido por:
$$ MSE = \frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2 $$

Em adição ao MSE, a raiz quadrada de MSE, ou Raiz do Erro Quadrático Médio (RMSE em Inglês), é comumente usada para expressar a acurácia dos resultados numéricos com a vantagem de que RMSE apresenta valores do erro nas mesmas dimensões da variável analisada. O RMSE é definido por:
$$ RMSE = \sqrt{\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2} $$

In [10]:
def calcular_rmse(real, previsao):
    return math.sqrt(sklearn.metrics.mean_squared_error(real, previsao))  

def testar_predicao_user_based(userId, filmes, notas_reais):
    notas_preditas = predizer_notas_userb_mediap(userId, filmes, users_cosine, ratings_usuarios,25,0)
    notas_preditas = notas_preditas.loc["User Pred"].values.tolist()
    return calcular_rmse(notas_reais,notas_preditas)

def get_filmes_teste(usuario):
    return list(ratings_teste[ratings_teste['userId'] == usuario]['movieId'])

def get_notas_teste(usuario):
    return list(ratings_teste[ratings_teste['userId'] == usuario]['rating'])

def testar_user_based(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_user_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_user_based.at[usuario,"rmse_User"]=testar_predicao_user_based(usuario, filmes_teste, notas_dadas)
    estatistica_user_based['rmse_User'].std()
    return estatistica_user_based.describe().T

tt = Treino_Teste(ratings,10)
resultado = []

for i in range(tt.get_k()):
    inicio = time.time()
    ratings_treino, ratings_teste = tt.get_treino_teste()
    matriz = gerar_matriz_movies_user(ratings_treino)
    gerar_matriz_similaridade_user(matriz)
    parcial = testar_user_based(ratings_teste)
    resultado.append(parcial)
    fim = time.time()
    print(f'{i+1}º processo user-based gastou {round((fim - inicio),4)} segs', end='\r')
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias_u = resultado['mean'].mean()
desvios_padrao_u = resultado['std'].mean()
confianca_u = 0.99
conf_int_u = scipy.stats.norm.interval(confianca_u, loc=medias_u, scale=desvios_padrao_u) 
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias_u}\nDesvio padrão médio: {desvios_padrao_u}")
print(f"Intervalo de {confianca_u} confiança: {conf_int_u}\n")
print_destaque('Fim da 1ª parte')

10º processo user-based gastou 55.845 segs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_User,6609.0,1.458614,0.791335,0.0,0.892188,1.3,1.860108,5.0
rmse_User,6592.0,1.445294,0.801173,0.0,0.876525,1.281405,1.837323,5.0
rmse_User,6617.0,1.458222,0.796983,0.0,0.884433,1.302196,1.850946,5.0
rmse_User,6601.0,1.434811,0.788552,0.0,0.875785,1.283088,1.826615,5.0
rmse_User,6611.0,1.460524,0.80395,0.0,0.89821,1.29647,1.861355,5.0
rmse_User,6603.0,1.420793,0.786038,0.0,0.870803,1.255843,1.807084,5.0
rmse_User,6620.0,1.440709,0.780262,0.0,0.886472,1.298318,1.851613,5.0
rmse_User,6623.0,1.461711,0.801446,0.0,0.898686,1.3,1.860587,5.0
rmse_User,6598.0,1.453117,0.784906,0.0,0.890058,1.302261,1.852176,5.0
rmse_User,6605.0,1.443009,0.782961,0.0,0.891133,1.290153,1.826198,5.0


[44m[93m Relatório da final da USER BASED 
[0m
Média das RMSE:    1.4476802972601517
Desvio padrão médio: 0.7917606847765684
Intervalo de 0.99 confiança: (-0.591760075985277, 3.4871206705055804)

[44m[93m Fim da 1ª parte 
[0m


---
---

## Item-Based Collaborative Filtering

### Gerar matriz Users X Movies
Transposição da matriz que tinha usuários nas linhas e filmes nas colunas, para filmes nas linhas e usuários nas colunas

In [11]:
ratings_filmes = ratings_usuarios.T
ratings_filmes.head(4)

userId,1,2,3,4,5,6,7,8,9,10,...,6738,6739,6740,6741,6742,6743,6744,6745,6746,6747
1,0.0,3.5,0.0,3.0,4.0,0.0,0.0,4.0,0.0,3.5,...,0.0,0.0,3.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


### Matriz de Similaridade Item a Item
Similaridade por Cosseno dos filmes entre si

In [12]:
def gerar_matriz_similaridade_item(dados):
    global movies_cosine
    movies_cosine_array = cosine_similarity(dados)
    movies_cosine = pd.DataFrame(data=movies_cosine_array, index=dados.index, columns=dados.index)
    return movies_cosine

gerar_matriz_similaridade_item(ratings_filmes).round(3).head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,207309,207311,207367,207405,207642,207830,207890,208002,208080,208737
1,1.0,0.34,0.25,0.104,0.267,0.325,0.269,0.085,0.151,0.32,...,0.007,0.0,0.01,0.009,0.0,0.013,0.024,0.0,0.016,0.0
2,0.34,1.0,0.205,0.155,0.207,0.239,0.18,0.161,0.152,0.361,...,0.0,0.0,0.0,0.021,0.0,0.0,0.0,0.0,0.0,0.0
3,0.25,0.205,1.0,0.197,0.369,0.257,0.314,0.086,0.208,0.173,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.104,0.155,0.197,1.0,0.142,0.096,0.139,0.11,0.141,0.115,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.267,0.207,0.369,0.142,1.0,0.214,0.336,0.112,0.15,0.159,...,0.02,0.0,0.027,0.0,0.0,0.0,0.0,0.0,0.043,0.0


### Selecionar um usuário e analisar os filmes que ele não deu nota
Os targets serão os filmes que o usuário não deu nota. É analisado os k filmes mais similares ao que ele não viu, e destes, de acordo com as notas que o usuário deu, é calculado a nota estimada. Isto é feito para todos os filmes.

<center><img src="img/item-based-cosseno-predicao.jpg" style="max-width: 20%"></center>

1. pegar um usuário e os filmes que ele não assistiu
2. pegar um filme que ele não assistiu e selecionar os K mais semelhantes & que o usuário deu nota
3. fazer a média ponderada entre as notas que ele deu pra estes filmes semelhantes para definir a nota nova faltante

In [13]:
def predizer_notas_item_b(usuario, filmes, matriz_similaridade=movies_cosine, matriz_dados = ratings_filmes, k=25, min_threshold=3):
    usuario = 1
    filmes_não_avaliados = listar_filmes_nao_vistos(usuario) #O que essa essa matriz dados???

    #recomendacao = pd.DataFrame(columns=("movieId", "Nota", "Qt de Notas"))
    resultado = pd.DataFrame(columns=filmes, index=['Item Pred', '# Notas'] ) 
    resultado.columns.name = 'movieId'

    set_avaliados = set(get_filmes_avaliados(usuario))

    for filme in filmes:  #Para cada filme que desejamos a predição, obterei aqueles mais similares
        filmes_mais_similares = obter_mais_similares(filme, matriz_similaridade, k, 0.2)
        if(len(filmes_mais_similares)==0):
            continue
        similares_vistos = list(set(filmes_mais_similares.index) & set_avaliados) #e ver quais que já foram avaliados
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for i in similares_vistos: #para cada filme similar avaliado, calcular a média ponderada
            coeficiente = matriz_similaridade[filme][i]
            nota = ratings_filmes[usuario][i] 
            if (nota != 0):
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1  
        if(qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['Item Pred',filme] = round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['Item Pred',filme] = 0
            resultado.at['# Notas',filme] = qtd_notas
            #recomendacao.loc[len(recomendacao)] = [int(filme), round(numerador/denominador,2), qtd_notas]
    return resultado.fillna(0)


filmes_não_avaliados = listar_filmes_nao_vistos(1)
recomendacao = predizer_notas_item_b(1, filmes_não_avaliados[:15000],movies_cosine,ratings_filmes,).T.join(movies[['title']], on=["movieId"]).sort_values(by=['Item Pred','# Notas','title'],ascending=False)
recomendacao.head(20)

Unnamed: 0_level_0,Item Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
27773,5.0,3.0,Old Boy (2003)
6870,5.0,3.0,Mystic River (2003)
4848,5.0,3.0,Mulholland Drive (2001)
30749,5.0,3.0,Hotel Rwanda (2004)
2997,5.0,3.0,Being John Malkovich (1999)
46723,5.0,3.0,Babel (2006)
6773,4.9,4.0,"Triplets of Belleville, The (Les triplettes de..."
5902,4.9,4.0,Adaptation (2002)
8784,4.9,3.0,Garden State (2004)
2329,4.8,5.0,American History X (1998)


In [14]:
def testar_predicao_item_based(userId, filmes, notas_reais):
    notas_preditas = predizer_notas_item_b(userId, filmes, movies_cosine, ratings_filmes,25,0)
    notas_preditas = notas_preditas.loc["Item Pred"].values.tolist()
    return calcular_rmse(notas_reais,notas_preditas)

def testar_item_based(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_item_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_item_based.at[usuario,"rmse_Item"]=testar_predicao_item_based(usuario, filmes_teste, notas_dadas)
    estatistica_item_based['rmse_Item'].std()
    return estatistica_item_based.describe().T

resultado = []

for i in range(tt.get_k()):    
    inicio = time.time()
    matriz = gerar_matriz_movies_user(ratings_treino).T
    gerar_matriz_similaridade_item(matriz)
    parcial = testar_item_based(ratings_teste)
    resultado.append(parcial)
    fim = time.time()
    print(f'{i+1}º processo item-based gastou {round((fim - inicio),3)} segs', end='\r')
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias_i = resultado['mean'].mean()
desvios_padrao_i = resultado['std'].mean()
confianca_i = 0.99
conf_int_i = scipy.stats.norm.interval(confianca_i, loc=medias_i, scale=desvios_padrao_i) 
print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias_i}\nDesvio padrão médio: {desvios_padrao_i}")
print(f"Intervalo de {confianca_i} confiança: {conf_int_i}\n")
print_destaque('Fim da 2ª parte')

10º processo item-based gastou 259.279 segs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0
rmse_Item,6605.0,2.904681,0.817098,0.0,2.511512,2.964176,3.395303,5.0


[44m[93m Relatório da final da ITEM BASED 
[0m
Média das RMSE:    2.9046810289452636
Desvio padrão médio: 0.817098149066744
Intervalo de 0.99 confiança: (0.7999756727035767, 5.00938638518695)

[44m[93m Fim da 2ª parte 
[0m


---
---

# SVD: Fatoração de Matriz
Devido a esparsidade do dataset, os métodos tradicionais de filtragem colaborativa podem não serem adequados a demanda de processamento. Uma forma de tratar é fazendo uso do algoritmo de **Singular Value Decomposition**, SVD.<br>
Neste algoritmo, a matriz é decomposta em  em outras 3 matrizes de menor dimensionalidade.
$$ A = USV^T$$
- A é a matriz original m x n
- U é uma matriz ortogonal m x n (mesmo shape de A)
- S é uma matriz diagonal n x n (valores $\sigma_1 \geqslant \sigma_2 \geqslant ... \geqslant \sigma_n$ => ordenados por importância)
- V é uma matriz ortogonal n x n

In [15]:
from scipy.sparse.linalg import svds
from numpy import count_nonzero
U, sigma, Vt = svds(ratings_usuarios.to_numpy(), k = 10) 

print(f"Matriz original{ratings_usuarios.shape} decomposta em U{U.shape}, sigma {sigma.shape} e Vt{Vt.shape}.")

sigma_matriz_diagonal=np.diag(sigma) #sigma é um array contendo a diagonal
all_user_predicted_ratings = np.dot(np.dot(U, sigma_matriz_diagonal), Vt)
matriz_SVD = pd.DataFrame(all_user_predicted_ratings, columns = ratings_usuarios.columns, index=ratings_usuarios.index)

esparsidade_SVD = 1.0 - ( count_nonzero(matriz_SVD) / float(matriz_SVD.size) )
print("Esparsidade: ", esparsidade_SVD,"%")

matriz_SVD.head()

Matriz original(6747, 21215) decomposta em U(6747, 10), sigma (10,) e Vt(10, 21215).
Esparsidade:  0.03158142823473953 %


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,207309,207311,207367,207405,207642,207830,207890,208002,208080,208737
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.316991,0.071769,-0.036591,-0.007795,-0.003778,-0.014944,0.004399,-0.001487,-0.030675,0.066213,...,-0.005078,0.0,-0.008943,-0.004747,0.0,-0.002721,-0.002019,-0.001174,-0.002265,3.1e-05
2,2.302708,0.665585,0.31281,-0.030012,0.260814,0.130548,0.498683,0.024175,0.054898,0.778422,...,-0.014935,0.0,-0.020815,-0.017087,0.0,-0.009259,-0.002197,-0.002434,-0.009408,-0.000568
3,1.279988,1.036564,0.232577,-0.028414,0.306969,2.220678,0.130908,-0.037104,0.271859,0.488881,...,0.043701,0.0,0.058553,0.037112,0.0,0.036244,0.005182,0.00663,0.024354,0.000902
4,2.186584,0.514753,-0.087683,-0.070939,-0.074234,0.295913,-0.044596,-0.008031,0.01318,0.052585,...,0.013437,0.0,0.027374,0.027976,0.0,0.023093,0.002922,0.004156,0.004201,0.001038
5,2.565389,0.874138,1.050205,0.226905,0.949202,1.55103,1.140282,0.090032,0.42067,1.109107,...,0.002548,0.0,-0.000758,-0.003278,0.0,-0.002843,-0.001698,0.000727,0.001704,0.000117


In [16]:
def teste_SVD2(): # teste de https://github.com/khanhnamle1994/movielens/blob/master/SVD_Model.ipynb acurácia menor que o que eu implementei
    R = ratings_usuarios.values
    user_ratings_mean = np.mean(R, axis = 1)
    Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)
    U, sigma, Vt = svds(Ratings_demeaned, k = 50)
    sigma = np.diag(sigma)
    all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
    preds = pd.DataFrame(all_user_predicted_ratings, columns = ratings_usuarios.columns)
    return preds
matriz_SVD2 = teste_SVD2()

In [17]:
def testar_predicao_svd(userId, filmes, notas_reais):
    try:
        notas_preditas = matriz_SVD[matriz_SVD.index==userId][filmes].values.tolist()[0]
    except:
        return 0
    return calcular_rmse(notas_reais,notas_preditas)

def testar_svd(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_user_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_user_based.at[usuario,"rmse_svd"]=testar_predicao_svd(usuario, filmes_teste, notas_dadas)
    estatistica_user_based['rmse_svd'].std()
    return estatistica_user_based.describe().T

resultado = []

inicio = time.time()
for i in range(tt.get_k()):
    inicio = time.time()
    ratings_treino, ratings_teste = tt.get_treino_teste()
    parcial = testar_svd(ratings_teste)
    resultado.append(parcial)
    fim = time.time()
    print(f'{i+1}º processo svd gastou {round((fim - inicio),5)} segs', end='\r')
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias_s = resultado['mean'].mean()
desvios_padrao_s = resultado['std'].mean()
confianca_s = 0.99
conf_int_s = scipy.stats.norm.interval(confianca_s, loc=medias_s, scale=desvios_padrao_s) 
print_destaque('Relatório da final da SVD')
print(f"Média das RMSE:    {medias_s}\nDesvio padrão médio: {desvios_padrao_s}")
print(f"Intervalo de {confianca_s} confiança: {conf_int_s}\n")
print_destaque('Fim de execução')

10º processo svd gastou 7.22699 segs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_svd,6609.0,3.002203,0.758693,0.0,2.542601,3.011773,3.496127,5.003161
rmse_svd,6592.0,3.002512,0.773298,0.0,2.547127,3.013409,3.51268,5.034362
rmse_svd,6617.0,3.001966,0.765231,0.0,2.532174,3.009694,3.519463,5.00126
rmse_svd,6601.0,3.008754,0.750816,0.0,2.550901,3.007611,3.496328,5.035563
rmse_svd,6611.0,2.996087,0.762864,0.0,2.535555,3.009185,3.493939,4.995471
rmse_svd,6603.0,2.998936,0.765822,0.0,2.536099,2.998479,3.500567,5.014938
rmse_svd,6620.0,3.003873,0.747259,0.0,2.546102,3.020105,3.489102,5.003561
rmse_svd,6623.0,2.995446,0.772499,0.0,2.533819,3.017296,3.503793,5.002598
rmse_svd,6598.0,2.995974,0.759152,0.0,2.538114,3.010039,3.482527,4.999142
rmse_svd,6605.0,3.04167,0.762682,0.0,2.579887,3.043836,3.540335,5.022537


[44m[93m Relatório da final da SVD 
[0m
Média das RMSE:    3.004741988625619
Desvio padrão médio: 0.7618315934159713
Intervalo de 0.99 confiança: (1.0423938459354085, 4.967090131315829)

[44m[93m Fim de execução 
[0m


---
---

# Executando todos aos juntos

In [18]:
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias_u}\nDesvio padrão médio: {desvios_padrao_u}")
print(f"Intervalo de {confianca_u} confiança: {conf_int_u}\n")

print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias_i}\nDesvio padrão médio: {desvios_padrao_i}")
print(f"Intervalo de {confianca_i} confiança: {conf_int_i}\n")

print_destaque('Relatório da final da SVD')
print(f"Média das RMSE:    {medias_s}\nDesvio padrão médio: {desvios_padrao_s}")
print(f"Intervalo de {confianca_s} confiança: {conf_int_s}\n")
print_destaque('Fim de execução')

[44m[93m Relatório da final da USER BASED 
[0m
Média das RMSE:    1.4476802972601517
Desvio padrão médio: 0.7917606847765684
Intervalo de 0.99 confiança: (-0.591760075985277, 3.4871206705055804)

[44m[93m Relatório da final da ITEM BASED 
[0m
Média das RMSE:    2.9046810289452636
Desvio padrão médio: 0.817098149066744
Intervalo de 0.99 confiança: (0.7999756727035767, 5.00938638518695)

[44m[93m Relatório da final da SVD 
[0m
Média das RMSE:    3.004741988625619
Desvio padrão médio: 0.7618315934159713
Intervalo de 0.99 confiança: (1.0423938459354085, 4.967090131315829)

[44m[93m Fim de execução 
[0m


In [19]:
tt = Treino_Teste(ratings,10)
resultado_user = []
resultado_item = []
resultado_svd = []

for i in range(tt.get_k()):
    ratings_treino, ratings_teste = tt.get_treino_teste()

    inicio = time.time()
    matriz_user = gerar_matriz_movies_user(ratings_treino)
    gerar_matriz_similaridade_user(matriz_user)
    parcial_user = testar_user_based(ratings_teste)    
    resultado_user.append(parcial_user)
    fim = time.time()
    print(f'{i+1}º processo user-based gastou {fim - inicio} segs')

    inicio = time.time()
    matriz_item = matriz_user.T
    gerar_matriz_similaridade_item(matriz_item)
    parcial_item = testar_item_based(ratings_teste)    
    resultado_item.append(parcial_item)
    fim = time.time()
    print(f'{i+1}º processo item-based gastou {fim - inicio} segs')

    inicio = time.time()
    parcial_svd = testar_svd(ratings_teste)
    print(f'{i+1}º processo(s)', end='\r')
    resultado_svd.append(parcial_svd)
    fim = time.time()
    print(f'{i+1}º processo SVD gastou {fim - inicio} segs')

    tt.proxima_folha()

resultado_user = pd.concat(resultado_user) #transformar a lista em DataFrame
medias_u = resultado_user['mean'].mean()
desvios_padrao_u = resultado_user['std'].mean()
confianca_u = 0.99
conf_int_u = scipy.stats.norm.interval(confianca_u, loc=medias_u, scale=desvios_padrao_u) 
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias_u}\nDesvio padrão médio: {desvios_padrao_u}")
print(f"Intervalo de {confianca_u} confiança: {conf_int_u}\n")

resultado_item = pd.concat(resultado_item) 
medias_i = resultado_item['mean'].mean()
desvios_padrao_i = resultado_item['std'].mean()
confianca_i = 0.99
conf_int_i = scipy.stats.norm.interval(confianca_i, loc=medias_i, scale=desvios_padrao_i) 
print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias_i}\nDesvio padrão médio: {desvios_padrao_i}")
print(f"Intervalo de {confianca_i} confiança: {conf_int_i}\n")

resultado_svd = pd.concat(resultado_svd) 
medias_s = resultado_svd['mean'].mean()
desvios_padrao_s = resultado_svd['std'].mean()
confianca_s = 0.99
conf_int_s = scipy.stats.norm.interval(confianca_s, loc=medias_s, scale=desvios_padrao_s) 
print_destaque('Relatório da final da SVD')
print(f"Média das RMSE:    {medias_s}\nDesvio padrão médio: {desvios_padrao_s}")
print(f"Intervalo de {confianca_s} confiança: {conf_int_s}\n")
print_destaque('Fim de execução')

1º processo user-based gastou 56.23309087753296 segs
1º processo item-based gastou 253.98591375350952 segs
1º processo SVD gastou 7.720290899276733 segs
2º processo user-based gastou 59.94591021537781 segs
2º processo item-based gastou 255.5558397769928 segs
2º processo SVD gastou 9.966126918792725 segs
3º processo user-based gastou 56.00648498535156 segs
3º processo item-based gastou 258.1151087284088 segs
3º processo SVD gastou 10.393805742263794 segs
4º processo user-based gastou 58.32508087158203 segs
4º processo item-based gastou 261.4222481250763 segs
4º processo SVD gastou 10.028176307678223 segs
5º processo user-based gastou 55.94313621520996 segs
5º processo item-based gastou 258.3511424064636 segs
5º processo SVD gastou 10.618063926696777 segs
6º processo user-based gastou 59.00544834136963 segs
6º processo item-based gastou 258.72985196113586 segs
6º processo SVD gastou 11.222956418991089 segs
7º processo user-based gastou 59.541555881500244 segs
7º processo item-based gasto