# Trabalho de Filtragem Colaborativa

Modelo de filtragem colaborativa usa as informações de ratings dos usuários para prover recomendações. <br>
**Ideia Principal:** A semelhança entre os usuários a partir do que se observa da interseção dos seus ratings permite inferir que os dados não inputados por apenas um destes poderia ter a mesma semelhança com relação ao outro. <br>
**Problema Principal:** A esparsidade dos dados, afinal a informação que um usuário provê é normalmente a um subconjunto muito pequeno dos itens. Logo a maioria da base de dados é de dados *faltantes* ou *não observados*. <br>
<br>
Há 2 métodos de filtragem colaborativa: <br>

**Memory-Based:** Também chamado de *neighborhood-based collaborative filtering algorithms*. Que se dividem basicamente em *user-based collaborative filtering* e *item-based collaborative filtering*. <br>
**Model-Based:** Modelos baseados em **machine learning** e **data mining** há um processo de aprendizado prévio para parametrizição. Alguns métodos são Decisions Trees, métodos bayesianos, modelos baseados em regras e latent factor method. <br>

## Implementação usando MovieLens


In [8]:
import pandas as pd         #DataFrames e operações associadas
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity          #Similaridade
import math  
import sklearn.metrics  


### Carregar a base de 25M ou 100K

In [10]:
base = 'small'  #qualquer outro valor e carregará a base completa
ratings = pd.DataFrame()
movies = pd.DataFrame()

if (base == 'small'):
    ratings = pd.read_csv('ratings_small.csv')
    movies = pd.read_csv('movies_small.csv').set_index("movieId")
else:
    ratings = pd.read_csv('ratings.csv')
    movies = pd.read_csv('movies.csv')

n_ratings = len(ratings)
n_users = len(ratings['userId'].unique())
n_movies_avaliados = len(ratings['movieId'].unique())
print(f"Total de ratings: {n_ratings}")
print(f"Total de filmes: {len(movies)}")
print(f"Filmes avaliados: {n_movies_avaliados}")
print(f"Total de usuários: {n_users}")
print(f"Média de ratings/user: {round(n_ratings/n_users, 2)}")
print(f"Shape de Ratings: {ratings.shape}")
esparsidade = round(1.0 -n_ratings/float(n_users * n_movies_avaliados),3)
print(f"O nível de esparsidade do dataset é {esparsidade * 100}%")
ratings.sample(6).sort_index()


Total de ratings: 100836
Total de filmes: 9742
Filmes avaliados: 9724
Total de usuários: 610
Média de ratings/user: 165.3
Shape de Ratings: (100836, 4)
O nível de esparsidade do dataset é 98.3%


Unnamed: 0,userId,movieId,rating,timestamp
15515,102,150,3.0,835875691
32538,221,7216,4.5,1119984215
37739,256,453,4.0,1447000649
62026,411,111,4.0,835532253
63922,414,5347,2.0,1047266264
77839,483,72998,4.5,1262525638


In [11]:
print("Relatório dos ratings dados pelos usuários:")
ratings.groupby('userId')['rating'].count().describe()

Relatório dos ratings dados pelos usuários:


count     610.000000
mean      165.304918
std       269.480584
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
Name: rating, dtype: float64

#### Funções para auxiliar

In [82]:
def listar_filmes_ja_vistos(usuario, matriz_filmes_X_usuarios):
    #filmes_ja_vistos_bin = matriz_filmes_X_usuarios.loc[usuario].gt(0)   #gerar array com o que usuário já deu rating: True ou False
    #return filmes_ja_vistos_bin.index[filmes_ja_vistos_bin].to_list() #com base no anterior, listar filmes que já viu        
    if(type(usuario)==list):
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario].sum(axis = 0)
    else: #if(type(usuario)==int):
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario!=0].index.to_list()

def listar_filmes_nao_vistos(usuario, matriz_filmes_X_usuarios):
    if(type(usuario)==list):
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario].sum(axis = 0)
    else: 
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario==0].index.to_list()

def eliminar_colunas_zeradas(matriz):
    return matriz.loc[:, (matriz != 0).any(axis=0)] #elimina todas as colunas cujos todos os valores são 0

def listar_nomes_filmes(indices):
    return movies.loc[[indices]]['title'].values.tolist()

## Gerar matriz Movies X User

In [14]:
#movies_X_users = ratings.pivot_table(index="userId", columns="movieId", values="rating", fill_value=0)
# A LINHA DE CIMA FAZ O MESMO QUE A DE BAIXO GASTANDO 15X MAIS TEMPO
movies_X_users = ratings.groupby(['userId', 'movieId'])['rating'].first().unstack(fill_value=0.0)
movies_X_users

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3.895825,-0.104175,3.895825,-0.104175,-0.104175,3.895825,-0.104175,-0.104175,-0.104175,-0.104175,...,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175
2,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,...,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775
3,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,...,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770
4,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,...,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980
5,3.983546,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,...,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.080625,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,2.080625,-0.419375,-0.419375,-0.419375,...,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375
607,3.927190,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,...,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810
608,2.232158,1.732158,1.732158,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,3.732158,...,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842
609,2.987557,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,3.987557,...,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443


In [27]:
moviesTitles_users = movies_X_users
moviesTitles_users.columns = movies.loc[movies_X_users.columns.values.tolist()].title.values.tolist()
moviesTitles_users.head(3)

Unnamed: 0_level_0,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0


## User-Based Collaborative Filtering

### Matriz de Similaridade por Usuário
A medida comumente usada é a similaridade do cosseno.
Essa medida de similaridade deve seu nome ao fato de ser igual ao cosseno do ângulo entre os dois vetores que estão sendo comparados:  vetores de similaridade de usuário (ou item) de ratings. Quanto menor o ângulo entre dois vetores, maior será o cosseno, resultando em um fator de similaridade mais alto. 

In [16]:
users_cosine_array = cosine_similarity(movies_X_users)
users_cosine = pd.DataFrame(data=users_cosine_array, index=movies_X_users.index, columns=movies_X_users.index)
users_cosine.round(3).head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027,0.06,0.194,0.129,0.128,0.159,0.137,0.064,0.017,...,0.081,0.164,0.221,0.071,0.154,0.164,0.269,0.291,0.094,0.145
2,0.027,1.0,0.0,0.004,0.017,0.025,0.028,0.027,0.0,0.067,...,0.203,0.017,0.012,0.0,0.0,0.028,0.013,0.046,0.028,0.102
3,0.06,0.0,1.0,0.002,0.005,0.004,0.0,0.005,0.0,0.0,...,0.005,0.005,0.025,0.0,0.011,0.013,0.019,0.021,0.0,0.032
4,0.194,0.004,0.002,1.0,0.129,0.088,0.115,0.063,0.011,0.031,...,0.086,0.128,0.308,0.053,0.085,0.2,0.132,0.15,0.032,0.108
5,0.129,0.017,0.005,0.129,1.0,0.3,0.108,0.429,0.0,0.031,...,0.068,0.419,0.11,0.259,0.149,0.106,0.153,0.136,0.261,0.061


#### Similaridade por Pearson
Fazer uma matriz usando também 'pearson': standard correlation coefficient (mediana + cosseno) <br>
kendall : Kendall Tau correlation coefficient <br>
spearman : Spearman rank correlation

In [33]:
#normalizar as notas pelas média EXECUTAR ANTES DE COSINE PARA OBTER VALORES IGUAIS A PEASRON
#movies_X_users = movies_X_users - np.asarray([(np.mean(movies_X_users, 1))]).T
#movies_X_users

#fazer pearson!
users_pearson = movies_X_users.T.corr(method='pearson')
users_pearson.round(3).head()


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.019,0.053,0.177,0.121,0.104,0.144,0.129,0.055,-0.0,...,0.066,0.15,0.187,0.057,0.134,0.122,0.254,0.262,0.085,0.099
2,0.019,1.0,-0.003,-0.004,0.013,0.016,0.022,0.024,-0.003,0.062,...,0.199,0.011,-0.004,-0.005,-0.008,0.011,0.006,0.033,0.024,0.089
3,0.053,-0.003,1.0,-0.005,0.002,-0.005,-0.006,0.002,-0.003,-0.006,...,0.0,-0.001,0.011,-0.005,0.004,-0.003,0.013,0.008,-0.003,0.016
4,0.177,-0.004,-0.005,1.0,0.121,0.066,0.101,0.054,0.002,0.016,...,0.073,0.114,0.282,0.04,0.065,0.165,0.115,0.117,0.024,0.063
5,0.121,0.013,0.002,0.121,1.0,0.294,0.102,0.427,-0.004,0.023,...,0.062,0.415,0.095,0.254,0.141,0.09,0.146,0.123,0.258,0.04


### Selecionar a similaridade desejada

In [17]:
#selecionar se vai usar pearson ou cosine para similaridade
matriz_similaridade = users_cosine 
if (False):     # Se verdadeiro, mudar para similaridade por pearson
    matriz_similaridade = users_pearson

### Pegar os k usuários mais similares ao Target selecionado

In [18]:
def selecionar_usuarios_mais_similares(target = 1, k = 25):
    todas_similaridades_com_usuario = matriz_similaridade.loc[target].to_numpy()     #criar um array com a linha do target na matrix de similaridades
    usuarios_mais_similares = movies_X_users.index[todas_similaridades_com_usuario.argpartition(-k)[-k-1:-1]] #seleciona os k com similaridade mais alta no vetor excluindo ele mesmo
    return usuarios_mais_similares 

### Matriz de filmes não vistos pelo usuário target e que receberam notas dos usuários mais similares

In [19]:
target = 1
usuarios_mais_similares = selecionar_usuarios_mais_similares(1,)
#gerar matriz dos usuarios mais similares x filmes não assistidos ainda pelo usuário
usuarios_similares_X_filmes_nao_vistos = movies_X_users.loc[usuarios_mais_similares].drop(columns=listar_filmes_ja_vistos(1,movies_X_users)) 
usuarios_similares_X_filmes_nao_vistos = eliminar_colunas_zeradas(usuarios_similares_X_filmes_nao_vistos)
print('Matriz de filmes não vistos por usuários mais similares:',usuarios_similares_X_filmes_nao_vistos.shape)
usuarios_similares_X_filmes_nao_vistos.head(8)

Matriz de filmes não vistos por usuários mais similares: (25, 2843)


movieId,2,5,7,9,10,11,12,13,15,16,...,146656,148626,149406,152081,160438,164179,165101,166528,166643,168174
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
294,3.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,...,0.0,4.0,0.0,2.0,0.0,0.0,0.5,0.0,4.0,0.0
288,2.0,2.0,0.0,0.0,3.0,0.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
91,3.0,0.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0,4.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
313,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
590,2.5,2.0,0.0,0.0,3.5,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
266,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Gerar a Recomendação de acordo com a nota dada pelos usuários similares

1. Para cada filme: 
2. Para cada usuário da lista de mais similar:
    1. Se nota foi dada: somar nota seguindo a fórmula
$$ nota_ = {\sum coeficienteUsuário * notaUsuário \over \sum coeficienteUsuário} $$

In [172]:
min_threshold = 5
qtd_sugestoes = 20 #qtd de sugestões para exibir na tela

#Dataframe com a similaridade_dos_mais_similares: [userId, coeficiente de similaridade] 
similaridade_dos_mais_similares = matriz_similaridade.loc[target,usuarios_mais_similares]  

recomendacao = pd.DataFrame(columns=usuarios_similares_X_filmes_nao_vistos.columns,index=['Nota Final', '# Notas'])
for filme in usuarios_similares_X_filmes_nao_vistos.columns:
    numerador = 0
    denominador = 0
    qtd_notas = 0
    for id in similaridade_dos_mais_similares.index:        
        nota_ = usuarios_similares_X_filmes_nao_vistos.loc[id,filme]
        if ( nota_ != 0):
            peso_ = similaridade_dos_mais_similares[id]
            numerador += nota_ * peso_
            denominador += peso_
            qtd_notas += 1
    if (qtd_notas < min_threshold):
        recomendacao = recomendacao.drop([filme], axis=1) #se a qtd de notas computadas dos usuários similares for abaixo do limiar, descartar a recomendação daquele filme
    else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
        recomendacao.at['Nota Final',filme] = round(numerador/denominador,1)
        recomendacao.at['# Notas',filme] = qtd_notas
    
recomendacao.T.sort_values(by=['Nota Final','# Notas',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on="movieId")

Unnamed: 0_level_0,Nota Final,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
750,4.7,9,Dr. Strangelove or: How I Learned to Stop Worr...
1203,4.7,7,12 Angry Men (1957)
953,4.7,6,It's a Wonderful Life (1946)
858,4.6,19,"Godfather, The (1972)"
1387,4.6,19,Jaws (1975)
318,4.5,14,"Shawshank Redemption, The (1994)"
1193,4.5,13,One Flew Over the Cuckoo's Nest (1975)
4993,4.5,11,"Lord of the Rings: The Fellowship of the Ring,..."
1204,4.5,9,Lawrence of Arabia (1962)
1199,4.5,8,Brazil (1985)


In [20]:
def predizer_notas(target, similares, filmes, min_threshold=0): # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme
    #Dataframe com a similaridade_dos_mais_similares: [userId, coeficiente de similaridade] 
    similaridade_dos_mais_similares = matriz_similaridade.loc[target,similares] 

    #matriz_similares_filmes = movies_X_users.loc[similares
    resultado = pd.DataFrame(columns=filmes, index=['Nota Final', '# Notas'] ) 
    resultado.columns.name = 'movieId'
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares:
            nota = movies_X_users.loc[similar,filme]
            if (nota != 0):
                coeficiente = similaridade_dos_mais_similares[similar]
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            resultado.at['Nota Final',filme] = round(numerador/denominador,1)
            resultado.at['# Notas',filme] = qtd_notas
    return resultado

In [21]:
qtd_sugestoes = 20 #qtd de sugestões para exibir na tela

filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target, movies_X_users) 
filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist(), movies_X_users)
filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
recomendacao = predizer_notas(target, usuarios_mais_similares, filmes_a_avaliar, 5)
#recomendacao
recomendacao.T.sort_values(by=['Nota Final','# Notas',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])


Unnamed: 0_level_0,Nota Final,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
750,4.7,9,Dr. Strangelove or: How I Learned to Stop Worr...
1203,4.7,7,12 Angry Men (1957)
953,4.7,6,It's a Wonderful Life (1946)
858,4.6,19,"Godfather, The (1972)"
1387,4.6,19,Jaws (1975)
318,4.5,14,"Shawshank Redemption, The (1994)"
1193,4.5,13,One Flew Over the Cuckoo's Nest (1975)
4993,4.5,11,"Lord of the Rings: The Fellowship of the Ring,..."
1204,4.5,9,Lawrence of Arabia (1962)
1199,4.5,8,Brazil (1985)


## Avaliando a eficácia do método
1. Selecionar um target randômico
2. Selecionar alguns valores de notas dadas por ele
3. Tentar predizer sua nota com base nos seus similares, 
4. Calcular margem de erro 

Uma medida frequentemente usada na verificação da acurácia de modelos numéricos é o Erro Quadrático Médio (MSE na sigla em Inglês) como descrito, por exemplo, em Wilks (2006).MSE é sempre positivo. MSE = 0 indica simulação perfeita. MSE é definido por:
$$ MSE = \frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2 $$

Em adição ao MSE, a raiz quadrada de MSE, ou Raiz do Erro Quadrático Médio (RMSE em Inglês), é comumente usada para expressar a acurácia dos resultados numéricos com a vantagem de que RMSE apresenta valores do erro nas mesmas dimensões da variável analisada. O RMSE é definido por:
$$ RMSE = \sqrt{\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2} $$

In [22]:
rnd_user = movies_X_users.sample()
rnd_user = eliminar_colunas_zeradas(rnd_user)
rnd_user

movieId,2,3,6,19,21,25,36,39,48,50,...,79132,87234,89745,91500,91529,92259,95377,96821,106642,117192
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
308,3.0,0.5,1.0,1.0,1.0,1.0,1.0,3.0,4.0,2.5,...,1.0,5.0,4.5,3.0,3.0,1.0,3.0,5.0,5.0,5.0


In [84]:
rnd_user_id = rnd_user.index.values[0]

filmes_assistidos = listar_filmes_ja_vistos(rnd_user_id, movies_X_users)
similares = selecionar_usuarios_mais_similares(rnd_user_id)
predicao = predizer_notas(rnd_user_id, similares, filmes_assistidos, 0)
predicao

movieId,2,3,6,19,21,25,36,39,48,50,...,79132,87234,89745,91500,91529,92259,95377,96821,106642,117192
Nota Final,3.4,0.5,2.6,2.5,1.7,1.3,1.4,3.0,3.4,4.0,...,3.5,5.0,4.2,3.4,3.9,3.1,3.0,4.2,4.6,5.0
# Notas,9.0,1.0,4.0,8.0,3.0,2.0,3.0,5.0,7.0,9.0,...,15.0,1.0,12.0,9.0,12.0,6.0,1.0,5.0,3.0,1.0


In [85]:
def calcular_rmse(real, previsao):
    mse = sklearn.metrics.mean_squared_error(notasReais, notasPreditas)     
    #mse = np.square(np.subtract(notasReais,notasPreditas)).mean()      
    print("Erro Quadrático Médio (MSE):", mse)         
    return math.sqrt(mse)  

notasReais = rnd_user.values.tolist()[0]
notasPreditas = predicao.loc["Nota Final"].values.tolist()
print("Raiz do Erro Quadrático Médio :", calcular_rmse(notasReais, notasPreditas))  

Erro Quadrático Médio (MSE): 1.5205217391304346
Raiz do Erro Quadrático Médio : 1.2330943755975998


---
---

## Item-Based Collaborative Filtering

### Gerar matriz Users X Movies
Transposição da matriz que tinha usuários nas linhas e filmes nas colunas, para filmes nas linhas e usuários nas colunas

In [2]:
users_X_movies = movies_X_users.T
users_X_movies.head(4)

NameError: name 'movies_X_users' is not defined

### Matriz de Similaridade
Similaridade por Coseno dos filmes entre si

In [210]:
movies_cosine_array = cosine_similarity(users_X_movies)
movies_cosine = pd.DataFrame(data=movies_cosine_array, index=users_X_movies.index, columns=users_X_movies.index)
movies_cosine.head()
#movies_pearson = movies_users.corr(method='pearson')

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.410562,0.296917,0.035573,0.308762,0.376316,0.277491,0.131629,0.232586,0.395573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.410562,1.0,0.282438,0.106415,0.287795,0.297009,0.228576,0.172498,0.044835,0.417693,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.296917,0.282438,1.0,0.092406,0.417802,0.284257,0.402831,0.313434,0.30484,0.242954,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.035573,0.106415,0.092406,1.0,0.188376,0.089685,0.275035,0.158022,0.0,0.095598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.308762,0.287795,0.417802,0.188376,1.0,0.298969,0.474002,0.283523,0.335058,0.218061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Selecionar um usuário target e seus filmes favoritos
É obtido a nota mais alta que este usuário deu e todos os filmes  com a mesma nota. Estes chamaremos de filmes favorito e de acordo eles é buscada a similaridade entre os itens

In [302]:
usuarioId_target = 1
usuario = movies_X_users.loc[usuarioId_target].sort_values(ascending=False)
rate_mais_alto = usuario.iloc[0]
filmes_mais_gosta = usuario[usuario >= rate_mais_alto].index.tolist()
assistiu_n_filmes = len(usuario[usuario > 0].index.to_list())
print(f"O usuário {usuarioId_target} assistiu {assistiu_n_filmes} filmes e deu nota {rate_mais_alto} para estes {len(filmes_mais_gosta)} filmes: ")
print(listar_nomes_filmes(filmes_mais_gosta))

O usuário 1 assistiu 232 filmes e deu nota 5.0 para estes 124 filmes: 
['Live and Let Die (1973)', 'Blues Brothers, The (1980)', 'Fantasia (1940)', 'Edward Scissorhands (1990)', "Gulliver's Travels (1939)", 'Indiana Jones and the Last Crusade (1989)', 'Pink Floyd: The Wall (1982)', 'Dirty Dozen, The (1967)', 'Goldfinger (1964)', 'From Russia with Love (1963)', 'Dr. No (1962)', 'Fugitive, The (1993)', 'Fight Club (1999)', 'Who Framed Roger Rabbit? (1988)', 'Mr. Smith Goes to Washington (1939)', 'Thunderball (1965)', 'Rob Roy (1995)', 'Rocky (1976)', 'Young Frankenstein (1974)', 'Highlander (1986)', 'Tombstone (1993)', 'Pinocchio (1940)', 'Henry V (1989)', 'JFK (1991)', 'Quiet Man, The (1952)', 'Terminator, The (1984)', 'American History X (1998)', 'Gladiator (2000)', 'Back to the Future (1985)', 'American Beauty (1999)', 'Bottle Rocket (1996)', 'Billy Madison (1995)', 'Duck Soup (1933)', 'Excalibur (1981)', "Schindler's List (1993)", 'NeverEnding Story, The (1984)', 'Spaceballs (1987)',

### Similaridade dos filmes favoritos com os filmes não assistidos
Matriz onde cada um dos filmes favoritos é um índice e cada coluna é um filme que não foram assistidos ainda.

In [289]:
lista_filmes_ja_vistos = listar_filmes_ja_vistos(usuarioId_target,movies_X_users)
print(f"Usuario {usuarioId_target} já assistiu {len(lista_filmes_ja_vistos)} filmes.")
filmes_pro_usuario = movies_cosine.loc[filmes_mais_gosta].drop(columns=lista_filmes_ja_vistos)
filmes_pro_usuario = eliminar_colunas_zeradas(filmes_pro_usuario) #eliminar as colunas dos filmes que não similaridade nenhuma com nada
filmes_pro_usuario

Usuario 1 já assistiu 232 filmes.


movieId,2,4,5,7,8,9,10,11,12,13,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2991,0.148219,0.000000,0.126004,0.045809,0.057514,0.054039,0.292736,0.136541,0.219055,0.049193,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1220,0.301501,0.054021,0.138388,0.151197,0.067566,0.011203,0.306229,0.243034,0.097315,0.117648,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1282,0.268140,0.020001,0.203573,0.197865,0.121510,0.016118,0.188510,0.235216,0.110627,0.196821,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2291,0.355192,0.030574,0.179859,0.130588,0.072232,0.029656,0.263267,0.176428,0.071878,0.084399,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2899,0.040948,0.000000,0.000000,0.000000,0.000000,0.116505,0.061309,0.000000,0.143113,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1196,0.379890,0.043835,0.151397,0.164152,0.067499,0.051514,0.396171,0.229860,0.153564,0.121005,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2353,0.294842,0.000000,0.042770,0.082134,0.088108,0.040906,0.268768,0.196302,0.132080,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1927,0.085195,0.000000,0.016860,0.030874,0.000000,0.000000,0.026092,0.065416,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2700,0.300846,0.022952,0.120011,0.086085,0.094616,0.026422,0.348305,0.133002,0.133884,0.129379,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Transformamos a matriz numa 1xN filmes com o valor máximo encontra de similaridade; e ordenamos essa matriz, selecionando os k filmes com maior similaridade apresentada <br>
Feito também um **join** com movies para mostrar o título

In [290]:
k = 20
#pegar a similaridade máxima que cada um dos filmes não vistos possui com os filmes já vistos
recomendacao = filmes_pro_usuario.max().sort_values(ascending=False).head(k)
recomendacao = pd.DataFrame(recomendacao).join(movies['title'], on='movieId')
recomendacao

Unnamed: 0_level_0,0,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
2683,0.751857,Austin Powers: The Spy Who Shagged Me (1999)
3633,0.724992,On Her Majesty's Secret Service (1969)
5083,0.707107,Rare Birds (2001)
3747,0.707107,Jesus' Son (1999)
2607,0.706998,Get Real (1998)
5347,0.706018,Deuces Wild (2002)
5425,0.704664,Dark Blue World (Tmavomodrý svet) (2001)
5486,0.7,Who Is Cletis Tout? (2001)
3078,0.7,Liberty Heights (1999)
589,0.695724,Terminator 2: Judgment Day (1991)


In [1]:
recomendacao = recomendacao.rename(columns={'title': 'Recomendação', 0: 'Score'})
pq_vc_assistiu = []
nota_media = []
for id in recomendacao.index:
    pq_vc_assistiu.append(filmes_pro_usuario.index[filmes_pro_usuario[id] == recomendacao.loc[id][0]].tolist()[0])
    nota_media.append(movies_X_users[id].mean())
recomendacao['Nota média'] = nota_media
recomendacao['Pq vc assistiu'] = listar_nomes_filmes(pq_vc_assistiu)
recomendacao

NameError: name 'recomendacao' is not defined

---
---

# SVD: Fatoração de Matriz
Devido a esparsidade do dataset, os métodos tradicionais de filtragem colaborativa podem não serem adequados a demanda de processamento. Uma forma de tratar é fazendo uso do algoritmo de **Singular Value Decomposition**, SVD.<br>
Neste algoritmo, a matriz é decomposta em  em outras 3 matrizes de menor dimensionalidade.
$$ A = USV^T$$
- A é a matriz original m x n
- U é uma matriz ortogonal m x n
- S é uma matriz diagona n x n
- V é uma matriz ortogonal n x n

https://heartbeat.comet.ml/recommender-systems-with-python-part-iii-collaborative-filtering-singular-value-decomposition-5b5dcb3f242b

https://www.kaggle.com/code/cast42/simple-svd-movie-recommender

In [315]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(movies_X_users.to_numpy(), k = 50) #o que são essas k features

print(f"Matriz original{movies_X_users.shape} decomposta em U{U.shape}, sigma {sigma.shape} e Vt{Vt.shape}.")

Matriz original(610, 9724) decomposta em U(610, 50), sigma (50,) e Vt(50, 9724).


In [318]:
sigma_diag_matrix=np.diag(sigma) #sigma é um array contendo a diagonal
all_user_predicted_ratings = np.dot(np.dot(U, sigma_diag_matrix), Vt)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = movies_X_users.columns, index=movies_X_users.index)
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2.181872,0.393674,0.838186,-0.082365,-0.546279,2.521662,-0.887231,-0.025221,0.196969,1.606758,...,-0.024984,-0.021415,-0.028553,-0.028553,-0.024984,-0.028553,-0.024984,-0.024984,-0.024984,-0.058988
2,0.209809,0.004821,0.030742,0.017252,0.183764,-0.06066,0.083306,0.023797,0.0481,-0.151968,...,0.018895,0.016196,0.021594,0.021594,0.018895,0.021594,0.018895,0.018895,0.018895,0.031966
3,0.013394,0.034726,0.050525,0.0002,-0.005577,0.114673,-0.007461,0.000738,0.004747,-0.061284,...,-0.001612,-0.001382,-0.001843,-0.001843,-0.001612,-0.001843,-0.001612,-0.001612,-0.001612,-0.00053
4,2.012072,-0.394882,-0.290386,0.093864,0.123312,0.259765,0.472667,0.035965,0.011293,-0.021983,...,0.001966,0.001685,0.002247,0.002247,0.001966,0.002247,0.001966,0.001966,0.001966,-0.021462
5,1.336714,0.772954,0.064577,0.11388,0.274994,0.58448,0.251048,0.131534,-0.08631,1.035361,...,-0.004407,-0.003778,-0.005037,-0.005037,-0.004407,-0.005037,-0.004407,-0.004407,-0.004407,-0.006099


In [367]:
def get_high_recommended_movies(userId):
    movies_rated_by_user = movies_X_users.loc[userId]
    movies_high_rated_by_user =  movies_rated_by_user[movies_rated_by_user > 4.5].index
    movies_recommended_for_user = preds_df.loc[userId]
    movies_high_recommend_for_user = movies_recommended_for_user[movies_recommended_for_user > 4].index
    return set(movies_high_recommend_for_user) - set(movies_high_rated_by_user)

In [368]:
user = 1

rec = get_high_recommended_movies(user)
rec_ = pd.DataFrame(index=list(rec), columns=['Título', 'Nota'])
rec_.index.name='movieId'
rec_['Título'] = listar_nomes_filmes(rec)
for id in rec:
    rec_.at[id,'Nota'] = preds_df.loc[user,id]
rec_



  return movies.loc[indices]['title'].values.tolist()


Unnamed: 0_level_0,Título,Nota
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
480,Jurassic Park (1993),4.074629
296,Pulp Fiction (1994),4.854252
1036,Die Hard (1988),4.009045
2028,Saving Private Ryan (1998),5.465609
593,"Silence of the Lambs, The (1991)",4.269116
733,"Rock, The (1996)",4.005126
