# Trabalho de Filtragem Colaborativa

Modelo de filtragem colaborativa usa as informações de ratings dos usuários para prover recomendações. <br>
**Ideia Principal:** A semelhança entre os usuários a partir do que se observa da interseção dos seus ratings permite inferir que os dados não inputados por apenas um destes poderia ter a mesma semelhança com relação ao outro. <br><br>
**Problemas Principais:** 
- A **esparsidade** dos dados, afinal a informação que um usuário provê é normalmente a um subconjunto muito pequeno dos itens. Logo a maioria da base de dados é de dados *faltantes* ou *não observados*. 
- **Cold-Start**: a falta de dados inicial para ter informação relevante seja para entender a personlidade de um usuário ou a preditibilidade de um item<br>
<br>
Há 2 métodos de filtragem colaborativa: <br>

**Memory-Based:** Também chamado de *neighborhood-based collaborative filtering algorithms*. Que se dividem basicamente em *user-based collaborative filtering* e *item-based collaborative filtering*. <br>
**Model-Based:** Modelos baseados em **machine learning** e **data mining** há um processo de aprendizado prévio para parametrizição. Alguns métodos são Decisions Trees, métodos bayesianos, modelos baseados em regras e latent factor method. <br>

## Implementação usando MovieLens
Carregar a base de 25M ou 100K


In [56]:
import pandas as pd                          #DataFrames e operações associadas
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity          #Similaridade
import math  
import scipy.stats
import sklearn.metrics  
from colorama import Fore, Back, Style       #prints coloridos e estilizados
from IPython.display import display

ratings = pd.DataFrame()
movies = pd.DataFrame()

def carregar_base(ratings_filename, movies_filename, n_ratings = 25000000):
    global ratings
    ratings = pd.read_csv(ratings_filename)[:n_ratings] #ler as primeiras n linhas da base completa; default 25.000.000
    global movies
    movies = pd.read_csv(movies_filename).set_index("movieId")
    print(f'Arquivo \'{ratings_filename}\' carregado.')

#carregar_base('ratings.csv', 'movies.csv', n_ratings=150000)
carregar_base('ratings_small.csv', 'movies_small.csv')

#### Funções para auxiliar
def get_filmes_avaliados(usuario_):
    if(type(usuario_)==list):
        return sorted(ratings[ratings['userId'].isin(usuario_)]['movieId'].unique().tolist())
    else:
        return ratings[ratings['userId']==usuario_]['movieId'].values.tolist()

def eliminar_colunas_zeradas(matriz):
    return matriz.loc[:, (matriz != 0).any(axis=0)] #elimina todas as colunas cujos todos os valores são 0

def get_nomes_filmes(indices):
    return movies.loc[indices]['title'].values.tolist()

def get_media_avaliacao(usuario_, decimais=2):
    return round(ratings[ratings['userId'].isin([usuario_])]['rating'].mean(),decimais)

def print_destaque(texto):
    print(Back.BLUE + Fore.LIGHTYELLOW_EX+ f' {texto} ')
    print(Style.RESET_ALL)

Arquivo 'ratings_small.csv' carregado.


In [138]:
#Pequeno relatório dos datasets originais
def print_report(ratings):
    n_ratings = len(ratings)
    global lista_usuarios
    lista_usuarios = sorted(ratings['userId'].unique())
    global lista_filmes_avaliados
    lista_filmes_avaliados = sorted(ratings['movieId'].unique())
    print(f"Total de ratings: {n_ratings}")
    print(f"Total de filmes: {len(movies)}")
    print(f"Filmes avaliados: {len(lista_filmes_avaliados)}")
    print(f"Total de usuários: {len(lista_usuarios)}")
    print(f"Média de ratings/user: {round(n_ratings/len(lista_usuarios), 2)}")
    print(f"Shape de Ratings: {ratings.shape}")
    esparsidade = round(1.0 -n_ratings/float(len(lista_usuarios) * len(lista_filmes_avaliados)),3)
    print(f"O nível de esparsidade do dataset é {esparsidade * 100}%\n")

    print_destaque('Relatório dos ratings dados pelos usuários')
    print(ratings.groupby('userId')['rating'].count().describe())

print_destaque('Relatório da base original usada')
print_report(ratings)
ratings.sample(5).sort_index()


[44m[93m Relatório da base original usada 
[0m
Total de ratings: 100836
Total de filmes: 9742
Filmes avaliados: 9724
Total de usuários: 610
Média de ratings/user: 165.3
Shape de Ratings: (100836, 4)
O nível de esparsidade do dataset é 98.3%

[44m[93m Relatório dos ratings dados pelos usuários 
[0m
count     610.000000
mean      165.304918
std       269.480584
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
Name: rating, dtype: float64


Unnamed: 0,userId,movieId,rating,timestamp
11850,73,8361,3.0,1464275283
49146,318,2396,3.0,1261342560
49342,318,8360,3.0,1237233422
52892,347,160,3.0,847645839
92580,597,2956,3.0,941730630


In [96]:
class Treino_Teste:
    fatias = []
    k = 0
    def __init__(self, database, k):
        database = database.sample(frac=1) #misturar randomicamente os dados
        self.k = k
        self.fatias = np.array_split(database, k)

    def get_treino_teste(self):
        treino = pd.concat(self.fatias[:-1])
        teste = self.fatias[-1].sort_values(by=['userId','movieId'])
        return treino, teste

    def get_k(self):
        return self.k
    
    def proxima_folha(self):
        self.fatias.append(self.fatias.pop(0))

tt = Treino_Teste(ratings,10)
ratings_treino, ratings_teste = tt.get_treino_teste()
print_destaque('Relatório da base de treino (9/10) da original')
print_report(ratings_treino)

[44m[93m Relatório da base de treino (9/10) da original 
[0m
Total de ratings: 90753
Total de filmes: 9742
Filmes avaliados: 9380
Total de usuários: 610
Média de ratings/user: 148.78
Shape de Ratings: (90753, 4)
O nível de esparsidade do dataset é 98.4%

[44m[93m Relatório dos ratings dados pelos usuários 
[0m
count     610.00000
mean      148.77541
std       243.07404
min        16.00000
25%        32.00000
50%        62.50000
75%       152.00000
max      2448.00000
Name: rating, dtype: float64


## Gerar matriz Movies X User

In [145]:
def gerar_matriz_movies_user(dados):
    global ratings_usuarios
    ratings_usuarios = dados.groupby(['userId', 'movieId'])['rating'].first().unstack(fill_value=0.0)
    ratings_usuarios = pd.DataFrame(data=ratings_usuarios, columns=lista_filmes_avaliados).fillna(0)
    return ratings_usuarios

#### Funções para auxiliar
def listar_filmes_ja_vistos(usuario):
    #filmes_ja_vistos_bin = matriz_filmes_X_usuarios.loc[usuario].gt(0)   #gerar array com o que usuário já deu rating: True ou False
    #return filmes_ja_vistos_bin.index[filmes_ja_vistos_bin].to_list() #com base no anterior, listar filmes que já viu        
    if(type(usuario)==list):
        filmes_usuario = ratings_usuarios.loc[usuario].sum(axis = 0)
    else: #if(type(usuario)==int):
        filmes_usuario = ratings_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario!=0].index.to_list()

def listar_filmes_nao_vistos(usuario):
    if(type(usuario)==list):
        filmes_usuario = ratings_usuarios.loc[usuario].sum(axis = 0)
    else: 
        filmes_usuario = ratings_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario==0].index.to_list()
    
def listar_notas_usuario(userId):
    return ratings_usuarios[ratings_usuarios.index==userId]
   
gerar_matriz_movies_user(ratings_treino)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## User-Based Collaborative Filtering

### Matriz de Similaridade por Usuário
A medida comumente usada é a similaridade do cosseno.
Essa medida de similaridade deve seu nome ao fato de ser igual ao cosseno do ângulo entre os dois vetores que estão sendo comparados:  vetores de similaridade de usuário (ou item) de ratings. Quanto menor o ângulo entre dois vetores, maior será o cosseno, resultando em um fator de similaridade mais alto. 

Dado 2 vetores, A e B, a similiridade por cosseno, cos($\theta$), é representada pelo produto escalar
$$\text{cosine similarity} =S_C (x,y):= \cos(\theta) = {\mathbf{x} \cdot \mathbf{y} \over \|\mathbf{x}\| \|\mathbf{y}\|} = \frac{ \sum\limits_{i=1}^{n}{x_i  y_i} }{ \sqrt{\sum\limits_{i=1}^{n}{x_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{y_i^2}} }$$

### Pegar os k usuários mais similares ao Target selecionado
<center><img src="img/user-based-similaridades.jpg" ></center>

Como selecionar os mais similares?
- Todos os vizinhos
- Selecionar randomicamente
- Todos acima de um threshold
- **Top-k por simaridade**

Problemas
- Custo computacional
- Mais vizinhos = mais ruído
- Poucos vizinhos = pouca cobertura
- Usar entre 25 e 100

In [146]:
def obter_mais_similares(target, matriz_similaridade, k = 25, min_score = 0):
    similares = matriz_similaridade.loc[target].sort_values(ascending=False).drop(target)
    similares = similares[similares >= min_score]
    return similares.iloc[:k]

def gerar_matriz_similaridade_user(dados):
    global users_cosine
    users_cosine_array = cosine_similarity(dados)
    users_cosine = pd.DataFrame(data=users_cosine_array, index=dados.index, columns=dados.index)
    return users_cosine

users_cosine.round(3).head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.015,0.05,0.181,0.112,0.126,0.131,0.137,0.067,0.01,...,0.08,0.133,0.21,0.066,0.143,0.157,0.229,0.279,0.082,0.133
2,0.015,1.0,0.0,0.004,0.0,0.015,0.012,0.0,0.0,0.05,...,0.181,0.0,0.013,0.0,0.0,0.021,0.0,0.043,0.0,0.093
3,0.05,0.0,1.0,0.003,0.006,0.003,0.0,0.005,0.0,0.0,...,0.006,0.002,0.029,0.0,0.012,0.015,0.022,0.025,0.0,0.026
4,0.181,0.004,0.003,1.0,0.123,0.083,0.082,0.058,0.012,0.025,...,0.074,0.128,0.279,0.046,0.078,0.17,0.104,0.132,0.031,0.097
5,0.112,0.0,0.006,0.123,1.0,0.276,0.118,0.38,0.0,0.011,...,0.073,0.339,0.089,0.202,0.099,0.09,0.139,0.11,0.236,0.041


### Gerar a Recomendação de acordo com a nota dada pelos usuários similares

Podemos usar várias métricas de acordo com os ratings dos vizinhos: mínimo, máximo, média, mediana, média ponderada, agregação supervisionada.
<br>
Usaremos **média ponderada** logo:
1. Para cada filme que se deseja saber a nota: 
2. Para cada usuário similar da lista:
    1. Se nota foi dada: somar nota seguindo a fórmula
$$ notaMédia = {\sum coeficiente * nota \over \sum coeficiente} $$
<center><img src="img/user-based-similaridades2.jpg" ></center>

In [147]:
def predizer_notas_userb_mediap(target, filmes, matriz_similaridade = users_cosine, matriz_dados = ratings_usuarios, k=25, min_threshold=0):  # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme                                        
    similares = obter_mais_similares(target, matriz_similaridade, 25)

    resultado = pd.DataFrame(columns=filmes, index=['User Pred', '# Notas'] ) 
    resultado.columns.name = 'movieId'
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares.index:
            try: #Tenta achar a nota na matriz de dados, mas pode não existir se o filme só aparece na matriz de teste
                nota = matriz_dados.loc[similar,filme]
            except:
                nota = 0
            if (nota != 0):
                coeficiente = similares[similar]
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['User Pred',filme] = round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['User Pred',filme] = 0
            resultado.at['# Notas',filme] = qtd_notas
    return resultado

def recomendar_user_based(target, k_similares = 30, qtd_sugestoes = 20, min_threshold=5):
    usuarios_mais_similares = obter_mais_similares(target, users_cosine, k_similares).index
    filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target) 
    filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist())
    filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
    recomendacao = predizer_notas_userb_mediap(target, filmes_a_avaliar, users_cosine, ratings_usuarios, k_similares, min_threshold=5)
    return recomendacao.T.sort_values(by=['User Pred','# Notas',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])

In [148]:
recomendar_user_based(1)


Unnamed: 0_level_0,User Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1204,4.8,7,Lawrence of Arabia (1962)
4993,4.7,9,"Lord of the Rings: The Fellowship of the Ring,..."
750,4.7,8,Dr. Strangelove or: How I Learned to Stop Worr...
58559,4.7,5,"Dark Knight, The (2008)"
858,4.6,19,"Godfather, The (1972)"
5952,4.6,6,"Lord of the Rings: The Two Towers, The (2002)"
1203,4.6,5,12 Angry Men (1957)
260,4.5,23,Star Wars: Episode IV - A New Hope (1977)
318,4.5,15,"Shawshank Redemption, The (1994)"
1387,4.5,15,Jaws (1975)


#### Nota por *Mean-Centered* (trad: média centralizada?)
$$ notaPredita_u = \mu_u + {\sum coeficiente * (nota-\mu_v) \over \sum coeficiente} $$

In [70]:
def predizer_notas_userb_meanc(target, filmes, matriz_similaridade = users_cosine, matriz_dados = ratings_usuarios, k=25, min_threshold=0): 
                                            # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme
    similares = obter_mais_similares(target, matriz_similaridade, 25)

    resultado = pd.DataFrame(columns=filmes, index=['User Pred2', '# Notas.'] ) 
    resultado.columns.name = 'movieId'
    media_target = get_media_avaliacao(target)
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares.index:
            nota = matriz_dados.loc[similar,filme]
            if (nota != 0):
                coeficiente = similares[similar]
                numerador += coeficiente * (nota - get_media_avaliacao(similar))
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['User Pred2',filme] = media_target + round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['User Pred2',filme] = 0
            resultado.at['# Notas.',filme] = qtd_notas
    return resultado

def recomendar_user_based_2(target, k_similares = 30, qtd_sugestoes = 20, min_threshold=5):
    usuarios_mais_similares = obter_mais_similares(target, users_cosine, k_similares).index
    filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target) 
    filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist())
    filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
    recomendacao = predizer_notas_userb_meanc(target, filmes_a_avaliar, users_cosine, ratings_usuarios, k_similares, min_threshold=5)
    return recomendacao.T.sort_values(by=['User Pred2','# Notas.',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])

In [71]:
recomendar_user_based_2(1) # Reaprar que a qt de notas avaliadas é a mesma com os mesmos parâmetros mas o cálculo é bem diferente


Unnamed: 0_level_0,User Pred2,# Notas.,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
58559,5.67,5,"Dark Knight, The (2008)"
913,5.57,7,"Maltese Falcon, The (1941)"
1204,5.57,6,Lawrence of Arabia (1962)
953,5.57,5,It's a Wonderful Life (1946)
1653,5.47,12,Gattaca (1997)
7153,5.47,10,"Lord of the Rings: The Return of the King, The..."
750,5.47,9,Dr. Strangelove or: How I Learned to Stop Worr...
1199,5.47,9,Brazil (1985)
1234,5.47,8,"Sting, The (1973)"
4878,5.47,7,Donnie Darko (2001)


## Testar modelo usando kfold e Raiz do Erro Quadrático Médio

### Calcular margem de erro 

Uma medida frequentemente usada na verificação da acurácia de modelos numéricos é o Erro Quadrático Médio (MSE na sigla em Inglês) como descrito, por exemplo, em Wilks (2006).MSE é sempre positivo. MSE = 0 indica simulação perfeita. MSE é definido por:
$$ MSE = \frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2 $$

Em adição ao MSE, a raiz quadrada de MSE, ou Raiz do Erro Quadrático Médio (RMSE em Inglês), é comumente usada para expressar a acurácia dos resultados numéricos com a vantagem de que RMSE apresenta valores do erro nas mesmas dimensões da variável analisada. O RMSE é definido por:
$$ RMSE = \sqrt{\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2} $$

In [167]:
def calcular_rmse(real, previsao):
    return math.sqrt(sklearn.metrics.mean_squared_error(real, previsao))  

def testar_predicao_user_based(userId, filmes, notas_reais):
    notas_preditas = predizer_notas_userb_mediap(userId, filmes, users_cosine, ratings_usuarios,25,0)
    notas_preditas = notas_preditas.loc["User Pred"].values.tolist()
    return calcular_rmse(notas_reais,notas_preditas)

def get_filmes_teste(usuario):
    return list(ratings_teste[ratings_teste['userId'] == usuario]['movieId'])

def get_notas_teste(usuario):
    return list(ratings_teste[ratings_teste['userId'] == usuario]['rating'])

def testar_user_based(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_user_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_user_based.at[usuario,"rmse_User"]=testar_predicao_user_based(usuario, filmes_teste, notas_dadas)
    estatistica_user_based['rmse_User'].std()
    return estatistica_user_based.describe().T

tt = Treino_Teste(ratings,10)
resultado = []

for i in range(tt.get_k()):
    ratings_treino, ratings_teste = tt.get_treino_teste()
    matriz = gerar_matriz_movies_user(ratings_treino)
    gerar_matriz_similaridade_user(matriz)
    parcial = testar_user_based(ratings_teste)
    print(f'{i+1} processo(s)', end='\r')
    resultado.append(parcial)
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias = resultado['mean'].mean()
desvios_padrao = resultado['std'].mean()
confianca = 0.99
conf_int = scipy.stats.norm.interval(confianca, loc=medias, scale=desvios_padrao) 
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias}\nDesvio padrão médio: {desvios_padrao}")
print(f"Intervalo de {confianca} confiança: {conf_int}\n")
print_destaque('Fim da 1ª parte')

10 processo(s)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_User,594.0,1.371313,0.6749,0.0,0.916411,1.275455,1.636125,4.315669
rmse_User,598.0,1.339649,0.704309,0.0,0.858596,1.219866,1.682197,5.0
rmse_User,599.0,1.367841,0.735049,0.0,0.891575,1.229431,1.697906,5.0
rmse_User,598.0,1.312977,0.686705,0.0,0.867355,1.205542,1.618903,5.0
rmse_User,599.0,1.355825,0.677883,0.0,0.851469,1.259705,1.677796,5.0
rmse_User,601.0,1.331811,0.73478,0.0,0.824621,1.205845,1.675746,5.0
rmse_User,598.0,1.352173,0.690444,0.0,0.878347,1.251598,1.663749,4.756574
rmse_User,595.0,1.346318,0.716816,0.0,0.873157,1.222957,1.693426,5.0
rmse_User,598.0,1.331276,0.710928,0.0,0.80006,1.236119,1.61995,5.0
rmse_User,594.0,1.363639,0.710411,0.0,0.866566,1.214118,1.733764,5.0


[44m[93m Relatório da final da USER BASED 
[0m
Média das RMSE:    1.3472821343510157
Desvio padrão médio: 0.7042225668724023
Intervalo de 0.99 confiança: (-0.4666749896193432, 3.1612392583213746)

[44m[93m Fim da 1ª parte 
[0m


---
---

## Item-Based Collaborative Filtering

### Gerar matriz Users X Movies
Transposição da matriz que tinha usuários nas linhas e filmes nas colunas, para filmes nas linhas e usuários nas colunas

In [159]:
ratings_filmes = ratings_usuarios.T
ratings_filmes.head(4)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,0.0,2.5,0.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,0.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Matriz de Similaridade Item a Item
Similaridade por Cosseno dos filmes entre si

In [160]:
def gerar_matriz_similaridade_item(dados):
    global movies_cosine
    movies_cosine_array = cosine_similarity(dados)
    movies_cosine = pd.DataFrame(data=movies_cosine_array, index=dados.index, columns=dados.index)
    return movies_cosine

gerar_matriz_similaridade_item(ratings_filmes).round(3).head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,1.0,0.374,0.246,0.042,0.277,0.352,0.247,0.113,0.228,0.353,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.374,1.0,0.202,0.129,0.217,0.283,0.158,0.178,0.022,0.352,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.246,0.202,1.0,0.115,0.34,0.241,0.329,0.286,0.314,0.19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.042,0.129,0.115,1.0,0.233,0.052,0.262,0.179,0.0,0.085,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.277,0.217,0.34,0.233,1.0,0.306,0.365,0.171,0.321,0.177,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Selecionar um usuário e analisar os filmes que ele não deu nota
Os targets serão os filmes que o usuário não deu nota. É analisado os k filmes mais similares ao que ele não viu, e destes, de acordo com as notas que o usuário deu, é calculado a nota estimada. Isto é feito para todos os filmes.

<center><img src="img/item-based-cosseno-predicao.jpg" style="max-width: 30%"></center>

1. pegar um usuário e os filmes que ele não assistiu
2. pegar um filme que ele não assistiu e selecionar os K mais semelhantes & que o usuário deu nota
3. fazer a média ponderada entre as notas que ele deu pra estes filmes semelhantes para definir a nota nova faltante

In [161]:
def predizer_notas_item_b(usuario, filmes, matriz_similaridade=movies_cosine, matriz_dados = ratings_filmes, k=25, min_threshold=3):
    usuario = 1
    filmes_não_avaliados = listar_filmes_nao_vistos(usuario) #O que essa essa matriz dados???

    #recomendacao = pd.DataFrame(columns=("movieId", "Nota", "Qt de Notas"))
    resultado = pd.DataFrame(columns=filmes, index=['Item Pred', '# Notas'] ) 
    resultado.columns.name = 'movieId'

    set_avaliados = set(get_filmes_avaliados(usuario))

    for filme in filmes:  #Para cada filme que desejamos a predição, obterei aqueles mais similares
        filmes_mais_similares = obter_mais_similares(filme, matriz_similaridade, k, 0.2)
        similares_vistos = list(set(filmes_mais_similares.index) & set_avaliados) #e ver quais que já foram avaliados
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for i in similares_vistos: #para cada filme similar avaliado, calcular a média ponderada
            coeficiente = matriz_similaridade[filme][i]
            nota = ratings_filmes[usuario][i] 
            if (nota != 0):
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1  
        if(qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['Item Pred',filme] = round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['Item Pred',filme] = 0
            resultado.at['# Notas',filme] = qtd_notas
            #recomendacao.loc[len(recomendacao)] = [int(filme), round(numerador/denominador,2), qtd_notas]
    return resultado


filmes_não_avaliados = listar_filmes_nao_vistos(1)
recomendacao = predizer_notas_item_b(1, filmes_não_avaliados[:15000],movies_cosine,ratings_filmes,).T.join(movies[['title']], on=["movieId"]).sort_values(by=['Item Pred','# Notas','title'],ascending=False)
recomendacao.head(20)

Unnamed: 0_level_0,Item Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4011,5.0,7,Snatch (2000)
1035,5.0,5,"Sound of Music, The (1965)"
3256,5.0,5,Patriot Games (1992)
2871,5.0,5,Deliverance (1972)
3082,5.0,4,"World Is Not Enough, The (1999)"
1947,5.0,4,West Side Story (1961)
4262,5.0,4,Scarface (1983)
1125,5.0,4,"Return of the Pink Panther, The (1975)"
3949,5.0,4,Requiem for a Dream (2000)
6807,5.0,4,Monty Python's The Meaning of Life (1983)


In [164]:
def testar_predicao_item_based(userId, filmes, notas_reais):
    notas_preditas = predizer_notas_item_b(userId, filmes, movies_cosine, ratings_filmes,25,0)
    notas_preditas = notas_preditas.loc["Item Pred"].values.tolist()
    return calcular_rmse(notas_reais,notas_preditas)

def testar_item_based(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_user_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_user_based.at[usuario,"rmse_User"]=testar_predicao_item_based(usuario, filmes_teste, notas_dadas)
    estatistica_user_based['rmse_User'].std()
    return estatistica_user_based.describe().T

resultado = []

for i in range(tt.get_k()):
    ratings_treino, ratings_teste = tt.get_treino_teste()
    matriz = gerar_matriz_movies_user(ratings_treino).T
    gerar_matriz_similaridade_item(matriz)
    parcial = testar_item_based(ratings_teste)
    print(f'{i+1} processo(s)', end='\r')
    resultado.append(parcial)
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias = resultado['mean'].mean()
desvios_padrao = resultado['std'].mean()
confianca = 0.99
conf_int = scipy.stats.norm.interval(confianca, loc=medias, scale=desvios_padrao) 
print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias}\nDesvio padrão médio: {desvios_padrao}")
print(f"Intervalo de {confianca} confiança: {conf_int}\n")
print_destaque('Fim da 2ª parte')

10 processo(s)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_User,594.0,2.229973,0.932865,0.0,1.67705,2.30288,2.832396,5.0
rmse_User,598.0,2.257631,0.960296,0.0,1.599023,2.292802,2.913181,5.0
rmse_User,599.0,2.255913,0.991404,0.0,1.616993,2.348977,2.898677,5.0
rmse_User,598.0,2.205408,0.953393,0.0,1.602746,2.27471,2.845977,5.0
rmse_User,599.0,2.248244,0.969008,0.0,1.553037,2.334262,2.89791,5.0
rmse_User,601.0,2.243253,0.95287,0.0,1.675492,2.34338,2.872281,4.630065
rmse_User,598.0,2.297363,0.931041,0.0,1.73704,2.392018,2.889635,5.0
rmse_User,595.0,2.208409,0.981083,0.0,1.584034,2.302655,2.881964,5.0
rmse_User,598.0,2.261041,0.9427,0.0,1.692683,2.339293,2.86245,5.0
rmse_User,594.0,2.240662,0.93263,0.0,1.674596,2.349666,2.888271,5.0


[44m[93m Relatório da final da ITEM BASED 
[0m
Média das RMSE:    2.244789895011093
Desvio padrão médio: 0.9547289785764498
Intervalo de 0.99 confiança: (-0.21442898495343643, 4.704008774975623)

[44m[93m Fim da 2ª parte 
[0m


---
---

# SVD: Fatoração de Matriz
Devido a esparsidade do dataset, os métodos tradicionais de filtragem colaborativa podem não serem adequados a demanda de processamento. Uma forma de tratar é fazendo uso do algoritmo de **Singular Value Decomposition**, SVD.<br>
Neste algoritmo, a matriz é decomposta em  em outras 3 matrizes de menor dimensionalidade.
$$ A = USV^T$$
- A é a matriz original m x n
- U é uma matriz ortogonal m x n (mesmo shape de A)
- S é uma matriz diagonal n x n (valores $\sigma_1 \geqslant \sigma_2 \geqslant ... \geqslant \sigma_n$ => ordenados por importância)
- V é uma matriz ortogonal n x n

https://www.researchgate.net/publication/330136513_Building_a_Movie_Recommendation_System_using_SVD_algorithm

https://heartbeat.comet.ml/recommender-systems-with-python-part-iii-collaborative-filtering-singular-value-decomposition-5b5dcb3f242b

http://alyssaq.github.io/2015/20150426-simple-movie-recommender-using-svd/

https://www.kaggle.com/code/cast42/simple-svd-movie-recommender

https://csiu.github.io/blog/update/2017/04/18/day53.html

Bom PDF para ler sobre fatoração de matrizes:
https://www.maxwell.vrac.puc-rio.br/19273/19273_4.PDF

In [277]:
from scipy.sparse.linalg import svds
from numpy import count_nonzero
U, sigma, Vt = svds(ratings_usuarios.to_numpy(), k = 10) #o que são essas k features

print(f"Matriz original{ratings_usuarios.shape} decomposta em U{U.shape}, sigma {sigma.shape} e Vt{Vt.shape}.")

sigma_diag_matrix=np.diag(sigma) #sigma é um array contendo a diagonal
all_user_predicted_ratings = np.dot(np.dot(U, sigma_diag_matrix), Vt)
matriz_SVD = pd.DataFrame(all_user_predicted_ratings, columns = ratings_usuarios.columns, index=ratings_usuarios.index)

esparsidade_SVD = 1.0 - ( count_nonzero(matriz_SVD) / float(matriz_SVD.size) )
print("Esparsidade: ", esparsidade_SVD,"%")

matriz_SVD.head()

Matriz original(610, 9724) decomposta em U(610, 10), sigma (10,) e Vt(10, 9724).
Esparsidade:  0.0 %


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2.861726,0.937778,0.975957,-0.01783,0.216177,1.702736,0.105428,0.000205,0.146076,1.997412,...,-0.012587,-0.010789,-0.014386,-0.014386,-0.012587,-0.014386,-0.012587,-0.012587,-0.012587,-0.019309
2,0.192522,-0.008264,-0.026998,0.003019,0.025943,0.086721,-0.056641,0.012829,-0.012171,-0.0788,...,0.005953,0.005103,0.006803,0.006803,0.005953,0.006803,0.005953,0.005953,0.005953,0.013188
3,0.031622,0.016571,0.019504,-0.004169,-0.01513,0.078697,-0.013183,0.000482,0.007625,0.06326,...,0.000239,0.000205,0.000273,0.000273,0.000239,0.000273,0.000239,0.000239,0.000239,-0.001889
4,1.574333,0.233673,0.278655,0.05657,0.187347,0.26725,0.328903,-0.054729,-0.01887,0.078537,...,-0.001781,-0.001527,-0.002036,-0.002036,-0.001781,-0.002036,-0.001781,-0.001781,-0.001781,-0.010068
5,1.277728,0.976974,0.42294,0.126277,0.537362,0.751711,0.630211,0.116558,0.117054,1.154697,...,0.000582,0.000499,0.000665,0.000665,0.000582,0.000665,0.000582,0.000582,0.000582,-6.6e-05


In [278]:
def testar_predicao_SVD(userId, printar_notas=False):
    notas_reais = ratings_usuarios[ratings_usuarios.index==userId]
    notas_reais = eliminar_colunas_zeradas(notas_reais)
    filmes_assistidos = notas_reais.columns
    notas_preditas = matriz_SVD[matriz_SVD.index==userId][filmes_assistidos]
    notas_reais = notas_reais.values.tolist()[0]
    notas_preditas = notas_preditas.values.tolist()[0]
    if(printar_notas):
        print(notas_reais)
        print(notas_preditas)
    return math.sqrt(sklearn.metrics.mean_squared_error(notas_reais, notas_preditas))

In [292]:
estatistica_SVD = pd.DataFrame(index = usuarios)
for i in usuarios:
    estatistica_SVD.at[i,"rmse_SVD"]=testar_predicao_SVD(i)
estatistica_SVD['rmse_SVD'].describe()
#pd.merge(estatistica_user_based.describe(), estatistica_item_based.describe(), left_index=True, right_index=True,)

count    610.000000
mean       2.997130
std        0.628125
min        0.870686
25%        2.547759
50%        3.066764
75%        3.447550
max        4.961767
Name: rmse_SVD, dtype: float64

In [286]:
estatistica_SVD['rmse_SVD'].std()

0.6320116962221388

In [298]:
dados_do_SVD = pd.DataFrame()
for r in range(1,30):
    k = r*5
    U, sigma, Vt = svds(ratings_usuarios.to_numpy(), k ) #o que são essas k features

    sigma_diag_matrix=np.diag(sigma) #sigma é um array contendo a diagonal
    all_user_predicted_ratings = np.dot(np.dot(U, sigma_diag_matrix), Vt)
    matriz_SVD = pd.DataFrame(all_user_predicted_ratings, columns = ratings_usuarios.columns, index=ratings_usuarios.index)

    for i in usuarios:
        estatistica_SVD.at[i,'rmse_SVD']=testar_predicao_SVD(i)
    dp = estatistica_SVD['rmse_SVD'].std()
    print(k, ": ", dp)
    #dados_do_SVD.concat([r,dp])
    #dados_do_SVD.at[i] = estatistica_SVD['rmse_SVD'].std()
    #dados_do_SVD = pd.merge(dados_do_SVD, estatistica_SVD.describe(), left_index=True, right_index=True,)

#dados_do_SVD

5 :  0.5935545846837844
10 :  0.6320116962221389
15 :  0.6549003426587992
20 :  0.677607031737377
25 :  0.6914823343136309
30 :  0.7112759495178654
35 :  0.7304640369118949
40 :  0.7460425279736048
45 :  0.7611835366268828
50 :  0.7769561335205206
55 :  0.7925902729742547
60 :  0.8071133525278414
65 :  0.8194576292017971
70 :  0.8321580093951625
75 :  0.842762385046984
80 :  0.8544159111048617
85 :  0.8649921516739925
90 :  0.8764957939634447
95 :  0.886035617069391
100 :  0.8950161458611524
105 :  0.9015433032305508
110 :  0.9086900873753875
115 :  0.9160600423800445
120 :  0.9225156737447514
125 :  0.9287809004071581
130 :  0.9335728984682884
135 :  0.9389268513165727
140 :  0.9430686702753261
145 :  0.9454801183514994


In [263]:
testar_predicao_SVD(1)

3.070575958820681

In [309]:
def predizer_notas3(userId):
    filmes_não_vistos = listar_filmes_nao_vistos(userId)
    recomendação = matriz_SVD[matriz_SVD.index==userId][filmes_não_vistos]
    recomendação.drop(columns= recomendação[recomendação<0].columns)
    return recomendação

predizer_notas3(1)

movieId,2,4,5,7,8,9,10,11,12,13,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.472394,0.089944,0.23467,0.298102,-0.036421,-0.209432,0.020422,0.471189,-0.043272,-0.153334,...,0.032753,0.028074,0.037432,0.037432,0.032753,0.037432,0.032753,0.032753,0.032753,-0.00861


In [225]:
user = 1

rec_ = pd.DataFrame(index=list(rec), columns=['Título', 'Nota'])
rec_.index.name='movieId'
rec_['Título'] = get\_nomes_filmes(rec)
for id in rec:
    rec_.at[id,'Nota'] = matriz_SVD.loc[user,id]
rec_



Unnamed: 0_level_0,Título,Nota
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
480,Jurassic Park (1993),4.074629
296,Pulp Fiction (1994),4.854252
1036,Die Hard (1988),4.009045
2028,Saving Private Ryan (1998),5.465609
593,"Silence of the Lambs, The (1991)",4.269116
733,"Rock, The (1996)",4.005126


In [223]:
recomendacao2

Unnamed: 0_level_0,Item Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4011,5.0,9,Snatch (2000)
1234,5.0,7,"Sting, The (1973)"
4262,5.0,6,Scarface (1983)
4226,5.0,6,Memento (2000)
7153,5.0,6,"Lord of the Rings: The Return of the King, The..."
...,...,...,...
2160,3.5,5,Rosemary's Baby (1968)
1717,3.5,3,Scream 2 (1997)
112,3.4,5,Rumble in the Bronx (Hont faan kui) (1995)
2541,3.1,4,Cruel Intentions (1999)


https://colab.research.google.com/drive/1GYSJNXK6lRl8kb2FvtMPLqW8EuZLcJjN?usp=sharing#scrollTo=NwjVPkZPctFU

# Copiar as explicações e gráficos