# Trabalho de Filtragem Colaborativa

Modelo de filtragem colaborativa usa as informações de ratings dos usuários para prover recomendações. <br>
**Ideia Principal:** A semelhança entre os usuários a partir do que se observa da interseção dos seus ratings permite inferir que os dados não inputados por apenas um destes poderia ter a mesma semelhança com relação ao outro. <br><br>
**Problemas Principais:** 
- A **esparsidade** dos dados, afinal a informação que um usuário provê é normalmente a um subconjunto muito pequeno dos itens. Logo a maioria da base de dados é de dados *faltantes* ou *não observados*. 
- **Cold-Start**: a falta de dados inicial para ter informação relevante seja para entender a personlidade de um usuário ou a preditibilidade de um item<br>
<br>
Há 2 métodos de filtragem colaborativa: <br>

**Memory-Based:** Também chamado de *neighborhood-based collaborative filtering algorithms*. Que se dividem basicamente em *user-based collaborative filtering* e *item-based collaborative filtering*. <br>
**Model-Based:** Modelos baseados em **machine learning** e **data mining** há um processo de aprendizado prévio para parametrizição. Alguns métodos são Decisions Trees, métodos bayesianos, modelos baseados em regras e latent factor method. <br>

## Implementação usando MovieLens
Carregar a base de 25M ou 100K


In [1]:
import pandas as pd                          #DataFrames e operações associadas
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity          #Similaridade
import math  
import scipy.stats
import sklearn.metrics  
from colorama import Fore, Back, Style       #prints coloridos e estilizados
from IPython.display import display
import time

ratings = pd.DataFrame()
movies = pd.DataFrame()

def carregar_base(ratings_filename, movies_filename, n_ratings = 25000000):
    global ratings
    ratings = pd.read_csv(ratings_filename)[:n_ratings] #ler as primeiras n linhas da base completa; default 25.000.000
    global movies
    movies = pd.read_csv(movies_filename).set_index("movieId")
    print(f'Arquivo \'{ratings_filename}\' carregado.')

carregar_base('ratings.csv', 'movies.csv', n_ratings=150000)
#carregar_base('ratings_small.csv', 'movies_small.csv')

#### Funções para auxiliar
def get_filmes_avaliados(usuario_):
    if(type(usuario_)==list):
        return sorted(ratings[ratings['userId'].isin(usuario_)]['movieId'].unique().tolist())
    else:
        return ratings[ratings['userId']==usuario_]['movieId'].values.tolist()

def eliminar_colunas_zeradas(matriz):
    return matriz.loc[:, (matriz != 0).any(axis=0)] #elimina todas as colunas cujos todos os valores são 0

def get_nomes_filmes(indices):
    return movies.loc[indices]['title'].values.tolist()

def get_media_avaliacao(usuario_, decimais=2):
    return round(ratings[ratings['userId'].isin([usuario_])]['rating'].mean(),decimais)

def print_destaque(texto):
    print(Back.BLUE + Fore.LIGHTYELLOW_EX+ f' {texto} ')
    print(Style.RESET_ALL)

Arquivo 'ratings.csv' carregado.


In [2]:
#Pequeno relatório dos datasets originais
def print_report(ratings):
    n_ratings = len(ratings)
    global lista_usuarios
    lista_usuarios = sorted(ratings['userId'].unique())
    global lista_filmes_avaliados
    lista_filmes_avaliados = sorted(ratings['movieId'].unique())
    print(f"Total de ratings: {n_ratings}")
    print(f"Total de filmes: {len(movies)}")
    print(f"Filmes avaliados: {len(lista_filmes_avaliados)}")
    print(f"Total de usuários: {len(lista_usuarios)}")
    print(f"Média de ratings/user: {round(n_ratings/len(lista_usuarios), 2)}")
    print(f"Shape de Ratings: {ratings.shape}")
    esparsidade = round(1.0 -n_ratings/float(len(lista_usuarios) * len(lista_filmes_avaliados)),3)
    print(f"O nível de esparsidade do dataset é {esparsidade * 100}%\n")

    print_destaque('Relatório dos ratings dados pelos usuários')
    print(ratings.groupby('userId')['rating'].count().describe())

print_destaque('Relatório da base original usada')
print_report(ratings)
ratings.sample(5).sort_index()


[44m[93m Relatório da base original usada 
[0m
Total de ratings: 150000
Total de filmes: 62423
Filmes avaliados: 11988
Total de usuários: 1053
Média de ratings/user: 142.45
Shape de Ratings: (150000, 4)
O nível de esparsidade do dataset é 98.8%

[44m[93m Relatório dos ratings dados pelos usuários 
[0m
count    1053.000000
mean      142.450142
std       233.761717
min        20.000000
25%        36.000000
50%        66.000000
75%       155.000000
max      3212.000000
Name: rating, dtype: float64


Unnamed: 0,userId,movieId,rating,timestamp
5284,38,593,3.5,1417643613
10661,78,7451,5.0,1498275507
11678,84,69604,4.0,1481830241
31723,240,1200,5.0,962963784
70025,548,53326,3.5,1431644883


In [3]:
class Treino_Teste:
    fatias = []
    k = 0
    def __init__(self, database, k):
        database = database.sample(frac=1) #misturar randomicamente os dados
        self.k = k
        self.fatias = np.array_split(database, k)

    def get_treino_teste(self):
        treino = pd.concat(self.fatias[:-1])
        teste = self.fatias[-1].sort_values(by=['userId','movieId'])
        return treino, teste

    def get_k(self):
        return self.k
    
    def proxima_folha(self):
        self.fatias.append(self.fatias.pop(0))

tt = Treino_Teste(ratings,10)
ratings_treino, ratings_teste = tt.get_treino_teste()
print_destaque('Relatório da base de treino (9/10) da original')
print_report(ratings_treino)

[44m[93m Relatório da base de treino (9/10) da original 
[0m
Total de ratings: 135000
Total de filmes: 62423
Filmes avaliados: 11496
Total de usuários: 1053
Média de ratings/user: 128.21
Shape de Ratings: (135000, 4)
O nível de esparsidade do dataset é 98.9%

[44m[93m Relatório dos ratings dados pelos usuários 
[0m
count    1053.000000
mean      128.205128
std       210.217852
min        16.000000
25%        32.000000
50%        59.000000
75%       140.000000
max      2883.000000
Name: rating, dtype: float64


## Gerar matriz Movies X User

In [4]:
def gerar_matriz_movies_user(dados):
    global ratings_usuarios
    ratings_usuarios = dados.groupby(['userId', 'movieId'])['rating'].first().unstack(fill_value=0.0)
    ratings_usuarios = pd.DataFrame(data=ratings_usuarios, columns=lista_filmes_avaliados).fillna(0)
    return ratings_usuarios

#### Funções para auxiliar
def listar_filmes_ja_vistos(usuario):
    #filmes_ja_vistos_bin = matriz_filmes_X_usuarios.loc[usuario].gt(0)   #gerar array com o que usuário já deu rating: True ou False
    #return filmes_ja_vistos_bin.index[filmes_ja_vistos_bin].to_list() #com base no anterior, listar filmes que já viu        
    if(type(usuario)==list):
        filmes_usuario = ratings_usuarios.loc[usuario].sum(axis = 0)
    else: #if(type(usuario)==int):
        filmes_usuario = ratings_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario!=0].index.to_list()

def listar_filmes_nao_vistos(usuario):
    if(type(usuario)==list):
        filmes_usuario = ratings_usuarios.loc[usuario].sum(axis = 0)
    else: 
        filmes_usuario = ratings_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario==0].index.to_list()
    
def listar_notas_usuario(userId):
    return ratings_usuarios[ratings_usuarios.index==userId]
   
gerar_matriz_movies_user(ratings_treino)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,205106,205156,205413,205499,205557,206272,206499,206523,207309,208002
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1050,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1052,3.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## User-Based Collaborative Filtering

### Matriz de Similaridade por Usuário
A medida comumente usada é a similaridade do cosseno.
Essa medida de similaridade deve seu nome ao fato de ser igual ao cosseno do ângulo entre os dois vetores que estão sendo comparados:  vetores de similaridade de usuário (ou item) de ratings. Quanto menor o ângulo entre dois vetores, maior será o cosseno, resultando em um fator de similaridade mais alto. 

Dado 2 vetores, A e B, a similiridade por cosseno, cos($\theta$), é representada pelo produto escalar
$$\text{cosine similarity} =S_C (x,y):= \cos(\theta) = {\mathbf{x} \cdot \mathbf{y} \over \|\mathbf{x}\| \|\mathbf{y}\|} = \frac{ \sum\limits_{i=1}^{n}{x_i  y_i} }{ \sqrt{\sum\limits_{i=1}^{n}{x_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{y_i^2}} }$$

### Pegar os k usuários mais similares ao Target selecionado
<center><img src="img/user-based-similaridades.jpg" ></center>

Como selecionar os mais similares?
- Todos os vizinhos
- Selecionar randomicamente
- Todos acima de um threshold
- **Top-k por simaridade**

Problemas
- Custo computacional
- Mais vizinhos = mais ruído
- Poucos vizinhos = pouca cobertura
- Usar entre 25 e 100

In [37]:
def obter_mais_similares(target, matriz_similaridade, k = 25, min_score = 0):
    try:
        similares = matriz_similaridade.loc[target].sort_values(ascending=False).drop(target)
        similares = similares[similares >= min_score]
        return similares.iloc[:k]
    except:
        return []

def gerar_matriz_similaridade_user(dados):
    global users_cosine
    users_cosine_array = cosine_similarity(dados)
    users_cosine = pd.DataFrame(data=users_cosine_array, index=dados.index, columns=dados.index)
    return users_cosine

gerar_matriz_similaridade_user(ratings_usuarios).round(3).head()

userId,1,2,3,4,5,6,7,8,9,10,...,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.047,0.064,0.04,0.018,0.0,0.068,0.025,0.0,0.029,...,0.004,0.0,0.0,0.1,0.105,0.018,0.036,0.08,0.02,0.023
2,0.047,1.0,0.162,0.174,0.14,0.057,0.071,0.164,0.118,0.151,...,0.181,0.048,0.074,0.051,0.225,0.145,0.017,0.131,0.208,0.143
3,0.064,0.162,1.0,0.31,0.053,0.086,0.022,0.068,0.055,0.118,...,0.087,0.047,0.044,0.023,0.311,0.105,0.018,0.111,0.114,0.101
4,0.04,0.174,0.31,1.0,0.063,0.071,0.017,0.092,0.052,0.089,...,0.124,0.026,0.033,0.048,0.186,0.099,0.018,0.023,0.074,0.083
5,0.018,0.14,0.053,0.063,1.0,0.101,0.193,0.29,0.212,0.269,...,0.192,0.018,0.275,0.077,0.123,0.119,0.206,0.09,0.187,0.212


### Gerar a Recomendação de acordo com a nota dada pelos usuários similares

Podemos usar várias métricas de acordo com os ratings dos vizinhos: mínimo, máximo, média, mediana, média ponderada, agregação supervisionada.
<br>
Usaremos **média ponderada** logo:
1. Para cada filme que se deseja saber a nota: 
2. Para cada usuário similar da lista:
    1. Se nota foi dada: somar nota seguindo a fórmula
$$ notaMédia = {\sum coeficiente * nota \over \sum coeficiente} $$
<center><img src="img/user-based-similaridades2.jpg" ></center>

In [8]:
def predizer_notas_userb_mediap(target, filmes, matriz_similaridade = users_cosine, matriz_dados = ratings_usuarios, k=25, min_threshold=0):  # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme                                        
    similares = obter_mais_similares(target, matriz_similaridade, 25)

    resultado = pd.DataFrame(columns=filmes, index=['User Pred', '# Notas'] ) 
    resultado.columns.name = 'movieId'
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares.index:
            try: #Tenta achar a nota na matriz de dados, mas pode não existir se o filme só aparece na matriz de teste
                nota = matriz_dados.loc[similar,filme]
            except:
                nota = 0
            if (nota != 0):
                coeficiente = similares[similar]
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['User Pred',filme] = round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['User Pred',filme] = 0
            resultado.at['# Notas',filme] = qtd_notas
    return resultado

def recomendar_user_based(target, k_similares = 30, qtd_sugestoes = 20, min_threshold=5):
    usuarios_mais_similares = obter_mais_similares(target, users_cosine, k_similares).index
    filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target) 
    filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist())
    filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
    recomendacao = predizer_notas_userb_mediap(target, filmes_a_avaliar, users_cosine, ratings_usuarios, k_similares, min_threshold=5)
    return recomendacao.T.sort_values(by=['User Pred','# Notas',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])

In [9]:
recomendar_user_based(1)


Unnamed: 0_level_0,User Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
318,4.7,16,"Shawshank Redemption, The (1994)"
1207,4.7,5,To Kill a Mockingbird (1962)
27773,4.7,5,Old Boy (2003)
1221,4.6,10,"Godfather: Part II, The (1974)"
1204,4.6,8,Lawrence of Arabia (1962)
858,4.5,13,"Godfather, The (1972)"
750,4.5,12,Dr. Strangelove or: How I Learned to Stop Worr...
1206,4.5,11,"Clockwork Orange, A (1971)"
1208,4.5,9,Apocalypse Now (1979)
215,4.5,5,Before Sunrise (1995)


#### Nota por *Mean-Centered* (trad: média centralizada?)
$$ notaPredita_u = \mu_u + {\sum coeficiente * (nota-\mu_v) \over \sum coeficiente} $$

In [10]:
def predizer_notas_userb_meanc(target, filmes, matriz_similaridade = users_cosine, matriz_dados = ratings_usuarios, k=25, min_threshold=0): 
                                            # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme
    similares = obter_mais_similares(target, matriz_similaridade, 25)

    resultado = pd.DataFrame(columns=filmes, index=['User Pred2', '# Notas.'] ) 
    resultado.columns.name = 'movieId'
    media_target = get_media_avaliacao(target)
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares.index:
            nota = matriz_dados.loc[similar,filme]
            if (nota != 0):
                coeficiente = similares[similar]
                numerador += coeficiente * (nota - get_media_avaliacao(similar))
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['User Pred2',filme] = media_target + round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['User Pred2',filme] = 0
            resultado.at['# Notas.',filme] = qtd_notas
    return resultado.fillna(0)

def recomendar_user_based_2(target, k_similares = 30, qtd_sugestoes = 20, min_threshold=5):
    usuarios_mais_similares = obter_mais_similares(target, users_cosine, k_similares).index
    filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target) 
    filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist())
    filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
    recomendacao = predizer_notas_userb_meanc(target, filmes_a_avaliar, users_cosine, ratings_usuarios, k_similares, min_threshold=5)
    return recomendacao.T.sort_values(by=['User Pred2','# Notas.',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])

In [11]:
recomendar_user_based_2(1) # Reaprar que a qt de notas avaliadas é a mesma com os mesmos parâmetros mas o cálculo é bem diferente


Unnamed: 0_level_0,User Pred2,# Notas.,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1207,4.91,5,To Kill a Mockingbird (1962)
318,4.81,16,"Shawshank Redemption, The (1994)"
858,4.71,13,"Godfather, The (1972)"
1732,4.71,12,"Big Lebowski, The (1998)"
1206,4.71,11,"Clockwork Orange, A (1971)"
1221,4.71,10,"Godfather: Part II, The (1974)"
1204,4.71,8,Lawrence of Arabia (1962)
4848,4.71,8,Mulholland Drive (2001)
1035,4.71,6,"Sound of Music, The (1965)"
5971,4.71,6,My Neighbor Totoro (Tonari no Totoro) (1988)


## Testar modelo usando kfold e Raiz do Erro Quadrático Médio

### Calcular margem de erro 

Uma medida frequentemente usada na verificação da acurácia de modelos numéricos é o Erro Quadrático Médio (MSE na sigla em Inglês) como descrito, por exemplo, em Wilks (2006).MSE é sempre positivo. MSE = 0 indica simulação perfeita. MSE é definido por:
$$ MSE = \frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2 $$

Em adição ao MSE, a raiz quadrada de MSE, ou Raiz do Erro Quadrático Médio (RMSE em Inglês), é comumente usada para expressar a acurácia dos resultados numéricos com a vantagem de que RMSE apresenta valores do erro nas mesmas dimensões da variável analisada. O RMSE é definido por:
$$ RMSE = \sqrt{\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2} $$

In [16]:
def calcular_rmse(real, previsao):
    return math.sqrt(sklearn.metrics.mean_squared_error(real, previsao))  

def testar_predicao_user_based(userId, filmes, notas_reais):
    notas_preditas = predizer_notas_userb_mediap(userId, filmes, users_cosine, ratings_usuarios,25,0)
    notas_preditas = notas_preditas.loc["User Pred"].values.tolist()
    return calcular_rmse(notas_reais,notas_preditas)

def get_filmes_teste(usuario):
    return list(ratings_teste[ratings_teste['userId'] == usuario]['movieId'])

def get_notas_teste(usuario):
    return list(ratings_teste[ratings_teste['userId'] == usuario]['rating'])

def testar_user_based(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_user_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_user_based.at[usuario,"rmse_User"]=testar_predicao_user_based(usuario, filmes_teste, notas_dadas)
    estatistica_user_based['rmse_User'].std()
    return estatistica_user_based.describe().T

tt = Treino_Teste(ratings,10)
resultado = []

for i in range(tt.get_k()):
    inicio = time.time()
    ratings_treino, ratings_teste = tt.get_treino_teste()
    matriz = gerar_matriz_movies_user(ratings_treino)
    gerar_matriz_similaridade_user(matriz)
    parcial = testar_user_based(ratings_teste)
    resultado.append(parcial)
    fim = time.time()
    print(f'{i+1}º processo user-based gastou {round((fim - inicio),4)} segs', end='\r')
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias_u = resultado['mean'].mean()
desvios_padrao_u = resultado['std'].mean()
confianca_u = 0.99
conf_int_u = scipy.stats.norm.interval(confianca_u, loc=medias_u, scale=desvios_padrao_u) 
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias_u}\nDesvio padrão médio: {desvios_padrao_u}")
print(f"Intervalo de {confianca_u} confiança: {conf_int_u}\n")
print_destaque('Fim da 1ª parte')

10º processo user-based gastou 8.0858 segs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_User,1039.0,1.386561,0.742252,0.0,0.880341,1.255388,1.746616,5.0
rmse_User,1031.0,1.421068,0.7272,0.0,0.923889,1.275408,1.828649,5.0
rmse_User,1033.0,1.400918,0.765805,0.0,0.8652,1.279844,1.768615,5.0
rmse_User,1023.0,1.440847,0.770514,0.0,0.919692,1.312202,1.795126,5.0
rmse_User,1038.0,1.371692,0.735688,0.0,0.852205,1.257995,1.729251,5.0
rmse_User,1032.0,1.420485,0.742245,0.0,0.916174,1.299769,1.783641,5.0
rmse_User,1033.0,1.410386,0.715144,0.0,0.916935,1.310777,1.784549,5.0
rmse_User,1032.0,1.384676,0.729192,0.0,0.883825,1.270374,1.743305,5.0
rmse_User,1029.0,1.418289,0.740397,0.0,0.9,1.299231,1.748857,5.0
rmse_User,1030.0,1.37242,0.72745,0.0,0.855082,1.266816,1.772493,5.0


[44m[93m Relatório da final da USER BASED 
[0m
Média das RMSE:    1.4027344558933068
Desvio padrão médio: 0.7395884375762437
Intervalo de 0.99 confiança: (-0.5023191141815284, 3.307788025968142)

[44m[93m Fim da 1ª parte 
[0m


---
---

## Item-Based Collaborative Filtering

### Gerar matriz Users X Movies
Transposição da matriz que tinha usuários nas linhas e filmes nas colunas, para filmes nas linhas e usuários nas colunas

In [17]:
ratings_filmes = ratings_usuarios.T
ratings_filmes.head(4)

userId,1,2,3,4,5,6,7,8,9,10,...,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053
1,0.0,3.5,0.0,3.0,4.0,0.0,0.0,4.0,0.0,3.5,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0


### Matriz de Similaridade Item a Item
Similaridade por Cosseno dos filmes entre si

In [18]:
def gerar_matriz_similaridade_item(dados):
    global movies_cosine
    movies_cosine_array = cosine_similarity(dados)
    movies_cosine = pd.DataFrame(data=movies_cosine_array, index=dados.index, columns=dados.index)
    return movies_cosine

gerar_matriz_similaridade_item(ratings_filmes).round(3).head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,205106,205156,205413,205499,205557,206272,206499,206523,207309,208002
1,1.0,0.27,0.285,0.13,0.252,0.32,0.314,0.048,0.168,0.283,...,0.0,0.056,0.07,0.07,0.07,0.049,0.031,0.0,0.0,0.0
2,0.27,1.0,0.189,0.051,0.133,0.206,0.14,0.108,0.179,0.321,...,0.0,0.0,0.0,0.0,0.0,0.068,0.0,0.0,0.0,0.0
3,0.285,0.189,1.0,0.194,0.388,0.249,0.339,0.0,0.284,0.165,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.13,0.051,0.194,1.0,0.168,0.107,0.091,0.0,0.257,0.06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.252,0.133,0.388,0.168,1.0,0.232,0.322,0.0,0.133,0.139,...,0.0,0.0,0.136,0.136,0.136,0.0,0.0,0.0,0.0,0.0


### Selecionar um usuário e analisar os filmes que ele não deu nota
Os targets serão os filmes que o usuário não deu nota. É analisado os k filmes mais similares ao que ele não viu, e destes, de acordo com as notas que o usuário deu, é calculado a nota estimada. Isto é feito para todos os filmes.

<center><img src="img/item-based-cosseno-predicao.jpg" style="max-width: 30%"></center>

1. pegar um usuário e os filmes que ele não assistiu
2. pegar um filme que ele não assistiu e selecionar os K mais semelhantes & que o usuário deu nota
3. fazer a média ponderada entre as notas que ele deu pra estes filmes semelhantes para definir a nota nova faltante

In [43]:
def predizer_notas_item_b(usuario, filmes, matriz_similaridade=movies_cosine, matriz_dados = ratings_filmes, k=25, min_threshold=3):
    usuario = 1
    filmes_não_avaliados = listar_filmes_nao_vistos(usuario) #O que essa essa matriz dados???

    #recomendacao = pd.DataFrame(columns=("movieId", "Nota", "Qt de Notas"))
    resultado = pd.DataFrame(columns=filmes, index=['Item Pred', '# Notas'] ) 
    resultado.columns.name = 'movieId'

    set_avaliados = set(get_filmes_avaliados(usuario))

    for filme in filmes:  #Para cada filme que desejamos a predição, obterei aqueles mais similares
        filmes_mais_similares = obter_mais_similares(filme, matriz_similaridade, k, 0.2)
        if(len(filmes_mais_similares)==0):
            continue
        similares_vistos = list(set(filmes_mais_similares.index) & set_avaliados) #e ver quais que já foram avaliados
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for i in similares_vistos: #para cada filme similar avaliado, calcular a média ponderada
            coeficiente = matriz_similaridade[filme][i]
            nota = ratings_filmes[usuario][i] 
            if (nota != 0):
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1  
        if(qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            try:
                resultado.at['Item Pred',filme] = round(numerador/denominador,1)
            except:
                #print(f"filme:{filme}, numerador:{numerador}, denominador: {denominador}, qt. notas: {qtd_notas}.")
                resultado.at['Item Pred',filme] = 0
            resultado.at['# Notas',filme] = qtd_notas
            #recomendacao.loc[len(recomendacao)] = [int(filme), round(numerador/denominador,2), qtd_notas]
    return resultado.fillna(0)


filmes_não_avaliados = listar_filmes_nao_vistos(1)
recomendacao = predizer_notas_item_b(1, filmes_não_avaliados[:15000],movies_cosine,ratings_filmes,).T.join(movies[['title']], on=["movieId"]).sort_values(by=['Item Pred','# Notas','title'],ascending=False)
recomendacao.head(20)

Unnamed: 0_level_0,Item Pred,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8464,5.0,3.0,Super Size Me (2004)
2329,5.0,3.0,American History X (1998)
5995,4.9,5.0,"Pianist, The (2002)"
4878,4.9,5.0,Donnie Darko (2001)
4027,4.9,4.0,"O Brother, Where Art Thou? (2000)"
8949,4.9,3.0,Sideways (2004)
46578,4.9,3.0,Little Miss Sunshine (2006)
27773,4.8,3.0,Old Boy (2003)
3481,4.8,3.0,High Fidelity (2000)
5669,4.7,5.0,Bowling for Columbine (2002)


In [44]:
def testar_predicao_item_based(userId, filmes, notas_reais):
    notas_preditas = predizer_notas_item_b(userId, filmes, movies_cosine, ratings_filmes,25,0)
    notas_preditas = notas_preditas.loc["Item Pred"].values.tolist()
    return calcular_rmse(notas_reais,notas_preditas)

def testar_item_based(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_item_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_item_based.at[usuario,"rmse_Item"]=testar_predicao_item_based(usuario, filmes_teste, notas_dadas)
    estatistica_item_based['rmse_Item'].std()
    return estatistica_item_based.describe().T

resultado = []

for i in range(tt.get_k()):    
    inicio = time.time()
    matriz = gerar_matriz_movies_user(ratings_treino).T
    gerar_matriz_similaridade_item(matriz)
    parcial = testar_item_based(ratings_teste)
    resultado.append(parcial)
    fim = time.time()
    print(f'{i+1}º processo item-based gastou {round((fim - inicio),3)} segs', end='\r')
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias_i = resultado['mean'].mean()
desvios_padrao_i = resultado['std'].mean()
confianca_i = 0.99
conf_int_i = scipy.stats.norm.interval(confianca_i, loc=medias_i, scale=desvios_padrao_i) 
print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias_i}\nDesvio padrão médio: {desvios_padrao_i}")
print(f"Intervalo de {confianca_i} confiança: {conf_int_i}\n")
print_destaque('Fim da 2ª parte')

10º processo item-based gastou 25.562 segs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0
rmse_Item,1030.0,2.958678,0.852885,0.0,2.551546,3.036033,3.48052,5.0


[44m[93m Relatório da final da ITEM BASED 
[0m
Média das RMSE:    2.9586782480083818
Desvio padrão médio: 0.8528851516469436
Intervalo de 0.99 confiança: (0.761791681834437, 5.155564814182327)

[44m[93m Fim da 2ª parte 
[0m


---
---

# SVD: Fatoração de Matriz
Devido a esparsidade do dataset, os métodos tradicionais de filtragem colaborativa podem não serem adequados a demanda de processamento. Uma forma de tratar é fazendo uso do algoritmo de **Singular Value Decomposition**, SVD.<br>
Neste algoritmo, a matriz é decomposta em  em outras 3 matrizes de menor dimensionalidade.
$$ A = USV^T$$
- A é a matriz original m x n
- U é uma matriz ortogonal m x n (mesmo shape de A)
- S é uma matriz diagonal n x n (valores $\sigma_1 \geqslant \sigma_2 \geqslant ... \geqslant \sigma_n$ => ordenados por importância)
- V é uma matriz ortogonal n x n

In [45]:
from scipy.sparse.linalg import svds
from numpy import count_nonzero
U, sigma, Vt = svds(ratings_usuarios.to_numpy(), k = 10) 

print(f"Matriz original{ratings_usuarios.shape} decomposta em U{U.shape}, sigma {sigma.shape} e Vt{Vt.shape}.")

sigma_diag_matrix=np.diag(sigma) #sigma é um array contendo a diagonal
all_user_predicted_ratings = np.dot(np.dot(U, sigma_diag_matrix), Vt)
matriz_SVD = pd.DataFrame(all_user_predicted_ratings, columns = ratings_usuarios.columns, index=ratings_usuarios.index)

esparsidade_SVD = 1.0 - ( count_nonzero(matriz_SVD) / float(matriz_SVD.size) )
print("Esparsidade: ", esparsidade_SVD,"%")

matriz_SVD.head()

Matriz original(1053, 11496) decomposta em U(1053, 10), sigma (10,) e Vt(10, 11496).
Esparsidade:  0.03740431454418924 %


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,205106,205156,205413,205499,205557,206272,206499,206523,207309,208002
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.07437,0.011611,-0.082976,-0.01989,-0.050221,0.063823,-0.046668,0.005242,-0.044586,0.064718,...,-0.002857,0.054214,-0.004031,-0.005375,-0.006047,-0.000839,0.026551,0.002967,-0.000649,-0.000649
2,2.335915,0.84719,0.459541,-0.003254,0.336054,0.304765,0.572595,0.061998,-0.023433,0.744642,...,-0.004418,0.030318,0.026269,0.035025,0.039403,-0.002004,0.00731,0.002631,-0.009158,-0.009158
3,1.398219,1.290431,-0.432665,-0.140748,-0.377726,0.793672,-0.843311,-0.015128,0.078108,0.684256,...,0.027305,-0.008195,0.03532,0.047094,0.052981,0.039013,0.029256,0.047231,0.038913,0.038913
4,1.874945,0.320528,-0.044137,-0.021046,-0.014767,-0.153808,-0.092385,-0.040386,0.033143,0.054434,...,0.015399,0.037186,0.047721,0.063628,0.071581,0.027479,0.040265,0.023666,0.025284,0.025284
5,1.984519,0.560427,0.596705,0.136933,0.580859,1.019603,0.646601,0.0007,0.215776,1.257375,...,0.000784,0.000686,-0.003021,-0.004028,-0.004532,0.001135,0.003992,0.000492,0.004258,0.004258


In [58]:
def testar_predicao_svd(userId, filmes, notas_reais):
    try:
        notas_preditas = matriz_SVD[matriz_SVD.index==userId][filmes].values.tolist()[0]
    except:
        return 0
    return calcular_rmse(notas_reais,notas_preditas)

def testar_svd(teste = ratings_teste):
    usuarios_teste = list(set(teste.userId))
    estatistica_user_based = pd.DataFrame(index = usuarios_teste)
    for usuario in usuarios_teste:
        filmes_teste =  get_filmes_teste(usuario)
        notas_dadas = get_notas_teste(usuario)
        estatistica_user_based.at[usuario,"rmse_svd"]=testar_predicao_svd(usuario, filmes_teste, notas_dadas)
    estatistica_user_based['rmse_svd'].std()
    return estatistica_user_based.describe().T

resultado = []

inicio = time.time()
for i in range(tt.get_k()):
    inicio = time.time()
    ratings_treino, ratings_teste = tt.get_treino_teste()
    parcial = testar_svd(ratings_teste)
    resultado.append(parcial)
    fim = time.time()
    print(f'{i+1}º processo svd gastou {round((fim - inicio),5)} segs', end='\r')
    tt.proxima_folha()

resultado = pd.concat(resultado)
display(resultado)

medias_s = resultado['mean'].mean()
desvios_padrao_s = resultado['std'].mean()
confianca_s = 0.99
conf_int_s = scipy.stats.norm.interval(confianca_s, loc=medias_s, scale=desvios_padrao_s) 
print_destaque('Relatório da final da SVD')
print(f"Média das RMSE:    {medias_s}\nDesvio padrão médio: {desvios_padrao_s}")
print(f"Intervalo de {confianca_s} confiança: {conf_int_s}\n")
print_destaque('Fim de execução')

10º processo svd gastou 1.05002 segs

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rmse_svd,1039.0,3.029917,0.85778,0.0,2.565546,3.072937,3.599444,4.981269
rmse_svd,1031.0,3.033963,0.880989,0.0,2.595213,3.090667,3.617542,4.970101
rmse_svd,1033.0,3.046745,0.872073,0.0,2.590069,3.094148,3.591594,4.993025
rmse_svd,1023.0,3.017356,0.894885,0.0,2.566778,3.065632,3.58257,5.003578
rmse_svd,1038.0,3.023418,0.865083,0.0,2.589569,3.069263,3.618769,4.98047
rmse_svd,1032.0,3.044509,0.84794,0.0,2.608994,3.102679,3.558621,4.977617
rmse_svd,1033.0,3.055097,0.857001,0.0,2.601501,3.080264,3.608107,4.972825
rmse_svd,1032.0,3.048528,0.822769,0.0,2.59457,3.059078,3.581574,4.996976
rmse_svd,1029.0,3.026954,0.859384,0.0,2.592141,3.070314,3.606942,4.969173
rmse_svd,1030.0,3.097392,0.852867,0.0,2.674523,3.159728,3.625509,5.003663


[44m[93m Relatório da final da SVD 
[0m
Média das RMSE:    3.0423879467038137
Desvio padrão médio: 0.861077188922275
Intervalo de 0.99 confiança: (0.8244000908603053, 5.260375802547323)

[44m[93m Fim de execução 
[0m


---
---

# Executando todos aos juntos

In [59]:
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias_u}\nDesvio padrão médio: {desvios_padrao_u}")
print(f"Intervalo de {confianca_u} confiança: {conf_int_u}\n")

print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias_i}\nDesvio padrão médio: {desvios_padrao_i}")
print(f"Intervalo de {confianca_i} confiança: {conf_int_i}\n")

print_destaque('Relatório da final da SVD')
print(f"Média das RMSE:    {medias_s}\nDesvio padrão médio: {desvios_padrao_s}")
print(f"Intervalo de {confianca_s} confiança: {conf_int_s}\n")
print_destaque('Fim de execução')

[44m[93m Relatório da final da USER BASED 
[0m
Média das RMSE:    1.4027344558933068
Desvio padrão médio: 0.7395884375762437
Intervalo de 0.99 confiança: (-0.5023191141815284, 3.307788025968142)

[44m[93m Relatório da final da ITEM BASED 
[0m
Média das RMSE:    2.9586782480083818
Desvio padrão médio: 0.8528851516469436
Intervalo de 0.99 confiança: (0.761791681834437, 5.155564814182327)

[44m[93m Relatório da final da SVD 
[0m
Média das RMSE:    3.0423879467038137
Desvio padrão médio: 0.861077188922275
Intervalo de 0.99 confiança: (0.8244000908603053, 5.260375802547323)

[44m[93m Fim de execução 
[0m


In [60]:
tt = Treino_Teste(ratings,10)
resultado_user = []
resultado_item = []
resultado_svd = []

for i in range(tt.get_k()):
    ratings_treino, ratings_teste = tt.get_treino_teste()

    inicio = time.time()
    matriz_user = gerar_matriz_movies_user(ratings_treino)
    gerar_matriz_similaridade_user(matriz_user)
    parcial_user = testar_user_based(ratings_teste)    
    resultado_user.append(parcial_user)
    fim = time.time()
    print(f'{i+1}º processo user-based gastou {fim - inicio} segs')

    inicio = time.time()
    matriz_item = matriz_user.T
    gerar_matriz_similaridade_item(matriz_item)
    parcial_item = testar_item_based(ratings_teste)    
    resultado_item.append(parcial_item)
    fim = time.time()
    print(f'{i+1}º processo item-based gastou {fim - inicio} segs')

    inicio = time.time()
    parcial_svd = testar_svd(ratings_teste)
    print(f'{i+1}º processo(s)', end='\r')
    resultado_svd.append(parcial_svd)
    fim = time.time()
    print(f'{i+1}º processo SVD gastou {fim - inicio} segs')

    tt.proxima_folha()


medias_u = resultado_user['mean'].mean()
desvios_padrao_u = resultado_user['std'].mean()
confianca_u = 0.99
conf_int_u = scipy.stats.norm.interval(confianca_u, loc=medias_u, scale=desvios_padrao_u) 
print_destaque('Relatório da final da USER BASED')
print(f"Média das RMSE:    {medias_u}\nDesvio padrão médio: {desvios_padrao_u}")
print(f"Intervalo de {confianca_u} confiança: {conf_int_u}\n")

medias_i = resultado_item['mean'].mean()
desvios_padrao_i = resultado_item['std'].mean()
confianca_i = 0.99
conf_int_i = scipy.stats.norm.interval(confianca_i, loc=medias_i, scale=desvios_padrao_i) 
print_destaque('Relatório da final da ITEM BASED')
print(f"Média das RMSE:    {medias_i}\nDesvio padrão médio: {desvios_padrao_i}")
print(f"Intervalo de {confianca_i} confiança: {conf_int_i}\n")

medias_s = resultado_svd['mean'].mean()
desvios_padrao_s = resultado_svd['std'].mean()
confianca_s = 0.99
conf_int_s = scipy.stats.norm.interval(confianca_s, loc=medias_s, scale=desvios_padrao_s) 
print_destaque('Relatório da final da SVD')
print(f"Média das RMSE:    {medias_s}\nDesvio padrão médio: {desvios_padrao_s}")
print(f"Intervalo de {confianca_s} confiança: {conf_int_s}\n")
print_destaque('Fim de execução')

1º processo user-based gastou 7.741596221923828 segs
1º processo item-based gastou 25.50557041168213 segs
1º processo SVD gastou 1.1067733764648438 segs
2º processo user-based gastou 7.81731104850769 segs
2º processo item-based gastou 26.169904708862305 segs
2º processo SVD gastou 1.030620813369751 segs
3º processo user-based gastou 8.088613271713257 segs
3º processo item-based gastou 26.540303707122803 segs
3º processo SVD gastou 1.41971755027771 segs
4º processo user-based gastou 7.292355537414551 segs
4º processo item-based gastou 26.641812324523926 segs
4º processo SVD gastou 1.0738272666931152 segs
5º processo user-based gastou 7.17916464805603 segs
5º processo item-based gastou 26.426568269729614 segs
5º processo SVD gastou 1.1548855304718018 segs
6º processo user-based gastou 7.338716745376587 segs
6º processo item-based gastou 28.963640451431274 segs
6º processo SVD gastou 1.1502759456634521 segs
7º processo user-based gastou 8.515458583831787 segs
7º processo item-based gastou

TypeError: list indices must be integers or slices, not str