# Trabalho de Filtragem Colaborativa

Modelo de filtragem colaborativa usa as informações de ratings dos usuários para prover recomendações. <br>
**Ideia Principal:** A semelhança entre os usuários a partir do que se observa da interseção dos seus ratings permite inferir que os dados não inputados por apenas um destes poderia ter a mesma semelhança com relação ao outro. <br>
**Problema Principal:** A esparsidade dos dados, afinal a informação que um usuário provê é normalmente a um subconjunto muito pequeno dos itens. Logo a maioria da base de dados é de dados *faltantes* ou *não observados*. <br>
<br>
Há 2 métodos de filtragem colaborativa: <br>

**Memory-Based:** Também chamado de *neighborhood-based collaborative filtering algorithms*. Que se dividem basicamente em *user-based collaborative filtering* e *item-based collaborative filtering*. <br>
**Model-Based:** Modelos baseados em **machine learning** e **data mining** há um processo de aprendizado prévio para parametrizição. Alguns métodos são Decisions Trees, métodos bayesianos, modelos baseados em regras e latent factor method. <br>

## Implementação usando MovieLens


In [21]:
import pandas as pd         #DataFrames e operações associadas
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity          #Similaridade
import math  
import sklearn.metrics  

#### Funções para auxiliar
def listar_filmes_ja_vistos(usuario, matriz_filmes_X_usuarios):
    #filmes_ja_vistos_bin = matriz_filmes_X_usuarios.loc[usuario].gt(0)   #gerar array com o que usuário já deu rating: True ou False
    #return filmes_ja_vistos_bin.index[filmes_ja_vistos_bin].to_list() #com base no anterior, listar filmes que já viu        
    if(type(usuario)==list):
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario].sum(axis = 0)
    else: #if(type(usuario)==int):
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario!=0].index.to_list()

def listar_filmes_nao_vistos(usuario, matriz_filmes_X_usuarios):
    if(type(usuario)==list):
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario].sum(axis = 0)
    else: 
        filmes_usuario = matriz_filmes_X_usuarios.loc[usuario]
    return filmes_usuario[filmes_usuario==0].index.to_list()

def eliminar_colunas_zeradas(matriz):
    return matriz.loc[:, (matriz != 0).any(axis=0)] #elimina todas as colunas cujos todos os valores são 0

def listar_nomes_filmes(indices, movies):
    return movies.loc[indices]['title'].values.tolist()


### Carregar a base de 25M ou 100K

In [2]:
small = False  #usar a base de ratings reduzida ou não
ratings = pd.DataFrame()
movies = pd.DataFrame()

if (small):
    ratings = pd.read_csv('ratings_small.csv')
    movies = pd.read_csv('movies_small.csv').set_index("movieId")
else:
    ratings = pd.read_csv('ratings.csv')[:2000000] #ler as primeiras 2.000.000 linhas da base completa
    movies = pd.read_csv('movies.csv')
    movies = movies.set_index('movieId')

n_ratings = len(ratings)
n_users = len(ratings['userId'].unique())
n_movies_avaliados = len(ratings['movieId'].unique())
print(f"Total de ratings: {n_ratings}")
print(f"Total de filmes: {len(movies)}")
print(f"Filmes avaliados: {n_movies_avaliados}")
print(f"Total de usuários: {n_users}")
print(f"Média de ratings/user: {round(n_ratings/n_users, 2)}")
print(f"Shape de Ratings: {ratings.shape}")
esparsidade = round(1.0 -n_ratings/float(n_users * n_movies_avaliados),3)
print(f"O nível de esparsidade do dataset é {esparsidade * 100}%")
ratings.sample(6).sort_index()


Total de ratings: 2000000
Total de filmes: 62423
Filmes avaliados: 27321
Total de usuários: 13322
Média de ratings/user: 150.13
Shape de Ratings: (2000000, 4)
O nível de esparsidade do dataset é 99.5%


Unnamed: 0,userId,movieId,rating,timestamp
711224,4835,62155,4.0,1349509445
1103766,7441,71535,4.5,1463282000
1313896,8868,5445,3.5,1529187954
1351877,9108,2076,3.5,1493604288
1702030,11332,440,4.0,833144036
1715857,11421,72395,5.0,1454415895


In [3]:
print("Relatório dos ratings dados pelos usuários:")
ratings.groupby('userId')['rating'].count().describe()

Relatório dos ratings dados pelos usuários:


count    13322.000000
mean       150.127608
std        238.442136
min         20.000000
25%         35.000000
50%         70.000000
75%        158.000000
max       4689.000000
Name: rating, dtype: float64

## Gerar matriz Movies X User

In [4]:
#movies_X_users = ratings.pivot_table(index="userId", columns="movieId", values="rating", fill_value=0)
# A LINHA DE CIMA FAZ O MESMO QUE A DE BAIXO GASTANDO 15X MAIS TEMPO
movies_X_users = ratings.groupby(['userId', 'movieId'])['rating'].first().unstack(fill_value=0.0)
movies_X_users

movieId,1,2,3,4,5,6,7,8,9,10,...,208531,208545,208683,208715,208737,208787,208793,208795,208939,209163
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13319,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13321,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
titulos_users = movies_X_users.copy() #se não mandar uma cópia, titulo users passa ser outro nome da mesma matriz e altera ela diretamente
titulos_users.columns = movies.loc[movies_X_users.columns.values.tolist()].title.values.tolist()
titulos_users.head(3)


Unnamed: 0_level_0,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Earthquake Bird (2019),Eminence Hill (2019),The Man Without Gravity (2019),Let It Snow (2019),Midway (2019),Marvel Renaissance (2014),Watchman (2019),Zana (2019),Klaus (2019),Bad Poems (2018)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## User-Based Collaborative Filtering

### Matriz de Similaridade por Usuário
A medida comumente usada é a similaridade do cosseno.
Essa medida de similaridade deve seu nome ao fato de ser igual ao cosseno do ângulo entre os dois vetores que estão sendo comparados:  vetores de similaridade de usuário (ou item) de ratings. Quanto menor o ângulo entre dois vetores, maior será o cosseno, resultando em um fator de similaridade mais alto. 

In [6]:
users_cosine_array = cosine_similarity(movies_X_users)
users_cosine = pd.DataFrame(data=users_cosine_array, index=movies_X_users.index, columns=movies_X_users.index)
users_cosine.round(3).head()

userId,1,2,3,4,5,6,7,8,9,10,...,13313,13314,13315,13316,13317,13318,13319,13320,13321,13322
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.041,0.061,0.041,0.016,0.0,0.094,0.021,0.023,0.026,...,0.011,0.009,0.055,0.032,0.0,0.025,0.111,0.0,0.024,0.028
2,0.041,1.0,0.179,0.197,0.158,0.13,0.065,0.177,0.129,0.157,...,0.204,0.107,0.231,0.179,0.174,0.106,0.0,0.036,0.117,0.141
3,0.061,0.179,1.0,0.358,0.061,0.115,0.031,0.081,0.062,0.132,...,0.046,0.19,0.168,0.061,0.07,0.104,0.062,0.063,0.049,0.139
4,0.041,0.197,0.358,1.0,0.066,0.072,0.016,0.089,0.066,0.084,...,0.09,0.1,0.147,0.049,0.072,0.064,0.089,0.015,0.027,0.057
5,0.016,0.158,0.061,0.066,1.0,0.115,0.202,0.308,0.216,0.27,...,0.098,0.113,0.206,0.072,0.045,0.088,0.053,0.013,0.33,0.129


### Selecionar a similaridade desejada

Nota: se fazer média das notas primeiro e usar cosseno, obtem-se mesmos valores que dá fazendo pearson diretamente. 
```python
    movies_X_users = movies_X_users - np.asarray([(np.mean(movies_X_users, 1))]).T
```
Implementação a seguir de person não é recomendada pois gasta mais que 5x o tempo de cosseno 

In [7]:
#selecionar se vai usar pearson ou cosine para similaridade
usar_pearson = False
matriz_similaridade = users_cosine 
if (usar_pearson):     # Se verdadeiro, mudar para similaridade por pearson
    matriz_similaridade = movies_X_users.T.corr(method='pearson') #pode usar tb kendall e spearman

### Pegar os k usuários mais similares ao Target selecionado

In [8]:
def selecionar_usuarios_mais_similares(target = 1, k = 25):
    todas_similaridades_com_usuario = matriz_similaridade.loc[target].to_numpy()     #criar um array com a linha do target na matrix de similaridades
    usuarios_mais_similares = movies_X_users.index[todas_similaridades_com_usuario.argpartition(-k)[-k-1:-1]] #seleciona os k com similaridade mais alta no vetor excluindo ele mesmo
    return usuarios_mais_similares 

### Matriz de filmes não vistos pelo usuário target e que receberam notas dos usuários mais similares

In [9]:
target = 1
usuarios_mais_similares = selecionar_usuarios_mais_similares(1,)
#gerar matriz dos usuarios mais similares x filmes não assistidos ainda pelo usuário
usuarios_similares_X_filmes_nao_vistos = movies_X_users.loc[usuarios_mais_similares].drop(columns=listar_filmes_ja_vistos(1,movies_X_users)) 
usuarios_similares_X_filmes_nao_vistos = eliminar_colunas_zeradas(usuarios_similares_X_filmes_nao_vistos)
print('Matriz de filmes não vistos por usuários mais similares:',usuarios_similares_X_filmes_nao_vistos.shape)
usuarios_similares_X_filmes_nao_vistos.head(8)

Matriz de filmes não vistos por usuários mais similares: (25, 2396)


movieId,1,2,5,6,10,11,12,14,16,17,...,98604,98961,99114,104841,106487,106489,106916,108727,108981,114662
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7928,4.0,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7766,5.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7431,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11885,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Gerar a Recomendação de acordo com a nota dada pelos usuários similares

1. Para cada filme: 
2. Para cada usuário da lista de mais similar:
    1. Se nota foi dada: somar nota seguindo a fórmula
$$ nota_ = {\sum coeficienteUsuário * notaUsuário \over \sum coeficienteUsuário} $$

In [10]:
def predizer_notas(target, similares, filmes, min_threshold=0): # min_threshold: qtd mínima de notas para ser considerada ao recomendar um filme
    #Dataframe com a similaridade_dos_mais_similares: [userId, coeficiente de similaridade] 
    similaridade_dos_mais_similares = matriz_similaridade.loc[target,similares] 

    #matriz_similares_filmes = movies_X_users.loc[similares
    resultado = pd.DataFrame(columns=filmes, index=['Nota Final', '# Notas'] ) 
    resultado.columns.name = 'movieId'
    for filme in filmes:
        numerador = 0
        denominador = 0
        qtd_notas = 0
        for similar in similares:
            nota = movies_X_users.loc[similar,filme]
            if (nota != 0):
                coeficiente = similaridade_dos_mais_similares[similar]
                numerador += nota * coeficiente
                denominador += coeficiente
                qtd_notas += 1
        if (qtd_notas < min_threshold):
            resultado = resultado.drop([filme], axis=1) #se a qt de notas for menor que limiar, descartar coluna com informação daquele filme
        else: #se não, prencher a nota calculada da média ponderada e a qtd de notas dadas
            resultado.at['Nota Final',filme] = round(numerador/denominador,1)
            resultado.at['# Notas',filme] = qtd_notas
    return resultado

In [11]:
qtd_sugestoes = 20 #qtd de sugestões para exibir na tela

filmes_vistos_pelo_usuario = listar_filmes_ja_vistos(target, movies_X_users) 
filmes_vistos_pelos_similares = listar_filmes_ja_vistos(usuarios_mais_similares.values.tolist(), movies_X_users)
filmes_a_avaliar = list(set(filmes_vistos_pelos_similares)-set(filmes_vistos_pelo_usuario))
recomendacao = predizer_notas(target, usuarios_mais_similares, filmes_a_avaliar, 5)
#recomendacao
recomendacao.T.sort_values(by=['Nota Final','# Notas',],ascending=False).head(qtd_sugestoes).join(movies[['title']], on=["movieId"])


Unnamed: 0_level_0,Nota Final,# Notas,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8197,4.7,6,Hiroshima Mon Amour (1959)
750,4.6,13,Dr. Strangelove or: How I Learned to Stop Worr...
1232,4.6,8,Stalker (1979)
2313,4.6,7,"Elephant Man, The (1980)"
1251,4.6,6,8 1/2 (8½) (1963)
858,4.5,16,"Godfather, The (1972)"
1208,4.5,13,Apocalypse Now (1979)
1193,4.5,11,One Flew Over the Cuckoo's Nest (1975)
293,4.5,10,Léon: The Professional (a.k.a. The Professiona...
8950,4.5,5,The Machinist (2004)


## Avaliando a eficácia do método
1. Selecionar um target randômico
2. Selecionar alguns valores de notas dadas por ele
3. Tentar predizer sua nota com base nos seus similares, 
4. Calcular margem de erro 

Uma medida frequentemente usada na verificação da acurácia de modelos numéricos é o Erro Quadrático Médio (MSE na sigla em Inglês) como descrito, por exemplo, em Wilks (2006).MSE é sempre positivo. MSE = 0 indica simulação perfeita. MSE é definido por:
$$ MSE = \frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2 $$

Em adição ao MSE, a raiz quadrada de MSE, ou Raiz do Erro Quadrático Médio (RMSE em Inglês), é comumente usada para expressar a acurácia dos resultados numéricos com a vantagem de que RMSE apresenta valores do erro nas mesmas dimensões da variável analisada. O RMSE é definido por:
$$ RMSE = \sqrt{\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2} $$

In [12]:
rnd_user = movies_X_users.sample()
rnd_user = eliminar_colunas_zeradas(rnd_user)
rnd_user

movieId,19,25,32,47,110,111,150,153,161,165,...,587,588,589,590,592,593,597,648,736,1036
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4705,3.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,3.0,3.0,4.0,4.0,3.0,4.0,4.0,3.0,3.0,4.0


In [13]:
rnd_user_id = rnd_user.index.values[0]

filmes_assistidos = listar_filmes_ja_vistos(rnd_user_id, movies_X_users)
similares = selecionar_usuarios_mais_similares(rnd_user_id)
predicao = predizer_notas(rnd_user_id, similares, filmes_assistidos, 0)
predicao

movieId,19,25,32,47,110,111,150,153,161,165,...,587,588,589,590,592,593,597,648,736,1036
Nota Final,2.8,3.4,4.2,3.8,4.0,3.0,3.7,3.3,3.8,3.9,...,3.3,3.4,4.0,3.8,3.4,4.3,3.1,3.4,3.6,4.0
# Notas,16.0,2.0,17.0,22.0,22.0,1.0,25.0,24.0,22.0,25.0,...,18.0,22.0,25.0,24.0,24.0,20.0,16.0,4.0,5.0,2.0


In [14]:
def calcular_rmse(real, previsao):
    mse = sklearn.metrics.mean_squared_error(notasReais, notasPreditas)     
    #mse = np.square(np.subtract(notasReais,notasPreditas)).mean()      
    print("Erro Quadrático Médio (MSE):", mse)         
    return math.sqrt(mse)  

notasReais = rnd_user.values.tolist()[0]
notasPreditas = predicao.loc["Nota Final"].values.tolist()
print("Raiz do Erro Quadrático Médio :", calcular_rmse(notasReais, notasPreditas))  

Erro Quadrático Médio (MSE): 0.36239130434782607
Raiz do Erro Quadrático Médio : 0.6019894553460434


---
---

## Item-Based Collaborative Filtering

### Gerar matriz Users X Movies
Transposição da matriz que tinha usuários nas linhas e filmes nas colunas, para filmes nas linhas e usuários nas colunas

In [15]:
users_X_movies = movies_X_users.T
users_X_movies.head(4)

userId,1,2,3,4,5,6,7,8,9,10,...,13313,13314,13315,13316,13317,13318,13319,13320,13321,13322
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,3.5,4.0,3.0,4.0,0.0,0.0,4.0,0.0,3.5,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Matriz de Similaridade
Similaridade por Cosseno dos filmes entre si

In [16]:
movies_cosine_array = cosine_similarity(users_X_movies)
movies_cosine = pd.DataFrame(data=movies_cosine_array, index=users_X_movies.index, columns=users_X_movies.index)
movies_cosine.head()
#movies_pearson = movies_users.corr(method='pearson')

movieId,1,2,3,4,5,6,7,8,9,10,...,208531,208545,208683,208715,208737,208787,208793,208795,208939,209163
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.385083,0.276592,0.109643,0.285718,0.354674,0.28509,0.101321,0.162151,0.361706,...,0.009099,0.00182,0.012739,0.014558,0.016137,0.014558,0.018198,0.018198,0.0,0.0
2,0.385083,1.0,0.220953,0.140544,0.23072,0.265091,0.189098,0.164472,0.155061,0.40285,...,0.0,0.006661,0.019982,0.0,0.027943,0.019982,0.026643,0.0,0.016652,0.023312
3,0.276592,0.220953,1.0,0.168761,0.429549,0.259667,0.364669,0.107634,0.250113,0.207599,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.109643,0.140544,0.168761,1.0,0.169346,0.095199,0.176095,0.150909,0.108823,0.099429,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.285718,0.23072,0.429549,0.169346,1.0,0.231271,0.399015,0.132778,0.232934,0.176002,...,0.0,0.019849,0.024811,0.0,0.0,0.004962,0.0,0.0,0.0,0.0


### Selecionar um usuário target e seus filmes favoritos
É obtido a nota mais alta que este usuário deu e todos os filmes  com a mesma nota. Estes chamaremos de filmes favorito e de acordo eles é buscada a similaridade entre os itens

In [20]:
movies.loc[filmes_mais_gosta]['title'].values.tolist()

['Night, The (Notte, La) (1960)',
 'Saragossa Manuscript, The (Rekopis znaleziony w Saragossie) (1965)',
 'Seventh Seal, The (Sjunde inseglet, Det) (1957)',
 'Run Lola Run (Lola rennt) (1998)',
 'Dolls (2002)',
 'Three Colors: Blue (Trois couleurs: Bleu) (1993)',
 'Eternal Sunshine of the Spotless Mind (2004)',
 'Pulp Fiction (1994)',
 'Dolce Vita, La (1960)',
 'Requiem for a Dream (2000)',
 'Underground (1995)',
 'City of God (Cidade de Deus) (2002)',
 'Lost in Translation (2003)',
 'Idiots, The (Idioterne) (1998)',
 'In the Mood For Love (Fa yeung nin wa) (2000)',
 'Teddy Bear (Mis) (1981)',
 'Look at Me (Comme une image) (2004)']

In [22]:
usuarioId_target = 1
usuario = movies_X_users.loc[usuarioId_target].sort_values(ascending=False)
rate_mais_alto = usuario.iloc[0]
filmes_mais_gosta = usuario[usuario >= rate_mais_alto].index.tolist()
assistiu_n_filmes = len(usuario[usuario > 0].index.to_list())
print(f"O usuário {usuarioId_target} assistiu {assistiu_n_filmes} filmes e deu nota {rate_mais_alto} para estes {len(filmes_mais_gosta)} filmes: ")
print(listar_nomes_filmes(filmes_mais_gosta, movies))

O usuário 1 assistiu 70 filmes e deu nota 5.0 para estes 17 filmes: 
['Night, The (Notte, La) (1960)', 'Saragossa Manuscript, The (Rekopis znaleziony w Saragossie) (1965)', 'Seventh Seal, The (Sjunde inseglet, Det) (1957)', 'Run Lola Run (Lola rennt) (1998)', 'Dolls (2002)', 'Three Colors: Blue (Trois couleurs: Bleu) (1993)', 'Eternal Sunshine of the Spotless Mind (2004)', 'Pulp Fiction (1994)', 'Dolce Vita, La (1960)', 'Requiem for a Dream (2000)', 'Underground (1995)', 'City of God (Cidade de Deus) (2002)', 'Lost in Translation (2003)', 'Idiots, The (Idioterne) (1998)', 'In the Mood For Love (Fa yeung nin wa) (2000)', 'Teddy Bear (Mis) (1981)', 'Look at Me (Comme une image) (2004)']


### Similaridade dos filmes favoritos com os filmes não assistidos
Matriz onde cada um dos filmes favoritos é um índice e cada coluna é um filme que não foram assistidos ainda.

In [23]:
lista_filmes_ja_vistos = listar_filmes_ja_vistos(usuarioId_target,movies_X_users)
print(f"Usuario {usuarioId_target} já assistiu {len(lista_filmes_ja_vistos)} filmes.")
filmes_pro_usuario = movies_cosine.loc[filmes_mais_gosta].drop(columns=lista_filmes_ja_vistos)
filmes_pro_usuario = eliminar_colunas_zeradas(filmes_pro_usuario) #eliminar as colunas dos filmes que não similaridade nenhuma com nada
filmes_pro_usuario

Usuario 1 já assistiu 70 filmes.


movieId,1,2,3,4,5,6,7,8,9,10,...,208531,208545,208683,208715,208737,208787,208793,208795,208939,209163
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4325,0.048319,0.036425,0.004197,0.0,0.003165,0.052934,0.010252,0.0,0.005654,0.030384,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.212605
2632,0.033449,0.029535,0.014118,0.0,0.002129,0.040721,0.009906,0.0,0.0,0.026479,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1237,0.145767,0.1015,0.049393,0.038993,0.055269,0.148725,0.056772,0.033396,0.018669,0.105733,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058498,0.0,0.040948
2692,0.280149,0.182277,0.097522,0.050761,0.071719,0.247829,0.088598,0.025237,0.039103,0.194257,...,0.0,0.022133,0.025295,0.0,0.0,0.012647,0.031619,0.031619,0.022133,0.022133
8327,0.039146,0.040715,0.005662,0.0,0.00934,0.045019,0.013325,0.0,0.0,0.02385,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.179259
307,0.137478,0.087935,0.057192,0.056846,0.051696,0.163889,0.085958,0.021172,0.035518,0.112383,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056588,0.0,0.039612
7361,0.343043,0.227357,0.074001,0.025174,0.082209,0.220624,0.07291,0.017835,0.028312,0.175972,...,0.018354,0.002294,0.018354,0.020648,0.021875,0.0,0.0,0.022943,0.018354,0.01606
296,0.491905,0.342615,0.16418,0.099265,0.151422,0.411448,0.158854,0.046971,0.100742,0.423525,...,0.014375,0.014375,0.014375,0.0115,0.012884,0.00575,0.014375,0.014375,0.014375,0.0115
8154,0.083079,0.061808,0.024624,0.02191,0.026139,0.091842,0.056287,0.010164,0.014778,0.058553,...,0.100019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3949,0.272124,0.194965,0.073408,0.019802,0.059297,0.237404,0.055039,0.014923,0.035329,0.15843,...,0.020156,0.002879,0.0,0.0,0.006864,0.0,0.023036,0.028794,0.023036,0.023036


Transformamos a matriz numa 1xN filmes com o valor máximo encontra de similaridade; e ordenamos essa matriz, selecionando os k filmes com maior similaridade apresentada <br>
Feito também um **join** com movies para mostrar o título

In [24]:
k = 20
#pegar a similaridade máxima que cada um dos filmes não vistos possui com os filmes já vistos
recomendacao = filmes_pro_usuario.max().sort_values(ascending=False).head(k)
recomendacao = pd.DataFrame(recomendacao).join(movies['title'], on='movieId')
recomendacao

Unnamed: 0_level_0,0,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
593,0.698877,"Silence of the Lambs, The (1991)"
26094,0.687403,"Eclisse, L' (Eclipse) (1962)"
308,0.685197,Three Colors: White (Trzy kolory: Bialy) (1994)
318,0.665221,"Shawshank Redemption, The (1994)"
47,0.650256,Seven (a.k.a. Se7en) (1995)
50,0.646176,"Usual Suspects, The (1995)"
356,0.642366,Forrest Gump (1994)
2959,0.600179,Fight Club (1999)
1089,0.591013,Reservoir Dogs (1992)
4226,0.581538,Memento (2000)


In [35]:
recomendacao = recomendacao.rename(columns={'title': 'Recomendação', 0: 'Score'})
pq_vc_assistiu = []
nota_media = []
for id in recomendacao.index:
    pq_vc_assistiu.append(filmes_pro_usuario.index[filmes_pro_usuario[id] == recomendacao.loc[id][0]].tolist()[0])
    nota_media.append(ratings.groupby(['movieId'])['rating'].mean().loc[id])
recomendacao['Nota média'] = nota_media
recomendacao['Pq vc assistiu'] = listar_nomes_filmes(pq_vc_assistiu, movies)
recomendacao

Unnamed: 0_level_0,Score,Recomendação,Nota média,Pq vc assistiu
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
593,0.698877,"Silence of the Lambs, The (1991)",4.153263,Pulp Fiction (1994)
26094,0.687403,"Eclisse, L' (Eclipse) (1962)",3.8,"Night, The (Notte, La) (1960)"
308,0.685197,Three Colors: White (Trzy kolory: Bialy) (1994),3.932,Three Colors: Blue (Trois couleurs: Bleu) (1993)
318,0.665221,"Shawshank Redemption, The (1994)",4.429786,Pulp Fiction (1994)
47,0.650256,Seven (a.k.a. Se7en) (1995),4.090097,Pulp Fiction (1994)
50,0.646176,"Usual Suspects, The (1995)",4.307735,Pulp Fiction (1994)
356,0.642366,Forrest Gump (1994),4.045913,Pulp Fiction (1994)
2959,0.600179,Fight Club (1999),4.219444,Pulp Fiction (1994)
1089,0.591013,Reservoir Dogs (1992),4.101577,Pulp Fiction (1994)
4226,0.581538,Memento (2000),4.162661,Eternal Sunshine of the Spotless Mind (2004)


---
---

# SVD: Fatoração de Matriz
Devido a esparsidade do dataset, os métodos tradicionais de filtragem colaborativa podem não serem adequados a demanda de processamento. Uma forma de tratar é fazendo uso do algoritmo de **Singular Value Decomposition**, SVD.<br>
Neste algoritmo, a matriz é decomposta em  em outras 3 matrizes de menor dimensionalidade.
$$ A = USV^T$$
- A é a matriz original m x n
- U é uma matriz ortogonal m x n
- S é uma matriz diagona n x n
- V é uma matriz ortogonal n x n

https://heartbeat.comet.ml/recommender-systems-with-python-part-iii-collaborative-filtering-singular-value-decomposition-5b5dcb3f242b

https://www.kaggle.com/code/cast42/simple-svd-movie-recommender

In [36]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(movies_X_users.to_numpy(), k = 50) #o que são essas k features

print(f"Matriz original{movies_X_users.shape} decomposta em U{U.shape}, sigma {sigma.shape} e Vt{Vt.shape}.")

Matriz original(13322, 27321) decomposta em U(13322, 50), sigma (50,) e Vt(50, 27321).


In [37]:
sigma_diag_matrix=np.diag(sigma) #sigma é um array contendo a diagonal
all_user_predicted_ratings = np.dot(np.dot(U, sigma_diag_matrix), Vt)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = movies_X_users.columns, index=movies_X_users.index)
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,208531,208545,208683,208715,208737,208787,208793,208795,208939,209163
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.772561,-0.031004,-0.045287,-0.036903,0.009587,-0.207948,0.064104,-0.032135,-0.001484,0.12384,...,-0.000974,0.000111,0.002047,0.00024,-0.001915,-0.001339,-0.002163,0.002527,-0.001393,0.006462
2,4.280314,0.440407,0.044153,-0.013698,-0.120013,0.462136,0.19492,0.09333,-0.101376,0.448111,...,-0.002205,-0.001829,-0.011183,0.002208,-0.003763,-0.002593,0.001678,0.01695,-0.006913,0.006643
3,1.260915,0.471058,-0.260996,-0.13309,-0.017546,0.523084,0.170512,-0.146719,-0.173209,-0.390248,...,0.00239,0.007356,0.016333,-0.002372,0.020703,0.015702,-0.006491,-0.020402,0.011331,-0.002112
4,2.631603,0.220179,-0.178007,-0.029201,-0.084359,0.069683,0.065234,-0.031115,-0.009492,0.328353,...,0.010204,-0.00038,0.00721,0.003881,0.017068,0.00244,-0.006303,0.001462,0.002747,0.007366
5,4.305501,0.937399,1.366239,0.108312,1.150614,1.609428,1.093901,0.065092,0.444644,1.228014,...,-0.002498,-0.000333,-0.001014,-0.001421,-0.006514,0.001098,0.001582,-0.000176,-0.002648,0.002661


In [38]:
def get_high_recommended_movies(userId):
    movies_rated_by_user = movies_X_users.loc[userId]
    movies_high_rated_by_user =  movies_rated_by_user[movies_rated_by_user > 4.5].index
    movies_recommended_for_user = preds_df.loc[userId]
    movies_high_recommend_for_user = movies_recommended_for_user[movies_recommended_for_user > 4].index
    return set(movies_high_recommend_for_user) - set(movies_high_rated_by_user)

In [39]:
user = 1

rec = get_high_recommended_movies(user)
rec_ = pd.DataFrame(index=list(rec), columns=['Título', 'Nota'])
rec_.index.name='movieId'
rec_['Título'] = listar_nomes_filmes(rec, movies)
for id in rec:
    rec_.at[id,'Nota'] = preds_df.loc[user,id]
rec_



  return movies.loc[indices]['title'].values.tolist()


Unnamed: 0_level_0,Título,Nota
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
