## Projeto - CRIANDO UM SISTEMA DE RECOMENDAÇÃO DE FILMES - Victor Tintel <br>
<p> Fonte de dados: https://www.kaggle.com/code/alyssonbispopereira/recomenda-o-de-filmes-ptbr/data

Este projeto implementa um sistema de recomendação de filmes utilizando técnicas de Machine Learning, similar aos algoritmos usados por plataformas como Netflix e Spotify. O sistema pode ser adaptado para recomendar outros produtos como músicas, livros ou itens de e-commerce.

In [3]:
# Importando os pacotes a serem utilizados

import pandas as pd
import numpy as np

In [5]:
# Importar o arquivo com os filmes e visualizar as primeiras linhas

filmes = pd.read_csv('movies_metadata.csv', low_memory = False)
filmes.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [7]:
# Importando o arquivo de avaliações e avaliando as primeiras linhas

avaliacoes = pd.read_csv('ratings.csv')
avaliacoes.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


## Pré Processamento dos Dados

In [None]:
# Filtrando somente as colunas necessários e renomeando nome das variaveis

# Selecionei somente as variaveis que irei utilizar
filmes = filmes [['id','original_title','original_language','vote_count']]

# Os outros dados não serão importantes para as análises.

# Renomeia as variaveis
filmes.rename(columns = {'id':'ID_FILME','original_title':'TITULO','original_language':'LINGUAGEM','vote_count':'QT_AVALIACOES'}, inplace = True)

In [13]:
# Exibe as primeiras linhas do arquivo tratado
filmes.head()

Unnamed: 0,ID_FILME,TITULO,LINGUAGEM,QT_AVALIACOES
0,862,Toy Story,en,5415.0
1,8844,Jumanji,en,2413.0
2,15602,Grumpier Old Men,en,92.0
3,31357,Waiting to Exhale,en,34.0
4,11862,Father of the Bride Part II,en,173.0


In [17]:
# Filtrando somente as colunas necessários e renomeando nome das variaveis

# Seleciona somente as variaveis que irei utilizar
avaliacoes = avaliacoes [['userId','movieId','rating']]

# Renomeia as variaveis
avaliacoes.rename(columns = {'userId':'ID_USUARIO','movieId':'ID_FILME','rating':'AVALIACAO'}, inplace = True)

In [19]:
# Exibe as primeiras linhas do arquivo tratado

avaliacoes.head()

Unnamed: 0,ID_USUARIO,ID_FILME,AVALIACAO
0,1,110,1.0
1,1,147,4.5
2,1,858,5.0
3,1,1221,5.0
4,1,1246,5.0


In [21]:
# Verificando se há valores nulos

filmes.isna().sum()

ID_FILME          0
TITULO            0
LINGUAGEM        11
QT_AVALIACOES     6
dtype: int64

In [25]:
# Como são poucos os valores nulos irei remover porque não terá impacto nenhum

filmes.dropna(inplace = True)

In [59]:
# Verificando se há valores nulos no DF FILMES

filmes.isna().sum()

ID_FILME         0
TITULO           0
LINGUAGEM        0
QT_AVALIACOES    0
dtype: int64

In [61]:
# Verificando se há valores nulos no DF AVALIACOES

avaliacoes.isna().sum()

ID_USUARIO    0
ID_FILME      0
AVALIACAO     0
dtype: int64

In [31]:
# Verificando a quantidade de avaliacoes por usuarios

avaliacoes['ID_USUARIO'].value_counts()

ID_USUARIO
45811     18276
8659       9279
270123     7638
179792     7515
228291     7410
          ...  
30155         1
9641          1
164717        1
243426        1
234625        1
Name: count, Length: 270896, dtype: int64

In [65]:
# Eu tb poderia utilizar o GROUPBY

avaliacoes.groupby('ID_USUARIO').size().sort_values(ascending=False)

ID_USUARIO
45811     18276
8659       9279
270123     7638
179792     7515
228291     7410
          ...  
14354      1000
196384     1000
220764     1000
53075      1000
30733      1000
Length: 2509, dtype: int64

- Para contagem simples, value_counts() é mais eficiente e legível.

- groupby().size() se precisar de flexibilidade em operações posteriores.

In [70]:
# Vou pegar o ID_USUARIO somente de usuários que fizeram mais de 999 avaliações

qt_avaliacoes = avaliacoes['ID_USUARIO'].value_counts() > 999
y = qt_avaliacoes[qt_avaliacoes].index
y.shape

(2509,)

In [72]:
# Visualizando os usuarios selecionados

y

Index([ 45811,   8659, 270123, 179792, 228291, 243443,  98415, 229879,  98787,
       172224,
       ...
       257117, 105631,  76945, 214328, 182812,  14354, 196384, 220764,  53075,
        30733],
      dtype='int64', name='ID_USUARIO', length=2509)

In [74]:
# visualizando o tamanho do dataset Avaliações

avaliacoes.shape

(3844582, 3)

In [39]:
# Pegando somente avaliacoes dos usuarios que avaliaram mais de 999 vezes

avaliacoes = avaliacoes[avaliacoes['ID_USUARIO'].isin(y)]

In [76]:
# visualizando o tamanho do dataset Avaliações

avaliacoes.shape

(3844582, 3)

In [82]:
filmes.shape

(1100, 4)

In [78]:
# Visualizando os DataFrame Avaliacoes

avaliacoes.head()

Unnamed: 0,ID_USUARIO,ID_FILME,AVALIACAO
17291,229,1,3.0
17292,229,2,3.0
17293,229,4,2.0
17294,229,5,1.0
17295,229,7,2.0


In [80]:
# Visualizando o DataFrame Filmes

filmes.head()

Unnamed: 0,ID_FILME,TITULO,LINGUAGEM,QT_AVALIACOES
0,862,Toy Story,en,5415.0
1,8844,Jumanji,en,2413.0
5,949,Heat,en,1886.0
9,710,GoldenEye,en,1194.0
15,524,Casino,en,1343.0


In [84]:
# Vou usar os filmes que possuem somente uma quantidade de avaliações superior a 999 avaliações

filmes = filmes[filmes['QT_AVALIACOES'] > 999]

In [88]:
# Vou agrupar e visualizar a quantidade de filmes pela linguagem

filmes_linguagem = filmes['LINGUAGEM'].value_counts()
filmes_linguagem.head(20)

LINGUAGEM
en    1100
Name: count, dtype: int64

In [90]:
# Selecionar somente os filmes da linguagem EN (English)

filmes = filmes[filmes['LINGUAGEM'] == 'en']

In [92]:
# Visualizar os tipos de dados das variaveis

filmes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1100 entries, 0 to 44842
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID_FILME       1100 non-null   object 
 1   TITULO         1100 non-null   object 
 2   LINGUAGEM      1100 non-null   object 
 3   QT_AVALIACOES  1100 non-null   float64
dtypes: float64(1), object(3)
memory usage: 43.0+ KB


In [94]:
avaliacoes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3844582 entries, 17291 to 26023521
Data columns (total 3 columns):
 #   Column      Dtype  
---  ------      -----  
 0   ID_USUARIO  int64  
 1   ID_FILME    int64  
 2   AVALIACAO   float64
dtypes: float64(1), int64(2)
memory usage: 117.3 MB


In [96]:
# Precisamos converter a variavel ID_FILME em inteiro

filmes['ID_FILME'] = filmes['ID_FILME'].astype(int)

In [98]:
# Verificando a quantidade de filmes pelo tamanho do arquivo

filmes.shape

(1100, 4)

In [100]:
filmes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1100 entries, 0 to 44842
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID_FILME       1100 non-null   int32  
 1   TITULO         1100 non-null   object 
 2   LINGUAGEM      1100 non-null   object 
 3   QT_AVALIACOES  1100 non-null   float64
dtypes: float64(1), int32(1), object(2)
memory usage: 38.7+ KB


In [102]:
# Concatenando os dataframes

avaliacoes_e_filmes = avaliacoes.merge(filmes, on = 'ID_FILME')
avaliacoes_e_filmes.head()

Unnamed: 0,ID_USUARIO,ID_FILME,AVALIACAO,TITULO,LINGUAGEM,QT_AVALIACOES
0,229,12,1.0,Finding Nemo,en,6292.0
1,229,70,3.0,Million Dollar Baby,en,2519.0
2,229,77,3.0,Memento,en,4168.0
3,229,85,3.0,Raiders of the Lost Ark,en,3949.0
4,229,106,4.0,Predator,en,2129.0


In [104]:
# Verificando a quantidade de filmes com avaliacoes pelo tamanho do arquivo

avaliacoes_e_filmes.shape

(189882, 6)

In [106]:
# Verificando se há valores nulos

avaliacoes_e_filmes.isna().sum()

ID_USUARIO       0
ID_FILME         0
AVALIACAO        0
TITULO           0
LINGUAGEM        0
QT_AVALIACOES    0
dtype: int64

In [110]:
# Vou visualizar as primeiras 20 linhas do arquivo

avaliacoes_e_filmes.head(20)

Unnamed: 0,ID_USUARIO,ID_FILME,AVALIACAO,TITULO,LINGUAGEM,QT_AVALIACOES
0,229,12,1.0,Finding Nemo,en,6292.0
1,229,70,3.0,Million Dollar Baby,en,2519.0
2,229,77,3.0,Memento,en,4168.0
3,229,85,3.0,Raiders of the Lost Ark,en,3949.0
4,229,106,4.0,Predator,en,2129.0
5,229,141,1.0,Donnie Darko,en,3574.0
6,229,153,2.0,Lost in Translation,en,1943.0
7,229,161,3.0,Ocean's Eleven,en,3857.0
8,229,170,3.0,28 Days Later,en,1816.0
9,229,176,4.0,Saw,en,2255.0


In [112]:
# Vou descartar os valores duplicados, para que não tenha problemas de ter o mesmo usuário avaliando o mesmo filme
# diversas vezes

avaliacoes_e_filmes.drop_duplicates(['ID_USUARIO','ID_FILME'], inplace = True)

In [116]:
# Visualizando se houve alteração na quantidade de registros

avaliacoes_e_filmes.shape

(189882, 6)

In [118]:
# Vou excluir a variavel ID_FILME porque não irei utiliza-la

del avaliacoes_e_filmes['ID_FILME']

In [122]:
# DataFrame sem a variavel ID_FILME

avaliacoes_e_filmes.head()

Unnamed: 0,ID_USUARIO,AVALIACAO,TITULO,LINGUAGEM,QT_AVALIACOES
0,229,1.0,Finding Nemo,en,6292.0
1,229,3.0,Million Dollar Baby,en,2519.0
2,229,3.0,Memento,en,4168.0
3,229,3.0,Raiders of the Lost Ark,en,3949.0
4,229,4.0,Predator,en,2129.0


In [124]:
# Agora precisamos fazer um PIVOT. O que quero é que cada ID_USUARIO seja uma variavel com o respectivo valor de nota
# para cada filme avaliado

filmes_pivot = avaliacoes_e_filmes.pivot_table(columns = 'ID_USUARIO', index = 'TITULO', values = 'AVALIACAO')

In [126]:
# Avaliar o arquivo transformado para PIVOT 

filmes_pivot.head(20)

ID_USUARIO,229,231,741,836,1104,1136,1243,1380,1652,1846,...,269632,269750,269913,270071,270123,270213,270237,270564,270654,270887
TITULO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You,,,,,,,,,,,...,,2.5,,3.0,3.0,,,,,
12 Angry Men,,,,,,,,,,,...,,,,,,,,,3.5,
127 Hours,,,,,,,,,,,...,,,,,,,,,,
1408,,,,,,,,,,,...,,,,,2.5,2.0,,,,
2 Fast 2 Furious,,,,,,,,,,,...,,,,,,,,,,
2001: A Space Odyssey,,,3.5,2.5,,,,,,,...,,,,3.0,4.0,,,,,
27 Dresses,,,,,,,,,,,...,,,,,,,,,,
28 Days Later,3.0,,2.0,3.0,,,,,4.0,,...,,,,,2.5,3.0,,,3.0,
28 Weeks Later,,,2.0,,,,,,4.0,1.0,...,,0.5,3.0,1.0,2.0,3.0,,2.5,,
300,,3.0,,,,,,,,,...,,,,,0.5,4.0,3.0,3.0,1.5,


In [130]:
# Os valores que são nulos irei preencher com ZERO

filmes_pivot.fillna(0, inplace = True)
filmes_pivot.head()

ID_USUARIO,229,231,741,836,1104,1136,1243,1380,1652,1846,...,269632,269750,269913,270071,270123,270213,270237,270564,270654,270887
TITULO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0
127 Hours,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.5,2.0,0.0,0.0,0.0,0.0
2 Fast 2 Furious,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [133]:
# Vou importar o csr_matrix do pacote SciPy
# Esse método possibilita criarmos uma matriz sparsa

from scipy.sparse import csr_matrix


# Vou transformar o dataset em uma matriz sparsa
filmes_sparse = csr_matrix(filmes_pivot)

- Quando temos uma matriz com muitos ZEROS, conseguimos compactar isso, compactamos essa matriz e onde tem o zero, temos uma função que vai gravar essas posições de ZEROS.

In [135]:
# Tipo do objeto

type(filmes_sparse)

scipy.sparse._csr.csr_matrix

In [137]:
# Vou importar o algoritmo KNN do SciKit Learn

from sklearn.neighbors import NearestNeighbors 

In [139]:
# Criando e treinando o modelo preditivo

modelo = NearestNeighbors(algorithm = 'brute')
modelo.fit(filmes_sparse)

## Vou fazer previsões de sugestões de filmes

In [144]:
# 127 Hours
# KNEIGHBORS é o que ele vai achar dos vizinhos mais proximos

distances, sugestions = modelo.kneighbors(filmes_pivot.filter(items = ['127 Hours'], axis=0).values.reshape(1, -1))

for i in range(len(sugestions)):
    print(filmes_pivot.index[sugestions[i]])  

Index(['127 Hours', 'American Hustle', 'The Expendables 2', 'Lord of War',
       'RED 2'],
      dtype='object', name='TITULO')


In [146]:
# Toy Story

distances, sugestions = modelo.kneighbors(filmes_pivot.filter(items = ['Toy Story'], axis=0).values.reshape(1, -1))

for i in range(len(sugestions)):
    print(filmes_pivot.index[sugestions[i]])  

Index(['Toy Story', 'Meet the Fockers', 'Top Gun',
       'Harry Potter and the Chamber of Secrets',
       'Austin Powers: International Man of Mystery'],
      dtype='object', name='TITULO')


In [148]:
# 1408

distances, sugestions = modelo.kneighbors(filmes_pivot.filter(items = ['1408'], axis=0).values.reshape(1, -1))

for i in range(len(sugestions)):
    print(filmes_pivot.index[sugestions[i]]) 

Index(['1408', 'Pirates of the Caribbean: At World's End', 'Platoon', 'Snitch',
       'The Expendables 2'],
      dtype='object', name='TITULO')


In [150]:
# 2 Fast 2 Furious

distances, sugestions = modelo.kneighbors(filmes_pivot.filter(items = ['2 Fast 2 Furious'], axis=0).values.reshape(1, -1))

for i in range(len(sugestions)):
    print(filmes_pivot.index[sugestions[i]]) 

Index(['2 Fast 2 Furious', 'Bambi', 'The Matrix Reloaded',
       'Brokeback Mountain', 'Lord of War'],
      dtype='object', name='TITULO')
