# Sistema de recomendação para filmes do IMDB

Inicialmente utilizamos a API do IMDB para pegar a base de filmes com 10.000 filmes. Partindo dessa base vamos montar um sistema de recomendação Content-Based.
Para este estudo primeiro iremos vetorizar nossos filmes e comparar eles através da similaridade de cossenos. O intuito deste notebook é aprofundar e explorar conceitos usados na criação de modelos de machine learning, principalmente em sistemas de recomendações.

Existem três tipos de sistemas de recomendações:

*  Content-based filtering: Esse sistema de recomendação leva em consideração itens curtidos anteriormente pelo usuário para fazer novas recomendações de produtos. Por exemplo se o usuário gostou de filmes de comédia, novos filmes desse gênero serão recomendados a ele. Esse filtro pode ser aplicado ao item e ao usuário.

* Collaborative filtering: Observa as semlhanças entre itens e usuários ao mesmo tempo. Ele leva em consideração similiridade de gostos de usuários e quais itens eles gostaram. Por exemplo: usuário A consome filmes semelhantes ao usuário B, logo, se o usuário A experimentar e gostar de um novo filme o sistema recomendará para o usuário B.

* Hybrid: É uma combinação entre os dois metódos citados anteriormente.

No nosso modelos adotaremos um sistema do primeiro tipo.

## Importando bibliotecas

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

## Importando e verificando dados

In [3]:
def load_csv_data(path):
    df = pd.read_csv(path)
    return df

df_filmes = load_csv_data('dados/filmes.csv')
df_filmes.head()

Unnamed: 0,id,id_genero,titulo,resumo,lancamento,idioma_original
0,238,"[18, 80]",The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",1972-03-14,en
1,278,"[18, 80]",The Shawshank Redemption,Framed in the 1940s for the double murder of h...,1994-09-23,en
2,569094,"[28, 12, 16, 878]",Spider-Man: Across the Spider-Verse,"After reuniting with Gwen Stacy, Brooklyn’s fu...",2023-05-31,en
3,240,"[18, 80]",The Godfather Part II,In the continuing saga of the Corleone crime f...,1974-12-20,en
4,424,"[18, 36, 10752]",Schindler's List,The true story of how businessman Oskar Schind...,1993-12-15,en


In [4]:
df_filmes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               10000 non-null  int64 
 1   id_genero        10000 non-null  object
 2   titulo           10000 non-null  object
 3   resumo           9998 non-null   object
 4   lancamento       10000 non-null  object
 5   idioma_original  10000 non-null  object
dtypes: int64(1), object(5)
memory usage: 468.9+ KB


Como só temos duas linhas com o resumo vazio escolhi dropar essas linhas já que não vão interferir na análise

In [5]:
df_filmes = df_filmes.dropna(axis=0, how='any')
len(df_filmes)

9998

Criando a tag unindo as variaveis

In [6]:
df_filmes['tags'] = df_filmes['resumo'] + ' ' + df_filmes['id_genero'] + ' ' + df_filmes['idioma_original'] + ' ' + df_filmes['lancamento']
df_filmes['tags']

0       Spanning the years 1945 to 1955, a chronicle o...
1       Framed in the 1940s for the double murder of h...
2       After reuniting with Gwen Stacy, Brooklyn’s fu...
3       In the continuing saga of the Corleone crime f...
4       The true story of how businessman Oskar Schind...
                              ...                        
9995    In the aftermath of a nuclear disaster, a star...
9996    Vitoria-Gasteiz, Basque Country, Spain, 2019. ...
9997    400 years into the future, disease has wiped o...
9998    A mild-mannered guy who is engaged to a monstr...
9999    At the end of WWII, an ambitious bootlegger an...
Name: tags, Length: 9998, dtype: object

In [7]:
df_filmes_process = df_filmes[['id', 'titulo', 'tags']]
df_filmes_process = df_filmes_process.drop_duplicates()
df_filmes_process.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9998 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      9998 non-null   int64 
 1   titulo  9998 non-null   object
 2   tags    9998 non-null   object
dtypes: int64(1), object(2)
memory usage: 312.4+ KB


## Vetorizando os dados

In [8]:
tf = TfidfVectorizer(max_features=10000)
dados_vetorizados = tf.fit_transform(df_filmes_process['tags'].values)
tf.get_feature_names_out()

array(['000', '007', '01', ..., 'zoo', 'zoos', 'état'], dtype=object)

In [9]:
df_vetorizado =  pd.DataFrame(dados_vetorizados.toarray(), index=df_filmes_process['tags'].index.tolist())
df_vetorizado.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,0.0,0.0,0.0,0.0,0.080317,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.069141,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Sistema de recomendação

In [12]:
similarity = cosine_similarity(df_vetorizado)
similarity

  ret = a @ b


array([[1.        , 0.06849751, 0.02936376, ..., 0.04936903, 0.04225582,
        0.04138453],
       [0.06849751, 1.        , 0.05707625, ..., 0.0401657 , 0.06139804,
        0.08307346],
       [0.02936376, 0.05707625, 1.        , ..., 0.05375855, 0.03362548,
        0.01953145],
       ...,
       [0.04936903, 0.0401657 , 0.05375855, ..., 1.        , 0.02846571,
        0.02434483],
       [0.04225582, 0.06139804, 0.03362548, ..., 0.02846571, 1.        ,
        0.03308925],
       [0.04138453, 0.08307346, 0.01953145, ..., 0.02434483, 0.03308925,
        1.        ]])

In [20]:
def recomendation_system(movie):
    id_of_movie = df_filmes_process[df_filmes_process['titulo']==movie].index[0]
    distances = similarity[id_of_movie]
    movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:15]
    for movie_id in movie_list:
        print(df_filmes_process.iloc[movie_id[0]].titulo)

In [23]:
recomendation_system('The Godfather')

The Godfather Part II
The Godfather Part III
The Replacement Killers
Shoplifters
Blood Ties
Kind Hearts and Coronets
Joe
Sansho the Bailiff
SPL: Kill Zone
The Best of Youth
Loose Cannons
The Color Purple
3 Ninjas
The Beastmaster
