### **Sistema de recomendação de filmes**
#### Bases utilizadas

- Netflix - https://www.kaggle.com/shivamb/netflix-shows
- Prime Video - https://www.kaggle.com/shivamb/amazon-prime-movies-and-tv-shows
- Disney Plus - https://www.kaggle.com/shivamb/disney-movies-and-tv-shows
- IMDB - https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset/

In [None]:
#!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
#!pip install --upgrade pandas

In [None]:
import numpy as np
import pandas as pd 
from pandas_profiling import ProfileReport

In [None]:

df_imdb_movies=pd.read_csv('/content/imdb_titles.csv',
                           usecols=['imdb_title_id','title','original_title',
                           'year','genre','country','director','actors','description','avg_vote','votes'],sep=',',low_memory=False)

df_netflix_movies=pd.read_csv('/content/netflix_titles.csv',low_memory=False,encoding='utf-8')
df_disney_plus_movies=pd.read_csv('/content/disney_plus_titles.csv',low_memory=False,encoding='utf-8')

df_prime_video_movies=pd.read_csv('/content/amazon_prime_titles.csv', low_memory=False,encoding='utf-8')

In [None]:
#df_imdb_movies['year'].unique()

In [None]:
# Alterar dado improcedente que deveria conter apenas o ano de lançamento do título. Está TV Movie 2019
df_imdb_movies = df_imdb_movies.drop(df_imdb_movies.query("year=='TV Movie 2019'").index)
df_imdb_movies['year'] = df_imdb_movies['year'].astype(np.int32)


In [None]:
#Setar ano de corte do título para a base do IMDB para 1940 dado que os dados de anos anteriores são escassos
#Selecionando apenas filmes com mais de 50 avaliações para que os dados sejam mais descritivos

df_imdb_movies = df_imdb_movies[df_imdb_movies['year'] >= 1940]
df_imdb_movies = df_imdb_movies[df_imdb_movies['votes'] >= 50]

#Removendo missing values da coluna "description"
df_imdb_movies = df_imdb_movies.dropna(subset=["description"], axis=0)


In [None]:
df_imdb_movies.head()

Unnamed: 0,imdb_title_id,title,original_title,year,genre,country,director,actors,description,avg_vote,votes
560,tt0017938,La glace à trois faces,La glace à trois faces,1983,"Drama, Romance",France,Jean Epstein,"Jeanne Helbling, Suzy Pierson, Olga Day, Raymo...",Psychological narrative avantgarde film about ...,7.0,759
2760,tt0029284,Le mie due mogli,My Favorite Wife,1940,"Comedy, Romance",USA,Garson Kanin,"Irene Dunne, Cary Grant, Randolph Scott, Gail ...","Missing for seven years and presumed dead, a w...",7.4,9008
3189,tt0031077,Band Waggon,Band Waggon,1940,"Comedy, Musical",UK,Marcel Varnel,"Arthur Askey, Jack Hylton and His Band, Richar...",A plot involving spies in a haunted castle giv...,5.5,132
3191,tt0031084,Piccola ladra,Battement de coeur,1940,"Comedy, Drama",France,Henri Decoin,"Danielle Darrieux, Claude Dauphin, André Lugue...",Danielle Darrieux plays an impoverished reform...,6.8,223
3222,tt0031192,Crook's Tour,Crook's Tour,1940,"Comedy, Mystery",UK,John Baxter,"Basil Radford, Naunton Wayne, Greta Gynt, Char...","Charters and Caldicott, touring in the Near Ea...",5.8,284


In [None]:
df_netflix_movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [None]:
df_disney_plus_movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!"
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...


In [None]:
# Na base do Disney Plus, os títulos do estúdio Marvel tem o prefiro "Marvel Studios'". Tal prefixo é removido dado que nas demais 
# bases a titularidade não o contém
df_disney_plus_movies['title'] = df_disney_plus_movies['title'].str.lstrip("Marvel Studios'")

In [None]:
#Confirmando a remoção do prefixo "Marvel Studios'"
#df_disney_plus_movies[df_disney_plus_movies['title'].str.startswith("Marvel Studios'")]


In [None]:
df_prime_video_movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


### Pandas Profiling

Com o auxílio da classe Pandas Profiling é possível ter uma visão macro dos dados de forma a simplificar a limpeza da base, por exemplo. Um ponto importante é que é possível verificar a quantidade de dados faltantes para cada feature.

In [None]:
movies_imdb_profile = ProfileReport(df_imdb_movies)
movies_imdb_profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 43/43 [00:36<00:00,  1.17it/s, Completed]                      
Generate report structure: 100%|██████████| 1/1 [00:07<00:00,  7.74s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.96s/it]


In [None]:
movies_netflix_profile = ProfileReport(df_netflix_movies)
movies_netflix_profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 28/28 [00:03<00:00,  8.49it/s, Completed]                         
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.67s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.34it/s]


In [None]:
movies_disney_profile = ProfileReport(df_disney_plus_movies)
movies_disney_profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 28/28 [00:01<00:00, 18.32it/s, Completed]                         
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.67s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.25it/s]


In [None]:

movies_amazon_profile = ProfileReport(df_prime_video_movies)
movies_amazon_profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 28/28 [00:04<00:00,  5.83it/s, Completed]                         
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.93s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s]


In [None]:
# Concatenação dos dataframes da Netflix, Prime Video e Disney Plus
streaming = pd.concat([df_netflix_movies,df_prime_video_movies,df_disney_plus_movies],join='outer')


In [None]:
# Realização de inner join entre o dataframe streaming e a base do IMDB. Desta forma, os títulos em comum são agrupados.
merged_titles = df_imdb_movies.merge(streaming,left_on=['title','year'],right_on=['title','release_year'],how='inner')

### Recomendação

As bibliotecas abaixo serão importadas para que seja possível realizar o processo de recomendação.

- Classe CountVectorizer: Realiza a contagem absoluta de termos (Evidencia palavras muito frequentes).
- Classe  TfidfVectorizer: TF-IDF(Term Frequency-inverse document frequency) atribui pesos às palavras de acordo com a frequência de ocorrência em um documento. Penaliza palavras muito frequentes.

- Kernel Linear - Similaridade de cos otimizada.
- Cos Similarity - Similaridade de cos tradicional.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity



### Bag of words

- Seleção de features de interesse
- Padronização de textos (tudo minúsculo, dado que os métodos são case sensitive)
- Remoção de espaços - Evita que palavras compostas sejam utilizadas em contextos diferentes
- Concatenação de todas as features em única coluna

In [None]:
#merged_titles.columns
# Elencar as features de interesse que tornem a informação o mais descritiva possível
features = ['title','genre','director_x','description_x','description_y','actors','listed_in','rating','year']
def clean(row):
  return str.lower(row)

In [None]:
merged_titles = merged_titles[features].copy()

In [None]:
# Todo o conteúdo em letras minúsculas
for i in features:
  merged_titles[i]= merged_titles[i].astype(str).apply(clean)


In [None]:
def concat(col):
  return  col['title']+' '+col['genre']+' '+col['director_x']+' '+col['description_x']+' '+col['description_y']+' '+col['actors']+' '+col['listed_in']+' '+col['rating']+' '+col['year']

In [None]:
# Receber as features concatenas separas por espaço
merged_titles['bow']=merged_titles.apply(concat,axis=1)

### Count Vectorizer

Stop words - Remover palavras que não são desejadas na análise como artigos, preposições etc

### Matriz de Similaridade

A matriz de similaridade calcula a similaridade de cos de um filme em relação aos demais filmes da base.
Note que a diagnoal principal da matriz é igual a 1, indicando máxima similatridade de cos; ou seja, os títulos são iguais.


In [None]:
count_term= CountVectorizer(stop_words='english')
count_matrix = count_term.fit_transform(merged_titles['bow'])
similarity = cosine_similarity(count_matrix,count_matrix)
similarity

array([[1.        , 0.21308068, 0.        , ..., 0.03655751, 0.11550616,
        0.01516894],
       [0.21308068, 1.        , 0.09805807, ..., 0.05806867, 0.13760418,
        0.02409464],
       [0.        , 0.09805807, 1.        , ..., 0.02775875, 0.00730882,
        0.06142951],
       ...,
       [0.03655751, 0.05806867, 0.02775875, ..., 1.        , 0.04328183,
        0.17052062],
       [0.11550616, 0.13760418, 0.00730882, ..., 0.04328183, 1.        ,
        0.0448977 ],
       [0.01516894, 0.02409464, 0.06142951, ..., 0.17052062, 0.0448977 ,
        1.        ]])

In [None]:
merged_titles = merged_titles.reset_index() # AVALIAR
title_index = pd.Series(merged_titles.index, index=merged_titles['title'])

### Associação
Com a matriz de similaridade, tem-se o cálculo do grau de similaridade entre os títulos calculados, entretanto não há associação dos valores com tais títulos.
Desta forma, será criada uma lista ordenada de filmes e números, onde é possível obter o filme associado ao score calculado.

In [None]:
def recomend(df,title,sim):
  index=title_index[title]
  score=list(enumerate(sim[index]))
  score = sorted(score,key=lambda value:value[1],reverse=True) #ordenar lista de tuplas de forma decrescente conforme score (value[1])
  movie_index=[]
  for movie in score:
    movie_index.append(movie[0])  
  recomendation = df.iloc[movie_index].copy()
  recomendation['score'] = score
  return recomendation[['title','score']]  


In [None]:
recomend(merged_titles,'the avengers',similarity)

Unnamed: 0,title,score
641,the avengers,"(641, 0.999999999999999)"
1340,avengers: age of ultron,"(1340, 0.4912120848266927)"
1780,avengers: endgame,"(1780, 0.43342293359826184)"
1779,avengers: infinity war,"(1779, 0.41555369899548994)"
1778,avengers: infinity war,"(1778, 0.4010410874644873)"
...,...,...
2931,mallesham,"(2931, 0.0)"
2933,the kissing booth 2,"(2933, 0.0)"
2939,15 august,"(2939, 0.0)"
2940,jai mummy di,"(2940, 0.0)"
