Perguntas:
- Quantos gêneros o filme "The Matrix" tem? Quais são os gêneros?
- Quantos gêneros cada filme tem, em média?
- Quais são os cinco filmes com maior número de gêneros?
- Quantos usuários deram nota para o filme "The Matrix"?
- Quantos usuários, em média, deram nota para cada filme?
- Quais são os cinco filmes para os quais mais usuários deram nota?
- Qual o rating médio do filme "The Matrix"?
- Cada filme tem um rating médio. Qual a média desses ratings, por gênero?
- Para cada gênero de filme, quantos usuários deram cada nota possível?

Dicas:
1. Só é necessário as tabelas movies e ratings para este projeto.
2. Pesquise a função "explode" no pandas. Ela pode ser útil.

## Importação das Bibliotecas

In [160]:
import pandas as pd
import numpy as np

## Extração dos Dados

In [181]:
def read_data_csv(filename=str):
    path_dir = 'dados/raw/movie_lens'
    df = pd.read_csv(f'{path_dir}/{filename}', decimal='.')
    return df

In [182]:
df_movies = read_data_csv('movies.csv')
df_ratings = read_data_csv('ratings.csv')

In [183]:
df_movies["ano"] = df_movies["title"].apply(lambda st: st[st.find("(")+1:st.find(")")])
df_movies["title"] = df_movies["title"].str.replace(r" \(.*\)","", regex=True)
df_movies['genres_count'] = df_movies['genres'].str.split('|').str.len()

In [187]:
id_title=df_movies[['movieId','title']]
movies_ratings = pd.merge(df_ratings, id_title, how = 'left', on = 'movieId')

In [185]:
movies_genres = df_movies.copy()
movies_genres['genres'] = movies_genres['genres'].str.split('|')
movies_genres = movies_genres.explode('genres')
movies_genres = movies_genres[['movieId','title', 'genres']]
movies_genres

Unnamed: 0,movieId,title,genres
0,1,Toy Story,Adventure
0,1,Toy Story,Animation
0,1,Toy Story,Children
0,1,Toy Story,Comedy
0,1,Toy Story,Fantasy
...,...,...,...
9738,193583,No Game No Life: Zero,Fantasy
9739,193585,Flint,Drama
9740,193587,Bungo Stray Dogs: Dead Apple,Action
9740,193587,Bungo Stray Dogs: Dead Apple,Animation


## Exploração dos Dados

### Quantos gêneros o filme "The Matrix" tem? Quais são os gêneros?

In [167]:
df_aux = df_movies.copy()
df_aux["title"] = df_movies['title'].str.replace('[^A-Za-z0-9]+','', regex=True)
df_aux["title"] = df_aux["title"].str.lower()

In [168]:
matrix_id = df_aux[(df_aux["title"]== ("matrixthe"))]
matrix_id = matrix_id.iloc[0]['movieId']
df_aux[(df_aux['movieId'] == matrix_id)]['genres_count'].to_frame("Qtd_genres_matrix")

Unnamed: 0,Qtd_genres_matrix
1939,3


### Quantos gêneros cada filme tem, em média?

In [169]:
genres_mean = df_movies["genres_count"].mean()
genres_mean

2.2668856497639087

### Quais são os cinco filmes com maior número de gêneros?

In [170]:
genres_highest = df_movies.nlargest(5, "genres_count")
genres_highest = genres_highest["title"].to_frame("Movies")
genres_highest

Unnamed: 0,Movies
7441,Rubber
5556,Patlabor: The Movie
1390,Mulan
2250,Who Framed Roger Rabbit?
3460,Osmosis Jones


### Quantos usuários deram nota para o filme "The Matrix"?

In [171]:
ratings_matrix_count = movies_ratings[(movies_ratings["movieId"] == matrix_id)]
ratings_matrix_count.count()["rating"]

278

### Quantos usuários, em média, deram nota para cada filme?

In [172]:
users_aver_movie = movies_ratings.groupby("title")["userId"].count().to_frame("users_average")
users_aver_movie['users_average'].mean()

10.730658720868362

### Quais são os cinco filmes para os quais mais usuários deram nota?

In [173]:
movies_ratings.groupby("title")["title"].count().to_frame("total_reviews").nlargest(5, "total_reviews")

Unnamed: 0_level_0,total_reviews
title,Unnamed: 1_level_1
Forrest Gump,329
"Shawshank Redemption, The",317
Pulp Fiction,307
"Silence of the Lambs, The",279
"Matrix, The",278


### Qual o rating médio do filme "The Matrix"?

In [174]:
ratings_matrix_mean = movies_ratings[(movies_ratings["movieId"] == matrix_id)]

In [175]:
ratings_matrix_mean

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
166,1,2571,5.0,964981888,"Matrix, The",Action|Sci-Fi|Thriller
442,4,2571,1.0,945173385,"Matrix, The",Action|Sci-Fi|Thriller
1130,10,2571,0.5,1455356378,"Matrix, The",Action|Sci-Fi|Thriller
1367,13,2571,5.0,987457007,"Matrix, The",Action|Sci-Fi|Thriller
1475,15,2571,4.0,1510571744,"Matrix, The",Action|Sci-Fi|Thriller
...,...,...,...,...,...,...
96689,603,2571,5.0,963177606,"Matrix, The",Action|Sci-Fi|Thriller
97791,606,2571,5.0,1171325929,"Matrix, The",Action|Sci-Fi|Thriller
98603,607,2571,5.0,963080047,"Matrix, The",Action|Sci-Fi|Thriller
99065,608,2571,5.0,1117162423,"Matrix, The",Action|Sci-Fi|Thriller


In [176]:
ratings_matrix_mean["rating"].mean()

4.192446043165468

### Cada filme tem um rating médio. Qual a média desses ratings, por gênero?

In [177]:
mean_by_genres = movies_ratings.groupby("movieId")["rating"].mean().to_frame("rating_mean")
mean_by_genres = pd.merge(movies_genres, mean_by_genres, how = 'left', on = 'movieId')
mean_by_genres.groupby("genres")["rating_mean"].mean().to_frame("mean_genres")

Unnamed: 0_level_0,mean_genres
genres,Unnamed: 1_level_1
(no genres listed),3.371148
Action,3.094498
Adventure,3.21523
Animation,3.497119
Children,3.10769
Comedy,3.181716
Crime,3.301844
Documentary,3.781682
Drama,3.420905
Fantasy,3.218843


### Para cada gênero de filme, quantos usuários deram cada nota possível?

In [231]:
id_title=movies_genres[['movieId','genres']]
df_aux = pd.merge(movies_ratings, id_title, how = 'left', on = 'movieId')
df_aux

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story,Adventure
1,1,1,4.0,964982703,Toy Story,Animation
2,1,1,4.0,964982703,Toy Story,Children
3,1,1,4.0,964982703,Toy Story,Comedy
4,1,1,4.0,964982703,Toy Story,Fantasy
...,...,...,...,...,...,...
274475,610,168252,5.0,1493846352,Logan,Sci-Fi
274476,610,170875,3.0,1493846415,The Fate of the Furious,Action
274477,610,170875,3.0,1493846415,The Fate of the Furious,Crime
274478,610,170875,3.0,1493846415,The Fate of the Furious,Drama


In [232]:
df_aux["rating"]=df_aux["rating"].astype(int)
df_aux = df_aux.drop_duplicates(subset=['userId', 'rating'])
df_aux = df_aux.reset_index()
df_aux

Unnamed: 0,index,userId,movieId,rating,timestamp,title,genres
0,0,1,1,4,964982703,Toy Story,Adventure
1,10,1,47,5,964983815,Seven,Mystery
2,15,1,70,3,964982400,From Dusk Till Dawn,Action
3,229,1,1219,2,964983393,Psycho,Crime
4,621,1,3176,1,964983504,"Talented Mr. Ripley, The",Drama
...,...,...,...,...,...,...,...
2882,270774,610,16,4,1479542171,Casino,Crime
2883,270788,610,95,3,1479542004,Broken Arrow,Action
2884,270856,610,412,2,1479542198,"Age of Innocence, The",Drama
2885,271147,610,2332,1,1493849039,Belly,Crime


In [236]:
df_aux.groupby('userId')['rating'].unique().to_frame("list_ratings")

Unnamed: 0_level_0,list_ratings
userId,Unnamed: 1_level_1
1,"[4, 5, 3, 2, 1]"
2,"[3, 4, 5, 2]"
3,"[0, 5, 3, 4, 2]"
4,"[3, 2, 4, 5, 1]"
5,"[4, 3, 5, 1, 2]"
...,...
606,"[2, 3, 4, 5, 1, 0]"
607,"[4, 3, 5, 2, 1]"
608,"[2, 4, 3, 0, 1, 5]"
609,"[3, 4]"
