### Centro Universitário Senac  
**Professor:** Rafael Cóbe  
**Disciplina:** Introdução ao Aprendizado de Máquina  

### Exercício 1 - **Análise do dataset MovieLens 100k com Pandas**

### Autores
**Renato Calabro (calabro@live.com)**
**Ágata Oliveira (agata.aso@hotmail.com)**

In [None]:
!../.venv/bin/python --version
%pip install -r ../requirements.txt
# %pip install numpy pandas matplotlib seaborn ipkernel
# %pip freeze > requirements.txt


Para esse conjunto de exercícios vamos utilizar o conjunto de dados
disponibilizado no dataset [MovieLens 100k](https://grouplens.org/datasets/movielens/100k/)

O conjunto de dados do MovieLens foi coletados pelo GroupLens Research Project
na Universidade de Minnesota.

Este conjunto de dados consiste em:
* 100.000 classificações (1-5) de 943 usuários em 1.682 filmes.
* Cada usuário classificou pelo menos 20 filmes.

O dataset está dividido em diversos arquivos.
Utilizando a biblioteca [Pandas](https://pandas.pydata.org/), implemente
funções que realizem as seguintes tarefas:

#### Considerando os dados de avaliação dos usuários

1. Cálculo da média, desvio padrão e variância para o dataset de avaliações
   completo (por filme);
2. Cálculo de média, desvio padrão e variância para cada usuário (armazenar
   esses valores em novas colunas do dataset);
3. Encontrar indivíduos que avaliam filmes de forma mais uniforme, i.e.,
   avaliações estão próximo ao valor da média do indivíduo;

#### Considerando os dados sobre filmes

1. Criar [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) que contenha informações sobre filmes:

| movie id | movie title | release date | video release date | IMDb URL |  unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
--------|---------|--------|--------|--------|--------|--------|---------|--------|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|
|2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|
|3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|
|...|

2. Identificar qual gênero de filme possui o maior número de exemplos;
3. Verificar se existem dados faltando

#### Criando novo dataset

1. Criar novo [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) que condense informações sobre o gênero do filme:

| movie id | movie title | release date | video release date | IMDb URL | genre|
-----------|-----------|-----------|-----------|-----------|------------|
|1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|Animation,Children's,Comedy|
|2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|Action,Adventure,Thriller|
|3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|Thriller|
|...|

2. Adicionar colunas que armazenem dados para o total de avaliações, a soma das
   avaliações, média, valor máximo (e mínimo), desvio padrão e variância;
3. Mostrar filmes com maior (e menor) número de avaliações;
4. Normalização é uma das tarefas mais importantes quando estamos preparando um
   dataset para utilizar algoritmos de Machine Learning. Implementar as
   seguintes estratégias de normalização:
   * [Normalização min-max](https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization))
   * [Normalização pela média](https://en.wikipedia.org/wiki/Feature_scaling#Mean_normalization)
   * [Normalização Z-score](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization))

### Ingestão Dataset

O dataset utilizado nesta análise é o **MovieLens 100k**, disponibilizado gratuitamente para fins acadêmicos e de pesquisa pelo [GroupLens Research Project](https://grouplens.org/datasets/movielens/100k/).

O arquivo pode ser baixado diretamente através do seguinte link:  
[http://files.grouplens.org/datasets/movielens/ml-100k.zip](http://files.grouplens.org/datasets/movielens/ml-100k.zip)

> O MovieLens 100k contém 100.000 avaliações de filmes feitas por 943 usuários em 1.682 filmes.

#### Descrição dos Arquivos do MovieLens 100k

- **u.data** – 100 mil avaliações com: `user_id | item_id | rating | timestamp` (TS em Unix).
- **u.info** – Contagem total de usuários, filmes e avaliações.
- **u.item** – Detalhes dos filmes com gêneros (0 ou 1 por gênero); inclui título e datas.
- **u.genre** – Lista de gêneros disponíveis.
- **u.user** – Dados demográficos dos usuários: `user_id | idade | gênero | ocupação | CEP`.
- **u.occupation** – Lista de ocupações possíveis dos usuários.
- **u1.base / u1.test … u5.base / u5.test** – Divisões 80/20 do dataset para validação cruzada (5-fold).
- **ua.base / ua.test / ub.base / ub.test** – Divisões com exatamente 10 avaliações por usuário no teste.
- **mku.sh** – Script para gerar os arquivos `.base` e `.test` a partir do `u.data`.
- **allbut.pl** – Script que cria conjuntos de treino/teste com "all but N ratings".


In [None]:
!mkdir -p ../datasets/movielens100k
!wget -O ../datasets/movielens100k/movielens100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ../datasets/movielens100k/movielens100k.zip -d ../datasets/movielens100k
!rm ../datasets/movielens100k/movielens100k.zip


In [149]:
from pathlib import Path
import pandas as pd

base_path = Path("../datasets/movielens100k/ml-100k")

### Considerando os dados de avaliação dos usuários

1. Cálculo da média, desvio padrão e variância para o dataset de avaliações
   completo (por filme);
2. Cálculo de média, desvio padrão e variância para cada usuário (armazenar
   esses valores em novas colunas do dataset);
3. Encontrar indivíduos que avaliam filmes de forma mais uniforme, i.e.,
   avaliações estão próximo ao valor da média do indivíduo;

In [150]:
ratings_file = base_path / "u.data"

# Verifica se o arquivo existe e carrega o DataFrame
if ratings_file.exists():
    df_user_rating = pd.read_csv(
        ratings_file,
        sep='\t',
        header=None,
        names=["user_id", "item_id", "rating", "timestamp"]
    )
    display(df_user_rating.head())
else:
    print(f"Arquivo não encontrado: {ratings_file.resolve()}")

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


##### Cálculo da média, desvio padrão e variância para o dataset de avaliações

In [151]:
media = df_user_rating["rating"].mean()
print(f"Média das avaliações: {media:.2f}")

desvio_padrao = df_user_rating["rating"].std()
print(f"Desvio padrão: {desvio_padrao:.2f}")

variancia = df_user_rating["rating"].var()
print(f"Variância: {variancia:.2f}")

Média das avaliações: 3.53
Desvio padrão: 1.13
Variância: 1.27


##### Cálculo de média, desvio padrão e variância para cada usuário (armazenar esses valores em novas colunas do dataset);

In [152]:
# Cálculo das estatísticas por usuário
df_user_rating["user_mean"] = df_user_rating.groupby("user_id")["rating"].transform("mean")
df_user_rating["user_std"] = df_user_rating.groupby("user_id")["rating"].transform("std")
df_user_rating["user_var"] = df_user_rating.groupby("user_id")["rating"].transform("var")

# Visualização das primeiras linhas com as novas colunas
display(df_user_rating.head())

Unnamed: 0,user_id,item_id,rating,timestamp,user_mean,user_std,user_var
0,196,242,3,881250949,3.615385,1.016065,1.032389
1,186,302,3,891717742,3.413043,1.223867,1.49785
2,22,377,1,878887116,3.351562,1.493239,2.229761
3,244,51,2,880606923,3.651261,1.071406,1.14791
4,166,346,1,886397596,3.55,1.431782,2.05


##### Encontrar indivíduos que avaliam filmes de forma mais uniforme, i.e., avaliações estão próximo ao valor da média do indivíduo;

In [153]:
## novo dataframe com as estatísticas por usuário
df_user_stats = df_user_rating.groupby("user_id")["rating"].agg(
    user_mean="mean",
    user_std="std",
    user_var="var",
    rating_count="count"
).reset_index()

display(df_user_stats.head())

Unnamed: 0,user_id,user_mean,user_std,user_var,rating_count
0,1,3.610294,1.263585,1.596646,272
1,2,3.709677,1.030472,1.061872,62
2,3,2.796296,1.219026,1.486024,54
3,4,4.333333,0.916831,0.84058,24
4,5,2.874286,1.362963,1.857668,175


In [154]:
# TOP 10 usuários mais uniformes (com mais de 10 avaliações)
display(df_user_stats[df_user_stats["rating_count"] >= 10].sort_values(by="user_std", ascending=True).head(10))

Unnamed: 0,user_id,user_mean,user_std,user_var,rating_count
848,849,4.869565,0.34435,0.118577,23
354,355,4.076923,0.483576,0.233846,26
476,477,4.457143,0.505433,0.255462,35
468,469,4.534884,0.549841,0.302326,43
32,33,3.708333,0.550033,0.302536,24
551,552,3.095238,0.551429,0.304073,84
766,767,4.432432,0.554804,0.307808,37
383,384,4.136364,0.560226,0.313853,22
887,888,4.3,0.571241,0.326316,20
9,10,4.206522,0.582777,0.339629,184


### Considerando os dados sobre filmes

1. Criar [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) que contenha informações sobre filmes:

| movie id | movie title | release date | video release date | IMDb URL |  unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
--------|---------|--------|--------|--------|--------|--------|---------|--------|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|
|2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|
|3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|
|...|

2. Identificar qual gênero de filme possui o maior número de exemplos;
3. Verificar se existem dados faltando

In [155]:
item_file = base_path / "u.item"

# Nomes das colunas conforme a documentação
genre_columns = [
    "unknown", "Action", "Adventure", "Animation", "Children's", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]

# Lista com as colunas principais (não relacionadas a gênero)
main_columns = [
    "movie_id", "title", "release_date", "video_release_date", "imdb_url"
]

# Lista final de colunas para o DataFrame
columns = main_columns + genre_columns

# Lendo o arquivo com encoding ISO-8859-1 para evitar erro com caracteres especiais
if item_file.exists():
    df_movie_genre = pd.read_csv(
        item_file,
        sep='|',
        header=None,
        names=columns,
        encoding="ISO-8859-1"
    )
    display(df_movie_genre.head())
else:
    print(f"Arquivo não encontrado: {item_file.resolve()}")


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


##### Identificar qual gênero de filme possui o maior número de exemplos;

In [156]:
genre_counts = df_movie_genre[genre_columns].sum().sort_values(ascending=False)

df_genre_counts = genre_counts.reset_index()
df_genre_counts.columns = ["genre", "count"]

print("Gênero mais avaliado:")
display(df_genre_counts.head(1))

Gênero mais avaliado:


Unnamed: 0,genre,count
0,Drama,725


##### Verificar se existem dados faltando

In [157]:
# Verifica se há dados ausentes
missing_data = df_movie_genre.isnull().sum()

# Filtra colunas que têm ao menos um valor ausente
missing_data = missing_data[missing_data > 0]

if missing_data.empty:
    print("✅ Não há dados faltantes no DataFrame.")
else:
    print("⚠️ Dados faltantes encontrados:")
    display(missing_data)


⚠️ Dados faltantes encontrados:


release_date             1
video_release_date    1682
imdb_url                 3
dtype: int64

### Criando novo dataset

1. Criar novo [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) que condense informações sobre o gênero do filme:

| movie id | movie title | release date | video release date | IMDb URL | genre|
-----------|-----------|-----------|-----------|-----------|------------|
|1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|Animation,Children's,Comedy|
|2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|Action,Adventure,Thriller|
|3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|Thriller|
|...|

2. Adicionar colunas que armazenem dados para o total de avaliações, a soma das
   avaliações, média, valor máximo (e mínimo), desvio padrão e variância;
3. Mostrar filmes com maior (e menor) número de avaliações;
4. Normalização é uma das tarefas mais importantes quando estamos preparando um
   dataset para utilizar algoritmos de Machine Learning. Implementar as
   seguintes estratégias de normalização:
   * [Normalização min-max](https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization))
   * [Normalização pela média](https://en.wikipedia.org/wiki/Feature_scaling#Mean_normalization)
   * [Normalização Z-score](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization))

##### Criar novo DataFrame que condense informações sobre o gênero do filme

In [158]:
def extract_genres(row):
    return ",".join([genre for genre in genre_columns if row[genre] == 1])

# Cria nova coluna 'genre' a partir das colunas binárias
df_movie_clean = df_movie_genre[main_columns].copy()
df_movie_clean["genres"] = df_movie_genre.apply(extract_genres, axis=1)

# Visualiza o resultado
display(df_movie_clean.head())

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genres
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,"Animation,Children's,Comedy"
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,"Action,Adventure,Thriller"
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,Thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,"Action,Comedy,Drama"
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),"Crime,Drama,Thriller"


##### Adicionar colunas que armazenem dados para o total de avaliações, a soma das avaliações, média, valor máximo (e mínimo), desvio padrão e variância;

In [159]:
df_movie_rating = df_user_rating.groupby("item_id")["rating"].agg(
    rating_count="count",
    rating_sum="sum",
    rating_mean="mean",
    rating_max="max",
    rating_min="min",
    rating_std="std",
    rating_var="var"
).reset_index()


df_movie_rating.rename(columns={"item_id": "movie_id"}, inplace=True)

display(df_movie_rating.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var
0,1,452,1753,3.878319,5,1,0.927897,0.860992
1,2,131,420,3.206107,5,1,0.966497,0.934116
2,3,90,273,3.033333,5,1,1.21276,1.470787
3,4,209,742,3.550239,5,1,0.965069,0.931358
4,5,86,284,3.302326,5,1,0.946446,0.895759


In [160]:
df_movie_full = df_movie_clean.merge(df_movie_rating, on="movie_id", how="left")
display(df_movie_full.head())

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genres,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,"Animation,Children's,Comedy",452,1753,3.878319,5,1,0.927897,0.860992
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,"Action,Adventure,Thriller",131,420,3.206107,5,1,0.966497,0.934116
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,Thriller,90,273,3.033333,5,1,1.21276,1.470787
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,"Action,Comedy,Drama",209,742,3.550239,5,1,0.965069,0.931358
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),"Crime,Drama,Thriller",86,284,3.302326,5,1,0.946446,0.895759


##### Mostrar filmes com maior (e menor) número de avaliações;

In [161]:
print("Filme menos avaliado:")
display(df_movie_full.sort_values(by="rating_count", ascending=True).head(1))

print("Filme mais avaliado:")
display(df_movie_full.sort_values(by="rating_count", ascending=False).head(1))

Filme menos avaliado:


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genres,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var
1672,1673,Mirage (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Mirage%20(1995),"Action,Thriller",1,3,3.0,3,3,,


Filme mais avaliado:


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genres,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,"Action,Adventure,Romance,Sci-Fi,War",583,2541,4.358491,5,1,0.881341,0.776762


##### Normalização é uma das tarefas mais importantes quando estamos preparando um dataset para utilizar algoritmos de Machine Learning. 
Implementar as seguintes estratégias de normalização:
   * Normalização min-max
   * Normalização pela média
   * Normalização Z-score

In [162]:
df_rating_normalized = df_movie_rating.copy()

##### Normalização min-max

In [163]:
# Panda puro

min_val = df_rating_normalized["rating_sum"].min()
max_val = df_rating_normalized["rating_sum"].max()
df_rating_normalized["rating_minmax_pandas"] = (
    df_rating_normalized["rating_sum"] - min_val
) / (max_val - min_val)

display(df_rating_normalized.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var,rating_minmax_pandas
0,1,452,1753,3.878319,5,1,0.927897,0.860992,0.689764
1,2,131,420,3.206107,5,1,0.966497,0.934116,0.164961
2,3,90,273,3.033333,5,1,1.21276,1.470787,0.107087
3,4,209,742,3.550239,5,1,0.965069,0.931358,0.291732
4,5,86,284,3.302326,5,1,0.946446,0.895759,0.111417


In [164]:
# Sklearn MinMaxScaler

from sklearn.preprocessing import MinMaxScaler



minmax_scaler = MinMaxScaler()
df_rating_normalized["rating_minmax_sklearn"] = minmax_scaler.fit_transform(df_rating_normalized[["rating_sum"]])

display(df_rating_normalized.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var,rating_minmax_pandas,rating_minmax_sklearn
0,1,452,1753,3.878319,5,1,0.927897,0.860992,0.689764,0.689764
1,2,131,420,3.206107,5,1,0.966497,0.934116,0.164961,0.164961
2,3,90,273,3.033333,5,1,1.21276,1.470787,0.107087,0.107087
3,4,209,742,3.550239,5,1,0.965069,0.931358,0.291732,0.291732
4,5,86,284,3.302326,5,1,0.946446,0.895759,0.111417,0.111417


##### Normalização pela média

In [165]:
# Pandas puro

mean_val = df_rating_normalized["rating_sum"].mean()
min_val = df_rating_normalized["rating_sum"].min()
max_val = df_rating_normalized["rating_sum"].max()

# Mean normalization correta
df_rating_normalized["rating_mean_norm"] = (
    df_rating_normalized["rating_sum"] - mean_val
) / (max_val - min_val)


display(df_rating_normalized.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var,rating_minmax_pandas,rating_minmax_sklearn,rating_mean_norm
0,1,452,1753,3.878319,5,1,0.927897,0.860992,0.689764,0.689764,0.607535
1,2,131,420,3.206107,5,1,0.966497,0.934116,0.164961,0.164961,0.082732
2,3,90,273,3.033333,5,1,1.21276,1.470787,0.107087,0.107087,0.024858
3,4,209,742,3.550239,5,1,0.965069,0.931358,0.291732,0.291732,0.209504
4,5,86,284,3.302326,5,1,0.946446,0.895759,0.111417,0.111417,0.029189


##### Normalização Z-score

In [166]:
# Pandas puro

mean_val = df_rating_normalized["rating_sum"].mean()
std_val = df_rating_normalized["rating_sum"].std()

df_rating_normalized["rating_zscore_pandas"] = (
    df_rating_normalized["rating_sum"] - mean_val
) / std_val

display(df_rating_normalized.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var,rating_minmax_pandas,rating_minmax_sklearn,rating_mean_norm,rating_zscore_pandas
0,1,452,1753,3.878319,5,1,0.927897,0.860992,0.689764,0.689764,0.607535,5.002385
1,2,131,420,3.206107,5,1,0.966497,0.934116,0.164961,0.164961,0.082732,0.681207
2,3,90,273,3.033333,5,1,1.21276,1.470787,0.107087,0.107087,0.024858,0.204678
3,4,209,742,3.550239,5,1,0.965069,0.931358,0.291732,0.291732,0.209504,1.725032
4,5,86,284,3.302326,5,1,0.946446,0.895759,0.111417,0.111417,0.029189,0.240336


In [167]:
# Sklearn StarndardScaler para Z-Score Normalization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_rating_normalized["rating_zscore_sklearn"] = scaler.fit_transform(
    df_rating_normalized[["rating_sum"]]
)

display(df_rating_normalized.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var,rating_minmax_pandas,rating_minmax_sklearn,rating_mean_norm,rating_zscore_pandas,rating_zscore_sklearn
0,1,452,1753,3.878319,5,1,0.927897,0.860992,0.689764,0.689764,0.607535,5.002385,5.003873
1,2,131,420,3.206107,5,1,0.966497,0.934116,0.164961,0.164961,0.082732,0.681207,0.681409
2,3,90,273,3.033333,5,1,1.21276,1.470787,0.107087,0.107087,0.024858,0.204678,0.204739
3,4,209,742,3.550239,5,1,0.965069,0.931358,0.291732,0.291732,0.209504,1.725032,1.725545
4,5,86,284,3.302326,5,1,0.946446,0.895759,0.111417,0.111417,0.029189,0.240336,0.240408


In [168]:
# Scipy Z-Score Normalization

from scipy.stats import zscore

# Usando ddof=1 para compatibilidade com pandas/std amostral
df_rating_normalized["rating_zscore_scipy"] = zscore(
    df_rating_normalized["rating_sum"], ddof=1
)

display(df_rating_normalized.head())

Unnamed: 0,movie_id,rating_count,rating_sum,rating_mean,rating_max,rating_min,rating_std,rating_var,rating_minmax_pandas,rating_minmax_sklearn,rating_mean_norm,rating_zscore_pandas,rating_zscore_sklearn,rating_zscore_scipy
0,1,452,1753,3.878319,5,1,0.927897,0.860992,0.689764,0.689764,0.607535,5.002385,5.003873,5.002385
1,2,131,420,3.206107,5,1,0.966497,0.934116,0.164961,0.164961,0.082732,0.681207,0.681409,0.681207
2,3,90,273,3.033333,5,1,1.21276,1.470787,0.107087,0.107087,0.024858,0.204678,0.204739,0.204678
3,4,209,742,3.550239,5,1,0.965069,0.931358,0.291732,0.291732,0.209504,1.725032,1.725545,1.725032
4,5,86,284,3.302326,5,1,0.946446,0.895759,0.111417,0.111417,0.029189,0.240336,0.240408,0.240336
