# Introdução

&nbsp;&nbsp;&nbsp;&nbsp; Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec elementum porta urna. Donec sagittis, ligula et feugiat mattis, lacus nisi efficitur lorem, eget imperdiet elit sem ut justo.  

## Significado de Cada Coluna

- track_id: O ID único de cada música

- artists: Nome dos(as) artistas que performaram a música, separados por ';'

- album_name: Nome do álbum no qual aparece a música

- track_name: Nome da música

- duration_ms: A duração da música em milissegundos

- explicit: Boolean indicando se a música possui conteúdo explícito

- danceability: Descreve quanto uma música é "dançante" (0.0 = menos dançante, 1.0 = mais dançante)

- energy: Representa a intensidade e atividade de uma música (0.0 = baixa energia, 1.0 = alta energia)

- key: A tonalidade musical da faixa mapeada usando a notação padrão de Classe de Altura (12 notas musicais)

- loudness: Nível geral de volume da faixa em decibéis (dB)

- mode: Indica a modalidade (maior ou menor) da faixa

- speechiness: Detecta a presença de palavras faladas na faixa

- acousticness: Medida de confiança sobre se a faixa é acústica (0,0 = não acústica, 1,0 = altamente acústica)

- instrumentalness: Prediz se uma faixa contém vocais (0,0 = contém vocais, 1,0 = instrumental)

- liveness: Detecta a presença de uma audiência na gravação (0,0 = gravação em estúdio, 1,0 = performance ao vivo)

- valence: Mede a positividade musical transmitida por uma faixa (0,0 = negativa, 1,0 = positiva)

- tempo: Tempo estimado da faixa em batidas por minuto (BPM)

- time_signature: Assinatura de tempo estimada da faixa (de 3 a 7)

- track_genre: O gênero da música

- popularity_target: Boolean indicando se a música é popular ou não

## Importação das Bibliotecas

In [93]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

## Funções

In [86]:
def display_statistics(df):
    """
    Calcula e exibe estatísticas descritivas para colunas numéricas de um DataFrame.
    
    Parâmetros:
    df (pd.DataFrame): O DataFrame para o qual as estatísticas serão calculadas.
    """
    # Calcular estatísticas
    stats = {
        'Mínimo': df.select_dtypes(include=['number']).min(),
        'Máximo': df.select_dtypes(include=['number']).max(),
        'Contagem de Valores Únicos': df.select_dtypes(include=['number']).nunique(),
        'Média': df.select_dtypes(include=['number']).mean()  # Ignora colunas não numéricas
    }

    # Criar um novo DataFrame com as estatísticas
    stats_df = pd.DataFrame(stats)

    # Ordenar o DataFrame pelas estatísticas desejadas
    sorted_stats_df = stats_df.sort_values(
        by=['Contagem de Valores Únicos', 'Máximo', 'Mínimo'],
        ascending=[True, False, True]  # Ajuste a ordem conforme necessário
    )

    # Exibir o DataFrame ordenado
    display(sorted_stats_df)

# Pré-processamento

In [87]:
df_train_cleaned = pd.read_csv("data/train.csv")

display(df_train_cleaned)
display(df_train_cleaned.info())
# Geração de estatísticas básicas (acho melhor utilizar esta função que utilizar o .describe())
display_statistics(df_train_cleaned)

unique_counts = df_train_cleaned.nunique()

print("Contagem de valores únicos para cada coluna:")
print(unique_counts)

Unnamed: 0,track_unique_id,track_id,artists,album_name,track_name,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,popularity_target
0,41996,7hUhmkALyQ8SX9mJs5XI3D,Love and Rockets,Love and Rockets,Motorcycle,211533,False,0.305,0.84900,9,...,1,0.0549,0.000058,0.056700,0.4640,0.3200,141.793,4,goth,0
1,76471,5x59U89ZnjZXuNAAlc8X1u,Filippa Giordano,Filippa Giordano,"Addio del passato - From ""La traviata""",196000,False,0.287,0.19000,7,...,0,0.0370,0.930000,0.000356,0.0834,0.1330,83.685,4,opera,0
2,54809,70Vng5jLzoJLmeLu3ayBQq,Susumu Yokota,Symbol,Purple Rose Minuet,216506,False,0.583,0.50900,1,...,1,0.0362,0.777000,0.202000,0.1150,0.5440,90.459,3,idm,1
3,16326,1cRfzLJapgtwJ61xszs37b,Franz Liszt;YUNDI,Relajación y siestas,"Liebeslied (Widmung), S. 566",218346,False,0.163,0.03680,8,...,1,0.0472,0.991000,0.899000,0.1070,0.0387,69.442,3,classical,0
4,109799,47d5lYjbiMy0EdMRV8lRou,Scooter,Scooter Forever,The Darkside,173160,False,0.647,0.92100,2,...,1,0.1850,0.000939,0.371000,0.1310,0.1710,137.981,4,techno,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79795,76820,6mmbWSbU5FElQOocyktyUZ,Amilcare Ponchielli;Gothenburg Symphony Orches...,"Ballet Highlights - The Nutcracker, Romeo & Ju...",La Gioconda / Act 3: Dance Of The Hours,162613,False,0.554,0.00763,4,...,1,0.0502,0.915000,0.000970,0.2210,0.1560,119.502,4,opera,1
79796,110268,0XL75lllKb1jTmEamqwVU6,Sajanka,Time of India,Time of India,240062,False,0.689,0.55400,9,...,1,0.0759,0.091000,0.914000,0.0867,0.1630,148.002,4,trance,0
79797,103694,763FEhIZGILafwlkipdgtI,Frankie Valli & The Four Seasons,Merry Christmas,I Saw Mommy Kissing Santa Claus,136306,False,0.629,0.56000,0,...,0,0.0523,0.595000,0.000000,0.1820,0.8800,118.895,3,soul,0
79798,860,2VVWWwQ3FiWnmbukTb6Kd3,The Mayries,I Will Wait,I Will Wait,216841,False,0.421,0.10700,6,...,1,0.0335,0.948000,0.000000,0.0881,0.1180,104.218,4,acoustic,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79800 entries, 0 to 79799
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   track_unique_id    79800 non-null  int64  
 1   track_id           79800 non-null  object 
 2   artists            79800 non-null  object 
 3   album_name         79800 non-null  object 
 4   track_name         79800 non-null  object 
 5   duration_ms        79800 non-null  int64  
 6   explicit           79800 non-null  bool   
 7   danceability       79800 non-null  float64
 8   energy             79800 non-null  float64
 9   key                79800 non-null  int64  
 10  loudness           79800 non-null  float64
 11  mode               79800 non-null  int64  
 12  speechiness        79800 non-null  float64
 13  acousticness       79800 non-null  float64
 14  instrumentalness   79800 non-null  float64
 15  liveness           79800 non-null  float64
 16  valence            798

None

Unnamed: 0,Mínimo,Máximo,Contagem de Valores Únicos,Média
mode,0.0,1.0,2,0.637732
popularity_target,0.0,1.0,2,0.487845
time_signature,0.0,5.0,5,3.902556
key,0.0,11.0,12,5.307043
danceability,0.0,0.985,1120,0.567318
speechiness,0.0,0.965,1454,0.08475
liveness,0.0,1.0,1706,0.213313
valence,0.0,0.995,1737,0.474267
energy,1.9e-05,1.0,1932,0.641529
acousticness,0.0,0.996,4856,0.314979


Contagem de valores únicos para cada coluna:
track_unique_id      79800
track_id             66720
artists              25775
album_name           37315
track_name           55767
duration_ms          40712
explicit                 2
danceability          1120
energy                1932
key                     12
loudness             17562
mode                     2
speechiness           1454
acousticness          4856
instrumentalness      5252
liveness              1706
valence               1737
tempo                37292
time_signature           5
track_genre            114
popularity_target        2
dtype: int64


## Remoção de Duplicatas

&nbsp;&nbsp;&nbsp;&nbsp; Apesar de inicialmente acreditar-se que não há nenhuma linha duplicada, porque `df_train_cleaned[df_train_cleaned.duplicated()]` não retorna nada, identificou-se, pela coluna track_id, que existem músicas que se repetem mudando somente o gênero. Após isso, identificou-se também que existem músicas repetidas pela cluna track_name.

In [88]:
# Identificar linhas duplicadas na coluna 'track_name'
df_duplicated_by_id = df_train_cleaned[df_train_cleaned.duplicated(subset='track_id', keep=False)].sort_values(by='track_id')
df_duplicated_by_name = df_train_cleaned[df_train_cleaned.duplicated(subset='track_name', keep=False)].sort_values(by='track_name')

# Exibir o DataFrame com as linhas duplicadas
print("Linhas duplicadas pela coluna track_id")
display(df_duplicated_by_id)
print("Linhas duplicadas pela coluna track_name")
display(df_duplicated_by_name)

Linhas duplicadas pela coluna track_id


Unnamed: 0,track_unique_id,track_id,artists,album_name,track_name,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,popularity_target
37758,103211,001APMDOl3qtx1526T11n1,Pink Sweat$;Kirby,New RnB,Better,176320,False,0.613,0.471,1,...,0,0.1070,0.316000,0.000001,0.1170,0.406,143.064,4,soul,0
26786,15028,001APMDOl3qtx1526T11n1,Pink Sweat$;Kirby,New RnB,Better,176320,False,0.613,0.471,1,...,0,0.1070,0.316000,0.000001,0.1170,0.406,143.064,4,chill,0
39619,85578,001YQlnDSduXd5LgBd66gT,Soda Stereo,Soda Stereo (Remastered),El Tiempo Es Dinero - Remasterizado 2007,177266,False,0.554,0.921,2,...,1,0.0758,0.019400,0.088100,0.3290,0.700,183.571,1,punk-rock,1
78451,100420,001YQlnDSduXd5LgBd66gT,Soda Stereo,Soda Stereo (Remastered),El Tiempo Es Dinero - Remasterizado 2007,177266,False,0.554,0.921,2,...,1,0.0758,0.019400,0.088100,0.3290,0.700,183.571,1,ska,1
58237,2106,003vvx7Niy0yvhvHt4a68B,The Killers,Hot Fuss,Mr. Brightside,222973,False,0.352,0.911,1,...,1,0.0747,0.001210,0.000000,0.0995,0.236,148.033,4,alt-rock,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314,72679,7zv2vmZq8OjS54BxFzI2wM,Attila,Soundtrack to a Party (Bonus),Lets Start the Party,125859,True,0.592,0.932,1,...,1,0.0558,0.000005,0.859000,0.0730,0.677,133.987,4,metalcore,0
24108,22326,7zv2vmZq8OjS54BxFzI2wM,Attila,Soundtrack to a Party (Bonus),Lets Start the Party,125859,True,0.592,0.932,1,...,1,0.0558,0.000005,0.859000,0.0730,0.677,133.987,4,death-metal,0
45951,91401,7zwn1eykZtZ5LODrf7c0tS,The Neighbourhood,Hard To Imagine The Neighbourhood Ever Changing,You Get Me So High,153000,False,0.551,0.881,7,...,0,0.0542,0.186000,0.079100,0.1520,0.387,88.036,4,rock,1
1105,3100,7zwn1eykZtZ5LODrf7c0tS,The Neighbourhood,Hard To Imagine The Neighbourhood Ever Changing,You Get Me So High,153000,False,0.551,0.881,7,...,0,0.0542,0.186000,0.079100,0.1520,0.387,88.036,4,alternative,1


Linhas duplicadas pela coluna track_name


Unnamed: 0,track_unique_id,track_id,artists,album_name,track_name,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,popularity_target
27456,39446,4yfFraZrNnh2zJTok5fzq7,Felix Mendelssohn;Christopher Herrick;Simon Pr...,Klassische Weihnachtsmusik,"""Hark! The Herald Angels Sing""",141000,False,0.158,0.210,7,...,1,0.0330,0.9530,0.860,0.153,0.1930,80.984,4,german,0
53334,16385,4yfFraZrNnh2zJTok5fzq7,Felix Mendelssohn;Christopher Herrick;Simon Pr...,Klassische Weihnachtsmusik,"""Hark! The Herald Angels Sing""",141000,False,0.158,0.210,7,...,1,0.0330,0.9530,0.860,0.153,0.1930,80.984,4,classical,0
16168,111315,1Ffxfl1vuEDc0xBVPIQ50s,Kid Koala,"""Was He Slow?"" (Music From The Motion Picture ...","""Was He Slow?"" - Music From The Motion Picture...",106880,False,0.754,0.719,10,...,0,0.3710,0.3730,0.259,0.120,0.6740,175.990,4,trip-hop,1
36378,111402,03XMXPAE2Yx6HeqCpAPL5o,Kid Koala,Baby Driver (Music from the Motion Picture),"""Was He Slow?"" - Music From The Motion Picture...",106880,False,0.754,0.719,10,...,0,0.3710,0.3730,0.259,0.120,0.6740,175.990,4,trip-hop,0
36323,4152,2Bc4llhjJBW77I552RgA3L,Aphex Twin,Selected Ambient Works Volume II,#3,464293,False,0.167,0.071,6,...,0,0.0410,0.9110,0.855,0.107,0.0613,143.315,5,ambient,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45090,70244,4JrGpUQothsSVR6iBhR9NC,Namewee;Leehom Wang,亞洲通車,飄向北方,269181,False,0.500,0.778,0,...,0,0.1260,0.3870,0.000,0.160,0.2790,173.906,4,mandopop,1
36388,62807,5v1dhqe9vgvp87eyd27hkb,yama,the meaning of life,麻痺,198040,False,0.309,0.941,0,...,0,0.0763,0.0587,0.000,0.435,0.7040,100.020,4,j-pop,1
25111,62556,1LwSnnsoKcAUv9TPFEZ7iQ,yama,麻痺,麻痺,198100,False,0.532,0.932,5,...,0,0.0522,0.0661,0.000,0.380,0.7380,100.100,4,j-pop,1
3139,12313,0FBjsiiUFLET2xqeKVrGBE,Eason Chan,我的快樂時代 (華星40系列),黃金時代,248040,False,0.555,0.494,4,...,1,0.0285,0.2460,0.000,0.291,0.2800,144.044,4,cantopop,1


&nbsp;&nbsp;&nbsp;&nbsp; Músicas duplicadas afetarão o treinamento dos modelos. É importante remover essas linhas para garantir a verossimilidade das análises. Também será feita a remoção de track_id agora que ela foi completamente inutilizada uma vez que existe a coluna track_unique_id e track_id não está identificando nada.

In [89]:
# Remover duplicatas com base na coluna 'track_id', mantendo apenas a primeira ocorrência
df_train_cleaned = df_train_cleaned.drop_duplicates(subset='track_id', keep='first')

# Remover duplicatas com base na coluna 'track_id', mantendo apenas a primeira ocorrência
df_train_cleaned = df_train_cleaned.drop_duplicates(subset='track_name', keep='first')

# Remoção de track_id 
df_train_cleaned = df_train_cleaned.drop('track_id', axis=1)

# Impressão do novo formato do dataframe
print(df_train_cleaned.shape)

(55767, 20)



&nbsp;&nbsp;&nbsp;&nbsp; Houve a redução de 24.033 músicas duplicadas na base.

## Codificação

&nbsp;&nbsp;&nbsp;&nbsp; Primeiro é feita a transformação das colunas categóricas em números inteiros

In [90]:
# Criar um objeto LabelEncoder
label_encoder = LabelEncoder()

# Identificar colunas categóricas
colunas_categoricas = ['artists', 'album_name', 'track_name', 'track_genre']

# Aplicar Label Encoding em cada coluna categórica
for coluna in colunas_categoricas:
    df_train_cleaned[coluna] = label_encoder.fit_transform(df_train_cleaned[coluna])
    
df_train_cleaned.info()
display_statistics(df_train_cleaned)

<class 'pandas.core.frame.DataFrame'>
Index: 55767 entries, 0 to 79799
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   track_unique_id    55767 non-null  int64  
 1   artists            55767 non-null  int64  
 2   album_name         55767 non-null  int64  
 3   track_name         55767 non-null  int64  
 4   duration_ms        55767 non-null  int64  
 5   explicit           55767 non-null  bool   
 6   danceability       55767 non-null  float64
 7   energy             55767 non-null  float64
 8   key                55767 non-null  int64  
 9   loudness           55767 non-null  float64
 10  mode               55767 non-null  int64  
 11  speechiness        55767 non-null  float64
 12  acousticness       55767 non-null  float64
 13  instrumentalness   55767 non-null  float64
 14  liveness           55767 non-null  float64
 15  valence            55767 non-null  float64
 16  tempo              55767 no

Unnamed: 0,Mínimo,Máximo,Contagem de Valores Únicos,Média
mode,0.0,1.0,2,0.633511
popularity_target,0.0,1.0,2,0.491061
time_signature,0.0,5.0,5,3.895476
key,0.0,11.0,12,5.28108
track_genre,0.0,113.0,114,56.492549
danceability,0.0,0.985,1106,0.559778
speechiness,0.0,0.965,1442,0.089276
liveness,0.0,1.0,1702,0.221363
valence,0.0,0.995,1731,0.467922
energy,1.9e-05,1.0,1892,0.638406


&nbsp;&nbsp;&nbsp;&nbsp; Também é necessário transformar a coluna booleana em uma coluna numérica.

In [91]:
# Transformar a coluna 'explicit' em 0 e 1
df_train_cleaned['explicit'] = df_train_cleaned['explicit'].astype(int)

df_train_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55767 entries, 0 to 79799
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   track_unique_id    55767 non-null  int64  
 1   artists            55767 non-null  int64  
 2   album_name         55767 non-null  int64  
 3   track_name         55767 non-null  int64  
 4   duration_ms        55767 non-null  int64  
 5   explicit           55767 non-null  int64  
 6   danceability       55767 non-null  float64
 7   energy             55767 non-null  float64
 8   key                55767 non-null  int64  
 9   loudness           55767 non-null  float64
 10  mode               55767 non-null  int64  
 11  speechiness        55767 non-null  float64
 12  acousticness       55767 non-null  float64
 13  instrumentalness   55767 non-null  float64
 14  liveness           55767 non-null  float64
 15  valence            55767 non-null  float64
 16  tempo              55767 no

&nbsp;&nbsp;&nbsp;&nbsp; Acima é possível ver como ficou a nova estrutura de dados

## Escalonamento

&nbsp;&nbsp;&nbsp;&nbsp; Para que os algoritmos consigam interpretar e calcular os dados de um modo que eles não tenham pesos diferentes, é necessário colocá-los na mesma escala. Aqui será utilizado a técnica MinMaxScaler disponibilizada pela biblioteca Scikit-Learn que deixa os valor máximo de uma coluna igual a 1 e o valor mínimo igual a 0, e o resto dos valores na mesma escala de antes, só que agora entre o intervalo entre 1 e 0.

In [96]:
display_statistics(df_train_cleaned)

# Instanciar o MinMaxScaler
scaler = MinMaxScaler()

# Selecionar colunas para escalar (todas exceto 'track_unique_id' e 'popularity_target')
columns_to_scale = df_train_cleaned.columns.difference(['track_unique_id', 'popularity_target'])

# Aplicar o scaler apenas nas colunas selecionadas
df_scaled_values = scaler.fit_transform(df_train_cleaned[columns_to_scale])

# Criar um novo DataFrame com os valores escalados
df_scaled = pd.DataFrame(df_scaled_values, columns=columns_to_scale)

# Reintegrar as colunas não escaladas
df_scaled['track_unique_id'] = df_train_cleaned['track_unique_id'].values
df_scaled['popularity_target'] = df_train_cleaned['popularity_target'].values

display_statistics(df_scaled)

Unnamed: 0,Mínimo,Máximo,Contagem de Valores Únicos,Média
explicit,0.0,1.0,2,0.085212
mode,0.0,1.0,2,0.633511
popularity_target,0.0,1.0,2,0.491061
time_signature,0.0,5.0,5,3.895476
key,0.0,11.0,12,5.28108
track_genre,0.0,113.0,114,56.492549
danceability,0.0,0.985,1106,0.559778
speechiness,0.0,0.965,1442,0.089276
liveness,0.0,1.0,1702,0.221363
valence,0.0,0.995,1731,0.467922


Unnamed: 0,Mínimo,Máximo,Contagem de Valores Únicos,Média
explicit,0.0,1.0,2,0.085212
mode,0.0,1.0,2,0.633511
popularity_target,0.0,1.0,2,0.491061
time_signature,0.0,1.0,5,0.779095
key,0.0,1.0,12,0.480098
track_genre,0.0,1.0,114,0.499934
danceability,0.0,1.0,1106,0.568302
speechiness,0.0,1.0,1442,0.092514
liveness,0.0,1.0,1702,0.221363
valence,0.0,1.0,1731,0.470274


&nbsp;&nbsp;&nbsp;&nbsp; Na saída acima, é possível comparar os valores mínimos e máximos antes e dpois do escalonamento e notar seu o efeito.

# Exploração

# Hipóteses

# Treinamento do Modelo

## Seleção de *Features*

# *Finetuning* de Hiperparâmetros

# Conclusão

&nbsp;&nbsp;&nbsp;&nbsp; Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec elementum porta urna. Donec sagittis, ligula et feugiat mattis, lacus nisi efficitur lorem, eget imperdiet elit sem ut justo.  