# Análise Exploratória de Dados
## Aprendizagem Automática (APRAU) - Grupo 3

### Contexto e Objetivos

Este notebook documenta a análise exploratória do conjunto de dados de música distribuído.

**Estrutura da Análise:**

1. **Estatística Descritiva** - Caracterização quantitativa das variáveis através de medidas de tendência central, dispersão e forma das distribuições
2. **Análise Univariada** - Estudo das distribuições individuais para identificar padrões, assimetrias e presença de outliers
3. **Análise Bivariada** - Investigação das relações entre preditores e variáveis target, quantificando associações e identificando os melhores candidatos para modelação


Esta análise fornece a fundamentação empírica para as decisões de pré-processamento e seleção de métodos nas fases subsequentes do projeto.

## 1. Configuração e Carregamento dos Dados

In [None]:
# Importação de bibliotecas
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.cm import get_cmap
from scipy import stats as sp_stats
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")

# Configuração de visualização
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.4f}'.format)

In [None]:
# Cores para as classes
CLASS_COLORS = {
    'class_13': '#e74c3c',
    'class_74': '#3498db',
    'class_78': '#2ecc71'
}

In [None]:
# Carregamento dos dados
df = pd.read_csv('Data/group_3.csv')
print(f"Conjunto de dados carregado: {df.shape[0]} observações × {df.shape[1]} variáveis")
N_SAMPLES = len(df)

Conjunto de dados carregado: 3000 observações × 49 variáveis


## 2. Visão Geral do Conjunto de Dados

A inspeção inicial visa identificar a estrutura dos dados, tipos de variáveis e potenciais problemas de qualidade que possam comprometer as análises subsequentes.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 49 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   duration_1                     3000 non-null   float64
 1   duration_2                     3000 non-null   float64
 2   duration_3                     3000 non-null   float64
 3   duration_4                     3000 non-null   float64
 4   duration_5                     3000 non-null   float64
 5   loudness_level                 3000 non-null   float64
 6   popularity_level               3000 non-null   float64
 7   tempo_class                    3000 non-null   float64
 8   time_signature                 3000 non-null   float64
 9   key_mode                       3000 non-null   float64
 10  artist_song_count              3000 non-null   float64
 11  album_freq                     3000 non-null   float64
 12  movement_index                 3000 non-null   f

In [None]:
df.head(10)

Unnamed: 0,duration_1,duration_2,duration_3,duration_4,duration_5,loudness_level,popularity_level,tempo_class,time_signature,key_mode,artist_song_count,album_freq,movement_index,intensity_level,verbal_density,purity_score,positivity_index,activity_rate,loudness_intensity,happy_dance,acoustics_instrumental,artists_avg_popularity,tempo_vs_genre,energy_rank_pct,loud_energy_ratio,mood_pca,mood_cluster,acoustic_valence_mood_cluster,explicit,signal_strength,mode_indicator,focus_factor,ambient_level,key_sin,key_cos,duration_log,duration_log_z,time_signature_class_boolean,loudness_yeo,is_instrumental,is_dance_hit,temp_zscore,resonance_factor,timbre_index,echo_constant,distorted_movement,signal_power,target_class,target_regression
0,0.0,0.0,0.0,1.0,0.0,4.0,3.0,1.0,0.2218,1.5834,3.5122,0.0816,0.053,-0.9566,-0.4299,1.3175,0.37,0.1627,-1.2027,0.1925,0.4606,-0.0012,0.0958,-0.7463,-0.0093,-0.0054,-0.7758,1.0552,0.0,0.505,0.0,0.199,0.151,-0.5,0.866,1.4027,-0.4185,1.0,-1.1059,0.0,0.0,0.1627,1.1748,0.2844,1,0.9519,0.505,class_13,0.617
1,0.0,0.0,0.0,1.0,0.0,4.0,3.0,1.0,0.2218,1.6116,3.5122,0.0816,0.0588,-1.2775,-0.5112,1.5611,-1.2075,-0.1038,0.5227,-0.9957,-0.3029,-0.0012,-0.1828,-1.4492,-0.0086,-1.5631,1.2282,-0.9221,0.0,0.228,1.0,0.000805,0.384,-0.5,0.866,1.3539,-0.587,1.0,-1.3361,0.0,0.0,-0.1038,-1.879,0.1965,1,-0.0968,0.228,class_13,0.4825
2,0.0,0.0,0.0,1.0,0.0,4.0,3.0,1.0,0.2218,-0.3582,3.5122,0.0816,0.4276,-1.4189,-0.3816,1.2543,-0.6675,-0.0058,-1.6133,-0.4908,-0.0823,-0.0012,-0.0803,-0.8415,-0.0093,-0.663,1.2282,1.0552,0.0,0.479,1.0,0.0598,0.115,0.866,-0.5,1.4323,-0.3162,1.0,-1.4305,0.0,0.0,-0.0058,-0.3278,0.8812,1,-0.7479,0.479,class_13,0.7515
3,0.0,0.0,0.0,1.0,0.0,4.0,3.0,1.0,0.2218,-0.0768,3.5122,-0.5149,0.2662,-1.4018,-0.5084,1.3656,-0.7948,0.3298,0.5933,-0.6354,-0.3044,-0.0012,0.2705,-1.4769,-0.0085,-1.2868,-0.7758,-0.9221,0.0,0.21,1.0,0.000477,0.139,0.5,-0.866,1.2459,-0.9597,1.0,-1.4193,0.0,0.0,0.3298,-0.0417,0.2064,1,1.1421,0.21,class_13,0.6618
4,0.0,0.0,1.0,0.0,0.0,2.0,3.0,1.0,0.2218,-1.4837,1.8291,-0.4723,0.4564,-0.541,-0.4687,1.534,-0.035,0.2591,0.1436,0.0515,-0.3063,0.7005,0.1966,-1.1626,-0.009,-0.4491,-0.7758,1.0552,0.0,0.37,1.0,5.17e-06,0.09,0.0,1.0,1.572,0.1659,1.0,-0.7678,0.0,0.0,0.2591,0.0159,0.7486,1,1.24,0.37,class_13,0.9308
5,0.0,0.0,0.0,1.0,0.0,4.0,3.0,1.0,0.2218,0.486,3.5122,0.0816,-0.6558,-1.1208,-0.4015,1.1551,-0.3011,1.8701,-0.5025,-0.5592,-0.2577,-0.0012,1.8809,-1.137,-0.0091,-0.8182,-1.4438,1.0552,0.0,0.38,1.0,0.0136,0.145,-0.5,-0.866,1.3163,-0.7167,1.0,-1.2266,0.0,0.0,1.8701,0.1805,0.094,1,1.4417,0.38,class_13,0.5721
6,0.0,0.0,1.0,0.0,0.0,2.0,3.0,1.0,-2.0897,-1.4837,3.5122,0.0816,-0.1314,-0.344,-0.4715,1.3927,-0.0658,0.7616,0.2375,-0.2098,0.498,-0.0012,0.722,-1.1149,-0.009,-0.5462,-0.4418,1.0552,0.0,0.389,1.0,0.202,0.109,0.0,1.0,1.4922,-0.1094,1.0,-0.5876,0.0,0.0,0.7616,0.6104,0.2867,1,0.9692,0.389,class_13,0.3928
7,0.0,0.0,0.0,1.0,0.0,2.0,3.0,1.0,0.2218,1.6116,3.5122,-0.0462,-0.5463,-0.1173,-0.4961,0.0995,0.4202,1.6553,0.684,-0.0757,-0.3057,-0.0012,1.6563,-1.2196,-0.0089,-0.3924,-1.4438,1.3847,0.0,0.344,1.0,0.00032,0.0958,-0.5,0.866,1.2974,-0.7817,1.0,-0.3604,0.0,0.0,1.6553,-0.0354,0.1246,1,0.7099,0.344,class_13,0.3928
8,0.0,0.0,0.0,1.0,0.0,4.0,3.0,1.0,0.2218,-1.4837,3.5122,0.0816,0.0127,-1.1089,-0.4583,1.3385,-0.9337,-0.2989,0.3831,-0.7983,-0.3034,-0.0012,-0.3868,-1.3963,-0.0087,-1.3299,1.2282,-0.9221,0.0,0.261,1.0,0.000749,0.136,0.0,1.0,1.1325,-1.351,1.0,-1.218,0.0,0.0,-0.2989,-0.3681,0.4059,1,0.7502,0.261,class_13,0.348
9,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.2218,1.3302,-0.4551,-0.5149,0.6465,0.2485,-0.4847,0.5657,0.6053,-0.5048,0.2056,0.6979,-0.3063,1.4277,-0.6021,-0.5256,-0.0092,0.3726,0.2262,-0.263,0.0,0.563,1.0,0.0,0.102,-0.866,0.5,1.5545,0.1054,1.0,0.0645,0.0,0.0,-0.5048,-0.4076,0.7622,1,0.8713,0.563,class_13,1.4688


In [None]:
# Verificação de valores em falta
missing_data = df.isnull().sum()
print(f"Total de valores em falta: {missing_data.sum()}")
if missing_data.any():
    print("\nColunas com valores em falta:")
    print(missing_data[missing_data > 0])

Total de valores em falta: 0


In [None]:
# Verificação do encoding one-hot das colunas duration
duration_cols = ['duration_1', 'duration_2', 'duration_3', 'duration_4', 'duration_5']
print("Soma por linha:")
print(df[duration_cols].sum(axis=1).value_counts())

Soma por linha:
1.0000    3000
Name: count, dtype: int64


### Observações Estruturais

**Dimensionalidade:** O dataset apresenta 3000 observações distribuídas por 49 variáveis, com 46 numéricas contínuas (float64), 1 inteira (echo_constant) e 2 categóricas (focus_factor, target_class).

**Integridade dos Dados:** Não foram identificados valores em falta na features, que requerem tratamento através de imputação antes da modelação.

**Encoding One-Hot:** As variáveis duration_1 a duration_5 representam codificação one-hot de uma variável categórica ordinal. A soma por linha confirma valores unitários, indicando encoding válido. Esta representação será consolidada numa única variável ordinal para simplificar a análise e evitar multicolinearidade estrutural nas matrizes de design dos modelos de regressão.

### 2.1. Consolidação de Variáveis

In [None]:
# Consolidação das colunas duration_* em variável ordinal única
df['duration'] = df[['duration_1', 'duration_2', 'duration_3', 'duration_4', 'duration_5']].values @ np.arange(1, 6).astype(int)
df = df.drop(columns=['duration_1', 'duration_2', 'duration_3', 'duration_4', 'duration_5'])
print(f"Colunas duration_* removidas. Shape atual: {df.shape}")

Colunas duration_* removidas. Shape atual: (3000, 45)


A conversão do encoding one-hot para representação ordinal reduz a dimensionalidade sem perda de informação, melhorando a interpretabilidade e estabilidade numérica dos modelos.

In [None]:
# Conversão de focus_factor para numérico
if df['focus_factor'].dtype == 'object':
    df['focus_factor'] = df['focus_factor'].str.replace(',', '.').astype(float)

Foi identificado um problema de formatação na variável `focus_factor` (tipo object em vez de numérico), que será corrigido de seguida.

## 3. Classificação e Tipologia das Características

Para realizar uma análise exploratória adequada, é essencial classificar corretamente cada característica de acordo com a sua natureza. Esta classificação determina quais técnicas estatísticas e visualizações são apropriadas para cada variável.

In [None]:
# Análise de cardinalidade das variáveis
unique_counts = df.nunique().sort_values()
for feat, count in unique_counts.items():
    print(f"{feat:40s}: {count:4d} valores únicos")

echo_constant                           :    1 valores únicos
is_dance_hit                            :    1 valores únicos
is_instrumental                         :    2 valores únicos
time_signature_class_boolean            :    2 valores únicos
mode_indicator                          :    2 valores únicos
explicit                                :    2 valores únicos
target_class                            :    3 valores únicos
tempo_class                             :    3 valores únicos
loudness_level                          :    5 valores únicos
duration                                :    5 valores únicos
popularity_level                        :    5 valores únicos
time_signature                          :    5 valores únicos
acoustic_valence_mood_cluster           :   11 valores únicos
key_cos                                 :   11 valores únicos
key_sin                                 :   11 valores únicos
mood_cluster                            :   11 valores únicos
key_mode

In [None]:
# Classificação tipológica das características
constant_features = ['echo_constant', 'is_dance_hit']
binary_features = ['explicit', 'mode_indicator', 'time_signature_class_boolean', 'is_instrumental']
ordinal_features = ['loudness_level', 'popularity_level', 'tempo_class', 'duration', 'time_signature']
nominal_features = ['mood_cluster', 'acoustic_valence_mood_cluster']
regression_target = 'target_regression'
classification_target = 'target_class'

# Agregação de todas as categóricas e targets
all_categorical = constant_features + binary_features + ordinal_features + nominal_features
all_targets = [regression_target, classification_target]
# Features contínuas são todas as que não são categóricas nem targets
continuous_features = [col for col in df.columns if col not in all_categorical + all_targets]

print("Classificação das Características:")
print("="*80)
print(f"1. Constantes: {len(constant_features)}")
print(f"2. Binárias: {len(binary_features)}")
print(f"3. Ordinais: {len(ordinal_features)}")
print(f"4. Nominais: {len(nominal_features)}")
print(f"5. Contínuas: {len(continuous_features)}")
print(f"6. Targets: Regressão={regression_target}, Classificação={classification_target}")

features_numericas = continuous_features + ordinal_features + constant_features
print(f"\nTotal de features numéricas: {len(features_numericas)}")

Classificação das Características:
1. Constantes: 2
2. Binárias: 4
3. Ordinais: 5
4. Nominais: 2
5. Contínuas: 30
6. Targets: Regressão=target_regression, Classificação=target_class

Total de features numéricas: 37


In [None]:
# Todas as variáveis são mantidas para análise (incluindo constantes)
print(f"Todas as variáveis mantidas. Shape final: {df.shape}")
classes_sorted = sorted(df['target_class'].unique())

Todas as variáveis mantidas. Shape final: (3000, 45)


### Resultado da Classificação Tipológica

**Características Constantes (2):** echo_constant e is_dance_hit apresentam cardinalidade unitária, não contribuindo para a variância explicada nos modelos. Estas variáveis serão removidas no pré-processamento.

**Características Binárias (4):** Indicadores binários (0/1) que capturam propriedades dicotómicas das faixas. A distribuição de explicit mostra forte desbalanceamento (90.2% não-explícitas), potencialmente limitando o seu valor preditivo.

**Características Ordinais (5):** Variáveis categóricas com ordem natural. A concentração observada em time_signature (83% num único valor) e tempo_class (85% numa classe) sugere baixa variabilidade, podendo limitar o poder discriminativo destas features.

**Características Nominais (2):** mood_cluster e acoustic_valence_mood_cluster representam agrupamentos categóricos sem ordem natural, cada um com 11 níveis.

**Características Contínuas (30):** Compõem a maioria do feature space, apresentando alta cardinalidade (>900 valores únicos para a maioria). Estas variáveis oferecem granularidade para capturar padrões complexos nas tarefas de regressão e classificação.