# Modelo K-Means — Análise de Clusters de Hits

Este notebook aplica **K-Means** ao dataset **preparado** e adiciona diagnósticos e artefatos extras.

### O que você vai ver
1. Carregar o dataset preparado mais recente (CSV/Parquet).
2. Selecionar features numéricas (já normalizadas/codificadas).
3. Escolher o **K** por **Silhouette** (K=2..8).
4. Treinar o K-Means final e medir **hit_rate por cluster**.
5. Mapear cluster → **hit-like** (hit_rate ≥ 0.5) e comparar com `is_hit`.
6. Encontrar **top features por cluster** (desvios padronizados).
7. **Diagnósticos** adicionais (Elbow, Calinski-Harabasz, Davies-Bouldin, Silhouette por amostra).
8. **Estabilidade** por bootstrap (ARI).
9. Alternativa com **PCA (95%)** + K-Means e comparação.
10. **Perfis** e **exemplos** por cluster.
11. **Exportar artefatos**: modelo, métricas, perfis, imagens.


In [2]:

from __future__ import annotations
import os
from pathlib import Path
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import sklearn

# Diretórios
PROCESSED_DIR = os.environ.get('PROCESSED_DIR', r'D:\\Previsor-de-Hits-IA\\data\\processed')
PROCESSED_DIR = Path(PROCESSED_DIR)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print('Processed dir:', PROCESSED_DIR)
print('sklearn :', sklearn.__version__)
print('pandas  :', pd.__version__)


Processed dir: D:\Previsor-de-Hits-IA\data\processed
sklearn : 1.7.2
pandas  : 2.3.3


### 1) Carregar o dataset preparado mais recente
Preferimos **CSV**; se não houver, tentamos **Parquet**.


In [4]:

from pathlib import Path
import pandas as pd

PROCESSED_DIR = Path(PROCESSED_DIR)  # garantido

csvs = sorted(PROCESSED_DIR.glob('spotify_prepared_v2_*.csv'), reverse=True)
parqs = sorted(PROCESSED_DIR.glob('spotify_prepared_v2_*.parquet'), reverse=True)

if csvs:
    prepared_path = csvs[0]
    X_tr_df = pd.read_csv(prepared_path)
    origem = 'CSV'
elif parqs:
    prepared_path = parqs[0]
    X_tr_df = pd.read_parquet(prepared_path)
    origem = 'Parquet'
else:
    raise FileNotFoundError(f'Nenhum arquivo preparado encontrado em: {PROCESSED_DIR}')

print(f'Usando {origem}:', prepared_path)
X_tr_df.head(3)


Usando CSV: D:\Previsor-de-Hits-IA\data\processed\spotify_prepared_v2_20251114_104521.csv


Unnamed: 0,num01__danceability,num01__energy,num01__speechiness,num01__acousticness,num01__instrumentalness,num01__liveness,num01__valence,numscale__tempo,numscale__duration_ms,numscale__loudness,...,cat__release_year_2021.0,cat__release_year_2022.0,cat__release_year_2023.0,cat__release_year_2024.0,track_name,playlist_genre,track_popularity,is_hit_rule,is_hit_agree,is_hit
0,0.521,0.592,0.0304,0.308,0.0,0.122,0.535,0.75027,0.404633,0.848451,...,0.0,0.0,0.0,1.0,Die With A Smile,pop,100,1,1.0,1
1,0.747,0.507,0.0358,0.2,0.0608,0.117,0.438,0.318378,0.301981,0.785096,...,0.0,0.0,0.0,1.0,BIRDS OF A FEATHER,pop,97,1,1.0,1
2,0.554,0.808,0.0368,0.214,0.0,0.159,0.372,0.347475,0.192423,0.943932,...,0.0,0.0,0.0,1.0,That’s So True,pop,93,1,1.0,1


### 2) Selecionar features para o K-Means
- Removemos colunas auxiliares/labels (`track_name`, `playlist_genre`, `is_hit`, `is_hit_rule`, `is_hit_agree`, `track_popularity`).
- Mantemos **apenas numéricas**, já normalizadas/codificadas.


In [5]:

drop_cols = [c for c in ['track_name','playlist_genre','is_hit','is_hit_rule','is_hit_agree','track_popularity'] if c in X_tr_df.columns]
X_num = X_tr_df.drop(columns=drop_cols, errors='ignore').select_dtypes(include=['number']).copy()
feature_cols = list(X_num.columns)
print('Features usadas no K-Means:', len(feature_cols))
assert len(feature_cols) > 0, 'Nenhuma feature numérica disponível para clusterização.'
# Vetor de rótulos (apenas para avaliação posterior, não entra no K-Means)
y_hit = X_tr_df['is_hit'].astype(int).values if 'is_hit' in X_tr_df.columns else None
X_feat = X_num.values
X_num.head(3)


Features usadas no K-Means: 211


Unnamed: 0,num01__danceability,num01__energy,num01__speechiness,num01__acousticness,num01__instrumentalness,num01__liveness,num01__valence,numscale__tempo,numscale__duration_ms,numscale__loudness,...,cat__release_year_2015.0,cat__release_year_2016.0,cat__release_year_2017.0,cat__release_year_2018.0,cat__release_year_2019.0,cat__release_year_2020.0,cat__release_year_2021.0,cat__release_year_2022.0,cat__release_year_2023.0,cat__release_year_2024.0
0,0.521,0.592,0.0304,0.308,0.0,0.122,0.535,0.75027,0.404633,0.848451,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.747,0.507,0.0358,0.2,0.0608,0.117,0.438,0.318378,0.301981,0.785096,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.554,0.808,0.0368,0.214,0.0,0.159,0.372,0.347475,0.192423,0.943932,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### 3) Escolha de K por **Silhouette**
Testamos K=2..8 e escolhemos o melhor Silhouette (maior é melhor).


In [6]:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

sil_scores = {}
best_k, best_sil = None, -1.0

for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_feat)
    try:
        sil = silhouette_score(X_feat, labels)
    except Exception:
        sil = np.nan
    sil_scores[k] = sil
    if np.isfinite(sil) and sil > best_sil:
        best_sil, best_k = sil, k

print('Silhouette por K:', {k: (None if pd.isna(v) else round(v,3)) for k,v in sil_scores.items()})
print(f'K escolhido: {best_k} \n silhouette ≈ {best_sil:.3f}' if pd.notna(best_sil) else f'K escolhido: {best_k}')

# gráfico silhouette vs K
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(list(sil_scores.keys()), [sil_scores[k] for k in sil_scores], '-o')
ax.set_title('Silhouette vs K'); ax.set_xlabel('K'); ax.set_ylabel('Silhouette')
plt.tight_layout()
( PROCESSED_DIR / 'silhouette_vs_k.png').unlink(missing_ok=True)
plt.savefig(PROCESSED_DIR / 'silhouette_vs_k.png', dpi=120)
plt.close()


Silhouette por K: {2: 0.101, 3: 0.097, 4: 0.071, 5: 0.057, 6: 0.052, 7: 0.057, 8: 0.06}
K escolhido: 2 
 silhouette ≈ 0.101


### 4) Treinar K-Means final e calcular **hit_rate por cluster**
Mapeamos cluster → **hit-like** quando `hit_rate ≥ 0.5` e medimos uma **acurácia exploratória** vs `is_hit`.


In [7]:

km_final = KMeans(n_clusters=best_k if best_k else 2, n_init=10, random_state=42)
clusters = km_final.fit_predict(X_feat)
X_num['cluster'] = clusters
if y_hit is not None:
    X_num['is_hit'] = y_hit

cluster_stats = (
    X_num.groupby('cluster')['is_hit']
    .agg(['count','mean'])
    .rename(columns={'count':'n','mean':'hit_rate'})
    .sort_values('hit_rate', ascending=False)
)
cluster_stats


Unnamed: 0_level_0,n,hit_rate
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2716,0.358984
1,2115,0.33617


In [8]:

cluster_to_hit = (cluster_stats['hit_rate'] >= 0.5).astype(int).to_dict()
X_num['kmeans_hit_pred'] = X_num['cluster'].map(cluster_to_hit)
acc = (X_num['kmeans_hit_pred'] == X_num['is_hit']).mean() if 'is_hit' in X_num.columns else np.nan
print('Mapeamento cluster→hit-like:', cluster_to_hit)
print(f'Acurácia exploratória (clusters→hit vs is_hit): {acc:.3f}')

# gráfico hit_rate por cluster
fig, ax = plt.subplots(figsize=(6,4))
cluster_stats.sort_index()['hit_rate'].plot(kind='bar', ax=ax)
ax.set_ylim(0,1); ax.set_ylabel('hit_rate'); ax.set_title('Taxa de hit por cluster')
plt.tight_layout(); plt.savefig(PROCESSED_DIR / 'cluster_hit_rate.png', dpi=120); plt.close()


Mapeamento cluster→hit-like: {0: 0, 1: 0}
Acurácia exploratória (clusters→hit vs is_hit): 0.651


### 5) Top features por cluster (desvios padronizados)
Comparamos a média de cada feature no cluster com a média global, padronizada pelo desvio padrão.


In [9]:

feat_cols = [c for c in X_num.columns if c not in ['cluster','is_hit','kmeans_hit_pred']]
global_mean = X_num[feat_cols].mean()
global_std = X_num[feat_cols].std().replace(0, np.nan)
cluster_means = X_num.groupby('cluster')[feat_cols].mean()
zscores = (cluster_means - global_mean) / global_std

top_features_per_cluster = {}
for cl in zscores.index:
    z = zscores.loc[cl].abs().sort_values(ascending=False)
    top_features_per_cluster[int(cl)] = list(z.head(10).index)

pd.DataFrame({k:v for k,v in top_features_per_cluster.items()}).to_csv(PROCESSED_DIR / 'top_features_per_cluster.csv', index=False)
top_features_per_cluster


{0: ['cat__mode_0.0',
  'cat__mode_1.0',
  'cat__key_0.0',
  'num01__danceability',
  'cat__key_4.0',
  'cat__key_2.0',
  'numscale__loudness',
  'cat__playlist_subgenre_pop',
  'cat__playlist_genre_turkish',
  'cat__key_11.0'],
 1: ['cat__mode_0.0',
  'cat__mode_1.0',
  'cat__key_0.0',
  'num01__danceability',
  'cat__key_4.0',
  'cat__key_2.0',
  'numscale__loudness',
  'cat__playlist_subgenre_pop',
  'cat__playlist_genre_turkish',
  'cat__key_11.0']}

### 6) Gráficos e artefatos de saída
- `silhouette_vs_k.png`: métrica vs K.
- `cluster_hit_rate.png`: taxa de hit por cluster.
- `pca_clusters.png`: visualização 2D (PCA – se gerar no bloco 9).
- `top_features_per_cluster.csv`: colunas principais por cluster.
- `kmeans_cluster_assignments_*.csv`: auditoria por faixa (cluster e predição hit-like).


In [10]:

assign_path = PROCESSED_DIR / f"kmeans_cluster_assignments_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.csv"
aud_cols = ['cluster']
if 'kmeans_hit_pred' in X_num.columns:
    aud_cols.append('kmeans_hit_pred')
if 'is_hit' in X_num.columns:
    aud_cols.append('is_hit')
X_num[aud_cols].to_csv(assign_path, index=False)

print('Arquivos gerados:')
print('-', PROCESSED_DIR / 'silhouette_vs_k.png')
print('-', PROCESSED_DIR / 'cluster_hit_rate.png')
print('-', PROCESSED_DIR / 'top_features_per_cluster.csv')
print('-', assign_path)


Arquivos gerados:
- D:\Previsor-de-Hits-IA\data\processed\silhouette_vs_k.png
- D:\Previsor-de-Hits-IA\data\processed\cluster_hit_rate.png
- D:\Previsor-de-Hits-IA\data\processed\top_features_per_cluster.csv
- D:\Previsor-de-Hits-IA\data\processed\kmeans_cluster_assignments_20251114_121726.csv


### 7) Diagnósticos adicionais (Elbow, CH, DB) e Silhouette por amostra


In [11]:

from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score, silhouette_samples

Ks = list(range(2, 9))
rows = []
for k in Ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    lbl = km.fit_predict(X_feat)
    inertia = km.inertia_
    try:
        ch = calinski_harabasz_score(X_feat, lbl)
    except Exception:
        ch = np.nan
    try:
        db = davies_bouldin_score(X_feat, lbl)
    except Exception:
        db = np.nan
    try:
        sil = silhouette_score(X_feat, lbl)
    except Exception:
        sil = np.nan
    rows.append({'K':k, 'inertia':inertia, 'calinski_harabasz':ch, 'davies_bouldin':db, 'silhouette':sil})

diag_df = pd.DataFrame(rows)
diag_df.to_csv(PROCESSED_DIR / 'kmeans_diagnostics_by_k.csv', index=False)

fig, ax = plt.subplots(2,2, figsize=(10,8)); ax = ax.ravel()
ax[0].plot(diag_df['K'], diag_df['inertia'], '-o'); ax[0].set_title('Elbow (Inertia)'); ax[0].set_xlabel('K'); ax[0].set_ylabel('Inertia')
ax[1].plot(diag_df['K'], diag_df['silhouette'], '-o'); ax[1].set_title('Silhouette'); ax[1].set_xlabel('K'); ax[1].set_ylabel('Score')
ax[2].plot(diag_df['K'], diag_df['calinski_harabasz'], '-o'); ax[2].set_title('Calinski-Harabasz'); ax[2].set_xlabel('K'); ax[2].set_ylabel('Score')
ax[3].plot(diag_df['K'], diag_df['davies_bouldin'], '-o'); ax[3].set_title('Davies-Bouldin (↓ melhor)'); ax[3].set_xlabel('K'); ax[3].set_ylabel('Score')
plt.tight_layout(); plt.savefig(PROCESSED_DIR / 'kmeans_diagnostics_grid.png', dpi=120); plt.close()

km_best = KMeans(n_clusters=best_k, n_init=10, random_state=42).fit(X_feat)
labels_best = km_best.labels_
sil_samp = silhouette_samples(X_feat, labels_best)
ss_df = pd.DataFrame({'silhouette': sil_samp, 'cluster': labels_best})
fig, ax = plt.subplots(figsize=(8,5))
start = 0
for cl in sorted(np.unique(labels_best)):
    vals = ss_df.loc[ss_df['cluster']==cl, 'silhouette'].sort_values().values
    ax.fill_betweenx(np.arange(start, start+len(vals)), 0, vals, alpha=0.6, label=f'Cluster {cl}')
    start += len(vals)
ax.axvline(ss_df['silhouette'].mean(), color='k', ls='--', lw=1)
ax.set_title(f'Silhouette por amostra (K={best_k})'); ax.set_xlabel('Silhouette'); ax.set_yticks([]); ax.legend(loc='lower right')
plt.tight_layout(); plt.savefig(PROCESSED_DIR / f'silhouette_samples_k{best_k}.png', dpi=120); plt.close()
print('Diagnósticos salvos.')


Diagnósticos salvos.


### 8) Estabilidade do clustering (Bootstrap + ARI)


In [12]:

from sklearn.metrics import adjusted_rand_score
n_boot = 20
sample_frac = 0.7
N = X_feat.shape[0]
base_km = KMeans(n_clusters=best_k, n_init=10, random_state=42).fit(X_feat)
aris = []
for b in range(n_boot):
    rng = np.random.RandomState(42 + b)
    idx = rng.choice(N, int(sample_frac*N), replace=False)
    km_b = KMeans(n_clusters=best_k, n_init=10, random_state=42+b).fit(X_feat[idx])
    pred_base_on_sub = base_km.predict(X_feat[idx])
    ar = adjusted_rand_score(pred_base_on_sub, km_b.labels_)
    aris.append(ar)
aris = np.array(aris)
ari_mean, ari_std = float(np.mean(aris)), float(np.std(aris))

pd.Series(aris).to_csv(PROCESSED_DIR / f'bootstrap_ari_k{best_k}.csv', index=False)
fig, ax = plt.subplots(figsize=(6,4))
ax.hist(aris, bins=10, alpha=0.8)
ax.axvline(ari_mean, color='k', ls='--', label=f'ARI média = {ari_mean:.3f} ± {ari_std:.3f}')
ax.set_title(f'Estabilidade (Bootstrap ARI) — K={best_k}')
ax.set_xlabel('ARI'); ax.set_ylabel('Frequência'); ax.legend()
plt.tight_layout(); plt.savefig(PROCESSED_DIR / f'bootstrap_ari_k{best_k}.png', dpi=120); plt.close()
print(f'ARI médio (±dp) = {ari_mean:.3f} ± {ari_std:.3f}')


ARI médio (±dp) = 1.000 ± 0.000


### 9) PCA (95% variância) + KMeans e comparação


In [13]:

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95, random_state=42)
X_pca = pca.fit_transform(X_feat)
km_pca = KMeans(n_clusters=best_k, n_init=10, random_state=42)
labels_pca = km_pca.fit_predict(X_pca)
from sklearn.metrics import silhouette_score
sil_orig = silhouette_score(X_feat, base_km.labels_) if 'base_km' in globals() else silhouette_score(X_feat, km_final.labels_)
sil_pca = silhouette_score(X_pca, labels_pca)
print(f'Silhouette (original): {sil_orig:.3f}')
print(f'Silhouette (PCA 95%): {sil_pca:.3f}')
print('Componentes PCA:', X_pca.shape[1])

if X_pca.shape[1] >= 2:
    fig, ax = plt.subplots(figsize=(6,5))
    sc = ax.scatter(X_pca[:,0], X_pca[:,1], c=labels_pca, s=5, cmap='tab10', alpha=0.6)
    ax.set_title(f'Clusters no espaço PCA (2D) — K={best_k}')
    ax.set_xlabel('PC1'); ax.set_ylabel('PC2')
    plt.colorbar(sc, ax=ax, label='cluster')
    plt.tight_layout(); plt.savefig(PROCESSED_DIR / f'pca2_clusters_k{best_k}.png', dpi=120); plt.close()

import joblib
joblib.dump(pca, PROCESSED_DIR / 'pca_95_var.pkl')


Silhouette (original): 0.101
Silhouette (PCA 95%): 0.106
Componentes PCA: 100


['D:\\Previsor-de-Hits-IA\\data\\processed\\pca_95_var.pkl']

### 10) Perfis e exemplos por cluster


In [14]:

annot_df = X_tr_df.copy()
annot_df['cluster'] = X_num['cluster'].values
if 'kmeans_hit_pred' in X_num.columns:
    annot_df['kmeans_hit_pred'] = X_num['kmeans_hit_pred'].values
if 'is_hit' in X_num.columns:
    annot_df['is_hit'] = X_num['is_hit'].values

name_col = 'track_name' if 'track_name' in annot_df.columns else None
genre_col = 'playlist_genre' if 'playlist_genre' in annot_df.columns else None
pop_col = 'track_popularity' if 'track_popularity' in annot_df.columns else None

def top_values(series, k=3):
    if series is None or series.isna().all():
        return []
    vc = series.value_counts()
    return [f"{idx} ({cnt})" for idx, cnt in vc.head(k).items()]

profiles = []
for cl, grp in annot_df.groupby('cluster'):
    n = len(grp)
    hit_rate = grp['is_hit'].mean() if 'is_hit' in grp.columns else np.nan
    top_genres = top_values(grp[genre_col], 3) if genre_col else []
    year_cols = [c for c in grp.columns if c.startswith('cat__release_year_')]
    top_years = []
    if year_cols:
        sums = grp[year_cols].sum().sort_values(ascending=False)
        top_years = [f"{c.replace('cat__release_year_','')} ({int(v)})" for c, v in sums.head(3).items()]
    pop_mean = grp[pop_col].mean() if pop_col else np.nan
    profiles.append({'cluster': int(cl), 'n': int(n), 'hit_rate': float(hit_rate) if pd.notna(hit_rate) else np.nan,
                     'popularity_mean': float(pop_mean) if pd.notna(pop_mean) else np.nan,
                     'top_genres': '; '.join(top_genres), 'top_release_years': '; '.join(top_years)})

profiles_df = pd.DataFrame(profiles).sort_values('hit_rate', ascending=False)
profiles_df.to_csv(PROCESSED_DIR / 'cluster_profiles.csv', index=False)

examples_rows = []
N_EX = 5
for cl, grp in annot_df.groupby('cluster'):
    g = grp.copy()
    if pop_col:
        g = g.sort_values(pop_col, ascending=False)
    g = g.head(N_EX)
    for _, r in g.iterrows():
        examples_rows.append({'cluster': int(cl), 'track_name': r.get(name_col, None),
                              'playlist_genre': r.get(genre_col, None),
                              'track_popularity': r.get(pop_col, None),
                              'is_hit': r.get('is_hit', None)})
examples_df = pd.DataFrame(examples_rows)
examples_df.to_csv(PROCESSED_DIR / 'cluster_examples_topN.csv', index=False)

profiles_df


Unnamed: 0,cluster,n,hit_rate,popularity_mean,top_genres,top_release_years
0,0,2716,0.358984,54.814433,pop (330); electronic (287); rock (242),2024.0 (787); 2023.0 (409); 2022.0 (232)
1,1,2115,0.33617,54.692671,electronic (302); latin (228); hip-hop (188),2024.0 (645); 2023.0 (329); 2022.0 (176)


### 11) Exportar artefatos do modelo


In [15]:

import json, joblib, shutil, time
stamp = time.strftime('%Y%m%d_%H%M%S')
artifacts_dir = PROCESSED_DIR / f'kmeans_artifacts_{stamp}'
artifacts_dir.mkdir(parents=True, exist_ok=True)

joblib.dump(km_final, artifacts_dir / 'kmeans_final.pkl')
if 'pca' in globals():
    joblib.dump(pca, artifacts_dir / 'pca_95_var.pkl')

meta = {'best_k': int(best_k), 'n_features': int(len(feature_cols))}
if 'is_hit' in X_num.columns:
    c2h = (X_num.groupby('cluster')['is_hit'].mean() >= 0.5).astype(int).to_dict()
    meta['cluster_to_hit'] = {int(k): int(v) for k,v in c2h.items()}
with open(artifacts_dir / 'metadata.json', 'w', encoding='utf-8') as f:
    json.dump(meta, f, ensure_ascii=False, indent=2)

for p in [
    PROCESSED_DIR / 'kmeans_diagnostics_by_k.csv',
    PROCESSED_DIR / 'kmeans_diagnostics_grid.png',
    PROCESSED_DIR / f'silhouette_samples_k{best_k}.png',
    PROCESSED_DIR / f'bootstrap_ari_k{best_k}.csv',
    PROCESSED_DIR / f'bootstrap_ari_k{best_k}.png',
    PROCESSED_DIR / 'silhouette_vs_k.png',
    PROCESSED_DIR / 'cluster_hit_rate.png',
    PROCESSED_DIR / 'top_features_per_cluster.csv',
    PROCESSED_DIR / 'cluster_profiles.csv',
    PROCESSED_DIR / 'cluster_examples_topN.csv',
    PROCESSED_DIR / f'pca2_clusters_k{best_k}.png',
]:
    if p.exists():
        shutil.copy2(p, artifacts_dir / p.name)

print('Artefatos exportados em:', artifacts_dir)


Artefatos exportados em: D:\Previsor-de-Hits-IA\data\processed\kmeans_artifacts_20251114_121756
