#Hipotesis Principal: Los nombres de los álbumes con determinadas características lingüisticas (como ser cortos,emocionales o fáciles de recordar) están positivamente correlacionados con un mayor nivel de popularidad en Spotify


In [2]:

# BLOQUE 1: Importar librerías
import os
import pandas as pd
import kagglehub

path = kagglehub.dataset_download("melissamonfared/spotify-tracks-attributes-and-popularity")

print("Ruta del dataset descargado:", path)
print("Archivos disponibles:", os.listdir(path))

# Buscar automáticamente el primer archivo CSV dentro del dataset
csv_files = [f for f in os.listdir(path) if f.endswith(".csv")]

if len(csv_files) == 0:
    raise FileNotFoundError("No se encontró ningún archivo CSV en el dataset.")
else:
    print("Archivo CSV detectado:", csv_files[0])
    file_path = os.path.join(path, csv_files[0])

df = pd.read_csv(file_path)

print("Dataset cargado correctamente")
print("Tamaño:", df.shape)
print("Columnas disponibles:", df.columns.tolist())

display(df.head())

# BLOQUE 2: Cargar el modelo de Hugging Face para nuetsro proyecto
from transformers import pipeline

sentiment_model = pipeline("sentiment-analysis")

prueba = sentiment_model("This song makes me so happy!")
print("Resultado de prueba del modelo:", prueba)

# BLOQUE 3: Aplicar el modelo al dataset
if 'track_name' in df.columns:

    sample_tracks = df['track_name'].dropna().head(5).tolist()

    print("Ejemplos de canciones a analizar:")
    for song in sample_tracks:
        print("-", song)

    resultados = [sentiment_model(song)[0] for song in sample_tracks]

    print("\nResultados del análisis de sentimientos:")
    for nombre, resultado in zip(sample_tracks, resultados):
        print(f"{nombre}: {resultado['label']} (confianza {round(resultado['score'], 3)})")
else:
    print("No se encontró la columna 'track_name' en el dataset.")




Downloading from https://www.kaggle.com/api/v1/datasets/download/melissamonfared/spotify-tracks-attributes-and-popularity?dataset_version_number=1...


100%|██████████| 8.17M/8.17M [00:00<00:00, 111MB/s]

Extracting files...





Ruta del dataset descargado: /root/.cache/kagglehub/datasets/melissamonfared/spotify-tracks-attributes-and-popularity/versions/1
Archivos disponibles: ['dataset.csv']
Archivo CSV detectado: dataset.csv
Dataset cargado correctamente
Tamaño: (114000, 21)
Columnas disponibles: ['index', 'track_id', 'artists', 'album_name', 'track_name', 'popularity', 'duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'track_genre']


Unnamed: 0,index,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Resultado de prueba del modelo: [{'label': 'POSITIVE', 'score': 0.9998815059661865}]
Ejemplos de canciones a analizar:
- Comedy
- Ghost - Acoustic
- To Begin Again
- Can't Help Falling In Love
- Hold On

Resultados del análisis de sentimientos:
Comedy: POSITIVE (confianza 1.0)
Ghost - Acoustic: NEGATIVE (confianza 0.94)
To Begin Again: POSITIVE (confianza 0.996)
Can't Help Falling In Love: POSITIVE (confianza 0.996)
Hold On: POSITIVE (confianza 0.999)


In [None]:
# Bloque 4: Creación de nuestra variable en una muestra
if 'df' not in locals() or 'sentiment_model' not in locals():
    raise ValueError("Asegúrate de haber ejecutado la Parte 1 antes de continuar.")

df = df.dropna(subset=['track_name']).copy()

df_sample = df.sample(5000, random_state=42)

df_sample['sentiment_result'] = df_sample['track_name'].apply(lambda x: sentiment_model(x)[0]['label'])


In [None]:
df_sample

Unnamed: 0,index,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,sentiment_result,indice_afectivo
7168,6477,35G3OPh6mgKHN9I0W76Ybi,Sargeist,Let the Devil In,As Darkness Tears the World Apart,17,235440,True,0.127,0.9710,...,0.0872,0.000002,0.931000,0.0727,0.0244,84.483,4,black-metal,POSITIVE,0.0244
27145,24155,24vNJIb0lDqw9R084eYiQ6,Omar S;John FM,The Best!,Heard'chew Single,39,592085,False,0.808,0.7270,...,0.0841,0.006190,0.784000,0.1980,0.3030,122.982,4,detroit-techno,NEGATIVE,0.3030
59166,53101,6uBhi9gBXWjanegOb2Phh0,Zedd;Alessia Cara,Stay,Stay,76,210090,False,0.690,0.6220,...,0.0622,0.253000,0.000000,0.1160,0.5440,102.040,4,house,POSITIVE,0.5440
44167,39180,0bhf5QzhPVVDrRnEK938gx,Zoe Wees,pov: it's 2021,Girls Like Us,0,190668,False,0.529,0.6480,...,0.0422,0.288000,0.000000,0.0569,0.1640,100.146,4,german,POSITIVE,0.1640
97549,85514,4JAtDXVF8f5OgCkz1vEAfU,Rybičky 48,Muži,Muži,38,220031,False,0.569,0.7140,...,0.0380,0.008250,0.000000,0.2190,0.5090,120.054,4,punk-rock,NEGATIVE,0.5090
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87034,76409,4OKWI2ODVp27li22Bpey4d,Andrea Bocelli,Romanza (20th Anniversary Edition / Deluxe),E chiove,22,265760,False,0.426,0.3970,...,0.0330,0.555000,0.009200,0.0862,0.2000,143.984,4,opera,POSITIVE,0.2000
66340,59386,0WcALqtM7nYmZ6SjwGoULv,Martin Shamoonpour,8 Bit,Gole Pamchal (Folk Song),0,113000,False,0.106,0.5300,...,0.0341,0.508000,0.860000,0.1250,0.6870,80.563,4,iranian,POSITIVE,0.6870
64570,57744,6ngu6sBaJMM05mAVATApZJ,KALEO,Feeling Good - Adult Pop Favorites,Save Yourself,0,273880,False,0.534,0.3380,...,0.0346,0.515000,0.000009,0.1240,0.1110,120.399,4,indie,NEGATIVE,0.1110
71353,64050,1qCQTy0fTXerET4x8VHyr9,Louis Armstrong,What A Wonderful World,What A Wonderful World,73,137520,False,0.399,0.2580,...,0.0330,0.792000,0.000002,0.1280,0.1920,108.174,3,jazz,POSITIVE,0.1920


In [None]:
df_sample["sentiment_result"].value_counts()


Unnamed: 0_level_0,count
sentiment_result,Unnamed: 1_level_1
POSITIVE,3217
NEGATIVE,1783


MODELO


In [7]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# 1. Limpiar el dataset original:
df_clean = df.dropna(subset=[
    "popularity",
    "danceability",
    "energy",
    "loudness",
    "track_name"
]).copy()

# 2. Muestra de 10000 canciones
df_sample = df_clean.sample(25000, random_state=42).copy()

# 3. Crear la variable sentiment_result usando Hugging Face
df_sample["sentiment_result"] = df_sample["track_name"].apply(
    lambda x: sentiment_model(x)[0]["label"]
)

# 4. Construir variables del modelo

# Variable: sentimiento positivo = 1
df_sample["sentiment_bin"] = df_sample["sentiment_result"].apply(
    lambda x: 1 if str(x).lower() == "positive" else 0
)

# Característica lingüística adicional: largo del título (en palabras)
df_sample["title_length"] = df_sample["track_name"].apply(
    lambda x: len(str(x).split())
)

# 5. Quitar nulos en variables del modelo
df_iv = df_sample.dropna(subset=[
    "popularity",
    "danceability",
    "energy",
    "loudness",
    "title_length",
    "sentiment_bin"
]).copy()

# ===========================
# 6. MODELO OLS (línea base)
# ===========================
y = df_iv["popularity"]

X_ols = df_iv[[
    "danceability",
    "energy",
    "loudness",
    "title_length",
    "sentiment_bin"
]]
X_ols = sm.add_constant(X_ols)

ols_model = sm.OLS(y, X_ols).fit()
print("\n===== RESULTADOS OLS =====")
print(ols_model.summary())

# ===========================
# 7. MODELO WLS (mejora sobre OLS)
# ===========================

# Residuos del OLS
resid = ols_model.resid

# Pesos: inverso del cuadrado de los residuos (con epsilon para evitar div/0)
epsilon = 1e-8
weights = 1 / (resid**2 + epsilon)

wls_model = sm.WLS(y, X_ols, weights=weights).fit()
print("\n===== RESULTADOS WLS =====")
print(wls_model.summary())




===== RESULTADOS OLS =====
                            OLS Regression Results                            
Dep. Variable:             popularity   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     42.60
Date:                Tue, 18 Nov 2025   Prob (F-statistic):           7.25e-44
Time:                        16:55:21   Log-Likelihood:            -1.1298e+05
No. Observations:               25000   AIC:                         2.260e+05
Df Residuals:                   24994   BIC:                         2.260e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            4

RESULTADOS


In [8]:

ols_robust = sm.OLS(y, X_ols).fit(cov_type='HC3')

print("\n===== RESULTADOS OLS ROBUSTO (HC3) =====")
print(ols_robust.summary())


===== RESULTADOS OLS ROBUSTO (HC3) =====
                            OLS Regression Results                            
Dep. Variable:             popularity   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     43.54
Date:                Tue, 18 Nov 2025   Prob (F-statistic):           7.30e-45
Time:                        16:55:45   Log-Likelihood:            -1.1298e+05
No. Observations:               25000   AIC:                         2.260e+05
Df Residuals:                   24994   BIC:                         2.260e+05
Df Model:                           5                                         
Covariance Type:                  HC3                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
cons