# Spotify Analysis: What Makes a Song Popular?

**Executive Summary** - Key findings from the analysis.

For detailed exploration, see:
- [01_exploration.ipynb](01_exploration.ipynb) - Data loading & quality checks
- [02_feature_analysis.ipynb](02_feature_analysis.ipynb) - Deep feature analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path

df = pd.read_csv(Path("../data/processed/spotify_cleaned.csv"))
df['release_date'] = pd.to_datetime(df['release_date'])

audio_features = ['danceability', 'energy', 'loudness', 'speechiness', 
                  'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']

## Dataset Overview

In [None]:
print(f"Total tracks analyzed: {len(df):,}")
print(f"Unique artists: {df['artist'].nunique():,}")
print(f"Date range: {df['release_date'].min().year} - {df['release_date'].max().year}")
print(f"Average popularity: {df['popularity'].mean():.1f}/100")

## Finding 1: Feature-Popularity Correlations

Which audio characteristics are most associated with popular songs?

In [None]:
correlations = df[audio_features + ['popularity']].corr()['popularity'].drop('popularity').sort_values()

plt.figure(figsize=(10, 6))
colors = ['#e74c3c' if x < 0 else '#27ae60' for x in correlations]
correlations.plot(kind='barh', color=colors)
plt.xlabel('Correlation with Popularity')
plt.title('Audio Feature Correlations with Popularity')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

## Finding 2: Hit Song Profile

How do top 10% songs differ from the rest?

In [None]:
top_10_pct = df['popularity'].quantile(0.9)
hits = df[df['popularity'] >= top_10_pct]
non_hits = df[df['popularity'] < top_10_pct]

comparison = pd.DataFrame({
    'Hits': hits[audio_features].mean(),
    'Others': non_hits[audio_features].mean()
}).round(3)

plt.figure(figsize=(12, 6))
x = np.arange(len(audio_features))
width = 0.35

plt.bar(x - width/2, comparison['Hits'], width, label=f'Hits (Top 10%, n={len(hits):,})', color='#f1c40f')
plt.bar(x + width/2, comparison['Others'], width, label=f'Others (n={len(non_hits):,})', color='#95a5a6')

plt.ylabel('Average Value')
plt.title('Hit Songs vs Others: Audio Feature Comparison')
plt.xticks(x, [f.title() for f in audio_features], rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

## Finding 3: Optimal Song Duration

In [None]:
duration_bins = pd.cut(df['duration_min'], bins=[0, 2, 3, 4, 5, 10], 
                       labels=['<2 min', '2-3 min', '3-4 min', '4-5 min', '5+ min'])
duration_popularity = df.groupby(duration_bins)['popularity'].mean()

plt.figure(figsize=(8, 5))
colors = ['#3498db' if x != duration_popularity.max() else '#e74c3c' for x in duration_popularity]
duration_popularity.plot(kind='bar', color=colors)
plt.xlabel('Duration')
plt.ylabel('Average Popularity')
plt.title('Song Duration vs Average Popularity')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

print(f"Optimal duration: {duration_popularity.idxmax()}")

## Key Takeaways

In [None]:
sorted_corr = correlations.abs().sort_values(ascending=False)

print("=" * 60)
print("KEY FINDINGS: WHAT MAKES A SONG POPULAR?")
print("=" * 60)

print("\n1. STRONGEST PREDICTORS:")
for feature in sorted_corr.head(3).index:
    corr_val = correlations[feature]
    direction = "Higher" if corr_val > 0 else "Lower"
    print(f"   - {direction} {feature} = more popular ({corr_val:+.3f})")

print(f"\n2. HIT SONG CHARACTERISTICS (Top 10%):")
for feature in ['loudness', 'instrumentalness', 'danceability']:
    hit_val = hits[feature].mean()
    other_val = non_hits[feature].mean()
    diff_pct = ((hit_val - other_val) / abs(other_val)) * 100 if other_val != 0 else 0
    print(f"   - {feature.title()}: {hit_val:.3f} vs {other_val:.3f} ({diff_pct:+.1f}%)")

print(f"\n3. OPTIMAL DURATION: {duration_popularity.idxmax()}")

print("\n" + "=" * 60)