# Spotify Song Popularity Analysis - Practice Assignment

## Overview
This notebook demonstrates the process of analyzing Spotify song features to predict song popularity using machine learning. This is a practice assignment to showcase data science workflow and documentation.

## Steps We'll Follow:
1. **Setup and Data Loading**
   - Import required libraries
   - Load and examine the dataset

2. **Data Preprocessing**
   - Clean the data
   - Engineer new features
   - Transform categorical variables

3. **Model Development**
   - Split data into training and testing sets
   - Scale features
   - Train and optimize model

4. **Evaluation and Visualization**
   - Assess model performance
   - Create visualizations
   - Document insights

## Note:
This is a practice exercise. Each step will be documented with explanations of what we're doing and why. The focus is on understanding the machine learning workflow and proper documentation practices.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette('viridis')

## Step 1: Setup and Initial Data Loading

First, we'll import all necessary libraries for our analysis. Each library serves a specific purpose:
- pandas: for data manipulation and analysis
- numpy: for numerical operations
- scikit-learn: for machine learning tools
- matplotlib/seaborn: for visualization

We're also setting visual styles for consistent, professional-looking plots.

In [None]:
# Load the dataset
df = pd.read_csv('../Untitled/Resources/spotify_songs.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nSample of the data:")
df.head()

## Step 2: Data Preprocessing and Feature Engineering

In this section, we'll:
1. Create popularity categories (High/Medium/Low) instead of using raw scores
   - High: ≥67 (top 25%)
   - Medium: 34-66 (middle 50%)
   - Low: <34 (bottom 25%)

2. Engineer new features:
   - energy_danceability: combining energy and danceability
   - loudness_scaled: normalized loudness values
   - tempo_scaled: normalized tempo values

3. Handle categorical variables:
   - One-hot encode genre information

We'll visualize the distribution of our target variable to understand class balance.

In [None]:
# Create popularity categories
def categorize_popularity(x):
    if x >= 67:  # Top 25%
        return 'High'
    elif x >= 34:  # Middle 50%
        return 'Medium'
    else:  # Bottom 25%
        return 'Low'

df['popularity_category'] = df['track_popularity'].apply(categorize_popularity)

# Feature engineering
df['energy_danceability'] = df['energy'] * df['danceability']
df['loudness_scaled'] = (df['loudness'] - df['loudness'].min()) / (df['loudness'].max() - df['loudness'].min())
df['tempo_scaled'] = df['tempo'] / df['tempo'].max()

# One-hot encode genre
genre_dummies = pd.get_dummies(df['playlist_genre'], prefix='genre')
df = pd.concat([df, genre_dummies], axis=1)

# Display the distribution of popularity categories
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='popularity_category', order=['Low', 'Medium', 'High'])
plt.title('Distribution of Song Popularity Categories')
plt.xlabel('Popularity Category')
plt.ylabel('Count')
plt.show()

## Step 3: Feature Selection and Data Preparation

Now we'll prepare our data for modeling:
1. Select relevant features:
   - Audio features (danceability, energy, etc.)
   - Engineered features
   - Genre information

2. Split data into training and testing sets:
   - 80% training, 20% testing
   - Use stratification to maintain class distribution

3. Scale features:
   - Use StandardScaler to normalize feature ranges
   - Fit scaler on training data only

In [None]:
# Select features for prediction
audio_features = [
    'danceability', 'energy', 'key', 'loudness_scaled', 'speechiness',
    'acousticness', 'instrumentalness', 'liveness', 'valence',
    'tempo_scaled', 'energy_danceability'
]

genre_columns = [col for col in df.columns if col.startswith('genre_')]
features = audio_features + genre_columns

# Prepare X (features) and y (target)
X = df[features]
y = df['popularity_category']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training set shape:", X_train_scaled.shape)
print("Testing set shape:", X_test_scaled.shape)

## Step 4: Model Training and Optimization

We'll use GridSearchCV with RandomForestClassifier to find the best model:
1. Define parameter grid:
   - n_estimators: number of trees
   - max_depth: tree depth
   - min_samples_split/leaf: controls tree structure

2. Train model with cross-validation:
   - 5-fold cross-validation
   - Parallel processing enabled
   - Track best parameters

In [None]:
# Define parameter grid for GridSearch
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initialize and train model with GridSearch
rf_model = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    verbose=1
)
rf_model.fit(X_train_scaled, y_train)

print("Best parameters:", rf_model.best_params_)

## Step 5: Model Evaluation

Let's evaluate our model's performance:
1. Generate predictions on test set
2. Calculate key metrics:
   - Overall accuracy
   - Per-class precision, recall, F1-score
   - Confusion matrix

This will help us understand how well our model performs across different popularity categories.

In [None]:
# Make predictions with best model
best_model = rf_model.best_estimator_
y_pred = best_model.predict(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy Score: {accuracy:.2f}")
print("\nClassification Report:")
print(class_report)

## Step 6: Results Visualization

We'll create visualizations to understand our results:
1. Confusion Matrix:
   - Shows prediction accuracy for each category
   - Highlights where model makes mistakes

2. Feature Importance:
   - Top 15 most influential features
   - Helps understand what drives song popularity

In [None]:
# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=['High', 'Low', 'Medium'],
            yticklabels=['High', 'Low', 'Medium'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Feature importance plot
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': best_model.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=feature_importance.head(15))
plt.title('Top 15 Features for Predicting Song Popularity')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

## Step 7: Save and Document Results

Finally, we'll save our results for future reference:
1. Create dedicated output directory
2. Save various outputs:
   - Feature importance rankings
   - Model predictions
   - Performance metrics

This creates a permanent record of our analysis and findings.

In [None]:
# Create output directory
output_dir = 'analysis_output'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save feature importance
feature_importance.to_csv(os.path.join(output_dir, 'feature_importance.csv'), index=False)

# Save predictions
predictions_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})
predictions_df.to_csv(os.path.join(output_dir, 'predictions.csv'), index=False)

# Save model performance metrics
with open(os.path.join(output_dir, 'model_performance.txt'), 'w') as f:
    f.write(f"Best Parameters: {rf_model.best_params_}\n")
    f.write(f"Accuracy Score: {accuracy:.2f}\n")
    f.write("\nClassification Report:\n")
    f.write(class_report)

print("Results saved to", output_dir)