# 🎵 BPM Prediction - Exploratory Data Analysis

## Kaggle Playground Series - Season 5, Episode 9

This notebook contains the initial exploratory data analysis for the BPM prediction competition.

### 🎯 Objectives
- Load and explore the competition datasets
- Understand the target variable (BPM) distribution
- Analyze feature characteristics and relationships
- Identify data quality issues
- Generate insights for feature engineering and modeling


## 📚 Setup and Imports


In [2]:
# Install packages if needed (run this cell first if you get import errors)
import sys
import subprocess

def install_package(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Uncomment and run if you get ModuleNotFoundError
# install_package("pandas>=1.5.0")
# install_package("numpy>=1.21.0") 
# install_package("matplotlib>=3.5.0")
# install_package("seaborn>=0.11.0")
# install_package("scikit-learn>=1.1.0")

# Standard library imports
import os
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append('../src')

# Core data science libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Project utilities
from utils import (
    load_competition_data, 
    plot_target_distribution,
    plot_feature_correlations,
    print_data_info,
    calculate_rmse
)
from data_preprocessing import (
    validate_data_quality,
    create_feature_summary,
    detect_outliers
)

# Configure plotting
plt.style.use('default')  # Use default matplotlib style
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set_palette("husl")
sns.set_style("whitegrid")  # Set seaborn style separately

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("✅ Setup complete!")


✅ Setup complete!


## 📊 Data Loading and Initial Exploration


In [5]:
# Load competition data
train_df, test_df, sample_submission_df = load_competition_data('../data')

# Display basic information
print("\n" + "="*60)
print("🎵 BPM PREDICTION DATASET OVERVIEW")
print("="*60)

print_data_info(train_df, "Training Set")
print("-"*60)
print_data_info(test_df, "Test Set")


✅ Data loaded successfully!
   - Training set: (524164, 11)
   - Test set: (174722, 10)
   - Sample submission: (174722, 2)

🎵 BPM PREDICTION DATASET OVERVIEW

📋 Training Set Information:
Shape: (524164, 11)
Memory usage: 43.99 MB

📊 Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524164 entries, 0 to 524163
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   id                         524164 non-null  int64  
 1   RhythmScore                524164 non-null  float64
 2   AudioLoudness              524164 non-null  float64
 3   VocalContent               524164 non-null  float64
 4   AcousticQuality            524164 non-null  float64
 5   InstrumentalScore          524164 non-null  float64
 6   LivePerformanceLikelihood  524164 non-null  float64
 7   MoodScore                  524164 non-null  float64
 8   TrackDurationMs            524164 non-null  float64
 9   Ene