# Spotify Music Stream Analytics

## Project Overview
This guide provides a step-by-step approach to building an advanced analytics solution using the Spotify Songs Dataset. It combines data engineering best practices, machine learning techniques, and business intelligence strategies for the hackathon competition.

Dataset: Spotify Songs Dataset (80M+ songs, 7M+ artists)

Kaggle Link: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

Primary Goal: Build an interactive dashboard with recommendation engine and advanced insights

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [2]:
# Load the Spotify dataset
df = pd.read_csv('dataset.csv')

In [None]:
# Display the shape of the DataFrame

df.shape


(114000, 21)

In [None]:
# Listing columns helps me spot weird names or typos early
df.columns

Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

In [None]:
# Check for missing values
df.isnull().sum()


Unnamed: 0          0
track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

In [13]:
# fill missing values with column unknown
df.fillna('unknown', inplace=True)
# Verify no missing values remain
df.isnull().sum()

Unnamed: 0          0
track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

In [14]:
# Data types
print(f"\nData types:\n{df.dtypes}")


Data types:
Unnamed: 0            int64
track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object
