# **The Loudness of Spotify Tracks Over Time**

<div style="background-color: #f0f0f0; padding: 10px; border-bottom: 1px solid #ccc; text-align: left;">
    <a href="1_main.html" target="_self" style="margin: 0 15px; font-weight: bold; text-decoration: none; font-size: 18px;">Data Setup</a> | 
    <a href="2_visualization.html" target="_self" style="margin: 0 15px; font-weight: bold; text-decoration: none; font-size: 18px;">Visualization</a> | 
    <a href="3_model.html" target="_self" style="margin: 0 15px; font-weight: bold; text-decoration: none; font-size: 18px;">Model and Conclusion</a>
</div>

# Data Setup

## [1] Selected Dataset

[Spotify Dataset 1921-2020, 600k+ Tracks](https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks/data?select=tracks.csv)

## [2] Project Description

#### **Problem**
In the modern streaming era, artists are fighting for listeners' limited attention spans. A common theory in the music industry, known as the "Loudness War," suggests that producers have been progressively mixing songs louder to make them stand out on radio and playlists. But is this trend anecdotal, or can it be proven with data? My project investigates the evolution of song loudness over the last century to determine if popular music is statistically getting louder or quieter.

#### **Data**
This project uses the 'Spotify Dataset 1921-2020, 600k+ Tracks' from Kaggle by Yamac Eren Ay (Updated in 2022). By isolating loudness and duration, these song attributes can be analyzed over the course of 100 years.

#### **Methodology**
1. Data Cleaning: Filtering out podcasts, unreasonably long songs, and missing release date information.
2. Exploratory Data Analysis (EDA): Visualizing trends using heatmaps and histograms to identify correlations.
3. Predictive Modeling: Training a simple linear regression model to predict and quantify the rate at which music is getting louder.

## [3] Checking the Data

### [3.1] Summary

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('data/tracks.csv')

print('Shape is:', df.shape)
df.info(verbose=True)
df.describe()

Shape is: (586672, 20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586672 entries, 0 to 586671
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                586672 non-null  object 
 1   name              586601 non-null  object 
 2   popularity        586672 non-null  int64  
 3   duration_ms       586672 non-null  int64  
 4   explicit          586672 non-null  int64  
 5   artists           586672 non-null  object 
 6   id_artists        586672 non-null  object 
 7   release_date      586672 non-null  object 
 8   danceability      586672 non-null  float64
 9   energy            586672 non-null  float64
 10  key               586672 non-null  int64  
 11  loudness          586672 non-null  float64
 12  mode              586672 non-null  int64  
 13  speechiness       586672 non-null  float64
 14  acousticness      586672 non-null  float64
 15  instrumentalness  586672 non-null  float64
 1

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0
mean,27.570053,230051.2,0.044086,0.563594,0.542036,5.221603,-10.206067,0.658797,0.104864,0.449863,0.113451,0.213935,0.552292,118.464857,3.873382
std,18.370642,126526.1,0.205286,0.166103,0.251923,3.519423,5.089328,0.474114,0.179893,0.348837,0.266868,0.184326,0.257671,29.764108,0.473162
min,0.0,3344.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.0,175093.0,0.0,0.453,0.343,2.0,-12.891,0.0,0.034,0.0969,0.0,0.0983,0.346,95.6,4.0
50%,27.0,214893.0,0.0,0.577,0.549,5.0,-9.243,1.0,0.0443,0.422,2.4e-05,0.139,0.564,117.384,4.0
75%,41.0,263867.0,0.0,0.686,0.748,8.0,-6.482,1.0,0.0763,0.785,0.00955,0.278,0.769,136.321,4.0
max,100.0,5621218.0,1.0,0.991,1.0,11.0,5.376,1.0,0.971,0.996,1.0,1.0,1.0,246.381,5.0


### [3.2] Cleaning the Dataset

#### [3.2.1] Convert duration from ms (int) to minutes (float)

In [2]:
df['duration_min'] = df['duration_ms'] / 60000

#### [3.2.2] Convert date from string obj to year (int)

In [3]:
# Remove rows without release dates
df = df.dropna(subset=['release_date'])

# Convert ISO8601 date YYYY-MM-DD to just YYYY
df['release_year'] = pd.to_datetime(df['release_date'], format='ISO8601').dt.year

### [3.3] Creating a Subset

In [4]:
# Attempt to trim off excessively long tracks like podcasts

df_sub = df[
    (df['duration_min'] < 15) &
    (df['release_year'] >= 1921)
    ].copy()

df_sub.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 584576 entries, 0 to 586671
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                584576 non-null  object 
 1   name              584505 non-null  object 
 2   popularity        584576 non-null  int64  
 3   duration_ms       584576 non-null  int64  
 4   explicit          584576 non-null  int64  
 5   artists           584576 non-null  object 
 6   id_artists        584576 non-null  object 
 7   release_date      584576 non-null  object 
 8   danceability      584576 non-null  float64
 9   energy            584576 non-null  float64
 10  key               584576 non-null  int64  
 11  loudness          584576 non-null  float64
 12  mode              584576 non-null  int64  
 13  speechiness       584576 non-null  float64
 14  acousticness      584576 non-null  float64
 15  instrumentalness  584576 non-null  float64
 16  liveness          584576 