# Predicting Track Popularity

This was my final project for my data science class at Brainstation.


## Spotify Dataset

The dataset is found [here](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv). This has a sample of 160,000 songs from Spotify, and there are additional sheets that aggregate data by artist, year, or genre. 

Spotify provides these audio features for a track:
- **Acousticness:** A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- **Danceability:** How suitable a track is for dancing, from 0.0 (least danceable) to 1.0 (most danceable). 
- **Duration**: The song length, in milliseconds - this typically ranges from 200k to 300k 
- **Energy:** A measure that represents a perceptual measure of intensity and activity, from 0.0 (low energy) to 1.0 (high energy).
- **Explicit**: The binary value whether the track contains explicit content (1) or not (0).
- **Instrumentalness:** Predicts whether a track contains no vocals. The closer this is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- **Liveness:** Whether there's a live audience in the recording, from 0.0 (low probablity) to 1.0 (high probability). A value above 0.8 provides strong likelihood that the track is live.
- **Loudness:** The overall loudness of a track in decibels (dB), with a typical range between -60 and 0 db.
- **Popularity:** The popularity of a track in the US based on how frequently and recently it's played. Ranges from 0 (obscure) to 100 (very popular). 
- **Speechiness:** Detects the presence of spoken words in a track. Values below 0.33 most likely represent music. 0.33 and 0.66 describe tracks that may contain both music and speech. The track has more spoken word from 0.66 to 1.0.
- **Tempo:** The overall estimated tempo of a track in beats per minute (BPM).
- **Valence:** A measure from 0.0 (more negative) to 1.0 (positive) describing the musical positiveness conveyed by a track. 

More details about what these audio features mean are [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/).

I chose this dataset for a few reasons:
- I'm on Spotify frequently (almost every day).
- I did a project a few months ago that used Spotify data, a data visualization of the 100 greatest metal albums. It was a fun way to understand and interpret music, and I have some familiarity with Spotify's audio data.  
- This is a very extensive dataset that's nicely formatted. 

## Setup

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
%matplotlib inline 

tracks = pd.read_csv('/kaggle/input/spotify-dataset-19212020-160k-tracks/data.csv')

## Data Cleaning

In [None]:
tracks.head()

In [None]:
tracks.tail()

In [None]:
tracks.info()

The data types look as expected.

In [None]:
tracks.isnull().sum()

Suprisingly, there aren't any null values. 

In [None]:
tracks.describe()

For the most part, the data looks as expected. There are few changes I'd make:
- Duration can be expressed in minutes, instead of milliseconds. This is because songs are generally a few minutes long.
- Release_date can be dropped, because it's an object with a few different date formats. Year seems to convey the same information (when a track was released), but it's consistently formatted, so we can use that instead.

In [None]:
tracks['duration_m'] = tracks['duration_ms']/60000
tracks = tracks.reindex(sorted(tracks.columns), axis=1)
tracks.head()

In [None]:
tracks.drop('duration_ms', axis = 1, inplace = True)
tracks.drop('release_date', axis = 1, inplace = True)
tracks.head()

## Initial Data Exploration

In [None]:
tracks.describe()

I can spot an outlier: the max duration_m is 90 minutes long. I wonder which track(s) clock in at an hour or more? 

In [None]:
tracks[tracks['duration_m']>60]

The tracks that are over an hour long are ambient sounds and one Brian Eno song. The energy measurements are suprisingly high for ocean wave sounds meant for sleep and relaxation.  

In [None]:
tracks.hist(figsize=(15, 15), color = 'black')
plt.show()

## Hypothesis

I want to know: which audio features are related to a track's popularity?

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(tracks.corr(),annot = True)

A few factors stand out:
- Acousticness 
- Energy 
- Loudness
- Tempo
- Year

More popular songs are those that are less acoustic-sounding, higher energy, louder, and released this year. Being danceable, having explicit content, less vocals, and fast tempo aren't as strong factors.

Out of the available data, when a track is released has the strongest correlation. This makes sense, given how much Spotify showcases new content in the app.  

Scatterplots mapping different audio features to popularity are shown below. In these charts, if some audio features are extreme enough, you can see if popularity drops or a technical limit is reached. 

In [None]:
sns.scatterplot(x = 'acousticness', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In [None]:
sns.scatterplot(x = 'danceability', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In [None]:
sns.scatterplot(x = 'energy', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

At energy = 0.5, it looks like the popularity tapers. 

In [None]:
sns.scatterplot(x = 'explicit', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In [None]:
sns.scatterplot(x = 'loudness', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In [None]:
sns.scatterplot(x = 'speechiness', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In [None]:
sns.scatterplot(x = 'tempo', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In [None]:
sns.scatterplot(x = 'year', y = 'popularity', data = tracks, alpha = 0.03, color = 'black')

In the bottom right corner, there's a cluster of tracks that are new and didn't get a lot of listeners yet. 

## Data Modeling

**Linear regression to predict popularity based on release year**

In [None]:
model_year = smf.ols(data = tracks, formula = "popularity ~ year")
result_year = model_year.fit()
result_year.summary()

Results:
- R-squared: 0.744
- Coef: Intercept = -1404.3129, year = 0.7263

If year increases by 1, then the popularity score goes up by 0.73. For comparison, the popularity score can be 0 to 100.

In [None]:
tracks['predicted_popularity_yr'] = result_year.predict(tracks)
tracks.head()

In [None]:
tracks.tail()

**Linear regression to predict popularity based on acousticness**

In [None]:
model_ac = smf.ols(data = tracks, formula = "popularity ~ acousticness")
result_ac = model_ac.fit()
result_ac.summary()

Results:
- R-squared: 0.329
- Coef: Intercept = 48.1366, acousticness = -33.2690	

A track that's not acoustic (score of 0) would have a 33 point higher popularity score than one that is (score of 1).

In [None]:
tracks['predicted_popularity_ac'] = result_ac.predict(tracks)
tracks.head()

In [None]:
tracks.tail()

**Linear regression to predict popularity based on release year, acoustics, energy, and loudness**

In [None]:
model_3 = smf.ols(formula = "popularity ~ year + acousticness + energy + loudness", data = tracks).fit()
model_3.summary()

Here, the coefficient for energy is -3.78, which doesn't make sense because earlier, we saw that 1) the correlation is a positive value and 2) the scatterplot showed a positive trend. 

It might be influenced by the other factors: year, acousticness, and loudness.

In [None]:
tracks['predicted_popularity'] = model_3.predict(tracks)
tracks.head()

In [None]:
tracks.tail()

**Cost functions for each of these models**

How far off are the 3 sample models?

In [None]:
#wrong cost function 
cost_yr = sum((tracks['predicted_popularity_yr'] - tracks['popularity'])**2)
cost_ac = sum((tracks['predicted_popularity_ac'] - tracks['popularity'])**2)
cost = sum((tracks['predicted_popularity'] - tracks['popularity'])**2)

print(cost_yr)
print(cost_ac)
print(cost)

In [None]:
#improved cost function
n = len(tracks.index)

cost_yr = sum(abs(tracks['predicted_popularity_yr'] - tracks['popularity']))/n
cost_ac = sum(abs(tracks['predicted_popularity_ac'] - tracks['popularity']))/n
cost = sum(abs(tracks['predicted_popularity'] - tracks['popularity']))/n

print(cost_yr)
print(cost_ac)
print(cost)

The third model (predicting based on release year, acoustics, energy, and loudness) was has the  the most accurate out of the three. 