# Spotify Artist Prediction

Author: Sanath Nair

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

The goal of this project is to see what combination of variables and model best predicts the artists. More specifically I want to compare if the _**characteristic of a song**_ (i.e. danceability, valence, energy, acousticness, instrumentalness, liveness, speechiness) are a better predicter than the _**number of playlists the song is in**_ and the _**# of streams that song gets**_.

## Imports

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier

## Preprocessing

In [2]:
df = pd.read_csv("spotify-2023.csv")
df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


In [3]:
df.info()
# as we can see the "streams" column has a dtype of object which doesn't make sense

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

In [4]:
# converting the dtypes of certain columns to allow for manipulation/analysis later on
df["streams"] = pd.to_numeric(df["streams"], errors="coerce") # coerce to set invalid to NaN
df = df.rename(columns={
    "artist(s)_name": "artist",
    "danceability_%": "danceability",
    "valence_%": "valence",
    "energy_%": "energy",
    "acousticness_%": "acousticness",
    "instrumentalness_%": "instrumentalness",
    "liveness_%": "liveness",
    "speechiness_%": "speechiness"
})
df["total playlists"] = df["in_spotify_playlists"] + df["in_apple_playlists"]
df.iloc[:, 17:] = df.iloc[:, 17:] / 100

I ignored deezer because it doesn't have a big user base around 16 million vs 551 million for spotify [Spotify vs. Deezer: Which Is Better?](https://www.headphonesty.com/2021/08/spotify-vs-deezer/#:~:text=Deezer%20also%20launched%20in%20several,compared%20to%20Spotify's%20406%20million.)

Before we being any in-depth analysis I wanted to see the most popular artists and the respecitive number of times that they show up in the dataset. This is becasue if we choose to represent artists with a couple of hit songs they will create bias and disrupt our prediction/classifer models. In order to ensure a more accurate model we need to make sure that any outliers are removed before.

In [5]:
artist_value_counts = df["artist"].value_counts()
artist_value_counts

Taylor Swift                         34
The Weeknd                           22
Bad Bunny                            19
SZA                                  19
Harry Styles                         17
                                     ..
Junior H, Eden Mu��                   1
Taylor Swift, Lana Del Rey            1
Jnr Choi                              1
Aitana, zzoilo                        1
Frank Sinatra, B. Swanson Quartet     1
Name: artist, Length: 645, dtype: int64

From this we can see that near the bottom there are a lot of artists with only 1 hit song. To better understand the distribution of these value_counts lets create a histogram.

In [6]:
artist_count_df = pd.DataFrame({
    "name": artist_value_counts.index,
    "occurences": artist_value_counts.values
})

alt.Chart(artist_count_df).mark_bar().encode(
    x=alt.X("occurences", bin=True, title="Number of Occurences in dataset"),
    y=alt.Y("distinct(name)", title="Number of Artists"),
    tooltip=[alt.Tooltip("distinct(name)", title="Count")]
).interactive()

From this histogram we can see that the data is very right-skewed with about 629 artists appearing 0-5 times in the dataset. However if we only exclude all the artists that only appeared 0-5 times and greater than 25 times we see that the data is less extremely skewed while having enough unique artists for our ML models.

In [7]:
new_artist_value_counts = artist_value_counts[(artist_value_counts > 5) & (artist_value_counts <= 25)]
new_artist_value_counts

The Weeknd          22
Bad Bunny           19
SZA                 19
Harry Styles        17
Kendrick Lamar      12
Morgan Wallen       11
Ed Sheeran           9
BTS                  8
Feid                 8
Drake, 21 Savage     8
Labrinth             7
Olivia Rodrigo       7
Doja Cat             6
NewJeans             6
Name: artist, dtype: int64

In [8]:
artist_count_df2 = pd.DataFrame({
    "name": new_artist_value_counts.index,
    "occurences": new_artist_value_counts.values
})

alt.Chart(artist_count_df2).mark_bar().encode(
    x=alt.X("occurences", bin=True, title="Number of Occurences in dataset"),
    y=alt.Y("distinct(name)", title="Number of Artists"),
    tooltip=[alt.Tooltip("distinct(name)", title="Count")]
).interactive()

In [9]:
# new df with the artists that show up more than 5 times
df = df[df["artist"].isin(new_artist_value_counts.index)]
df.head()

Unnamed: 0,track_name,artist,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,key,mode,danceability,valence,energy,acousticness,instrumentalness,liveness,speechiness,total playlists
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140004000.0,94,...,F,Major,0.51,0.32,0.53,0.17,0.0,0.31,0.06,14.91
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236300.0,84,...,A,Minor,0.65,0.23,0.8,0.14,0.63,0.11,0.06,32.17
11,Super Shy,NewJeans,1,2023,7,7,422,55,58255150.0,37,...,F,Minor,0.78,0.52,0.82,0.18,0.0,0.15,0.07,4.59
14,As It Was,Harry Styles,1,2022,3,31,23575,130,2513188000.0,403,...,F#,Minor,0.52,0.66,0.73,0.34,0.0,0.31,0.06,239.78
15,Kill Bill,SZA,1,2022,12,8,8109,77,1163094000.0,183,...,G#,Major,0.64,0.43,0.73,0.05,0.17,0.16,0.04,82.92


_Now that we have narrowed down the artists we want to analyze. Lets dive deeper into their music._

## Exploration of the Dataset

Before we dive deeper into specific relations lets set a baseline accuracy that we want to beat.

In [10]:
df["artist"].unique().shape

(14,)

Since there are 14 artists, if we randomly guessed an artist we should expect to be correct with an accuracy of 1/14 or roughly 7.14%. Therefore any model we create should have an accuracy greater than this in order for it to be useful.

### Extra: Label Encoding

To improve the performance of our model we are going to label encode the artists. Doing so allows the computer to deal with numeric comparisons rather than string comparisons which is faster and allows for faster model creation.

In [11]:
le = LabelEncoder()
df["artist_le"] = le.fit_transform(df["artist"])
df["artist_le"]

2      11
4       1
11     10
14      6
15     12
       ..
935     3
937     3
939     3
943     3
946     3
Name: artist_le, Length: 159, dtype: int64

### Relation between # of Playlists and # of Streams

First I wanted to get a general understanding of the relation between # of playlists and # of streams. To do this I made a simple scatter plot.

In [12]:
alt.Chart(df).mark_circle().encode(
    x=alt.X("total playlists"),
    y=alt.Y("streams"),
    color="artist",
    tooltip=["in_spotify_playlists", "in_apple_playlists", "artist", "track_name"]
).interactive()

While this chart suggests a positive linear association between playlists and streams it doesn't give us a clear picture of how those variables determine the artist.

To see if there is a relation, lets make a LogisticRegression model trained on total playlists and streams. Since we also want to measure the accuracy of our model we will split the data into a train and test set.

First we standardize the input columns to have mean 0 and standard deviation of 1. This is a common machine learning practice to improve performance and accuracy.

In [13]:
df_std = pd.DataFrame({
    "artist": df["artist_le"],
    "artist_name": df["artist"]
})
df_std["total_playlist_standardized"] = ( df["total playlists"] - df["total playlists"].mean() ) / df["total playlists"].std()
df_std["streams_standardized"] = ( df["streams"] - df["streams"].mean() ) / df["streams"].std()
df_std

Unnamed: 0,artist,artist_name,total_playlist_standardized,streams_standardized
2,11,Olivia Rodrigo,-0.445540,-0.579838
4,1,Bad Bunny,-0.192130,-0.334684
11,10,NewJeans,-0.597058,-0.702613
14,6,Harry Styles,2.855987,2.984369
15,12,SZA,0.552978,0.956707
...,...,...,...,...
935,3,"Drake, 21 Savage",-0.466682,-0.534166
937,3,"Drake, 21 Savage",-0.397237,-0.577259
939,3,"Drake, 21 Savage",-0.418085,-0.492186
943,3,"Drake, 21 Savage",-0.515720,-0.621240


In [14]:
X_train, X_test, y_train, y_test = train_test_split(df_std[["total_playlist_standardized", "streams_standardized"]], df_std["artist"], random_state=1)
# use a seed for reproducable results

In [15]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression()

In [16]:
clf.score(X_train, y_train), clf.score(X_test, y_test)

(0.2773109243697479, 0.25)

While on its own this model is only accurate 25% of the time, it is a major improvement over randomly guessing. In addition, since the training and test data have the same accuracy we know there isn't significant overfitting occuring.

We can attribute this inaccuracy with the fact that different artists can have songs that perform the same in both playlists and streams. In addition these two variables themselves are related so having one over the other doesn't provide us better predicting capabilites and possibly even worse the prediction.

### Relation between Characteristics and Artist

Second, I wanted to get a general understanding of the relation between the various characteristics of a song. To do this I made multiple bar charts showing each artist and their average value of the characteristic

In [17]:
fields = [
    "danceability",
    "valence",
    "energy",
    "acousticness",
    "instrumentalness",
    "liveness",
    "speechiness",
]

In [18]:
df_std[fields] = (df[fields] - df[fields].min()) / (df[fields].max() - df[fields].min())
df_std

Unnamed: 0,artist,artist_name,total_playlist_standardized,streams_standardized,danceability,valence,energy,acousticness,instrumentalness,liveness,speechiness
2,11,Olivia Rodrigo,-0.445540,-0.579838,0.363636,0.298851,0.445946,0.180851,0.000000,0.397059,0.070175
4,1,Bad Bunny,-0.192130,-0.334684,0.575758,0.195402,0.810811,0.148936,0.875000,0.102941,0.070175
11,10,NewJeans,-0.597058,-0.702613,0.772727,0.528736,0.837838,0.191489,0.000000,0.161765,0.087719
14,6,Harry Styles,2.855987,2.984369,0.378788,0.689655,0.716216,0.361702,0.000000,0.397059,0.070175
15,12,SZA,0.552978,0.956707,0.560606,0.425287,0.716216,0.053191,0.236111,0.176471,0.035088
...,...,...,...,...,...,...,...,...,...,...,...
935,3,"Drake, 21 Savage",-0.466682,-0.534166,0.863636,0.310345,0.216216,0.021277,0.000000,0.514706,1.000000
937,3,"Drake, 21 Savage",-0.397237,-0.577259,0.696970,0.218391,0.554054,0.010638,0.000000,0.411765,0.087719
939,3,"Drake, 21 Savage",-0.418085,-0.492186,0.757576,0.160920,0.675676,0.010638,0.000000,0.176471,0.052632
943,3,"Drake, 21 Savage",-0.515720,-0.621240,1.000000,0.643678,0.554054,0.000000,0.000000,0.117647,0.315789


In [19]:
alt.Chart(df_std).mark_bar().encode(
    x=alt.X("artist_name"),
    y=alt.Y(alt.repeat("column"), type="quantitative"),
    color="artist_name"
).repeat(
    column=fields
) 

These charts give us various insights into each of the characteristic but whats most notable is how the artist ***Labrinth*** which for most characteristics differs with its values ranging in the opposite directino of other artist. This leads me to believe that predicting whether was made by Labrinth will be better than predicting other artists.

To test this hypothesis lets build another Logistic Regression model and test our accuracy.

In [20]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(df_std[fields], df_std["artist"], random_state=1)

In [21]:
clf2 = LogisticRegression()
clf2.fit(X_train2, y_train2)

LogisticRegression()

In [22]:
(clf2.score(X_train2, y_train2), clf2.score(X_test2, y_test2))

(0.2773109243697479, 0.25)

Surprsingly this model performs on par than our classifier based on playlists and number of streams.

However I still believe that these variables should be a better predicter of the artists, and thus I'll create a GradientBoostingClassifier.

### Extra: GradientBoostingClassifier

The differences between GradientBoostingClassifier and RandomForestClassifier are explained in detailed by this [article](https://www.datasciencecentral.com/decision-tree-vs-random-forest-vs-boosted-trees-explained/).

A quick summary:
- ***Random forests*** are a large number of trees, combined (using averages or “majority rules”) at the end of the process
- ***Gradient boosting*** machines also combine decision trees, but start the combining process at the beginning, instead of at the end.

In [23]:
clf3 = GradientBoostingClassifier(learning_rate=0.01)
clf3.fit(X_train2, y_train2)

GradientBoostingClassifier(learning_rate=0.01)

In [24]:
(clf3.score(X_train2, y_train2), clf3.score(X_test2, y_test2))

(0.9327731092436975, 0.275)

As we can see our GradientBoostingClassifier performs better than our LogisiticRegressionModel but not at an accuracy level with which we could make accurate decisions.

Howver there is significant overfitting as our model performs really well on our training data but poorly on our test data.

## Summary

We made three models, 2 LogisticRegression and 1 GradientBoostingClassifier to see if it was possible to predict the artist based on various information about their specific song. All of our models were able to give us better predictions that randomly guessing however the difference between guessing and using one of these models was not significant enough to claim that there is a relationship between those variables and the artist.

Regardles of the results this project was entertaining as I learned a lot about different artists and the types of music they make when cleaning and creating visualizations for the data.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?
https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

* List any other references that you found helpful.

https://www.geeksforgeeks.org/how-to-standardize-data-in-a-pandas-dataframe/
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
https://altair-viz.github.io/user_guide/compound_charts.html

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2be1a2d8-0e80-4afd-b63f-a280af12ef60' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>