<h1><center>Spotify Classification Dashboard and Model Analysis</center></h1>
<h3><center>By Piero Trujillo</center></h3>

## Introduction

In this project, my friend Nirvit and I shared our 2023 Spotify Wrapped playlists so we could visualize comparisons between our music tastes and then create a model to try and predict whose playlist a song belongs to. Finally, I have compiled the results of each model into an interactive dashboard using [Panel](https://panel.holoviz.org/). 

This blog post will have the following sections:

* Setup and Preprocessing

* Exploratory Data Analysis for Feature Selection

* Prepping Data For Machine Learning Models

* Creating Machine Learning Models

* Panel Dashboard

* Final thoughts 

Now, let’s dive into the exciting world of music data analysis!

### Understanding our Spotify Dataset

**Track Metadata**
| column | description |
| --- | --- |
| Song | Song title |
| Artist | Song artist |
| Genre | Song genre category |

**Audio Numerical Quantitive Data**
| column | description |
| --- | --- |
| Loud | How loud a song is (db) |
| Time Seconds | Duration of the song in seconds |
| BPM | Average song tempo / how fast a song is |

**Audio Qualitative Data**
| column | description |
| --- | --- |
| Energy | How energetic the song is |
| Dance | How easy the song is to dance to |
| Happy | How positive the mood of the song is |
| Acoustic | How acoustic sounding the song is |
| Speech | How much of a song is spoken word |
| Popularity |  How popular a song is (at time of data collection) |
| Live | How likely the song is a live recording (higher value = live recording) |
| Instrumental | Measures if the song is more music and less vocals |


**Audio Categorical Data**
| column | description |
| --- | --- |
| Key | The most repeated key in the song |
| Time Signature | Numerical representation of rhythmic structure in song |
| Camelot | Musical key of a song for harmonic mixing |
| Playlist Owner | Who's playlist the song belongs to |


[**Gather Your Own Spotify Dataset**](https://www.chosic.com/spotify-playlist-analyzer/?plid=37i9dQZF1Fa1IIVtEpGUcU)


## Setup and Preprocessing

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Read in csv file to create tabular dataframe 
piero_top_songs = pd.read_csv("/Users/piero/Downloads/Spotify_Project/Piero_Top_Songs_2023.csv") 
nirvit_top_songs = pd.read_csv("/Users/piero/Downloads/Spotify_Project/Nirvit_Top_Songs_2023.csv") 

# Add a truth column to classify whether a song is from Piero's or Nirvit's playlist
piero_top_songs['Playlist Owner'] = 'Piero'
nirvit_top_songs['Playlist Owner'] = 'Nirvit'

# Convert time to seconds
piero_top_songs['Time Seconds'] = pd.to_timedelta('00:' + piero_top_songs['Time']).dt.total_seconds().astype(int)
nirvit_top_songs['Time Seconds'] = pd.to_timedelta('00:' + nirvit_top_songs['Time']).dt.total_seconds().astype(int)

# Remove unnecessary columns
piero_top_songs = piero_top_songs.drop(columns=['Song Preview', 'Spotify Track Img', 'Album Label', 'Spotify Track Id', 'Added At', 'Spotify Track Id', '#', 'Album', 'Album Date', 'Time'])
nirvit_top_songs = nirvit_top_songs.drop(columns=['Song Preview', 'Spotify Track Img', 'Album Label', 'Spotify Track Id', 'Added At', 'Spotify Track Id', '#', 'Album', 'Album Date', 'Time'])

# Join playlists into one dataframe
all_songs = pd.concat([piero_top_songs, nirvit_top_songs])

# Convert all object columns to type string
object_columns = all_songs.select_dtypes(include=['object']).columns # First, create list of object columns to convert
all_songs[object_columns] = all_songs[object_columns].astype('string')

#print(all_songs.dtypes) # Check that column types have been converted to string

In [2]:
# Check for null values
all_songs.isnull().sum().sum() # 9 NaN values in 'Genres' and 'Parent Genres' columns

# Create dataframe of songs containing NaN values in either 'Genres' or 'Parent Genres'
nan_rows = all_songs[(all_songs['Genres'].isnull()) | (all_songs['Parent Genres'].isnull())]

# Fill NaNs in 'Parent Genres' column with 'Unknown' since I cannot find them on Spotify or Google
all_songs[['Genres']] = all_songs[['Genres']].fillna('Unknown')

# Populate NaN values in 'Genres' column with genres found on Spotify or Google for specified song and artist
all_songs.loc[(all_songs['Song'] == 'Mumbo Sugar') & (all_songs['Artist'] == 'Arc De Soleil'), ['Parent Genres']] = ['R&B, Soul']

all_songs.loc[(all_songs['Song'] == 'Give It Back') & (all_songs['Artist'] == 'Gaelle'), ['Parent Genres']] = ['Dance, Electronic']

all_songs.loc[(all_songs['Song'] == '愛してる') & (all_songs['Artist'] == "callin'"), ['Parent Genres']] = ['Anime, J-Pop']

all_songs.loc[(all_songs['Song'] == 'You Are Mine') & (all_songs['Artist'] == 'Jay Robinson'), ['Parent Genres']] = ['Classic Soul']

all_songs.loc[(all_songs['Song'] == 'Thank You DubNation! (the page will never be long enough)') & (all_songs['Artist'] == 'herlovebeheadsdaisies'), ['Parent Genres']] = ['Screamo']

# Convert categorical variables to factors - allow us to use non-numeric data in statistical modeling
object_columns = all_songs.select_dtypes(include=['object']).columns # First, create list of object columns to convert
all_songs[object_columns] = all_songs[object_columns].astype('category')

In [3]:
# Making sure there are no null values left in the dataset
nan_rows = all_songs[(all_songs['Genres'].isnull()) | (all_songs['Parent Genres'].isnull())]
nan_rows

Unnamed: 0,Song,Artist,Popularity,BPM,Genres,Parent Genres,Dance,Energy,Acoustic,Instrumental,Happy,Speech,Live,Loud,Key,Time Signature,Camelot,Playlist Owner,Time Seconds


In [4]:
# Splitting 'Parent Genres' column since there are so many different genres
first_instance = all_songs['Parent Genres'].str.split(',').str[0] # extract first genre element

# Assign first instance to new 'Genre' column
all_songs['Genre'] = first_instance

unique_genres = all_songs['Genre'].unique()
num_unique_genres = len(unique_genres)
print("Number of unique genres:", num_unique_genres)

# Counting unique genres
all_songs['Genre'].value_counts() # 17 (now) vs 56 (before)

# Remove unnecessary columns
all_songs = all_songs.drop(columns=['Parent Genres', 'Genres'])

Number of unique genres: 17


In [5]:
print(all_songs.dtypes) # Check categorical column types have been converted to string

Song              string
Artist            string
Popularity         int64
BPM                int64
Dance              int64
Energy             int64
Acoustic           int64
Instrumental       int64
Happy              int64
Speech             int64
Live               int64
Loud               int64
Key               string
Time Signature     int64
Camelot           string
Playlist Owner    string
Time Seconds       int64
Genre             object
dtype: object


#### Final Dataset

In [6]:
# Save dataset as csv file
#all_songs.to_csv('all_spotify_songs.csv')

# Final dataset
all_songs

Unnamed: 0,Song,Artist,Popularity,BPM,Dance,Energy,Acoustic,Instrumental,Happy,Speech,Live,Loud,Key,Time Signature,Camelot,Playlist Owner,Time Seconds,Genre
0,CAN'T SAY,Travis Scott,80,148,70,71,20,0,71,0,10,-5,A#/B♭ Minor,4,3A,Piero,198,Hip Hop
1,New Gold (feat. Tame Impala and Bootie Brown),"Gorillaz,Tame Impala,Bootie Brown",71,108,70,92,4,5,55,0,10,-4,C♯/D♭ Minor,3,12A,Piero,215,Hip Hop
2,1AM FREESTYLE,Joji,68,126,62,54,75,0,12,0,10,-6,C Minor,4,5A,Piero,113,Pop
3,20 Min,Lil Uzi Vert,84,123,77,75,11,0,78,10,10,-4,G#/A♭ Minor,4,1A,Piero,220,Hip Hop
4,The Less I Know The Better,Tame Impala,88,117,64,74,1,1,79,0,10,-4,E Major,4,12B,Piero,216,Metal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,FLASH CASANOVA,Yabujin,53,143,42,72,1,0,41,10,0,-10,C♯/D♭ Major,4,3B,Nirvit,163,Hip Hop
96,Sinceramente,Sérgio Sampaio,51,92,71,25,94,0,85,0,10,-11,E Minor,4,9A,Nirvit,78,Jazz
97,24 Hr Drive-Thru,Origami Angel,52,155,57,96,2,0,26,10,30,-4,G#/A♭ Major,4,4B,Nirvit,164,Rock
98,If I Ain't Got You,Alicia Keys,84,118,61,44,60,0,17,10,10,-9,G Major,3,9B,Nirvit,228,R&B


## Exploratory Data Analysis for Feature Selection

### Correlation Heatmap

In [7]:
import plotly.graph_objects as go

def corr_plot(data):
    # Calculate the correlation matrix
    correlation_matrix = data.corr()

    # Create heatmap using Plotly  
    annotations = []
    for i, row in enumerate(correlation_matrix.values):
        for j, value in enumerate(row):
            font_color = 'white' if value > -0.4 else '#7fc591'  # Set font color based on z value
            annotations.append(dict(x=correlation_matrix.columns[j], y=correlation_matrix.index[i],
                                text=str(round(value, 2)),
                                showarrow=False, font=dict(color=font_color)))

    # Create heatmap using Plotly
    fig = go.Figure(data=go.Heatmap(
                    z=correlation_matrix.values,
                    x=correlation_matrix.columns,
                    y=correlation_matrix.index,
                    colorscale='Greens',  # Choose your preferred colorscale
                    colorbar=dict(title='Correlation<br>Strength<br>')
    ))



    fig.update_layout(
        title=dict(text ='<b>Correlation Heatmap</b>', x=0.5, y=0.85),
        xaxis=dict(title='<b>Features</b>'),
        yaxis=dict(title='<b>Features</b>'),
        annotations=annotations,
        template="plotly_dark",
        height=500,
        width=700,
        hoverlabel=dict(
            bgcolor="#008000")
    )

    return fig

corr_plot(all_songs)

  correlation_matrix = data.corr()


A correlation heatmap visualizes how well different variables interact with each other.  By illustrating the strength and direction of these relationships, correlation heatmaps help identify patterns, trends, and dependencies within the data. Therefore, we are most interested in the features with very dark or very light tiles.

A few main takeaways:
* **Loud:** Strong positive correlation with Energy suggests that louder songs tend to have higher energy levels.

* **Accoustic:** Strong negative correlation with Energy and Loudness implies that acoustic songs tend to have lower energy and loudness levels.

* **Energy:** Strong positive correlation with Loudness indicates that energetically intense songs tend to be louder.

* **Instrumental:** Moderate negative correlation with Danceability and Popularity suggests that instrumental songs are less danceable and less popular.

* **Dance:** It has a moderate positive correlation with Popularity, Energy, and Happiness, suggesting that more danceable songs tend to be more popular, energetic, and happier.

* **Popularity:** It shows weak positive correlations with attributes like Dance, Energy, Happy, and Loudness, indicating that more popular songs tend to have higher danceability, energy, happiness, and loudness.

### Interactive Scatterplot Comparing Similarity Between Music Tastes

In [8]:
import plotly.graph_objects as go

# Define colors for each playlist owner
color_map = {'Piero': '#1ED760', 'Nirvit': '#ff00ff'} #1db96e , #b91d82

# Define symbols for each playlist owner
symbol_map = {'Piero': 'circle', 'Nirvit': 'diamond'} #triangle-up

# Define a function to create scatter plot with my original dataset
def create_original_scatter_plot(all_songs):
    # Create scatter plot
    fig = go.Figure()

    # Add text markers when hovering over points
    for group, data in all_songs.groupby('Playlist Owner'):
        fig.add_trace(go.Scatter(
            x=data['Happy'],
            y=data['Energy'],
            opacity=0.75,
            mode='markers',
            name=group,
            text=data.apply(lambda row: f"Song: {row['Song']}, Artist: {row['Artist']}, Energy: {row['Energy']}, Happiness: {row['Happy']}", axis=1),  # Hover text
            marker=dict(
                color=color_map[group],  # Color points based on group
                size=10,
                symbol=symbol_map.get(group, 'circle'),
                line=dict(
                    color='#2a8ccb',
                    width=2
                )
            )
        ))

    # Scatterplot layout
    fig.update_layout(
        title={
            'text': "<b>Top 100 Songs by Mood</b>", # Top 100 Songs by Positivity and Energy Levels
            'font': {'size': 14},
            'x': 0.5,  # Centered title
            'y': 0.9  # Adjust vertical position of title
        },
        xaxis_title="Happiness Level",
        yaxis_title="Energy Level",
        legend_title="Listener",
        width=1070,  # Set width to 1000 pixels
        height=525,  # Set height to 600 pixels
        template="plotly_dark",
        # Make hover text white
        hoverlabel=dict( 
            font=dict(
                color="white"  # Text color inside hover label
            ))
        
    )


    # Label song mood quadrants
    fig.add_annotation(
        x=0, y=105,
        text="<b>Chaotic/Angry</b>",
        font=dict(
            size=12,
            color="white",
        ),
        showarrow=False
    ) 

    fig.add_annotation(
        x= 100, y=105,
        text="<b>Happy/Upbeat</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x= 100, y=-5,
        text="<b>Chill/Peaceful</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x=0, y=-5,
        text="<b>Sad/Depressing</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    # Adding cross section to distinguish mood sectors

    # Vertical line
    fig.add_shape(
        type="line",
        x0=50, y0=0,
        x1=50, y1=100,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Horizontal line
    fig.add_shape(
        type="line",
        x0=0, y0=50,
        x1=100, y1=50,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Show the plot
    return fig

create_original_scatter_plot(all_songs)

This scatterplot compares the energy and happiness levels of all songs in our Spotify Wrapped playlists. To interpret the plot, it’s important to think about how energy and happiness features interact. 

* Low Energy + Low Happiness =  **Sad / Depressing**

* Low Energy + High Happiness = **Chill / Peaceful**

* High Energy + High Happiness = **Happy / Upbeat**

* High Energy + Low Happiness = **Chaotic / Angry**

The scatterplot reveals that songs from my playlist are primarily clustered in the top quadrant, reflecting a mix of chaotic/angry and happy/upbeat tunes. This clustering pattern could significantly influence a model's predictive capabilities, potentially making the dataset more predictable than anticipated. Additionally, another notable trend emerges: while Nirvit's music taste appears evenly spread across the plot, he tends to gravitate towards a higher proportion of sad and chill music compared to my preferences.

Make sure to hover over the various points on the scatterplot, to see which songs they represent.

## Prepping Data For Machine Learning Models
### Normalize Data

In [9]:
from sklearn.preprocessing import MinMaxScaler

# Remove Artist and Song columns
normalized_songs = all_songs.drop(columns=['Song', 'Artist'])

# Select numerical columns to normalize
columns_to_normalize = ['Popularity', 'BPM', 'Dance', 'Energy', 'Acoustic', 'Instrumental', 'Happy', 'Speech', 'Live', 'Loud', 'Time Signature', 'Time Seconds']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler on the selected columns
scaler.fit(normalized_songs[columns_to_normalize])

# Transform the selected columns
normalized_songs[columns_to_normalize] = scaler.transform(normalized_songs[columns_to_normalize])

# Create a new binary response column
normalized_songs['Binary Response'] = (normalized_songs['Playlist Owner'] == 'Piero').astype(int)

# Drop the original 'playlist' column if no longer needed
normalized_songs.drop(columns=['Playlist Owner'], inplace=True)

Now we've got ourselves a normalized dataset!

In [10]:
normalized_songs

Unnamed: 0,Popularity,BPM,Dance,Energy,Acoustic,Instrumental,Happy,Speech,Live,Loud,Key,Time Signature,Camelot,Time Seconds,Genre,Binary Response
0,0.857143,0.550725,0.732558,0.707071,0.202020,0.000000,0.715789,0.000000,0.125,0.820513,A#/B♭ Minor,0.75,3A,0.115156,Hip Hop,1
1,0.750000,0.260870,0.732558,0.919192,0.040404,0.050505,0.547368,0.000000,0.125,0.846154,C♯/D♭ Minor,0.50,12A,0.127786,Hip Hop,1
2,0.714286,0.391304,0.639535,0.535354,0.757576,0.000000,0.094737,0.000000,0.125,0.794872,C Minor,0.75,5A,0.052006,Pop,1
3,0.904762,0.369565,0.813953,0.747475,0.111111,0.000000,0.789474,0.166667,0.125,0.846154,G#/A♭ Minor,0.75,1A,0.131501,Hip Hop,1
4,0.952381,0.326087,0.662791,0.737374,0.010101,0.010101,0.800000,0.000000,0.125,0.846154,E Major,0.75,12B,0.128529,Metal,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.535714,0.514493,0.406977,0.717172,0.010101,0.000000,0.400000,0.166667,0.000,0.692308,C♯/D♭ Major,0.75,3B,0.089153,Hip Hop,0
96,0.511905,0.144928,0.744186,0.242424,0.949495,0.000000,0.863158,0.000000,0.125,0.666667,E Minor,0.75,9A,0.026003,Jazz,0
97,0.523810,0.601449,0.581395,0.959596,0.020202,0.000000,0.242105,0.166667,0.375,0.846154,G#/A♭ Major,0.75,4B,0.089896,Rock,0
98,0.904762,0.333333,0.627907,0.434343,0.606061,0.000000,0.147368,0.166667,0.125,0.717949,G Major,0.50,9B,0.137444,R&B,0


### Setting Up Training and Testing Data

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

# Define features (X) and target variable (y)
X = normalized_songs[['Popularity', 'BPM', 'Dance', 'Energy', 'Acoustic', 'Instrumental', 'Happy', 'Speech', 'Live', 'Loud', 'Key', 'Time Signature', 'Camelot', 'Time Seconds', 'Genre']] # Features

y = normalized_songs['Binary Response'] # Target variable Playlist Owner

# Initialize OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

# One-hot encode categorical columns
X_encoded = pd.DataFrame(encoder.fit_transform(X[['Key', 'Camelot', 'Genre']]))  # Only encode categorical columns
X_encoded.columns = encoder.get_feature_names_out(['Key', 'Camelot', 'Genre'])  # Get categorical column names

# Reset indices of X and X_encoded
X.reset_index(drop=True, inplace=True)
X_encoded.reset_index(drop=True, inplace=True)

# Concatenate numerical and encoded categorical columns
X_final = pd.concat([X, X_encoded], axis=1)

# Drop original columns since they have been encoded to new columns
X_final.drop(columns=['Key', 'Camelot', 'Genre'], inplace=True)

# Splitting up the data into training and testing sets (60% training, 40% testing)
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.4, random_state=18, shuffle=True)


`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.



Now that the data has been split into testing and training sets, the next step involves creating machine learning models to predict which Spotify Wrapped playlist a song belongs to.

## Creating Machine Learning Models

### Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create and train the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred_lr = lr_model.predict(X_test)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy:", accuracy_lr)

Accuracy: 0.7375


### Feature Importance Plot (Logistic Regression)
I developed this feature importance plot function to identify the most and least useful predictors in each model.

In [13]:
import plotly.graph_objects as go
import panel as pn

def plot_linear_feature_importance(model_name):
    # Get feature importances
    lr_importances = model_name.coef_[0]
    indices = np.argsort(lr_importances)[::-1]

    # Get feature names
    feature_names = X_train.columns

    # Create custom color gradient
    colors = ['#1DB954', '#2BBE60', '#3AC26C', '#48C778', '#57CB84', '#65D08F', '#74D49B', '#83D9A7', '#91DDB3', '#9FE2BF'] 

    # Create figure
    fig = go.Figure()

    # Add bars to plot
    fig.add_trace(go.Bar(
        x=lr_importances[indices][:10],  # Grabs the top 10 features
        y=[feature_names[i] for i in indices[:10]],  # Grabs their corresponding feature names
        marker=dict(color=colors),
        orientation='h'  # Style as horizontal bar chart
    ))

    # Style barplot
    fig.update_layout(
        title=dict(text="<b>Top 10 Feature Importances</b>", x=0.5, font=dict(size=16, color='white', family='Arial, sans-serif')),
        xaxis=dict(title='<b>Importance</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        yaxis=dict(title='<b>Features</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        font=dict(size=12, color='white', family='Arial, sans-serif'),
        margin=dict(l=100, r=20, t=40, b=20),
        height=500, #500
        width=700,  # 800
        template="plotly_dark", # dark mode
         # Make hover markers have white text
        hoverlabel=dict(
            font=dict(
                color="white"
            )
        )
    )

    return fig # Display plot in dashboard when clicked

# Call function for logistic regression
plot_linear_feature_importance(lr_model)


### Creating a Visualization Dataset
To craft scatterplots, we need a streamlined visualization dataset containing only essential columns. This dataset, labeled `viz_dataset`, is extracted from the original dataset, `all_songs`, and encompasses descriptive song attributes like 'Song', 'Artist', 'Playlist Owner', in addition to 'Happy' and 'Energy' levels. The extraction process involves selecting rows corresponding to indices found within the `X_test` dataset.

In [14]:
# Reset index of the all_songs DataFrame
all_songs_reset_index = all_songs.reset_index(drop=True)

# Extract rows from the original dataset based on indices in X_test
viz_dataset = all_songs_reset_index.loc[X_test.index, ['Song', 'Artist', 'Playlist Owner','Happy', 'Energy']]

viz_dataset

Unnamed: 0,Song,Artist,Playlist Owner,Happy,Energy
134,"Suite bergamasque, L. 75: III. Clair de lune","Claude Debussy,Philippe Entremont",Nirvit,4,6
91,Fair Trade (with Travis Scott),"Drake,Travis Scott",Piero,29,47
81,Father Stretch My Hands Pt. 1,Kanye West,Piero,44,57
108,愛してる,callin',Nirvit,31,31
170,Disfarça E Chora,Cartola,Nirvit,96,44
...,...,...,...,...,...
126,Kiss the Ladder,Fleshwater,Nirvit,25,99
37,lose,Travis Scott,Piero,28,56
27,Doin' it Right (feat. Panda Bear),"Daft Punk,Panda Bear",Piero,19,45
2,1AM FREESTYLE,Joji,Piero,12,54


### Creating a Scatterplot Function to Show Logistic Regression Classification Results
This function can create scatterplots for any type of model, whether it's linear, tree-based, or cluster-based. The plan is to utilize it in the dashboard to visually represent classification song predictions for every model.

In [15]:
import plotly.graph_objects as go

def model_plot(y_pred):

    # Define colors for each playlist owner
    color_map = {1: '#1ED760', 0: '#ff00ff'} #1db96e , #b91d82

    # Define symbols for each playlist owner
    symbol_map = {1: 'circle', 0: 'diamond'} 

    # Map class labels to name legend labels
    legend_labels = {1: 'Piero', 0: 'Nirvit'}

    # Replace prediction labels (1,0) for names (Piero, Nirvit) in the legend
    legend_names = [legend_labels[label] for label in color_map.keys()]

    # Add truth labels by merging `y_pred` from each model as a prediction column
    viz_dataset['Predicted Owner'] = y_pred

    # Create scatter plot
    fig = go.Figure()

    # Add text markers when hovering over points
    for group, data in viz_dataset.groupby('Predicted Owner'):
        fig.add_trace(go.Scatter(
            x=data['Happy'],
            y=data['Energy'],
            opacity=0.75,
            mode='markers',
            name=legend_labels[group],
            text=data.apply(lambda row: f"Song: {row['Song']}, Artist: {row['Artist']}, Energy: {row['Energy']}, Happiness: {row['Happy']}", axis=1),  # Hover text
            marker=dict(
                color=color_map[group],  # Color points based on group
                size=10,
                symbol=symbol_map.get(group, 'circle'),
                line=dict(
                    color='#2a8ccb', ##2a8ccb
                    width=2
                )
            )
        ))


    # Change scatterplot appearance / styles
    fig.update_layout(
         title={
        'text': "<b>Top 100 Songs by Mood</b>", # Top 100 Songs by Positivity and Energy Levels
        'font': {'size': 14},
        'x': 0.5,  # Centered title
        'y': 0.9  # Adjust vertical position of title
        },
        xaxis_title="Happiness Level",
        yaxis_title="Energy Level",
        legend_title="Listener",
        width=1070,
        height=525,
        template="plotly_dark",
        # Make hover text white
        hoverlabel=dict(
            font=dict(
                color="white"
            )
        )
       
    )

    # Label song mood quadrants
    fig.add_annotation(
        x=0, y=105,
        text="<b>Chaotic/Angry</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x= 100, y=105,
        text="<b>Happy/Upbeat</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )


    fig.add_annotation(
        x= 100, y=-5,
        text="<b>Chill/Peaceful</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    fig.add_annotation(
        x=0, y=-5,
        text="<b>Sad/Depressing</b>",
        font=dict(
            size=12,
            color="white"
        ),
        showarrow=False
    )

    # Adding cross section to distinguish mood sectors

    # Vertical line
    fig.add_shape(
        type="line",
        x0=50, y0=0,
        x1=50, y1=100,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Horizontal line
    fig.add_shape(
        type="line",
        x0=0, y0=50,
        x1=100, y1=50,
        line=dict(
            color="white",
            width=1,
            dash="dash"
        )
    )

    # Show scatterplot
    return fig 

In [16]:
model_plot(y_pred_lr)  # Scatterplot for logistic regression


### Random Forest

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Create random forest model
rf_model = RandomForestClassifier(n_estimators=1000, random_state=18)

# Train Model
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate model performance
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("Accuracy:", rf_accuracy)

Accuracy: 0.7875


### Feature Importance Plot (Random Forest)
Since linear models and tree-based models store their feature importances differently, two separate feature importance plot functions are required.

In linear models, such as linear regression or logistic regression, feature importance is derived directly from the coefficients assigned to each feature during the model fitting process. These coefficients represent the magnitude and direction of the relationship between each feature and the target variable. Therefore, accessing the `.coef_` attribute retrieves these coefficients, which can be interpreted as feature importances.

In tree-based models like Random Forests, feature importance is typically computed based on how much each feature contributes to decreasing impurity (e.g., Gini impurity or entropy) across all the trees in the forest. The `.feature_importances_` attribute of a trained Random Forest model provides the importance scores for each feature, calculated based on this criterion.

So, while linear models directly use the coefficients as feature importance, Random Forest models use a measure of impurity decrease to determine feature importance across the ensemble of trees.

In [18]:
import plotly.graph_objects as go
import panel as pn

def plot_tree_feature_importance(model_name):
    # Get feature importances for tree-based model
    lr_importances = model_name.feature_importances_
    indices = np.argsort(lr_importances)[::-1]

    # Get corresponding feature names
    feature_names = X_train.columns

    # Create custom color gradient
    colors = ['#1DB954', '#2BBE60', '#3AC26C', '#48C778', '#57CB84', '#65D08F', '#74D49B', '#83D9A7', '#91DDB3', '#9FE2BF'] 

    # Create figure
    fig = go.Figure()

    # Add bars to plot
    fig.add_trace(go.Bar(
        x=lr_importances[indices][:10],  # Grab top 10 features in the model
        y=[feature_names[i] for i in indices[:10]],  # Get corresponding feature names
        marker=dict(color=colors), # assign color gradient to bars
        orientation='h'  # Style as horizontal barplot
    ))

    # Style barplot
    fig.update_layout(
        title=dict(text="<b>Top 10 Feature Importances</b>", x=0.5, font=dict(size=16, color='white', family='Arial, sans-serif')),
        xaxis=dict(title='<b>Importance</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        yaxis=dict(title='<b>Features</b>', titlefont=dict(size=14, color='white', family='Arial, sans-serif')),
        font=dict(size=12, color='white', family='Arial, sans-serif'),
        margin=dict(l=100, r=20, t=40, b=20),
        height=500,
        width=700,
        template="plotly_dark",
         # Make hover text white
        hoverlabel=dict(
            font=dict(
                color="white"
            )
        )
    )

    return fig # display plot in dashboard when clicked

# Plot random forest barplot
plot_tree_feature_importance(rf_model)


### Boosted Trees

In [19]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Create boosted trees model
boost_model = GradientBoostingClassifier(n_estimators=1000,
                                max_depth=3,
                                learning_rate=0.1,
                                min_samples_split=3)

# Fit the model to training set
boost_model.fit(X_train, y_train)

# Predictions
y_pred_boost = boost_model.predict(X_test)

# Evaluate boosted trees model accuracy
boost_accuracy = accuracy_score(y_test, y_pred_boost)
print("Accuracy:", boost_accuracy)

Accuracy: 0.8


### Feature Importance Plot (Boosted Trees)

In [20]:
plot_tree_feature_importance(boost_model)

### K-Nearest Neighbors (KNN)

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Create K-nearest neighbors classifier
knn_model = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Fit the model to training set
knn_model.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn_model.predict(X_test)

# Calculate accuracy
knn_accuracy = accuracy_score(y_test, y_pred_knn)
print("Accuracy:", knn_accuracy)


Accuracy: 0.6375


### Feature Importance Plot (KNN)
Unfortunately, a feature importance bar plot cannot be plotted because the K-Nearest Neighbors algorithm doesn't inherently provide feature importance scores like tree-based algorithms or linear models. Instead, K-Nearest Neighbors is a distance-based algorithm that makes predictions using Euclidean distance to measure proximity and similarity between data points. Due to the lack of feature importance scores and its low performance, it will not be included in the final dashboard.

### Support Vector Machine

In [22]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Create Support Vector Machine Classifier
svm_model = SVC(kernel='linear')  # Other kernels I could choose 'linear', 'rbf', 'poly'

# Fit the model to training set
svm_model.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Calculate accuracy
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print("Accuracy:", svm_accuracy)

Accuracy: 0.75


### Feature Importance Plot (SVM)

In [23]:
plot_linear_feature_importance(svm_model)

### Decision Trees

In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create Decision Tree Classifier
dec_tree_model = DecisionTreeClassifier()

# Fit the model to training set
dec_tree_model.fit(X_train, y_train)

# Make predictions
y_pred_dec_tree = dec_tree_model.predict(X_test)

# Calculate accuracy
dec_tree_accuracy = accuracy_score(y_test, y_pred_dec_tree)
print("Accuracy:", dec_tree_accuracy)

Accuracy: 0.7875


### Feature Importance Plot (Decision Tree)

In [25]:
plot_tree_feature_importance(dec_tree_model)

### Gauge Visualization
Now, a gauge visualization function is developed to showcase model accuracy on the dashboard.

In [26]:
import panel as pn
import plotly.graph_objects as go

# Create gauge visualization function
def gauge_accuracy_viz(model_performance, last_reference):
    # Calculate delta to show if current model is performing better or worse
    delta = model_performance - last_reference

    # Create gauge chart
    fig = go.Figure(go.Indicator(
        mode="gauge+number+delta",
        value= model_performance * 100,
        domain={'x': [0, 1], 'y': [0, 1]},
        title={'text': "Accuracy", 'font': {'size': 24, 'color': "#00ff7f"}},
        delta={'reference': last_reference * 100, 'increasing': {'color': "#00ff00"}, 'decreasing': {'color': "#ff7373"}},
        gauge={
            'axis': {'range': [None, 100], 'tickwidth': 2, 'tickcolor': "#70D2A2"},
            'bar': {'color': "#1DB954"},
            'bgcolor': "white",
            'borderwidth': 3,
            'bordercolor': "#00ff7f",
            'steps': [
                {'range': [0, 50], 'color': '#b91d82'},
                {'range': [50, 100], 'color': '#fff68f'}],
            'threshold': {
                'line': {'color': "#cc0000", 'width': 4},
                'thickness': 0.75,
                'value': model_performance * 100}}
    ))
    
    # Add percent sign to value and delta
    fig.update_traces(number={'suffix': '%'}, delta={'suffix': '%'})
    # Visualize gauge in dark mode
    fig.update_layout(template="plotly_dark", font={'color': "#00ff7f", 'family': "Arial"}, height=500, width=364)
    
    return fig 

## Panel Dashboard

In [30]:
import panel as pn
import pandas as pd
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pn.extension('echarts')

# Create buttons for selecting models
button_original_dataset = pn.widgets.Button(name = 'Original Dataset')
button_logistic_regression = pn.widgets.Button(name='Logistic Regression')
button_random_forest = pn.widgets.Button(name='Random Forest')
button_boosted_trees = pn.widgets.Button(name='Boosted Trees')
button_decision_trees = pn.widgets.Button(name='Decision Trees')
button_svm = pn.widgets.Button(name='Support Vector Machine')
button_knn = pn.widgets.Button(name='K-Nearest Neighbors')

last_reference = 0 # Create global variable to store the previous model's accuracy score

# Define callback functions for the buttons
def on_click_original_dataset(event):
    scatter_plot.object = create_original_scatter_plot(all_songs)
    feature_importance_plot.object = corr_plot(all_songs) # Switch in a corr plot since there are no features to show

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(0,0)

def on_click_logistic_regression(event):
    scatter_plot.object = model_plot(y_pred_lr)  
    feature_importance_plot.object = plot_linear_feature_importance(lr_model)
    
    global last_reference
    gauge_pane.object = gauge_accuracy_viz(accuracy_lr, last_reference)
    last_reference = accuracy_lr

def on_click_random_forest(event):
    scatter_plot.object =  model_plot(y_pred_rf) 
    feature_importance_plot.object = plot_tree_feature_importance(rf_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(rf_accuracy, last_reference)
    last_reference = rf_accuracy

def on_click_boosted_trees(event):
    scatter_plot.object =  model_plot(y_pred_boost) 
    feature_importance_plot.object =  plot_tree_feature_importance(boost_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(boost_accuracy, last_reference)
    last_reference = boost_accuracy

def on_click_decision_trees(event):
    scatter_plot.object = model_plot(y_pred_dec_tree) 
    feature_importance_plot.object = plot_tree_feature_importance(dec_tree_model)
    
    global last_reference
    gauge_pane.object = gauge_accuracy_viz(dec_tree_accuracy, last_reference)
    last_reference = dec_tree_accuracy

def on_click_svm(event):
    scatter_plot.object = model_plot(y_pred_svm) 
    feature_importance_plot.object = plot_linear_feature_importance(svm_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(svm_accuracy, last_reference)
    last_reference = svm_accuracy

def on_click_knn(event):
    scatter_plot.object = model_plot(y_pred_knn) 
    feature_importance_plot.object =  plot_tree_feature_importance(knn_model)

    global last_reference
    gauge_pane.object = gauge_accuracy_viz(knn_accuracy, last_reference)
    last_reference = knn_accuracy


# Bind callbacks when button is clicked
button_original_dataset.on_click(on_click_original_dataset)
button_logistic_regression.on_click(on_click_logistic_regression)
button_random_forest.on_click(on_click_random_forest)
button_boosted_trees.on_click(on_click_boosted_trees)
button_decision_trees.on_click(on_click_decision_trees)
button_svm.on_click(on_click_svm)
button_knn.on_click(on_click_knn)


# Create scatter plot widget
scatter_plot = pn.pane.Plotly()

# Create feature importance plot widget
feature_importance_plot = pn.pane.Plotly()  # plot_feature_importance(lr_model) 

# Create gauge visualization pane
gauge_pane = pn.pane.Plotly() #gauge_accuracy_viz(rf_accuracy, last_reference)

# Create logo pane
panel_logo = pn.pane.PNG(
    '/Users/piero/Downloads/Spotify_Project/Spotify_Logo_RGB_Green.png',
    width=150, height=95, align='center'
)

#text1 = 'Visualize the performance of machine learning models in classifying songs from my playlist and my friends.' 
text2 = 'Select a model using the buttons above to visualize its performance.' 
text3 = '[View dashboard code](link_to_your_code)'

# Dashboard layout
template = pn.template.FastListTemplate(theme="dark",
    logo = '/Users/piero/Downloads/Spotify_Project/Spotify_Logo_RGB_Green.png',
    title = "Visualizing Spotify Song Classification Performance",
    sidebar =[pn.pane.Markdown("## Reset"),   
             button_original_dataset, pn.pane.Markdown("## Models"), button_logistic_regression, button_random_forest, 
             button_boosted_trees, button_decision_trees, button_svm, text2, text3],
    main=[
            pn.Row(pn.Column(scatter_plot, sizing_mode='stretch_both', margin=(-20,0,0,-24))),
            pn.Row(pn.Column(feature_importance_plot, margin=(11,0,0,-24)),
                   pn.Column(gauge_pane, margin=(11,0,0,-13)), sizing_mode='stretch_both', height=400, width=950
                  )
          ],
    theme_toggle = False,
    accent_base_color="#0bff38", # change color of hyperlink text
    header_background="#1f2630", # change color of header banner | previous color: #009E60
    header_color = '#0bff38', # change color of header text | previous color: #57ff76
    main_max_width = '900',
    main_layout = None, # maximum width of the main area containing all plots
    sidebar_width=172, # adjust sidebar size
    font = 'https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100;1,100&display=swap'
    
) 

# Load original dataset button images on startup
on_click_original_dataset(None)

# Display the dashboard
template.show()





Launching server at http://localhost:50831


<panel.io.server.Server at 0x12daf9bd0>

## Final Thoughts
Overall, this project provided an amazing opportunity to delve into the realm of music data analysis, machine learning, and dashboarding. It was fascinating to uncover the intricate patterns within our Spotify Wrapped playlists and to conduct statistical comparisons of our music tastes. I was incredibly excited to visualize the similarities in our music tastes and gain deeper insights into our listening habits. While our classification models didn't achieve perfection, they still yielded remarkably accurate results, hinting at meaningful distinctions in the songs favored by Nirvit and myself.

For those interested in conducting a similar analysis using Python, I recommend exploring my [GitHub repository](https://github.com/suppiero/spotify_classification_dash) dedicated to this project.

## Sources
I'd like to extend a special thank you to the wonderful data analysts who inspired me to make this project, offering invaluable ideas and sharing fantastic source code.
* [Whose Song is it Anyway? By Lewis White.](https://lewis-r-white.github.io/posts/2023-03-13-spotify-ML-blog/)
* [How to Create a Beautiful Python Visualization Dashboard With Panel/Hvplot. By Thu Vu Data Analytics.](https://www.youtube.com/watch?v=uhxiXOTKzfs)
* [Predicting Song Popularity. By Alison Salerno.](https://github.com/AlisonSalerno/song-popularity-linear-regression/tree/master)
* [App Gallery. By Panel.](https://panel.holoviz.org/gallery/index.html)