# Discovering The Most Important Attributes of Tracks on Spotify

## Task Overview
In this notebook we will be analyzing a Spotify dataset in order to discover the most important attributes of a track.  Along the way we will take a special look at the metrics of the most popular tracks within each genre.

We will start by cleaning the data and preparing it for analysis.

## Data Origin
This dataset is from [Kaggle](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download), which includes a csv file containing the following metrics:
- `track_id`: The Spotify ID for the track
- `artists`: The artists' names who performed the track. If there is more than one artist, they are separated by a ;
- `album_name`: The album name in which the track appears
- `track_name`: Name of the track
- `popularity`: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity.
- `duration_ms`: The track length in milliseconds
- `explicit`: Whether or not the track has explicit lyrics (true = yes it does; false = no it does not OR unknown)
- `danceability`: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable
- `energy`: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale
- `key`: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1
- `loudness`: The overall loudness of a track in decibels (dB)
- `mode`: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0
- `speechiness`: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks
- `acousticness`: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
- `instrumentalness`: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content
- `liveness`: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live
- `valence`: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)
- `tempo`: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration
- `time_signature`: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.
- `track_genre`: The genre in which the track belongs
### Run the cell below to load the data.

In [None]:
### Run me!
import pandas as pd
import numpy as np
from pandas.testing import assert_frame_equal
import numpy as np
import matplotlib.pyplot as plt


spotify_df = pd.read_csv('dataset.csv')
original_df = spotify_df.copy()

# Clean The Data

**Input:** `spotify_df`: A Pandas DataFrame, as described above

**Return:** `cleaned_df`: A Pandas DataFrame that has been cleansed for processing

**Requirements:** 

0. Do not modify the original DataFrame
1. Remove the unnamed index column.
2. Rename the `valence` column 'positivity.'
3. Rename the `loudness` column 'loudness_DBFS'

In [None]:
def clean_data(spotify_df: pd.DataFrame):
    df = spotify_df.drop(['Unnamed: 0'], axis=1)
    df = df.rename(columns={'valence': 'positivity', 'loudness': 'loudness_DBFS'})
    return df

# demo function call
results = clean_data(spotify_df)
display(results.head())
assert_frame_equal(spotify_df, original_df)
clean_df = results.copy()
print("Passed.")

# A Quick Helper Function

Before starting our analysis, let's build a helper function.

**Input:** `clean_df`: A Pandas DataFrame, as described above

`n`: An integer

**Return:** `top_n_tracks_by_genre`: A Pandas DataFrame containing the n most popular tracks in each genre

**Requirements:** 

1. Reset the index

In [None]:
def top_n_tracks_by_genre(clean_df: pd.DataFrame, n: int):
    out = (clean_df
           .groupby(by='track_genre')
           .apply(lambda row: row.nlargest(n, 'popularity'))
           .reset_index(drop=True)
          )
    return out

# demo function call
top_10_tracks_by_genre_df = top_n_tracks_by_genre(clean_df,10)
display(top_10_tracks_by_genre_df.head(15))

# Initial Analysis

Now let's perform an initial analysis by taking averages of the most popular songs in each genre. 

**Input:** `clean_df`: A Pandas DataFrame, as described above

**Return:** `top_stats_df`: A Pandas DataFrame containing the required statistics for each genre

**Requirements:** 

Do not modify the original DataFrame.

The return DataFrame should include the following mean values for the top 10 songs in each genre, sorted by `avg_popularity` descending:

- `avg_popularity`
- `avg_duration_mins`: convert to minutes
- `avg_danceability`
- `avg_energy`
- `avg_loudness_DBFS`
- `avg_speechiness`
- `avg_acousticness`
- `avg_instrumentalness`
- `avg_liveness`
- `avg_positivity`

`track_genre` should also be renamed `genre` for this analysis



In [None]:
def top_stats(clean_df: pd.DataFrame):
    t_df = top_n_tracks_by_genre(clean_df, 10)
    out = t_df.groupby(by='track_genre').mean(numeric_only=True)
    out['avg_duration_mins'] = out['duration_ms'].apply(lambda ms: ms / 60000)
    out = out.drop(columns=['duration_ms'])
    out = out.reset_index(drop=False)
    out = out.rename(columns={'popularity': 'avg_popularity',
                              'danceability': 'avg_danceability',
                              'energy': 'avg_energy',
                              'loudness_DBFS': 'avg_loudness_DBFS',
                              'speechiness': 'avg_speechiness',
                              'acousticness': 'avg_acousticness',
                              'instrumentalness': 'avg_instrumentalness',
                              'liveness': 'avg_liveness',
                              'positivity': 'avg_positivity',
                              'track_genre': 'genre',
                              'key': 'avg_key',
                              'tempo': 'avg_tempo',
                              'time_signature': 'avg_time_signature',
                              'explicit': 'avg_explicit',
                              'mode': 'avg_mode'})
    out = out.sort_values(by='avg_popularity', ascending=[False])
    out = out.reset_index(drop=True)
    return out

# demo function call
top_stats_df = top_stats(clean_df)
display(top_stats_df)

# Visualization of Preliminary Results

In [None]:
vis = top_stats_df.drop(columns='avg_popularity')
for col, data in vis.items():
    plt.scatter(top_stats_df[col], top_stats_df['avg_popularity'])
    plt.title(f'{col} vs avg_popularity')
    plt.xlabel(col)
    plt.ylabel('Genre Popularity')
    plt.show()

# Which Genres Reach The Highest?

Genre seems to be an indicator of popularity.  Let's find out which genres had top tracks with average scores above the 80 in popularity.

**Input:** 

`top_stats_df`: A Pandas DataFrame, as described above

`threshold`: An integer representing the minimum average popularity value allowed

**Return:** `popular_genres_df`: A Pandas DataFrame containing genres with `avg_popularity` values above the `threshold`

**Requirements:** 


In [None]:
def popular_genres(top_stats_df: pd.DataFrame, threshold: int):
    return top_stats_df[top_stats_df['avg_popularity'] > threshold][['genre', 'avg_popularity']]


# demo function call
popular_genres_df = popular_genres(top_stats_df, 80)
display(popular_genres_df)

# Hypothesis

At this point we have dug deep enough into the data to hypothesize about which attributes may be significant.  It looks to me that overall loudness plays a substantial role in the success of a track, as well as the right balance of energy, positivity, avoiding instrumentalness, and hitting the low 3-minute duration.  Genre also looks interesting, but it is qualitative. Let's perform a **principle component analysis** to see if my intuition is correct.

### Start by collecting only the quantitative data from the dataset.

**Input:** `clean_df`: A Pandas DataFrame, as described above

**Return:** `quant_df`: A Pandas DataFrame containing only the quantitative data of `clean_df`

**Requirements:** 

Drop any columns that are strings.  Convert `explicit` to 1 or 0 instead of `True` or `False`, where 1 means `True`.

In [None]:
def quant_data(clean_df: pd.DataFrame):
    strs = clean_df.select_dtypes(include='object')
    out = clean_df.drop(strs, axis=1)
    out['explicit'] = out.apply(lambda x: 1 if x['explicit'] == True else 0, axis=1)
    return out

# demo function call
quant_df = quant_data(clean_df)
display(quant_df)

# Centering The Data

The first step of PCA is to center the data about the mean.  We will do this for all columns in `quant_df`

**Input:** `quant_df`: A Pandas DataFrame of quantitative data only

**Return:** `centered_clean_df`: A Pandas DataFrame containing the data of `quant_df` cenetered about their mean, i.e., $\frac{1}{m} \sum_{i=0}^{m-1} \hat{x}_i = 0$. 

**Requirements:** 

1. Do not modify the original DataFrame.
2. All data should be centered about their mean column-wise. Each observation is represented by a $d$-dimensional real-valued vector corresponding to $d$ measured predictors. We have already stacked them into a data matrix, denoted $X \equiv \left(\begin{array}{c} \hat{x}_0^T \\ \vdots \\ \hat{x}_{m-1}^T \end{array}\right)$, and wish to ensure that $\frac{1}{m} \sum_{i=0}^{m-1} \hat{x}_i = 0$. 

In [None]:
def centered_data(quant_df: pd.DataFrame):
    means = quant_df.apply(np.mean)
    centered = (quant_df - means)
    return centered

# demo function call
centered_clean_df = centered_data(quant_df)
display(centered_clean_df)

# Single Value Decomposition

Next we will compute the SVD using NumPy.

**Input:** `df`: A Pandas DataFrame, with columns centered about their means

**Return:** `(U, Sigma, VT)`: A Tuple representing the SVD of `df`

**Requirements:** 

1. Do not modify the original DataFrame.

In [None]:
def pd_svd(df: pd.DataFrame):
    data = df.to_numpy()
    return np.linalg.svd(data, full_matrices=False)

# demo function call
U, Sigma, VT = pd_svd(centered_clean_df)
display(f'U shape: {U.shape},Sigma shape: {Sigma.shape},VT shape: {VT.shape}')

# Determine The Most Informative Attributes

The SVD contains insights into which musical attributes carry the most information about a track.  Let's examine the magnitudes of the right-singular vectors $v_k$ in `VT` and use the largest to determine those attributes.  The largest magnitudes contribute the most to the projections of the original vectors.

**Input:** `VT`: A NumPy matrix of transposed right-singular vectors.

`n`: An integer of top attributes to return

**Return:** `top_attr`: a list of column names from `centered_clean_df` 

**Requirements:** 

1. Do not modify the original DataFrame.
2. Only take the top attribute from each vector $v_k$

In [None]:
def top_attributes(VT: np.array, n: int):
    col_indexes = []
    for k in range(n):
        v_k = np.abs(VT[k, :].T)
        col_indexes.append(np.argmax(v_k))

    return centered_clean_df.columns[col_indexes].to_list()

# demo function call
top_attr = top_attributes(VT, 7)
display(top_attr)

For the next section it will be nice to have the list of genres on-hand. Here are all the genres to choose from:

In [None]:
pd.set_option('display.max_rows', None)
genres = clean_df['track_genre'].drop_duplicates().reset_index(drop=True)
display(genres)
pd.set_option('display.max_rows', 15)

# Determine The Most Informative Attributes By Genre

Let's see if these results change on a genre-to-genre basis.  I will perform a similar calculation for a genre of my choice.  Feel free to change the genre to your favorite!

**Input:** `clean_df`: A pandas DataFrame of data about tracks which includes a `track_genre` column.

`genre`: A string from the list of genres in `genres_df`

`n`: An integer of top attributes to return

**Return:** `top_attr_by_genre`: a list of column names from `quant_df` 

**Requirements:** 

1. Do not modify the original DataFrame.

In [None]:
def top_attributes_by_genre(clean_df: pd.DataFrame, genre: str, n: int):
    _,_, VT = pd_svd(centered_data(quant_data(clean_df[clean_df['track_genre'] == genre])))
    return top_attributes(VT, n)

# demo function call
top_attr_by_genre = top_attributes_by_genre(clean_df, 'ska', 7)
display(top_attr_by_genre)

# What Genres Are Different?

Lots of genres seem to value the same attributes, just in slightly different orders, or at different values.  But which ones stand out from the pack?  Let's find out.

**Input:** `clean_df`: A pandas DataFrame of data about tracks, including a `track_genre` column.

`n`: An integer of top attributes to return

**Return:** `genre_differences_df`: a pandas DataFrame containing all the genres that have differing top attributes from the genre-agnostic result.

**Requirements:** 

In [None]:
def genre_differences(clean_df: pd.DataFrame, n: int):
    _,_, VT = pd_svd(centered_data(quant_data(clean_df)))
    ta_no_genre = top_attributes(VT, n)
    genres = clean_df['track_genre'].drop_duplicates().reset_index(drop=True)
    
    # create results DataFrame
    results = pd.DataFrame(genres)
    results['non-standard_attributes'] = ''

    for genre in genres:
        _,_, VT_g = pd_svd(centered_data(quant_data(clean_df[clean_df['track_genre'] == genre])))
        g = set(top_attributes(VT_g,n))
        no_g = set(ta_no_genre)
        if g != no_g:
            diff = g.difference(no_g)
            results.loc[results['track_genre'] == genre, 'non-standard_attributes'] = ' '.join(diff)
    return results
# demo function call
pd.set_option('display.max_rows', None)
genre_differences_df = genre_differences(clean_df, 7)
display(genre_differences_df)
pd.set_option('display.max_rows', 15)

# Conclusion

I did not expect duration to be the most important quantitative attribute of a track! Tempo also made a surprise appearance.  Loudness playing a big role makes sense to me, since we perceive loud sounds as better than quiet ones (until it's too loud!).  Popularity also makes sense as an important factor, since music is viral.  

My analysis could have been improved by using a more mathematical method of determining the number of attributes taken from the vectors $v_k$.  I chose to use one from each after visually inspecting the coefficients, noting that there was almost always one value of 0.9 or higher followed by very small values, i.e. the principal components were very similar to the columns themselves.

Some data could have been better represented by using the mode, rather than the mean.  I would expect tempo to be multimodal, for example. Key is another attribute that probably needs a mode for an average.

In the future I'd like to look into the ideal values of the most important attributes for each genre as well.