# Spotify Top 50 tracks analysis


## Context

Imagine you're a data analyst working for Spotify. Your team is responsible for content analysis and in this quarter you've decided to analyze Spotify's top hits to quantify what makes a hit song. Your team's product manager has many ideas and has prepared a list of questions (requirements) that she wants you to answer. After reviewing the list of over 20 questions, you are not in a good mood - it will take a couple of days to get all the answers. 

Luckily, a few days ago, an experienced data scientist working in your team queried the top 50 tracks for her machine learning project and agreed to share the data with you. This is a great help - your SQL skills are not too sharp yet, and you don't yet know where to find all the relevant tables in your data warehouse. With this dataset, you are confident that you'll be able to answer all of your PM's questions, plus maybe even look into some additional points of interest.

## Preparing for exploratory data analysis

First, we'll import `numpy` and `pandas` libraries for our analysis. After reading the `.csv` dataset, we'll print top 5 tracks to understand the features of the dataset. 

In [None]:
import numpy as np
import pandas as pd

top50 = pd.read_csv("datasets/spotifytoptracks.csv")
top50.head()

## 00 Data cleaning

Perform data cleaning by:
- Handling missing values.
- Removing duplicate samples and features.
- Treating the outliers.

### Missing values

We'll check for missing values using the combination of `isna()` and `sum()` methods. If it returns zero, we have no missing values to deal with.

In [None]:
top50.isna().sum()

### Duplicate samples and features

Again, using the `duplicated()` method, we check if there are repetitive entries in the dataset.

In [None]:
top50.duplicated()

### Outliers

So far, we have no duplicate rows and no missing values. However, we still need to find out, if we have any outliers in different columns. Here I define the function `find_outliers(df)` which will use interquartile range (IQR) and the formula to find if there are values that are below q1 by 1.5\*IQR or above q3 by 1.5\*IQR.

Using the following function, we discover if there are outliers in seprate columns, and using DataFrame methods we extract the songs which are outliers at least in one feature (column). We return the boolean DataFrame, indicating there songs are outliers compared to the whole `top50` dataset, counting outlier dimension in column `out_dim_count`.

In [None]:
dim = list(top50.columns[5:16].values)

def find_outliers_IQR(df, dim):
    temp = []
    for d in dim:
        q1=df[d].quantile(0.25)
        q3=df[d].quantile(0.75)
        IQR=q3-q1
        outliers = df[d][((df[d]<(q1-1.5*IQR)) | (df[d]>(q3+1.5*IQR)))]
        temp.append(outliers)
    outliers_concatenated = pd.concat(temp, axis=1)
    return outliers_concatenated

outliers = find_outliers_IQR(top50, dim)
out_sorted = outliers.sort_index().notna()
out_sorted['out_dim_count'] = out_sorted.sum(axis=1)
outlier_tracks = top50.loc[out_sorted.index, ['artist', 'album', 'track_name']]
out_sorted = pd.concat([outlier_tracks, out_sorted], axis=1)
out_sorted.sort_values(['out_dim_count'], ascending=False)

From this initial analysis we see that Billie Elish 'everything i wanted' clearly differs from the rest as it exceeds in three dimensions – instrumentalness, acousticness and loudness. But that's only our initial scoping of the dataframe. Half of its songs are outliers in at least one dimension, meaning that dealing with them early on might siognificantly reduce our sample size. So, we decided not to tinker with them now, and proceed with the project manager's questions.

## 01 Data analysis

### 1. How many observations are there in this dataset?

In [None]:
sample = len(top50.index)
features = len(top50.columns)
observations = sample * features
print(f"There are {sample} samples over {features} features, resulting in {observations} observations")

### 2. How many features this dataset has?

In [None]:
print(f"There are {features} features in the dataset")

### 3. Which of the features are categorical?

In [None]:
top50.select_dtypes(include=['category']).columns

A quick stop here: apparently, we have no categorical data in columns, even though `genre` values are clearly repetitive and can divide the data into categories. We may convert it to categorical data, but is it useful?

In [None]:
top50['genre'].value_counts()

We see that ithe columns has some multiple genres per cell, but only 10 observations are very 'exotic' – that is, have some strange combinations of genres. It might be worth taking a shot at conversion for usability purposes later in our exploration. 

In [None]:
top50['genre'] = top50['genre'].astype('category')

Let's check, if we successfully converted genres into categorical data:

In [None]:
print("The following features are categorical: ")
top50.select_dtypes(include=['category']).columns

### 4. Which of the features are numeric?

To answer this question, we select columns with data types `int64` and `float64`.

In [None]:
numeric_features = pd.Series(top50.select_dtypes(include=['int64', 'float64']).columns)

print("The following features are numerical: ")
numeric_features

### 5. Are there any artists that have more than 1 popular track? If yes, which and how many?

For this, we look for repetitions of artist name in the `artist` column, and return only those values the count of which is more than 1.

In [None]:
artists_several_tracks = top50['artist'].value_counts()

print("The following artists have more than one track in the top50 list: ")
artists_several_tracks[artists_several_tracks > 1]

### 6. Who was the most popular artist?

There are two ways to answer this question:
1. By top hit count, we already saw that Billie Eilish, Dua Lipa and Travis Scott has the most tracks in the list;
2. Another way is to take the top row of the list.

As there is no `no_of_plays` feature, which would indicate how popular the track was, we assume that ordering by index represents the popularity. 

In [None]:
print("The following is the most popular artist with the first position in chart: ")

top50['artist'].head(1)

### 7. How many artists in total have their songs in the top 50?

For this, we'll look for unique entries in the `artist` column.

In [None]:
artist_count = len(top50['artist'].unique())
print(f"{artist_count} artists have their songs in the top 50")

### 8. Are there any albums that have more than 1 popular track? If yes, which and how many?

In this case, we need to group by the `album` column, treating repetitions as a song count. If it exceeds 1, we assume the album has more than one song in top 50.

In [None]:
albums_several_tracks = top50.groupby(['album', 'artist']).size().reset_index(name='song_count')
with_multiple_tracks = albums_several_tracks[albums_several_tracks['song_count'] > 1]

print("These albums have more than one track on the top 50 list: ")
with_multiple_tracks.sort_values(by='song_count', ascending=False)

### 9. How many albums in total have their songs in the top 50?

We use the `unique` method to count all distinct values – the length returns us the number of albums.

In [None]:
album_count = len(top50['album'].unique())
print(f"There are {album_count} albums in top 50.")

### 10. Which tracks have a danceability score above 0.7?

We select the dataframe with `artist`, `track_name` and `danceability` columns. We then apply the condition, based on the `danceability` values, and sort the resulting dataframe by them in descending order.

In [None]:
danceability_score = top50[['artist', 'track_name', 'danceability']]
high_danceability = danceability_score[danceability_score['danceability'] > 0.7].sort_values(by='danceability', ascending=False)

print("The following are the tracks with high danceability score: ")
high_danceability

### 11. Which tracks have a danceability score below 0.4?

We use the similar approach here, only changing the condition and sorting in ascending order. 

In [None]:
low_danceability = danceability_score[danceability_score['danceability'] < 0.4].sort_values(by='danceability', ascending=True)

print("The following are the tracks with the low danceability score: ")
low_danceability

### 12. Which tracks have their loudness above -5?

Here we follow the `danceability` logic, only changing the features and conditions.

In [None]:
loudness_score = top50[['artist', 'track_name', 'loudness']]
loud_tracks = loudness_score[loudness_score['loudness'] > -5].sort_values(by='loudness', ascending=False)

print("These are the loudest tracks in the top 50: ")
loud_tracks

### 13. Which tracks have their loudness below -8?

In [None]:
quiet_tracks = loudness_score[loudness_score['loudness'] < -8].sort_values(by='loudness', ascending=False)

print("These are the quietest tracks in the top 50: ")
quiet_tracks

### 14. Which track is the longest?

For this, again, we select a subset of the original data frame, and indicate the maximum value in the `duration_ms` column, by using `idxmax()` feature.

In [None]:
track_duration = top50[['artist', 'track_name', 'duration_ms']]
longest_track = track_duration['duration_ms'].idxmax()
longest_track
track_duration.loc[longest_track]

Not the user-friendlist format, let's convert into `minutes:seconds`

In [None]:
def to_minutes_seconds(duration):
    minutes = duration // 60000
    seconds = round((((duration / 60000) - (duration // 60000)) * 60))
    return f"{minutes}:{seconds}"

long_duration = track_duration.loc[longest_track]['duration_ms']

print(f"The following is the longest track of the top 50: \n \n {track_duration.loc[longest_track]} \n")

print(f"Length in minutes: {to_minutes_seconds(long_duration)}")

### 15. Which track is the shortest?

In [None]:
shortest_track = track_duration['duration_ms'].idxmin()
shortest_track
track_duration.loc[shortest_track]

In [None]:
short_duration = track_duration.loc[shortest_track]['duration_ms']

print(f"The following is the shortest track of the top 50: \n \n {track_duration.loc[shortest_track]} \n")

print(f"Length in minutes: {to_minutes_seconds(short_duration)}")

### 16. Which genre is the most popular?

We have already touched upon this question when considering categorical variables. Let's redo the Series:

In [None]:
top50['genre'].value_counts()

From data we see that Pop dominates with Hip-hop/Rap close behind, especially considering sub-genres with one or two observations (chamber pop, hip-hop / trap, etc.) Sufficient here is to say, that pop and hip-hop/rap genres significantly dominates over the 'distinct' genre-level alternatives such as electronic music, alternative and R&B.

### 17. Which genres have just one song on the top 50?

For this, let's use the condition, transforming genres into indices, and resulting counts as values.

In [None]:
song_genre = top50[['artist', 'track_name', 'genre']]
genre_list = top50['genre'].value_counts()
single_track_per_genre = genre_list[genre_list < 2].index

unique_genre_tracks = song_genre[song_genre['genre'].isin(single_track_per_genre)]

print("The following genres only appear once in the dataset: ")
unique_genre_tracks

Short note here. Like we discovered already in dealing with categorical data type, some of these genres are just combination of some of the other genres. These exotic sets actually raise a question, how we deal with this: do we treat the combination as unique genre, or should we somehow look for the ways to disassemble them into separate features? But if so, which genre should be taken as the main one? This caveat is only a side note for further analysis, as we'll now focus on the 'pure' genres, which make up the majority of the top 50 list.

### 18. How many genres in total are represented in the top 50?

Disregarding philosophical considerations on what defines a genre in the last question, we simply use the `unique` method here.

In [None]:
print(f"There are {len(top50['genre'].unique())} genres in the top 50")

### 19. Which features are strongly positively correlated?

We have to arbitrarily select the threshold which would indicate the strong correlation. We will take the r=0.75 as threshold, and indicate that if features have r>0.75, there is a strong correlation between them. The same will go for negative correlation.

In [None]:
features = top50.iloc[:, 5:16]
features.corr() > 0.7

We see that there are following correlations:

- energy – loudness

This proves the truism: the louder the music, the more energetic the vibe. We cannot indicate the causation, though.

### 20. Which features are strongly negatively correlated?

In [None]:
features = top50.iloc[:, 5:16]
features.corr() < -0.5

Choosing a threshold of r=0.75 yielded no results, thus we lowered the bar to r=0.5 just to see some moderate patterns. Those are:

- energy – acousticness
- instrumentalness – loudness

From these initial inferences we can conclude that the energy of the track is strongly positively related to loudness, while negatively correlates with acousticness – acoustic songs are calmer. Also, instrumental tracks seem to be quieter.

### 21. Which features are not correlated?

We use the same conditional approach here, setting the range of low correlation to (-0.25, 0.25).

In [None]:
correlation_matrix = features.corr()
correlation_matrix
no_correlation_features = (correlation_matrix > -0.25) & (correlation_matrix < 0.25)
no_correlation_features

In short, it's hard to answer which features are not correlated, given only r = (-0.25, 0.25) range. We clearly see that duration and tempo has no effect on other features, except the stronger relationship between speechiness and duration. Other dimensions have weak-to-moderate relationship, but no outliers there.

### 22. How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

For this, we'll print summaries of descriptive statistics, using the `describe` method. We also indicate key genres in a separate list for the easier usability further on. For more informative appraoch, we'll also add `median` manually.

In [None]:
key_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

In [None]:
danceability_by_key_genre = top50[['danceability', 'genre']].set_index('genre').loc[key_genres]
for genre in key_genres:
    print(genre, '\n', danceability_by_key_genre.loc[genre].describe(), '\n', "median: ", danceability_by_key_genre.loc[genre].median(), '\n')

On average, we see that Hip-Hop/Rap and Dance/Electronic tracks have a higher danceability score. Hip-Hop/Rap also has the most danceable track (0.896) of all genres. The least danceable genre is Alternative/Indie, which also has the least danceable track (0.459)

### 23. How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [None]:
loudness_by_key_genre = top50[['loudness', 'genre']].set_index('genre').loc[key_genres]
for genre in key_genres:
    print(genre, '\n', loudness_by_key_genre.loc[genre].describe(), '\n', "median: ", loudness_by_key_genre.loc[genre].median(), '\n')

As expected, the Dance/Electronic music is the loudest, while Hip-Hop/Rap has the quietest songs. However, if we compare the median values, Alternative/Indie has the higher middle value, meaning that there are really loud songs among Dance/Electronic, which skew the data, or the majority of Alternative/Indie songs tend to be louder.

### 24. How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [None]:
acousticness_by_key_genre = top50[['acousticness', 'genre']].set_index('genre').loc[key_genres]
for genre in key_genres:
    print(genre, '\n', acousticness_by_key_genre.loc[genre].describe(), '\n', "median: ", acousticness_by_key_genre.loc[genre].median(), '\n')

The most acoustic genre is Alternative/Indie, having both the highest mean and median values, the latter also surpassing the former – meaning that the values are in general on the higher side. Dance/Electronic tracks are the least acoustic.

## Some takeaways and considerations

### What did we learn from this analysis?

1. Billie Eilish, Travis Scott and Dua Lipa has the most songs on the top 50 list (3 each), meaning that together they make up almost a fifth of the chart. Given that there are 40 artists in total, less than 10% of artists make up for 20% of total tracks of the list;
2. Billie Eilish song 'everything i wanted' is an outlier in the most – 3 – dimensions. However, 25 of the tracks are outliers in at least one dimension;
3. Four genres cover almost 4/5 (36 out of 50) of tracks, but we should be aware that the majority of less frequent genres are the combination of some other genres, and these should be treated differently, either by cleaning and manipulation. Notice that from 'pure' genres R&B/Soul has only two samples, which shows its declining popularity – even though the most popular track is R&B;
4. Energy and loudness features correlate strongly, while we see the negative correlation between energy and acousticness, instrumentalness and loudness (but the r threshold here is lower). Tempo and track duration does not influence other features of the track;
5. Some genre-related insights on several features:
- Hip-Hop/Rap and Dance/Electronic tracks have a higher danceability score. Hip-Hop/Rap also has the most danceable track (0.896) of all genres. The least danceable genre is Alternative/Indie, which also has the least danceable track (0.459);
- As expected, the Dance/Electronic music is the loudest, while Hip-Hop/Rap has the quietest songs. However, if we compare the median values, Alternative/Indie has the higher middle value, meaning that there are really loud songs among Dance/Electronic, which skew the data, or the majority of Alternative/Indie songs tend to be louder;
- The most acoustic genre is Alternative/Indie, having both the highest mean and median values, the latter also surpassing the former – meaning that the values are in general on the higher side. Dance/Electronic tracks are the least acoustic.

### How to improve it and further things to look at:

1. We didn't have the listener / play count in this dataset, meaning that we had to trust indexing as the aggregated value for popularity. It would be interesting to see the relationship between play count and other features in numerical terms;
2. We could also tinker the genre approach, building more general categories (for example, Pop category including nu-pop or disco-pop). The regex and mapping could be employed here to add several genre columns;
3. The `key` feature also is interesting to analyze, especially in terms of valence or energy. Initially we saw no correlation, but maybe knowing major or minor tonality could provide us with a hint of how top music sounds.
