# AN ANALYSIS OF TOP 50 SPOTIFY TRACKS IN 2020

This notebook, called "spotify_project" contains an analysis of top 50 Spotify tracks in 2020. It contains analysis questions, code, and answers derived from dataset. There are 50 observations (each row represents one song) and 16 features (each column represents a song attribute), without index column, in this dataset.

In [29]:
import kagglehub
import pandas as pd
import numpy as np
import os

path = kagglehub.dataset_download("atillacolak/top-50-spotify-tracks-2020")

print("Path to dataset files:", path)

print("Files in the downloaded directory:", os.listdir(path))

csv_file_path = os.path.join(path, "spotifytoptracks.csv")

top_tracks = pd.read_csv(csv_file_path, index_col=0)

top_tracks.head()

Path to dataset files: /home/ubuntu/.cache/kagglehub/datasets/atillacolak/top-50-spotify-tracks-2020/versions/2
Files in the downloaded directory: ['spotifytoptracks.csv']


Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


In [31]:
top_tracks.shape

(50, 16)

# DATA CLEANING

First of all, I will perform data cleaning. This dataset contains no NULL values and all rows are unique.

In [2]:
top_tracks.isna().any()

Unnamed: 0          False
artist              False
album               False
track_name          False
track_id            False
energy              False
danceability        False
key                 False
loudness            False
acousticness        False
speechiness         False
instrumentalness    False
liveness            False
valence             False
tempo               False
duration_ms         False
genre               False
dtype: bool

In [3]:
uniques = top_tracks.drop_duplicates()

uniques

Unnamed: 0.1,Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
5,5,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
6,6,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
7,7,Powfu,death bed (coffee for your head),death bed (coffee for your head),7eJMfftS33KTjuF7lTsMCx,0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
8,8,Trevor Daniel,Nicotine,Falling,2rRJrJEo19S2J82BDsQ3F7,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
9,9,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


Below is a list of data types of all columns:

In [4]:
top_tracks.dtypes

Unnamed: 0            int64
artist               object
album                object
track_name           object
track_id             object
energy              float64
danceability        float64
key                   int64
loudness            float64
acousticness        float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
genre                object
dtype: object

Let's convert genre column data type to categorical data type in order to faster process and improve code performance:

In [5]:
top_tracks["genre"] = top_tracks["genre"].astype("category")

top_tracks.dtypes

Unnamed: 0             int64
artist                object
album                 object
track_name            object
track_id              object
energy               float64
danceability         float64
key                    int64
loudness             float64
acousticness         float64
speechiness          float64
instrumentalness     float64
liveness             float64
valence              float64
tempo                float64
duration_ms            int64
genre               category
dtype: object

Then, I will check for outliers in numeric columns:

In [6]:
numeric_columns = ["energy",
                   "danceability",
                   "loudness",
                   "acousticness",
                   "speechiness",
                   "liveness",
                   "valence",
                   "tempo",
                   "duration_ms"
                   ]

outliers_dict = {}

for column in numeric_columns:
    Q1 = top_tracks[column].quantile(0.25)
    Q3 = top_tracks[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = top_tracks[
    (top_tracks[column] < lower_bound) | (top_tracks[column] > upper_bound)
]

    outliers_dict[column] = outliers

for column, outliers in outliers_dict.items():
    print(f"Column: {column}. Number of outliers: {len(outliers)}")

Column: energy. Number of outliers: 0
Column: danceability. Number of outliers: 3
Column: loudness. Number of outliers: 1
Column: acousticness. Number of outliers: 7
Column: speechiness. Number of outliers: 6
Column: liveness. Number of outliers: 3
Column: valence. Number of outliers: 0
Column: tempo. Number of outliers: 0
Column: duration_ms. Number of outliers: 2


I found an outliers in the data, such as in danceability, loudness, acousticness, speechiness, liveness, and duration. However, I chose to keep them since songs naturally vary in their attributes.

# EXPLORATORY DATA ANALYSIS

Below is a list of artists, that have more than one popular track:

In [7]:
artist_count = top_tracks["artist"].value_counts()

multiple_tracks_artists = artist_count[artist_count > 1]

multiple_tracks_artists

artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Harry Styles     2
Lewis Capaldi    2
Justin Bieber    2
Post Malone      2
Name: count, dtype: int64

The most popular artists among the top 50 Spotify tracks of 2020, based on the number of tracks they have, are Dua Lipa, Billie Eilish, and Travis Scott.

Next, there are in total 40 artists that have their songs in this list:

In [8]:
unique_artist_count = len(artist_count)

unique_artist_count

40

In total 4 albums have more than 1 popular track:

In [9]:
album_count = top_tracks["album"].value_counts()

multiple_tracks_albums = album_count[album_count > 1]

multiple_tracks_albums

album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64

Next, there are 45 albums in total in this list:

In [10]:
unique_album_count = len(album_count)

unique_album_count

45

Analyzing the attributes of songs, below is a list of 32 tracks with a danceability score above 0.7. Higher value of danceability means that it is easier to dance to the song.

In [11]:
high_danceability_tracks = top_tracks[top_tracks["danceability"] > 0.7]

high_danceability = high_danceability_tracks[["track_name", "danceability"]]
high_lenght =len(high_danceability)

print(high_danceability)
print()
print(f"Total number of high danceability tracks: {high_lenght}.")


                                       track_name  danceability
1                                    Dance Monkey         0.825
2                                         The Box         0.896
3                           Roses - Imanbek Remix         0.785
4                                 Don't Start Now         0.793
5                    ROCKSTAR (feat. Roddy Ricch)         0.746
7                death bed (coffee for your head)         0.726
8                                         Falling         0.784
10                                           Tusa         0.803
13                                Blueberry Faygo         0.774
14                       Intentions (feat. Quavo)         0.806
15                                   Toosie Slide         0.830
17                                         Say So         0.787
18                                       Memories         0.764
19                     Life Is Good (feat. Drake)         0.795
20               Savage Love (Laxed - Si

There is only 1 track with a danceability score under 0.4. It suggest, that most of the songs from the top 50 tracks are with middle or higher danceability score:

In [12]:
low_danceability_tracks = top_tracks[top_tracks["danceability"] < 0.4]

low_danceability = low_danceability_tracks[["artist", "track_name", "danceability"]]
low_lenght =len(low_danceability)

print(low_danceability)
print()
print(f"Total number of low danceability tracks: {low_lenght}.")

           artist            track_name  danceability
44  Billie Eilish  lovely (with Khalid)         0.351

Total number of low danceability tracks: 1.


Loudness of songs is measured using the Loudness Unit Full Scale. A list of 19 relatively loud tracks with a loudness above -5 dB:

In [13]:
loud_tracks_list = top_tracks[top_tracks["loudness"] > -5]

loud_tracks = loud_tracks_list[["artist", "track_name", "loudness"]]
loud_lenght =len(loud_tracks)

print(loud_tracks)
print()
print(f"Total number of relatively loud tracks: {loud_lenght}.")

           artist                                     track_name  loudness
4        Dua Lipa                                Don't Start Now    -4.521
6    Harry Styles                               Watermelon Sugar    -4.209
10        KAROL G                                           Tusa    -3.280
12    Post Malone                                        Circles    -3.497
16  Lewis Capaldi                                  Before You Go    -4.858
17       Doja Cat                                         Say So    -4.577
21   Harry Styles                                      Adore You    -3.675
23       24kGoldn                         Mood (feat. iann dior)    -3.558
31       Dua Lipa                                 Break My Heart    -3.434
32            BTS                                       Dynamite    -4.410
33          BENEE               Supalonely (feat. Gus Dapperton)    -4.746
35      Lady Gaga                Rain On Me (with Ariana Grande)    -3.764
37    Post Malone  Sunflo

A list of 9 relatively quieter tracks with a loudness below -8 dB:

In [14]:
quiet_tracks_list = top_tracks[top_tracks["loudness"] < -8]

quiet_tracks = quiet_tracks_list[["artist", "track_name", "loudness"]]
quiet_lenght =len(quiet_tracks)

print(quiet_tracks)
print()
print(f"Total number of relatively quiet tracks: {quiet_lenght}.")

           artist                                      track_name  loudness
7           Powfu                death bed (coffee for your head)    -8.765
8   Trevor Daniel                                         Falling    -8.756
15          Drake                                    Toosie Slide    -8.820
20      Jawsh 685                Savage Love (Laxed - Siren Beat)    -8.520
24  Billie Eilish                             everything i wanted   -14.454
26  Billie Eilish                                         bad guy   -10.965
36   Travis Scott                             HIGHEST IN THE ROOM    -8.764
44  Billie Eilish                            lovely (with Khalid)   -10.109
47        JP Saxe  If the World Was Ending - feat. Julia Michaels   -10.086

Total number of relatively quiet tracks: 9.


The shortest track is "Mood (feat. iann dior)" and the longest track is  "SICKO MODE":

In [15]:
shortest_duration = top_tracks["duration_ms"].min()
longest_duration = top_tracks["duration_ms"].max()

shortest_track_name = top_tracks.loc[
    top_tracks["duration_ms"] == shortest_duration, "track_name"
].values[0]

longest_track_name = top_tracks.loc[
    top_tracks["duration_ms"] == longest_duration, "track_name"
].values[0]

print(
    f"The shortest track is '{shortest_track_name}' with a duration of "
    f"{shortest_duration} ms."
)
print(
    f"The longest track is '{longest_track_name}' with a duration of "
    f"{longest_duration} ms."
)

The shortest track is 'Mood (feat. iann dior)' with a duration of 140526 ms.
The longest track is 'SICKO MODE' with a duration of 312820 ms.


The most popular genre among the top 50 Spotify tracks of 2020, based on the number of songs of specific genre, is POP with 14 tracks:

In [16]:
genre_count = top_tracks["genre"].value_counts()

genre_count.head()

genre
Pop                  14
Hip-Hop/Rap          13
Dance/Electronic      5
Alternative/Indie     4
 Electro-pop          2
Name: count, dtype: int64

10 genres, that have one song in the list, are:
- Chamber pop
- Alternative/reggaeton/experimental
- Dreampop/Hip-Hop/R&B
- Disco-pop
- Dance-pop/Disco
- Hip-Hop/Trap
- Nu-disco
- Pop rap
- Pop/Soft Rock
- R&B/Hip-Hop alternative

In [17]:
one_song_genre = genre_count[genre_count == 1]

one_song_genre

genre
Chamber pop                           1
Alternative/reggaeton/experimental    1
Dreampop/Hip-Hop/R&B                  1
Disco-pop                             1
Dance-pop/Disco                       1
Hip-Hop/Trap                          1
Nu-disco                              1
Pop rap                               1
Pop/Soft Rock                         1
R&B/Hip-Hop alternative               1
Name: count, dtype: int64

In total, there are 16 unique genres:

In [18]:
unique_genres_count = len(genre_count)

unique_genres_count

16

When analyzing correlations, the Pearson correlation coefficient (r) measures the strength and direction of a linear relationship between two variables. The boundaries for interpreting Pearson's r are generally as follows:
- Strong positive correlation: 𝑟 > 0.7
- Moderate positive correlation: 0.3 < r ≤ 0.7
- Weak positive correlation: 0 < r ≤ 0.3
- No correlation: r=0
- Weak negative correlation: −0.3 ≤ r <0
- Moderate negative correlation: −0.7 ≤ r < −0.3
- Strong negative correlation: r < −0.7

I am using a slightly modified Pearson correlation approach based on this dataset, focusing on only three boundaries:
- Strong positive correlation: 𝑟 > 0.7
- No correlation: -0.05 < r < 0.05
- Strong negative correlation: r < −0.6

I am analyzing the correlation of measurable variables (numeric columns). The correlation matrix is converted to a long format and duplicate pairs and self-correlations are filtered out.

First of all, there are strongly positively correlated (r > 0.7) song attributes: energy and loudness, with a correlation above 0.79. This means that more energetic songs tend to be louder, with a correlation of 0.79164.

In [19]:
numeric_columns = ["energy",
                   "danceability",
                   "loudness",
                   "acousticness",
                   "speechiness",
                   "liveness",
                   "valence",
                   "tempo",
                   "duration_ms"]

correlation = top_tracks[numeric_columns].corr()

correlation_long = correlation.where(
    np.triu(np.ones(correlation.shape), k=1).astype(bool)
)

correlation_long = correlation_long.stack().reset_index()

correlation_long.columns = ['Feature 1', 'Feature 2', 'Correlation']

strong_positive_correlation = correlation_long[
    correlation_long['Correlation'] > 0.7
]

strong_positive_correlation

Unnamed: 0,Feature 1,Feature 2,Correlation
1,energy,loudness,0.79164


Next, there are strongly negatively correlated (r < -0.6) song attributes: energy and acousticness, with a correlation below -0.68. Negatively means inversely correlated - if one attribute goes up, another goes down. This means that more enegetic songs tend to be less acoustic, while acoustic songs tend to have lower energy levels. The correlation suggests that as the energy of a song increases, its acoustic qualities decrease, indicating a potential trade-off between these attributes in music production.

In [20]:
strong_negative_correlation = correlation_long[correlation_long['Correlation'] < -0.6]

strong_negative_correlation

Unnamed: 0,Feature 1,Feature 2,Correlation
2,energy,acousticness,-0.682479


Lastly, there is no correlation (−0.05 < r < 0.05) between these attributes: 
- danceability and liveness (correlation: -0.006648)
- danceability and duration (correlation: -0.033763)
- loudness and speechiness (correlation: -0.021693)
- acousticness and duration (correlation: -0.010988)
- liveness and valence (correlation: -0.033366)
- liveness and tempo (correlation: 0.025457)
- valence and tempo (correlation: 0.045089)
- valence and duration (correlation: -0.039794)

This means that these attributes do not have a direct effect on one another.

In [21]:
no_correlation = correlation_long[(correlation_long['Correlation'] < 0.05)
                                  & (correlation_long['Correlation'] > -0.05)]

no_correlation

Unnamed: 0,Feature 1,Feature 2,Correlation
11,danceability,liveness,-0.006648
14,danceability,duration_ms,-0.033763
16,loudness,speechiness,-0.021693
25,acousticness,duration_ms,-0.010988
30,liveness,valence,-0.033366
31,liveness,tempo,0.025457
33,valence,tempo,0.045089
34,valence,duration_ms,-0.039794


Let's analyze the comparisons of different song attributes between various genres.

First, I analyze the danceability scores across the Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres. The average danceability score is highest for Hip-Hop/Rap (0.7655) and Dance/Electronic (0.7550). In contrast, the Alternative/Indie genre has the lowest average danceability score (0.6618) and Pop has lower level as well (0.677571). This suggests that tracks in the Hip-Hop/Rap and Dance/Electronic genres tend to be more rhythmically engaging compared to those in the Alternative/Indie genre.

In [38]:
genres_of_interest = ["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"]

filtered_tracks = top_tracks.loc[top_tracks["genre"].isin(genres_of_interest)]

print("Unique genres in filtered data:", filtered_tracks["genre"].unique())

danceability_comparison = filtered_tracks.groupby("genre", as_index=False,
                                                  observed=True)["danceability"].mean()

danceability_comparison.columns = ["genre", "average_danceability"]

danceability_comparison.sort_values(by="average_danceability", ascending=False, inplace=True)

danceability_comparison


Unique genres in filtered data: ['Alternative/Indie' 'Hip-Hop/Rap' 'Dance/Electronic' 'Pop']


Unnamed: 0,genre,average_danceability
2,Hip-Hop/Rap,0.765538
1,Dance/Electronic,0.755
3,Pop,0.677571
0,Alternative/Indie,0.66175


Next, I analyze the loudness scores in dB across the Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres. The average loudness score is highest for Dance/Electronic (-5.338) and Alternative/Indie (-5.421). The Hip-Hop/Rap genre (-6.917846) and Pop genre (-6.460357) has lower average loudness scores. These findings indicate that Dance/Electronic tracks tend to be produced with higher loudness levels, which can enhance their impact in club and dance environments.

In [40]:
loudness_comparison = filtered_tracks.groupby("genre", as_index=False,
                                              observed=True)["loudness"].mean()

loudness_comparison.columns = ["genre", "average_loudness"]

loudness_comparison.sort_values(by="average_loudness", ascending=False, inplace=True)

loudness_comparison

Unnamed: 0,genre,average_loudness
1,Dance/Electronic,-5.338
0,Alternative/Indie,-5.421
3,Pop,-6.460357
2,Hip-Hop/Rap,-6.917846


Lastly, I analyze the acousticness scores across the Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres. Acousticness value describes how acoustic a song is. There is a significant variation in the acousticness attribute among these genres of interest, suggesting that acousticness tends to differ across musical styles. The Alternative/Indie genre has the highest average acousticness score (0.5835), while the Dance/Electronic genre has the lowest average score (0.0994). In between are the Pop genre (0.3238) and the Hip-Hop/Rap genre (0.1887).

This variation indicates that Alternative/Indie music incorporates more acoustic elements, while Dance/Electronic tracks are more electronically produced, resulting in lower acousticness scores.

In [41]:
acousticness_comparison = filtered_tracks.groupby("genre", as_index=False,
                                                  observed=True)["acousticness"].mean()

acousticness_comparison.columns = ["genre", "average_acousticness"]

acousticness_comparison.sort_values(by="average_acousticness", ascending=False, inplace=True)

acousticness_comparison

Unnamed: 0,genre,average_acousticness
0,Alternative/Indie,0.5835
3,Pop,0.323843
2,Hip-Hop/Rap,0.188741
1,Dance/Electronic,0.09944


# IMPROVEMENTS OF ANALYSIS

Here are some suggestions how this analysis can be improved and further implemented:
- For further analysis specific questions of product manager and senior data analyst will be welcomed.
- Additionally, we could analyze a year-by-year comparison of the top 50 tracks on Spotify. This would allow us to see which song attributes are most common and observe any changes over time. Since most of the top 50 tracks in 2020 have a medium or high danceability score, we could investigate whether high danceability is unique to 2020 or if it is popular in other years as well, examining how it varies over multiple years.
- Moreover, we could analyze the popularity of different genres across several years.