# Case Study : What are the characteristics of modern popular songs ?

### link to the Kaggle dataset : https://www.kaggle.com/datasets/yelexa/spotify200

We import the python libraries that we're going to use for this analysis. Numpy and Pandas are useful to process and transform the data. Matplotlib and Plotly come in handy to perform visualizations based on the transformed data.

In [None]:
!git clone https://github.com/theophile-bb/Spotify-ranking-an-analysis.git

%cd Spotify-ranking-an-analysis

In [None]:
pip install -r requirements.txt

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import kagglehub
import os

In [None]:
dataset_path = kagglehub.dataset_download("yelexa/spotify200")

print("Downloaded files:", os.listdir(dataset_path))

csv_files = [f for f in os.listdir(dataset_path) if f.endswith(".csv")]

if csv_files:
    csv_path = os.path.join(dataset_path, csv_files[0])

    df = pd.read_csv(csv_path, low_memory=False)
else:
    print("No CSV file found in the dataset folder.")

The dataset we're using for this study is the 'Spotify Weekly Top 200 Songs Streaming Data' dataset we found on Kaggle. This dataset gathers data about the top spotify streams on a 75 weeks time span.

In [None]:
df

## Data cleaning

A process of data cleaning has to be done to make the data usable for our study. We start of by removing the columns that won't see any use :

In [None]:
df2 = df.drop(columns = ["Unnamed: 0","uri","artist_img", "album_cover", "collab", "pivot","artist_id","album_num_tracks"])
df2

We then have to change the types of our columns. Here all the columns have the type 'object' so we're going to change the types to better fit the data type in each case :

In [None]:
df2.dtypes

In [None]:
df2 = df2[df2['rank'] != 'rank']
df2.reset_index(drop=True, inplace=True)
df2

In [None]:
def fix_dates(x):
    if len(str(x)) == 4:  # If the date is only a year like '2005'
        return f"{x}-01-01"
    return x

df2['release_date'] = df2['release_date'].apply(fix_dates)
df2['release_date'] = pd.to_datetime(df2['release_date'], errors='coerce')

column_types = {
    'rank': int,
    'artists_num': float,
    'peak_rank': int,
    'previous_rank': int,
    'weeks_on_chart': int,
    'streams': int,
    'mode': float,
    'danceability': float,
    'energy': float,
    'key': float,
    'loudness': float,
    'speechiness': float,
    'acousticness': float,
    'instrumentalness': float,
    'liveness': float,
    'valence': float,
    'tempo': float,
    'duration': float
}

df2 = df2.astype(column_types)

df2['week'] = pd.to_datetime(df2['week'])

In [None]:
df2.dtypes

After changing the types, we check if our rows contain 'NaN' values, and if so, we fill these NaN with the mean of the column.

In [None]:
isnull = df2.isnull().sum()/len(df2)

print("NaN values:\n", isnull)

In [None]:
mean_values = {
    'danceability': df2['danceability'].mean(),
    'energy': df2['energy'].mean(),
    'key': df2['key'].mean(),
    'mode': df2['mode'].mean(),
    'loudness': df2['loudness'].mean(),
    'speechiness': df2['speechiness'].mean(),
    'acousticness': df2['acousticness'].mean(),
    'instrumentalness': df2['instrumentalness'].mean(),
    'liveness': df2['liveness'].mean(),
    'valence': df2['valence'].mean(),
    'tempo': df2['tempo'].mean(),
    'duration': df2['duration'].mean()
}

df2 = df2.dropna(subset=['release_date'])

df2 = df2.fillna(value=mean_values)

In [None]:
df2

We notice that some rows have an 'artist_genre' value with aberrant values. Here for example we selected all the rows with an artist_genre equal to 0.

In [None]:
df2[df2['artist_genre']=='0']

We replace all the rows with an artist_genre equal to 0 by the value 'other'.

In [None]:
df2.loc[:, 'artist_genre'] = df2['artist_genre'].replace('0', 'other')

In [None]:
df2[df2['artist_genre']=='other']

## Data visualizations

#### In this part we are going to make many visualizations based on our data to highlight its specificities.

### Genre distribution

First, here is a little overview of the distribution of the genre of the tracks in the dataset. We can see that it is mostly pop, rap and latino music such as reggaeton.

In [None]:
genre_distribution = df2['artist_genre'].value_counts().head(15)

fig = px.pie(genre_distribution, values=genre_distribution, names=genre_distribution.index, title='Genre Distribution in Weekly Top Songs')
fig.show()

### Streams by region of the world

Let's see the streams total classified by region. We can see that the most famous artists that are grouped in the 'global' region (as their renown is at the scale of the world) are way ahead in terms of streams. It is also due to the data temporal range not being the same: for  the 'Global' charts, there is data from the week of 2016/12/29 ~ 2022/07/14 whereas for the other dataset the data is gathered for week of 2021/02/04 ~ 2022/07/14. The first real region when it comes to streams is Europe, followed by America.

In [None]:
region_popularity = df2.groupby('region')['streams'].sum().sort_values(ascending=True)

fig = px.bar(x=region_popularity.index, y=region_popularity.values,
             title='Streams Count of Weekly Top Songs by Region',
             labels={'x': 'Region', 'y': 'Streams Count'})
fig.show()

In [None]:
region_distribution = region_popularity

fig = px.pie(region_distribution, values=region_distribution, names=region_distribution.index, title='Popularity in number of streams around the regions of the world')
fig.show()

### How many songs does an average user listen to ?  

The data we use is the Spotify data from the 2021/02/04 to the 2022/07/14. It is equivalent to 75 weeks in total. We're gonna use all the streams on this time span without the data from the global chart to estimate the number of songs from this spotify top an average user listens to.

In [None]:
nb_streams = df2[~df2["country"].isin(["Global", "country"])]
nb_streams = nb_streams["streams"].sum()
print('Total number of tracks listened on the duration : ', nb_streams)

This statistic study gives us the information that there are surely around 500 millions active spotify users in the world in 2023 :
https://fr.statista.com/statistiques/574665/spotify-nombre-d-utilisateurs-actifs-dans-le-monde/ .
Thanks to this, we can deduce the average number of 'popular' tracks from the spotify top an average user listens to.

In [None]:
weeks = nb_streams / float(500000000)
print(weeks)

An average spotify user listens to (or at least plays) 845 songs.

In [None]:
weekly = weeks/75
weekly

That makes around 11 songs from this top played by an average spotify user during a week.

### Streams by country

After focusing on the streams by regions, we can go more into detail and focus on the number of streams by country. We can clearly see that the USA are the first country for streams.

In [None]:
country_popularity = df2.groupby('country')['streams'].sum().sort_values(ascending=False)

top_20_countries = country_popularity.head(21)

# Remove country = Global
top_20_countries2 = top_20_countries[1:]

fig = px.bar(x=top_20_countries2.index,
             y=top_20_countries2.values,
             title='Streams count of Weekly Top Songs by country (Top 20)',
             labels={'x': 'Country', 'y': 'Streams Count'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

The problem we with our dataset is that the first value of the country top is 'global' to reffer to world famous artists. Unfortunately we had to remove this value to facilitate our following study because it doesn't represent any country and is greatly superior to our second entry.

In [None]:
country_popularity = df2.groupby('country')['streams'].sum().sort_values(ascending=False)

pop_df = pd.DataFrame(country_popularity)
pop_df['country'] = pop_df.index
pop_df = pop_df.reset_index(drop=True)
pop_df = pop_df[['country','streams']]
pop_df = pop_df[~pop_df["country"].isin(["Global", "country"])]
pop_df

Here is a visualization of these numbers on a worldmap :

In [None]:
fig = px.scatter_geo(pop_df, locations="country", locationmode="country names",
                     color="streams", size="streams",
                     projection="natural earth",
                     hover_name="country", scope="world")

fig.show()

### Top Global artists based on stream numbers

We wanted to see who were the artists that were streamed the most in our top.

In [None]:
dfGlobal = df2[df2["country"]=='Global']

In [None]:
top_artists = dfGlobal.groupby('artist_individual')['streams'].sum().sort_values(ascending=False).head(20)

fig = px.bar(x=top_artists.index, y=top_artists.values,
             title='Top 20 Artists in Weekly Top Songs',
             labels={'x': 'Artists', 'y': 'Frequency'},
             color_discrete_sequence=px.colors.qualitative.Pastel)
fig.show()

### Top artists based on streams and on featurings

We plot a pie chart to represent the repartition of the number of artists participating on a song. We can see that a bit more than a third of the song are made by 1 artist and the two other thirds are songs with feats.

In [None]:
df_artist_num = dfGlobal['artists_num'].value_counts().head(10)
fig = px.pie(df_artist_num, values=df_artist_num.values, names=df_artist_num.index, title='distribution of number of artists by song in Weekly Top song in the world')
fig.show()

We first plot the top 20 artists that have the most streams on songs with featurings.

In [None]:
df3 = dfGlobal[dfGlobal["artists_num"]> 1.0]

In [None]:
top_artists_feat = df3.groupby('artist_individual')['streams'].sum().sort_values(ascending=False).head(20)

fig = px.bar(x=top_artists_feat.index,
             y=top_artists_feat.values,
             title='Top 20 Artists in Weekly Top Songs for featurings',
             labels={'x': 'Artists', 'y': 'Frequency'})
fig.show()

We then plot the top 20 artists that have the most streams on solo songs this time.

In [None]:
df4 = dfGlobal[dfGlobal["artists_num"]== 1.0]

In [None]:
top_artists_solo = df4.groupby('artist_individual')['streams'].sum().sort_values(ascending=False).head(20)

fig = px.bar(x=top_artists_solo.index,
             y=top_artists_solo.values,
             title='Top 20 Artists solo in Weekly Top Songs for solo songs',
             labels={'x': 'Artists', 'y': 'Frequency'})
fig.show()

And finally here is a comparison of these two plots :

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x=top_artists_solo.index,
                y=top_artists_solo.values,
                name='solo',
                marker_color='rgb(55, 83, 109)'
                ))
fig.add_trace(go.Bar(x=top_artists_feat.index,
                y=top_artists_feat.values,
                name='feat',
                marker_color='lightsalmon'
                ))
fig.update_layout(barmode='group', xaxis_tickangle=-45)

fig.show()

Not that many artists are both in the top 20 for solo and featuring songs. Ed Sheeran and The Weeknd have much more streams on their solo songs than their featurings. In the other hand Dua Lipa, Post Malone and Drake have more streams son their featuring songs. The last artist is Bad Bunny, who also has much much more streams for his featuring songs than his solo songs (twice as much).

We can also visualize this comparison between solo and featuring songs among the top 20 artists by streams.

In [None]:
solo_streams_all = dfGlobal[dfGlobal["artists_num"] == 1.0].groupby('artist_individual')['streams'].sum().to_frame(name='solo_streams')
feat_streams_all = dfGlobal[dfGlobal["artists_num"] > 1.0].groupby('artist_individual')['streams'].sum().to_frame(name='feat_streams')

df5 = pd.DataFrame(index=top_artists.index)

df5 = df5.merge(solo_streams_all, left_index=True, right_index=True, how='left')
df5 = df5.merge(feat_streams_all, on='artist_individual', how='inner')
df5 = df5.fillna(0)

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x=df5.index,
                y=df5['solo_streams'],
                name='solo',
                marker_color='rgb(55, 83, 109)'
                ))
fig.add_trace(go.Bar(x=df5.index,
                y= df5['feat_streams'],
                name='feat',
                marker_color='lightsalmon'
                ))
fig.update_layout(barmode='group', xaxis_tickangle=-45)

fig.show()

We can see familiar names in this plot such as Bad Bunny, Ed Sheeran Justin Bieber or Drake.

Out of the 20 most streamed artists:

*   10/20 have a majority of streams on their solo songs
*   10/20 have a majority of streams on their featurings songs

The top 20 artists hold themselves pretty well with only their solo songs.

The gap comes from the top 5 artists. Among them:

*   4/5 have a majority of streams on their featurings songs
*   Only 1/5 have a majority of streams on their solo songs

It is clear that these big names are much more prevalent (especially Bad Bunny) in the numbers.


Let's compare the 2 tops with the total number of streams :

In [None]:
featstreams = df3['streams'].sum()
featstreams

In [None]:
solostreams = df4['streams'].sum()
solostreams

We can clearly see that the songs with featurings generate twice as much streams than the solo songs. We can deduce that the songs in featuring are much more popular than solo songs. It is one of the criteria that can explain the popularity of a song. We think that it can be explained by the fact that a song featuring multiple artists will appeal to the fan bases of all the artists participating, therefore generating much more streams.

### Weeks on chart

We'll now see the songs and the artists that stay in the chart the longest, based on the number of weeks.

In [None]:
distinct_weeks = dfGlobal.groupby('artist_individual')['week'].nunique().sort_values(ascending=False).head(20)

fig = px.bar(x=distinct_weeks.index,
             y=distinct_weeks.values,
             title='Artists with the most distinct weeks in the Top',
             labels={'x': 'Artists', 'y': 'Number of distinct weeks in the Top'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

We can clearly see that 3 artists contest for the top position of number of weeks on top : Ed Sheeran, Imagine Dragon and Drake.

Now let's see the maximum number of consecutive weeks on top of the charts:

In [None]:
weeks = dfGlobal.groupby('artist_individual')['weeks_on_chart'].max().sort_values(ascending=False).head(40)

fig = px.bar(x=weeks.index,
             y=weeks.values,
             title='Artists with the most consecutive weeks in the Top',
             labels={'x': 'Artists', 'y': 'Max number of weeks in the Top'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

We find again Ed Sheeran and Imagine Dragons at the top with repectively 285 and 284 consecutives weeks on top of the chart. That's approximatly 5 years 1/2 at top which is quite impressive.

## Conclusion

### Genre Distribution
Popular songs predominantly fall into **Pop, Rap, and Latino music** categories, such as Reggaeton.

### Regional and Country-Specific Popularity
*   **Global artists** lead in overall streams, though this is influenced by a longer data temporal range for 'Global' charts compared to country-specific data.
*   Among specific regions, **Europe** and **America** show the highest stream counts.
*   At the country level, the **United States** has the most streams, followed by **Brazil** and **Mexico**, highlighting the strong presence of these markets in global music consumption.

### Average User Listening Habits
Based on the dataset, an average Spotify user listens to approximately **845 popular songs** (from the weekly top charts) over a 75-week period, which translates to roughly **11 top songs per week**.

### Solo vs. Featuring Songs
*   A significant trend observed is the prevalence of collaborations: approximately **two-thirds of popular songs feature multiple artists**.
*   **Songs with featurings generate nearly twice as many streams** as solo songs, suggesting that collaborations significantly boost a song's reach by appealing to combined fanbases.
*   Notable artists like **Ed Sheeran** and **The Weeknd** gain more streams from their solo work, while artists such as **Dua Lipa, Post Malone, Drake**, and especially **Bad Bunny**, achieve greater stream counts through their featuring songs.

### Significant Artists and Longevity
*   **Bad Bunny** stands out as a dominant artist, leading significantly in  overall streams, largely due to his numerous successful collaborations.
*   In the other hand **Ed Sheeran** appears 4th in the overall streams ranking, but he is first in the solo songs streams category. He also is the most consistent artist in the top, placing first in the Global Top chart (top 40) for 285 consecutive weeks (5,5 years), where Bad Bunny doesn't even appear.

The explaination between these observations can be that Ed Sheeran had a very popular song in the top for a very long time and that it worked incredibly well (possibly 'Shape of you') which concentrated most of his streams. Bad Bunny made much more streams, possibly due to the release of an album with plenty of songs in featuring. These songs didn't get to the Top 40 of the chart but still gathered many streams that once added up make him the most streamed artist over this period.


***
In conclusion, modern popular songs are often characterized by a strong presence of Pop, Rap, and Latin genres, benefit significantly from artist collaborations, and find their largest audiences in North and South America and Europe. Artists like Bad Bunny or Ed Sheeran have a very consistant popularity, through frequent high-impact features and consistent solo work, respectively.