<h4>This dataset presents a list of the 100 most streamed songs on Spotify of all time</h4>
(This analysis dates from March 8th 2023)  


Dataset source: https://www.kaggle.com/datasets/amaanansari09/most-streamed-songs-all-time




<h4>Features description:</h4>

- <b>duration:</b> Duration of the song (minutes).


- <b>energy:</b> A perceptual measure of intensity and activity.


- <b>key:</b> The harmony key, being 0 = C, 1 = C#, ... , 11 = B


- <b>loudness:</b> The overall loudness of a track in decibels (dB).


- <b>mode:</b> 0 = Major Key; 1 = Minor Key;


- <b>speechiness:</b> ?


- <b>acousticness:</b> ?


- <b>instrumentalness:</b> ?


- <b>liveness:</b> ?


- <b>valence:</b> ?


- <b>tempo:</b> Rhythm of the song (beats/min or bpm)


- <b>danceability:</b> ?






<h3>Objectives and Key Insights</h3>

<h4> Answers to be found:</h4>

1. What are the artists with more songs in the list

2. What are the artists with more streams in the list

3. Is there a pattern in terms of "key" and "mode" for these top songs?

4. Do these songs gravitate around an "optimal duration"?

5. Are there any correlations between "energy" and "tempo"?

6. Do these songs gravitate around an "optimal duration"?



<h4> Insights that cannot be leveraged with this dataset (and I am particularly interested in!!):</h4>

1. What are the chord progressions of each song? Is there a pattern among these top streamed ones?

2. What is the primary language the song is sung in? And how many words are there?

3. What are the song structures used by each song? (e.g. Intro, Verse, Bridge, Pre Chorus, Chorus, Otro, etc.)


In [373]:
import os
import pandas as pd
import numpy as np 
import seaborn as sb
import matplotlib.pyplot as plt
import plotly.express as px
from datetime import date
from dateutil.parser import parse


pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 10000)

In [None]:
features_file_path = f"/Users/sandrolobao/Desktop/Python Projects/Streamed Songs Spotify/streamed songs spotify/Features.csv"

streams_file_path = f"/Users/sandrolobao/Desktop/Python Projects/Streamed Songs Spotify/streamed songs spotify/Streams.csv"

In [None]:
df_features = pd.read_csv(features_file_path)
df_streams = pd.read_csv(streams_file_path)

df_merge = pd.merge(left=df_streams, left_on="Song", right=df_features, right_on="name")

df_merge = df_merge[["Song", "Artist", "Streams (Billions)", "Release Date", "duration", "energy", "key", "mode", "loudness", "tempo"]]

df_merge.head()

In [None]:
todays_date = date(2023, 3, 8)
df_merge.insert(4, "Today", todays_date)
df_merge['Release Date'] = df_merge['Release Date'].astype('string')
df_merge['Release Date'] = df_merge['Release Date'].apply(lambda x: parse(x).date())
df_merge['Aging in Days'] = df_merge['Today'] - df_merge['Release Date']
df_merge.head()

In [None]:
df_merge['Aging in Days'] = df_merge['Aging in Days'].dt.total_seconds() / (60*60*24)
df_merge.head()

In [None]:
sb.scatterplot(y='Streams (Billions)', x='Aging in Days', data=df_merge)

In [None]:
sb.scatterplot(y='Streams (Billions)', x='duration', data=df_merge)

In [None]:
df_merge.nunique()

In [None]:
# Artists with more songs
artistOccurrences = df_merge['Artist'].value_counts()
artistOccurrences

In [None]:
barplot_1 = sb.barplot(x=artistOccurrences.index[:15], y=artistOccurrences.values[:15])

barplot_1.set_xticklabels(barplot_1.get_xticklabels(), rotation=45, ha='right')

sb.despine()

In [None]:
# Artists with more streams
streamSum = df_merge.groupby('Artist')['Streams (Billions)'].sum()
streamSum = streamSum.sort_values(ascending=False)
streamSum

TODO: Address the issue of the "featuring" artists (e.g. "Post Malone featuring 21 Savage" is a different singer when compared to "Post Malone"). Perhaps I should consider only the main singer...

TODO: I would like to see what does the average streams per day look like for them...

In [None]:
fig = px.bar(df_merge, x=streamSum.index[:15], y=streamSum.values[:15])
fig.update_layout(title='Streams per Artist in Billions', xaxis_title='Artist', yaxis_title='Streams')
fig.show()

In [None]:
key_mode_count = df_merge.groupby(['key', 'mode'])['Artist'].size()
key_mode_count.sort_values(ascending=False)

1. No songs in D (key=2 and mode=0), whereas there's 8 songs in Dm (key=2 and mode=1)
2. Songs in Eb, Ebm, F, and G# make the bottom of the list
3. Minor keys make the top three of the list (in terms of number of songs), and they are: C#m, Dm, and Gm