## Data analysis of the billboard 100 data

This document is part of the showcase, where I replicate the same brief and simple analyses with different tools.

This particular file focuses on data analysis (a few queries) of the billboard 100 data from the tidytuesday project.

The data can be found in <https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-09-14>. They consist of two documents: *billboard.csv* contains information about the songs focusing on their position in the top100 list at different weeks. *audio_features.csv* contains information about specific attributes of the songs from spotify.

For the specific analysis I will use **Python** and **Dask** (plus **Jupyter notebook**).

We start by loading the packages:

In [1]:
from dask.distributed import Client
client = Client(n_workers=4)

and the *billboard* datset:

In [2]:
import dask.dataframe as dd
import pandas as pd
billboard = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')
billboard = dd.from_pandas(billboard, npartitions=4)

We can have a look at information about the dask dataframe:

In [3]:
billboard.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 10 entries, url to weeks_on_chart
dtypes: object(5), float64(1), int64(4)

and the summary statistics:

In [4]:
billboard.describe().compute()

Unnamed: 0,week_position,instance,previous_week_position,peak_position,weeks_on_chart
count,327895.0,327895.0,295941.0,327895.0,327895.0
mean,50.499309,1.072538,47.604066,41.358307,9.153793
std,28.865707,0.334188,28.056915,29.542497,7.590281
min,1.0,1.0,1.0,1.0,1.0
25%,39.0,1.0,33.0,19.0,3.0
50%,66.0,1.0,62.0,45.0,8.0
75%,84.0,1.0,81.0,82.0,19.0
max,100.0,10.0,100.0,100.0,87.0


For the first main query, our aim is to select only the songs that have reached the No 1 spot of the billboard and see how many weeks they have stayed at the billboard in total (Note that sorting doesnot work for dask Series):

In [5]:
top_songs = billboard[billboard['peak_position'] == 1]
top_songs = top_songs.drop(['url', 'week_position', 'previous_week_position'], axis=1)
top_songs = top_songs.groupby(['performer', 'song']).weeks_on_chart.max()
top_songs.compute().reset_index(name="max_weeks")

Unnamed: 0,performer,song,max_weeks
0,50 Cent Featuring Olivia,Candy Shop,23
1,? (Question Mark) & The Mysterians,96 Tears,15
2,Ace Of Base,The Sign,41
3,Adele,Hello,26
4,Aerosmith,I Don't Want To Miss A Thing,20
...,...,...,...
1119,Walter Murphy & The Big Apple Band,A Fifth Of Beethoven,28
1120,Will Smith Featuring Dru Hill & Kool Mo Dee,Wild Wild West,17
1121,Wiz Khalifa,Black And Yellow,25
1122,XXXTENTACION,Sad!,38


Now we load the *audio features* dataset:

In [6]:
audio_features = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')
audio_features = dd.from_pandas(audio_features, npartitions=4)

And once again we look at the summary statistics:

In [7]:
audio_features.describe().compute()

Unnamed: 0,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,spotify_track_popularity
count,24397.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24397.0
mean,220684.3,0.599945,0.618096,5.231651,-8.664607,0.727172,0.073554,0.294635,0.032539,0.192098,0.601746,120.276066,3.931577,41.224413
std,67746.71,0.153133,0.199078,3.560211,3.601119,0.445422,0.083153,0.2823,0.136276,0.159073,0.238645,28.046937,0.320858,22.477405
min,29688.0,0.0,0.000581,0.0,-28.03,0.0,0.0,3e-06,0.0,0.00967,0.0,0.0,0.0,0.0
25%,177029.5,0.506,0.487,2.0,-10.87775,0.0,0.0325,0.0518,0.0,0.0917,0.427,100.264,4.0,25.0
50%,216026.5,0.613,0.643,5.0,-7.981,1.0,0.0422,0.215,5e-06,0.134,0.628,119.947,4.0,45.0
75%,254813.2,0.715,0.781,8.0,-5.717,1.0,0.0724,0.539,0.00051,0.25575,0.806,137.5155,4.0,61.0
max,3079157.0,0.988,0.997,11.0,2.291,1.0,0.951,0.991,0.982,0.999,0.991,241.009,5.0,100.0


For the second main query, our aim is to derive information about the peak position a song has reached in the billboard and the main spotify information. For this, we need to join the two datasets:

In [8]:
merge_left = billboard.groupby(['performer', 'song', 'song_id']).peak_position.max().compute().reset_index(name="best_position")
merge_right = audio_features[['song_id', 'performer', 'song', 'spotify_genre', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
data_join = dd.merge(merge_left, merge_right, on="song_id")
data_join.head(10)

Unnamed: 0,performer_x,song_x,song_id,best_position,performer_y,song_y,spotify_genre,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"""Weird Al"" Yankovic",Amish Paradise,"Amish Paradise""Weird Al"" Yankovic",65,"""Weird Al"" Yankovic",Amish Paradise,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.728,0.448,8.0,-10.54,0.172,0.103,0.0,0.267,0.483,80.902
1,"""Weird Al"" Yankovic",Canadian Idiot,"Canadian Idiot""Weird Al"" Yankovic",82,"""Weird Al"" Yankovic",Canadian Idiot,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.543,0.697,8.0,-9.211,0.0612,0.00206,2e-06,0.343,0.861,185.978
2,"""Weird Al"" Yankovic",Eat It,"Eat It""Weird Al"" Yankovic",59,"""Weird Al"" Yankovic",Eat It,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.767,0.811,7.0,-8.548,0.0766,0.0866,0.0,0.0684,0.858,147.423
3,'N Sync,(God Must Have Spent) A Little More Time On You,(God Must Have Spent) A Little More Time On Yo...,60,'N Sync,(God Must Have Spent) A Little More Time On You,[],,,,,,,,,,
4,'N Sync,Bye Bye Bye,Bye Bye Bye'N Sync,42,'N Sync,Bye Bye Bye,[],,,,,,,,,,
5,'Til Tuesday,(Believed You Were) Lucky,(Believed You Were) Lucky'Til Tuesday,98,'Til Tuesday,(Believed You Were) Lucky,"['boston rock', 'dance rock', 'new romantic', ...",0.612,0.523,5.0,-11.425,0.0321,0.448,2e-06,0.0727,0.495,124.315
6,'Til Tuesday,Coming Up Close,Coming Up Close'Til Tuesday,90,'Til Tuesday,Coming Up Close,"['boston rock', 'dance rock', 'new romantic', ...",0.368,0.473,8.0,-14.943,0.0316,0.181,0.000251,0.0665,0.273,80.186
7,0,Deacon Blues,Deacon Blues0,86,0,Deacon Blues,"['album rock', 'art rock', 'blues rock', 'clas...",0.751,0.572,0.0,-12.324,0.0388,0.481,0.000532,0.105,0.58,115.839
8,1 Of The Girls,Do Da What,Do Da What1 Of The Girls,91,1 Of The Girls,Do Da What,[],,,,,,,,,,
9,"10,000 Maniacs",Because The Night,"Because The Night10,000 Maniacs",92,"10,000 Maniacs",Because The Night,"['alternative metal', 'christian rock', 'nu me...",,,,,,,,,,
