## Data analysis of the billboard 100 data

This document is part of the showcase, where I replicate the same brief and simple analyses with different tools.

This particular file focuses on data analysis (a few queries) of the billboard 100 data from the tidytuesday project.

The data can be found in <https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-09-14>. They consist of two documents: *billboard.csv* contains information about the songs focusing on their position in the top100 list at different weeks. *audio_features.csv* contains information about specific attributes of the songs from spotify.

For the specific analysis I will use **Python** and **Pandas** (plus **Jupyter notebook**).

We start by loading the packages:

In [1]:
import pandas as pd

and the *billboard* datset:

In [2]:
billboard = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')

We can have a look at the schema of the billboard data:

In [3]:
billboard.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   url                     327895 non-null  object 
 1   week_id                 327895 non-null  object 
 2   week_position           327895 non-null  int64  
 3   song                    327895 non-null  object 
 4   performer               327895 non-null  object 
 5   song_id                 327895 non-null  object 
 6   instance                327895 non-null  int64  
 7   previous_week_position  295941 non-null  float64
 8   peak_position           327895 non-null  int64  
 9   weeks_on_chart          327895 non-null  int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 25.0+ MB


and the summary statistics:

In [4]:
billboard.describe(include='all')

Unnamed: 0,url,week_id,week_position,song,performer,song_id,instance,previous_week_position,peak_position,weeks_on_chart
count,327895,327895,327895.0,327895,327895,327895,327895.0,295941.0,327895.0,327895.0
unique,3279,3279,,24360,10061,29389,,,,
top,http://www.billboard.com/charts/hot-100/1965-0...,7/17/1965,,Stay,Taylor Swift,RadioactiveImagine Dragons,,,,
freq,100,100,,208,1022,87,,,,
mean,,,50.499309,,,,1.072538,47.604066,41.358307,9.153793
std,,,28.865707,,,,0.334188,28.056915,29.542497,7.590281
min,,,1.0,,,,1.0,1.0,1.0,1.0
25%,,,25.5,,,,1.0,23.0,14.0,4.0
50%,,,50.0,,,,1.0,47.0,39.0,7.0
75%,,,75.0,,,,1.0,72.0,66.0,13.0


For the first main query, our aim is to select only the songs that have reached the No 1 spot of the billboard and see how many weeks they have stayed at the billboard in total:

In [5]:
top_songs = billboard[billboard['peak_position'] == 1]
top_songs = top_songs.drop(['url', 'week_position', 'previous_week_position'], axis=1)
top_songs = top_songs.groupby(['performer', 'song']).agg(max_weeks=('weeks_on_chart', 'max'))
top_songs = top_songs.sort_values("max_weeks", ascending=False)
top_songs.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,max_weeks
performer,song,Unnamed: 2_level_1
LMFAO Featuring Lauren Bennett & GoonRock,Party Rock Anthem,68
Adele,Rolling In The Deep,65
Post Malone,Circles,61
Los Del Rio,Macarena (Bayside Boys Mix),60
John Legend,All Of Me,59
Gotye Featuring Kimbra,Somebody That I Used To Know,59
Ed Sheeran,Shape Of You,58
Santana Featuring Rob Thomas,Smooth,58
Katy Perry Featuring Juicy J,Dark Horse,57
The Black Eyed Peas,I Gotta Feeling,56


Now we load the *audio features* dataset:

In [6]:
audio_features = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')

And once again we look at the schema:

In [7]:
audio_features.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29503 entries, 0 to 29502
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   song_id                    29503 non-null  object 
 1   performer                  29503 non-null  object 
 2   song                       29503 non-null  object 
 3   spotify_genre              27903 non-null  object 
 4   spotify_track_id           24397 non-null  object 
 5   spotify_track_preview_url  14491 non-null  object 
 6   spotify_track_duration_ms  24397 non-null  float64
 7   spotify_track_explicit     24397 non-null  object 
 8   spotify_track_album        24391 non-null  object 
 9   danceability               24334 non-null  float64
 10  energy                     24334 non-null  float64
 11  key                        24334 non-null  float64
 12  loudness                   24334 non-null  float64
 13  mode                       24334 non-null  flo

And the summary statistics (of the continuous attributes):

In [8]:
audio_features.describe()

Unnamed: 0,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,spotify_track_popularity
count,24397.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24334.0,24397.0
mean,220684.3,0.599945,0.618096,5.231651,-8.664607,0.727172,0.073554,0.294635,0.032539,0.192098,0.601746,120.276066,3.931577,41.224413
std,67746.71,0.153133,0.199078,3.560211,3.601119,0.445422,0.083153,0.2823,0.136276,0.159073,0.238645,28.046937,0.320858,22.477405
min,29688.0,0.0,0.000581,0.0,-28.03,0.0,0.0,3e-06,0.0,0.00967,0.0,0.0,0.0,0.0
25%,175053.0,0.499,0.476,2.0,-11.034,0.0,0.0321,0.0467,0.0,0.0909,0.415,99.06075,4.0,23.0
50%,214850.0,0.608,0.634,5.0,-8.205,1.0,0.0413,0.195,5e-06,0.131,0.622,118.9105,4.0,43.0
75%,253253.0,0.708,0.778,8.0,-5.85625,1.0,0.0683,0.508,0.00046,0.24875,0.802,136.48375,4.0,59.0
max,3079157.0,0.988,0.997,11.0,2.291,1.0,0.951,0.991,0.982,0.999,0.991,241.009,5.0,100.0


For the second main query, our aim is to derive information about the peak position a song has reached in the billboard and the main spotify information. For this, we need to join the two datasets:

In [9]:
merge_left = billboard.groupby(['performer', 'song', 'song_id']).agg(best_position=('peak_position', 'max'))
merge_right = audio_features[['song_id', 'performer', 'song', 'spotify_genre', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
data_join = pd.merge(merge_left, merge_right, on="song_id")
data_join = data_join.sort_values('performer', ascending = True)
data_join.head(10)

Unnamed: 0,song_id,best_position,performer,song,spotify_genre,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"Misty""Groove"" Holmes",100,"""Groove"" Holmes",Misty,"['instrumental soul', 'jazz funk', 'jazz organ...",0.553,0.487,8.0,-9.938,0.0324,0.783,0.876,0.292,0.524,93.665
1,"What Now My Love""Groove"" Holmes",98,"""Groove"" Holmes",What Now My Love,"['instrumental soul', 'jazz funk', 'jazz organ...",,,,,,,,,,
2,"May The Bird Of Paradise Fly Up Your Nose""Litt...",90,"""Little"" Jimmy Dickens",May The Bird Of Paradise Fly Up Your Nose,"['country gospel', 'traditional country']",0.66,0.801,4.0,-8.446,0.115,0.738,1e-05,0.627,0.867,104.374
3,"I Know I Know""Pookie"" Hudson",96,"""Pookie"" Hudson",I Know I Know,['deep northern soul'],,,,,,,,,,
14,"Word Crimes""Weird Al"" Yankovic",39,"""Weird Al"" Yankovic",Word Crimes,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.897,0.43,7.0,-12.759,0.0551,0.0118,0.0,0.0473,0.964,121.987
12,"Smells Like Nirvana""Weird Al"" Yankovic",95,"""Weird Al"" Yankovic",Smells Like Nirvana,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.591,0.786,6.0,-7.664,0.0749,0.162,0.00178,0.28,0.729,120.762
11,"Ricky""Weird Al"" Yankovic",90,"""Weird Al"" Yankovic",Ricky,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.521,0.814,4.0,-8.27,0.108,0.066,0.0,0.0868,0.972,153.747
10,"Like A Surgeon""Weird Al"" Yankovic",74,"""Weird Al"" Yankovic",Like A Surgeon,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.838,0.671,3.0,-8.328,0.0346,0.252,0.0,0.056,0.961,126.012
9,"King Of Suede""Weird Al"" Yankovic",77,"""Weird Al"" Yankovic",King Of Suede,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.78,0.808,11.0,-10.056,0.0382,0.271,0.0,0.0958,0.569,128.505
13,"White & Nerdy""Weird Al"" Yankovic",28,"""Weird Al"" Yankovic",White & Nerdy,"['antiviral pop', 'comedy rock', 'comic', 'par...",0.791,0.613,1.0,-11.628,0.0763,0.0986,0.0,0.0765,0.896,143.017
