In [1]:
import pandas as pd
import numpy as np
import os

In [2]:

def read_dataset(name):
    url = os.path.join(os.getcwd(), '..', 'resources', name)
    return pd.read_csv(url)

def print_column_names(df):
    for colname in df.columns:
        print(colname)
        
def print_changes(changes):
    for c in changes:
        print(f"{c.before} -> {c.after}")

## Preprocessing

Read each of the datasets we are gonna be working on:

In [3]:
top_2010_2019_df = read_dataset('top-tracks-2010-to-2019.csv')
top_2017_df = read_dataset('top-tracks-2017.csv')
top_2018_df = read_dataset('top-tracks-2018.csv')
top_2019_df = read_dataset('top-tracks-2019.csv')
tracks_1921_2020_df = read_dataset('tracks-1921-2020.csv')

Since most of the datasets come from different sources, we are gonna need to normalize the column names as well as the scaling in the dataframes. However, most of the datasets come from the Spotify API, we can figure out which column names are available by looking at:

https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

### Automatic column renaming

Based on the Spotify API reference, most of the expected column names are:

In [4]:
columns = [
    
  # metadata
  'id',
  'artist',
  'duration_ms',
  'genre',
  'title',
  'year',
    
  # audio features
  'acousticness',
  'danceability',
  'energy',
  'explicit',
  'instrumentalness',
  'key',
  'liveness',
  'loudness',
  'mode',
  'speechiness',
  'tempo',
  'time_signature',
  'valence',
]

Using the column renamer, we can normalize the column names (casing, grammar, etc) in all of our datasets making exploration easier down the road.

In [5]:
from util.preprocessing import ColumnRenamer

renamer = ColumnRenamer()

# top_2017_df = renamer.normalize(top_2017_df)
# top_2018_df = renamer.normalize(top_2018_df)
# top_2019_dr = renamer.normalize(top_2019_df)
# tracks_1921_2020 = renamer.normalize(tracks_1921_2020)

#### Top tracks from 2010 to 2019 renaming

In [47]:
top_2010_2019_df, changed, not_changed = renamer.normalize(columns, top_2010_2019_df)

Changed:

In [48]:
print_changes(changed)

title -> title
artist -> artist
genre -> genre
year -> year
energy -> energy
liveness -> liveness
valence -> valence


Not changed:

In [49]:
print_changes(not_changed)

Unnamed: 0 -> None
bpm -> None
dnce -> None
dB -> None
dur -> None
acous -> None
spch -> None
pop -> None


#### Top tracks 2017 renaming

In [9]:
top_2017_df, changed, not_changed = renamer.normalize(columns, top_2017_df)

Changed:

In [10]:
print_changes(changed)

id -> id
artists -> artist
danceability -> danceability
energy -> energy
key -> key
loudness -> loudness
mode -> mode
speechiness -> speechiness
acousticness -> acousticness
instrumentalness -> instrumentalness
liveness -> liveness
valence -> valence
tempo -> tempo
duration_ms -> duration_ms
time_signature -> time_signature


Not changed:

In [11]:
print_changes(not_changed)

name -> None


#### Top tracks 2018 renaming

In [12]:
top_2018_df, changed, not_changed = renamer.normalize(columns, top_2018_df)

Changed:

In [13]:
print_changes(changed)

id -> id
artists -> artist
danceability -> danceability
energy -> energy
key -> key
loudness -> loudness
mode -> mode
speechiness -> speechiness
acousticness -> acousticness
instrumentalness -> instrumentalness
liveness -> liveness
valence -> valence
tempo -> tempo
duration_ms -> duration_ms
time_signature -> time_signature


Not changed:

In [14]:
print_changes(not_changed)

name -> None


#### Top tracks 2019 renaming

In [15]:
top_2019_df, changed, not_changed = renamer.normalize(columns, top_2019_df)

Changed:

In [16]:
print_changes(changed)

Artist.Name -> artist
Genre -> genre
Energy -> energy
Danceability -> danceability
Loudness..dB.. -> loudness
Liveness -> liveness
Valence. -> valence
Acousticness.. -> acousticness
Speechiness. -> speechiness


Not changed:

In [17]:
print_changes(not_changed)

Unnamed: 0 -> None
Track.Name -> None
Beats.Per.Minute -> None
Length. -> None
Popularity -> None


#### Tracks from 1921 to 2020

In [18]:
tracks_1921_2020_df, changed, not_changed = renamer.normalize(columns, tracks_1921_2020_df)

Changed:

In [19]:
print_changes(changed)

acousticness -> acousticness
artists -> artist
danceability -> danceability
duration_ms -> duration_ms
energy -> energy
explicit -> explicit
id -> id
instrumentalness -> instrumentalness
key -> key
liveness -> liveness
loudness -> loudness
mode -> mode
speechiness -> speechiness
tempo -> tempo
valence -> valence
year -> year


Not changed:

In [20]:
print_changes(not_changed)

name -> None
popularity -> None
release_date -> None


### Manual column renaming

Based on the results from the automatic column renaming (columns listed under *Not changed*), we can manually fix the names of those columns as follows

In [21]:
columns_map = {
    'unamed': 'id',
    'bpm': 'tempo',
    'dnce': 'danceability',
    'db': 'loudness',
    'acous': 'acousticness',
    'spch': 'speechness',
    'pop': 'popularity',
    'Track.name': 'title',
    'beats.per.minute': 'tempo',
    'length': 'duration_ms',
    'name': 'title',
    'release_date': 'release_date',
}

#### Top tracks from 2010 to 2019 renaming

In [51]:
top_2010_2019_df, changed, not_changed = renamer.map(columns_map, top_2010_2019_df)

Changed:

In [52]:
print_changes(changed)

Unnamed: 0 -> id
bpm -> tempo
dnce -> danceability
dB -> loudness
acous -> acousticness
spch -> speechness
pop -> popularity


Not changed:

In [53]:
print_changes(not_changed)

title -> None
artist -> None
genre -> None
year -> None
energy -> None
liveness -> None
valence -> None
dur -> None


#### Top tracks 2017 renaming

In [54]:
top_2017_df, changed, not_changed = renamer.map(columns_map, top_2017_df)

Changed:

In [55]:
print_changes(changed)

Not changed:

In [56]:
print_changes(not_changed)

id -> None
title -> None
artist -> None
danceability -> None
energy -> None
key -> None
loudness -> None
mode -> None
speechiness -> None
acousticness -> None
instrumentalness -> None
liveness -> None
valence -> None
tempo -> None
duration_ms -> None
time_signature -> None


#### Top tracks 2018 renaming

In [57]:
top_2018_df, changed, not_changed = renamer.map(columns_map, top_2018_df)

Changed:

In [35]:
print_changes(changed)

name -> title


Not changed:

In [36]:
print_changes(not_changed)

id -> None
artist -> None
danceability -> None
energy -> None
key -> None
loudness -> None
mode -> None
speechiness -> None
acousticness -> None
instrumentalness -> None
liveness -> None
valence -> None
tempo -> None
duration_ms -> None
time_signature -> None


#### Top tracks 2019 renaming

In [37]:
top_2019_df, changed, not_changed = renamer.map(columns_map, top_2019_df)

Changed:

In [38]:
print_changes(changed)

Unnamed: 0 -> id
Track.Name -> title
Beats.Per.Minute -> tempo
Length. -> duration_ms


Not changed:

In [39]:
print_changes(not_changed)

artist -> None
genre -> None
energy -> None
danceability -> None
loudness -> None
liveness -> None
valence -> None
acousticness -> None
speechiness -> None
Popularity -> None


#### Tracks from 1921 to 2020 renaming

In [41]:
tracks_1921_2020_df, changed, not_changed = renamer.map(columns_map, tracks_1921_2020_df)

Changed:

In [42]:
print_changes(changed)

release_date -> release_date


Not changed:

In [43]:
print_changes(not_changed)

acousticness -> None
artist -> None
danceability -> None
duration_ms -> None
energy -> None
explicit -> None
id -> None
instrumentalness -> None
key -> None
liveness -> None
loudness -> None
mode -> None
title -> None
popularity -> None
speechiness -> None
tempo -> None
valence -> None
year -> None


## Exploration

#### Top tracks 2010 to 2019

In [67]:
top_2010_2019_df.head()

Unnamed: 0,id,title,artist,genre,year,tempo,energy,danceability,loudness,liveness,valence,dur,acousticness,speechness,popularity
0,1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


#### Top tracks 2017

In [68]:
top_2017_df.head()

Unnamed: 0,id,title,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,7qiZfU4dY1lWllzX7mPBI,Shape of You,Ed Sheeran,0.825,0.652,1.0,-3.183,0.0,0.0802,0.581,0.0,0.0931,0.931,95.977,233713.0,4.0
1,5CtI0qwDJkDQGwXD1H1cL,Despacito - Remix,Luis Fonsi,0.694,0.815,2.0,-4.328,1.0,0.12,0.229,0.0,0.0924,0.813,88.931,228827.0,4.0
2,4aWmUDTfIPGksMNLV2rQP,Despacito (Featuring Daddy Yankee),Luis Fonsi,0.66,0.786,2.0,-4.757,1.0,0.17,0.209,0.0,0.112,0.846,177.833,228200.0,4.0
3,6RUKPb4LETWmmr3iAEQkt,Something Just Like This,The Chainsmokers,0.617,0.635,11.0,-6.769,0.0,0.0317,0.0498,1.4e-05,0.164,0.446,103.019,247160.0,4.0
4,3DXncPQOG4VBw3QHh3S81,I'm the One,DJ Khaled,0.609,0.668,7.0,-4.284,1.0,0.0367,0.0552,0.0,0.167,0.811,80.924,288600.0,4.0


#### Top tracks 2018

In [69]:
top_2018_df.head()

Unnamed: 0,id,title,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,6DCZcSspjsKoFjzjrWoCd,God's Plan,Drake,0.754,0.449,7.0,-9.211,1.0,0.109,0.0332,8.3e-05,0.552,0.357,77.169,198973.0,4.0
1,3ee8Jmje8o58CHK66QrVC,SAD!,XXXTENTACION,0.74,0.613,8.0,-4.88,1.0,0.145,0.258,0.00372,0.123,0.473,75.023,166606.0,4.0
2,0e7ipj03S05BNilyu5bRz,rockstar (feat. 21 Savage),Post Malone,0.587,0.535,5.0,-6.09,0.0,0.0898,0.117,6.6e-05,0.131,0.14,159.847,218147.0,4.0
3,3swc6WTsr7rl9DqQKQA55,Psycho (feat. Ty Dolla $ign),Post Malone,0.739,0.559,8.0,-8.011,1.0,0.117,0.58,0.0,0.112,0.439,140.124,221440.0,4.0
4,2G7V7zsVDxg1yRsu7Ew9R,In My Feelings,Drake,0.835,0.626,1.0,-5.833,1.0,0.125,0.0589,6e-05,0.396,0.35,91.03,217925.0,4.0


#### Top tracks 2019

In [70]:
top_2019_df.head()

Unnamed: 0,id,title,artist,genre,tempo,energy,danceability,loudness,liveness,valence,duration_ms,acousticness,speechiness,Popularity
0,1,Señorita,Shawn Mendes,canadian pop,117,55,76,-6,8,75,191,4,3,79
1,2,China,Anuel AA,reggaeton flow,105,81,79,-4,8,61,302,8,9,92
2,3,boyfriend (with Social House),Ariana Grande,dance pop,190,80,40,-4,16,70,186,12,46,85
3,4,Beautiful People (feat. Khalid),Ed Sheeran,pop,93,65,64,-8,8,55,198,12,19,86
4,5,Goodbyes (Feat. Young Thug),Post Malone,dfw rap,150,65,58,-4,11,18,175,45,7,94


#### Tracks from 1921 to 2020

In [72]:
tracks_1921_2020_df.head()

Unnamed: 0,acousticness,artist,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,title,popularity,release_date,speechiness,tempo,valence,year
0,0.995,['Carl Woitschach'],0.708,158648,0.195,0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,10,0.151,-12.428,1,Singende Bataillone 1. Teil,0,1928,0.0506,118.469,0.779,1928
1,0.994,"['Robert Schumann', 'Vladimir Horowitz']",0.379,282133,0.0135,0,6KuQTIu1KoTTkLXKrwlLPV,0.901,8,0.0763,-28.454,1,"Fantasiestücke, Op. 111: Più tosto lento",0,1928,0.0462,83.972,0.0767,1928
2,0.604,['Seweryn Goszczyński'],0.749,104300,0.22,0,6L63VW0PibdM1HDSBoqnoM,0.0,5,0.119,-19.924,0,Chapter 1.18 - Zamek kaniowski,0,1928,0.929,107.177,0.88,1928
3,0.995,['Francisco Canaro'],0.781,180760,0.13,0,6M94FkXd15sOAOQYRnWPN8,0.887,1,0.111,-14.734,0,Bebamos Juntos - Instrumental (Remasterizado),0,1928-09-25,0.0926,108.003,0.72,1928
4,0.99,"['Frédéric Chopin', 'Vladimir Horowitz']",0.21,687733,0.204,0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,11,0.098,-16.829,1,"Polonaise-Fantaisie in A-Flat Major, Op. 61",1,1928,0.0424,62.149,0.0693,1928


### Ranges and scaling

From the exploration above, we can see that some of the values are not in the same range (duration, danceability, etc). Therefore, we have to make sure those are the same before we keep going. The audio features page from spotify (https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) specifies which values represent a percentage (0.0-1.0, 0%-100%) and what values do not.

In [73]:
percentage_columns = [
    'acousticness',
    'danceability',
    'energy',
    'instrumentalness',
    'liveness',
    'speechiness',
    'valence',
]

## Exploration

In [26]:
# TBD

## Feature selection and reduction

In [27]:
# TBD