# __A peak at Spotify's top 100__


In this project we shall analyse the Top 100 songs of each year on Spotify from 2009 to 2021. The Big goal of this project is to answer the question, "What does it take to make it to Spotify's Top 100 list?"
These are the 3 business questions we seek to ask and answer:
* What genre mostly makes it to Spotify's top 100
* What are the main indicators that a song will make it to Spotify's Top 100?
* Can we use the indicators identified to predict the performance of new music genres?


__Context of the Data__

Column Name | Column Description
-----------------|------------------------
title|Song's Title
artist|Song's artist
genre|Genre of song
year |released	Year the song was released
added	|Day song was added to Spotify's Top Hits playlist
bpm	|Beats Per Minute - The tempo of the song
nrgy	|Energy - How energetic the song is
dnce	|Danceability - How easy it is to dance to the song
dB	|Decibel - How loud the song is
live	|How likely the song is a live recording
val	|How positive the mood of the song is
dur	|Duration of the song
acous	|How acoustic the song is
spch	|The more the song is focused on spoken word
pop	|Popularity of the song (not a ranking)
top year|Year the song was a top hit|
artist type|	Tells if artist is solo, duo, trio, or a band



Data obtained from https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019

Articles & Works Cited
https://www.washingtonpost.com/news/to-your-health/wp/2015/10/30/the-mathematical-formula-behind-feel-good-songs/
https://towardsdatascience.com/crisp-dm-methodology-for-your-first-data-science-project-769f35e0346c
https://www.kaggle.com/code/jkanthony/spotify-top-hit-songs-fix-the-data-and-eda

In [2]:
#Import the neccessary libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#download the data and read it as data frame
df = pd.read_csv('Spotify 2010 - 2019 Top 100.csv')
df.head()

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009.0,2022‑02‑17,140.0,81.0,61.0,-6.0,23.0,23.0,203.0,0.0,6.0,70.0,2010.0,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010.0,2022‑02‑17,138.0,89.0,68.0,-4.0,36.0,83.0,192.0,1.0,8.0,68.0,2010.0,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010.0,2022‑02‑17,95.0,48.0,84.0,-7.0,9.0,96.0,243.0,20.0,3.0,72.0,2010.0,Solo
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,atl hip hop,2010.0,2022‑02‑17,93.0,87.0,66.0,-4.0,4.0,38.0,180.0,11.0,12.0,80.0,2010.0,Solo
4,Nothin' on You (feat. Bruno Mars),B.o.B,atl hip hop,2010.0,2022‑02‑17,104.0,85.0,69.0,-6.0,9.0,74.0,268.0,39.0,5.0,79.0,2010.0,Solo


_Describe the data_

In [4]:
num_rows = df.shape[0] #Provide the number of rows in the dataset
num_cols = df.shape[1] #Provide the number of columns in the dataset


In [5]:
num_rows

1003

In [6]:
num_cols

17

In [3]:
df.isnull().sum()

title            3
artist           3
top genre        3
year released    3
added            3
bpm              3
nrgy             3
dnce             3
dB               3
live             3
val              3
dur              3
acous            3
spch             3
pop              3
top year         3
artist type      3
dtype: int64

3 bottom valyes are null in all columns, let's delete them

In [4]:
df = df.iloc[0:999]

Data exploration

In [13]:
#Convert year released and top year from float to int
df['year released'] = df['year released'].astype('Int64')
df['top year'] = df['top year'].astype('Int64')

Data Quality

In [14]:
df.groupby('year released')['year released'].count()

year released
2009     24
2010     96
2011     95
2012    110
2013     86
2014    101
2015     99
2016     87
2017     99
2018    112
2019     90
Name: year released, dtype: int64

There are some inconsistency with in the data, because some songs released year are written as 1975, 2020 and 2021. let's fix them.

In [15]:
df[df['year released']==1975]

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type


In [16]:
df['year released'].iloc[982]

2011

In [17]:
for i in range(len(df['title'])):
    if df['year released'].iloc[i] == 1975:
        df['year released'].iloc[i] = 2011

In [8]:
df[df['year released']==2020]

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
69,Eenie Meenie,Sean Kingston,dance pop,2020.0,2022‑02‑17,121.0,64.0,73.0,-3.0,10.0,84.0,202.0,3.0,3.0,72.0,2010.0,Solo
98,We No Speak Americano (Edit),Yolanda Be Cool,australian dance,2020.0,2022‑02‑17,125.0,81.0,90.0,-5.0,9.0,75.0,157.0,7.0,5.0,65.0,2010.0,Duo
651,Gold,Kiiara,alt z,2020.0,2020‑06‑08,113.0,41.0,60.0,-9.0,13.0,41.0,226.0,62.0,34.0,64.0,2016.0,Solo
901,Easier,5 Seconds of Summer,boy band,2020.0,2020‑06‑22,176.0,46.0,56.0,-4.0,11.0,62.0,158.0,48.0,26.0,74.0,2019.0,Band/Group
951,i'm so tired...,Lauv,dance pop,2020.0,2020‑06‑22,102.0,73.0,60.0,-7.0,24.0,53.0,163.0,18.0,20.0,81.0,2019.0,Solo
972,Options,NSG,afro dancehall,2020.0,2020‑08‑20,102.0,62.0,84.0,-5.0,10.0,76.0,240.0,39.0,9.0,62.0,2019.0,Band/Group


In [18]:
for i in range(len(df['title'])):
    if i == 69:
        df['year released'].iloc[i]=2010
    if i == 98:
        df['year released'].iloc[i]=2010
    if i == 651:
        df['year released'].iloc[i]=2016
    if i == 901:
        df['year released'].iloc[i]=2019
    if i == 951:
        df['year released'].iloc[i]=2019
    if i == 972:
        df['year released'].iloc[i]=2019

In [19]:
for i in range(len(df['title'])):
    if i == 168:
        df['year released'].iloc[i]=2011
    if i == 182:
        df['year released'].iloc[i]=2012
    if i == 608:
        df['year released'].iloc[i]=2019

In [20]:
df.groupby('year released')['year released'].count()

year released
2009     24
2010     96
2011     95
2012    110
2013     86
2014    101
2015     99
2016     87
2017     99
2018    112
2019     90
Name: year released, dtype: int64

In [21]:
artist_with_most_hit_songs = pd.DataFrame()
artist_with_most_hit_songs['occurence'] = df.groupby('artist')['artist'].count()
artist_with_most_hit_songs.sort_values('occurence',ascending=False)

Unnamed: 0_level_0,occurence
artist,Unnamed: 1_level_1
Taylor Swift,21
Drake,18
Calvin Harris,18
Ariana Grande,14
Rihanna,14
...,...
Milky Chance,1
Mr. Probz,1
Muse,1
Mustard,1


Top 5 are taylor swift, Drake, Calvin Harris, Ariana Grande, Rihanna.

## Top Genre

In [14]:
df.groupby('top genre')['top genre'].count().sort_values(ascending=False)

top genre
dance pop        361
pop               57
atl hip hop       39
art pop           37
hip hop           21
                ... 
idol               1
indie folk         1
dark clubbing      1
basshall           1
acoustic pop       1
Name: top genre, Length: 132, dtype: int64

## Indicators 

Based on an article by The Washington Post BPM plays a very critical role and we will use the data in this section to find that sweet spot.

In [15]:
df['bpm'].sort_values(ascending=False)

563     206.0
109     206.0
898     204.0
227     202.0
969     202.0
        ...  
38       65.0
896      65.0
1000      NaN
1001      NaN
1002      NaN
Name: bpm, Length: 1003, dtype: float64

In [16]:
df.iloc[563]

title            FourFiveSeconds
artist                   Rihanna
top genre          barbadian pop
year released               2015
added                 2020‑06‑19
bpm                        206.0
nrgy                        27.0
dnce                        58.0
dB                          -6.0
live                        13.0
val                         35.0
dur                        188.0
acous                       88.0
spch                         5.0
pop                         82.0
top year                    2015
artist type                 Solo
Name: 563, dtype: object

### Most energetic song

In [17]:
df['nrgy'].sort_values(ascending=False)

312     98.0
70      98.0
284     97.0
254     97.0
88      97.0
        ... 
916     11.0
675      6.0
1000     NaN
1001     NaN
1002     NaN
Name: nrgy, Length: 1003, dtype: float64

In [18]:
df.iloc[[70,312]]

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
70,Riverside,Sidney Samson,dutch house,2009,2022‑02‑17,126.0,98.0,80.0,-2.0,13.0,29.0,321.0,0.0,5.0,42.0,2010,Solo
312,Get Up (Rattle),Bingo Players,big room,2013,2020‑06‑11,128.0,98.0,80.0,-3.0,26.0,80.0,167.0,2.0,7.0,61.0,2013,Duo


### Most danceable

In [19]:
df['dnce'].sort_values(ascending=False)

461     96.0
856     96.0
923     95.0
750     94.0
705     94.0
        ... 
521     26.0
355     19.0
1000     NaN
1001     NaN
1002     NaN
Name: dnce, Length: 1003, dtype: float64

In [20]:
df.iloc[[461,856]]

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
461,Anaconda,Nicki Minaj,dance pop,2014,2020‑06‑10,130.0,61.0,96.0,-6.0,21.0,65.0,260.0,7.0,18.0,70.0,2014,Solo
856,Yes Indeed,Lil Baby,atl hip hop,2018,2020‑06‑22,120.0,35.0,96.0,-9.0,11.0,56.0,142.0,4.0,53.0,84.0,2018,Solo


### Shortest and Longest songs

In [21]:
df['dur'].sort_values()

955     113.0
956     115.0
795     119.0
896     122.0
750     124.0
        ...  
350     484.0
443     688.0
1000      NaN
1001      NaN
1002      NaN
Name: dur, Length: 1003, dtype: float64

In [22]:
df.iloc[[955,443]]

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
955,Old Town Road,Lil Nas X,lgbtq+ hip hop,2019,2020‑06‑22,136.0,53.0,91.0,-6.0,10.0,51.0,113.0,6.0,13.0,80.0,2019,Solo
443,Not a Bad Thing,Justin Timberlake,dance pop,2013,2020‑06‑10,86.0,56.0,31.0,-9.0,13.0,11.0,688.0,53.0,7.0,61.0,2014,Solo
