# What is a music hit recepie?
## Overview
In this project I am trying to answer the question, what is the recepie for a music hit. Why some songs become hits while others remain relatively unknown, even if written and performed by the same artist?
In order to answer this question I will attempt to find disctintive features of proven hits from the perspective of musical theory and [semi-]subjective impression, they leave.
### Methodology
For my research I am going to analyze the list of Billboard Hot 100 tracks for years of 1999 - 2019. I am going to compare the tracks from this dataset to the tracks of the same artists, which do not belong to Hot 100.
The tracks will be compared across objective dimensions of:
- Music metere
- Tempo
- Key

and subjective dimensions of:
- Energy
- Danceability
- Instrumentality
The measures for both objective and subjective dimensions will be obtained from Spotify via Spotify API.
In addition to analysis of data by artist, I am going to perform "landscape" analysis to identify trends of analyzed variables over time.

## Data Profile

1. I am going to join the datasets by track name and artist, then I am going to focus my analysis on track qualities, such as key, tempo, energy level, instrumentalism, etc. 

2. I am going to get another dataset of track qualities for all popular tracks by artists in hot 100 and compare the qualities of hits vs popular songs.

3. In addition, if time allows, I plan to perform exploratory analysis of data for curiosities. I want to look for features like preferred key by artist or genre, or year. Or anomalies or trends in music qualities over years.

**Note** There is a checkpoint down below, where the data can be loaded from CSV files, instead of retreival from API, so that you don't have to wait 30 minutes for data to be pulled

In [None]:
#I am going to use SpotiPy library to work with Spotify API
!pip install spotipy


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sys
import spotipy
import spotipy.util as util
import json

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# secrets are used to store Spotify API client ID and secret
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
SPOTIFY_CLIENT_ID = user_secrets.get_secret("spotify_client_id")
SPOTIFY_CLIENT_SECRET = user_secrets.get_secret("spotify_client_secret")

# constuct API handler
from spotipy.oauth2 import SpotifyClientCredentials
client_credentials_manager = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

This is how the Billboard Hot 100 dataset looks like. We are going to focus on track name, artist and date.

In [None]:
#These are the tracks (unique list)
hot100 = pd.read_csv("/kaggle/input/data-on-songs-from-billboard-19992019/BillboardFromLast20/billboardHot100_1999-2019.csv").drop(columns=['Unnamed: 0','Weekly.rank','Peak.position','Weeks.on.chart','Week','Genre','Writing.Credits','Lyrics','Features']).drop_duplicates()
hot100.head()

It is a long list of songs!

In [None]:
len(hot100)

And many artists as well!

In [None]:
# These are the artists and a number of songs by 10 of them
hot100_artists = hot100['Artists'].drop_duplicates()
len(hot100_artists)
len(hot100[hot100['Artists'].isin(hot100_artists.head(20))])


### Data Retrieval
We obtain data from Spotify API.
We first search for each song in Hot 100 list of songs, then we parse Spotify data to obtain Track and Artist information. Finally, we call Spotify API again to obtain track analysis.

In [None]:
# Limit number of artists to survey
ARTIST_LIMIT=300

#this function takes care of one track. It accepts track record and returns semi-parsed information from Spotify API
def searchSp (track):
    #Search for the track
    raw_search=sp.search(q='artist:"'+track.Artists+'" track:"'+track.Name+'"', type='track')
    if len(raw_search['tracks']['items'])>0:
        #track is found, let's take the first search result
        p1=raw_search['tracks']['items'][0]
        #let's extract album information
        album=pd.json_normalize(p1['album'])[['id', 'name']]
        
        #and artist information
        artist=pd.json_normalize(p1['artists'][0])[['id','name']]
        
        #and track info
        track_info = pd.json_normalize(p1)[['id','name','popularity']]
        
        #now let's get audio features
        a1 = sp.audio_features(track_info['id'])
        audio_features=pd.json_normalize(a1[0]).drop(columns=['type', 'id','uri','track_href', 'analysis_url', 'duration_ms'])
        
        #put everything together, into single line dataframe
        result=pd.concat([track_info.add_prefix('track.'), audio_features.add_prefix('audio.'), album.add_prefix('album.'), artist.add_prefix('artist.')], axis=1)             
    else:    
        result = pd.DataFrame( columns=['track.id','track.name','track.popularity','audio.danceability','audio.energy','audio.key','audio.loudness','audio.mode','audio.speechiness','audio.acousticness','audio.instrumentalness','audio.liveness','audio.valence','audio.tempo','audio.time_signature','album.id','album.name' , 'artist.id', 'artist.name'])
    return result#.to_dict(orient='list')


sp_h100 = pd.DataFrame( columns=['track.id','track.name','track.popularity','audio.danceability','audio.energy','audio.key','audio.loudness','audio.mode','audio.speechiness','audio.acousticness','audio.instrumentalness','audio.liveness','audio.valence','audio.tempo','audio.time_signature','album.id','album.name' , 'artist.id', 'artist.name'])
#iterate through hot100 and pull track data from Spotify for each track by selected artist
for tr in hot100[hot100['Artists'].isin(hot100_artists.head(ARTIST_LIMIT))].itertuples():
    sp_h100=sp_h100.append(searchSp(tr))
#let's save our hot track list as csv
sp_h100.to_csv('Spotify_data_for_hot_100_select.csv')
sp_h100.head(5)

In [None]:
#let's retrive hot track list from csv
sp_h100=pd.read_csv('/kaggle/working/Spotify_data_for_hot_100_select.csv').drop(columns=['Unnamed: 0'])
print("Hot 100 before de-duplication", len(sp_h100))
sp_h100=sp_h100.drop_duplicates(subset='track.id')
print("Hot 100 after de-duplication", len(sp_h100))
sp_h100.head()

Now we retrieve 10 popular tracks for each of the artists we selected

In [None]:
#this function retrieves all tracks by given artist

def tracks_by_artist_Sp (a):
    #Search for the track
    raw_search=sp.artist_top_tracks(a[1])
    if len(raw_search['tracks'])>0:
        raw_tracks = pd.json_normalize(pd.json_normalize(raw_search).explode('tracks')['tracks'])
    
         #now let's get audio features
        a1 = sp.audio_features(raw_tracks['id'].tolist())
        audio_features = pd.json_normalize(a1).drop(columns=['type', 'uri', 'track_href', 'analysis_url', 'duration_ms']).add_prefix("audio.")
        result=raw_tracks[['id', 'name', 'popularity']].add_prefix('track.').merge(audio_features, left_on='track.id', right_on='audio.id').drop(columns='audio.id') 
        result=pd.concat([result, raw_tracks[['album.id', 'album.name']]],axis=1)
        result['artist.id']=a[1]
        result['artist.name']=a[2]
    else:
        result = pd.DataFrame( columns=['track.id','track.name','track.popularity','audio.danceability','audio.energy','audio.key','audio.loudness','audio.mode','audio.speechiness','audio.acousticness','audio.instrumentalness','audio.liveness','audio.valence','audio.tempo','audio.time_signature','album.id','album.name' , 'artist.id', 'artist.name'])
       
    return result

TRACK_LIMIT=len(sp_h100)
sp_a20 = pd.DataFrame( columns=['track.id','track.name','track.popularity','audio.danceability','audio.energy','audio.key','audio.loudness','audio.mode','audio.speechiness','audio.acousticness','audio.instrumentalness','audio.liveness','audio.valence','audio.tempo','audio.time_signature','album.id','album.name' , 'artist.id', 'artist.name'])
#iterate through hot100 and pull track data from Spotify for each track
for tr in sp_h100.head(TRACK_LIMIT)[['artist.id', 'artist.name']].itertuples():
    sp_a20=sp_a20.append(tracks_by_artist_Sp(tr))
print ("Tracks before deduplication", len(sp_a20))
sp_a20 = sp_a20.drop_duplicates(subset='track.id')
print ("Tracks after deduplication", len(sp_a20))
sp_a20.head(5)

And remove hot tracks from this list, so that we can compare hot vs other tracks by artist

In [None]:
sp_a20 = sp_a20[~sp_a20['track.id'].isin(sp_h100['track.id'])]
print ("Popular tracks after exclusion of hot", len(sp_a20))
sp_a20.to_csv('Spotify_data_for_popular_select.csv')

### This is a checkpoint, where we start, if we don't want to load data from API and want to proceed streight to analysis

In [None]:
#let's retrive popular track list and hot 100 list from csv
sp_h100=pd.read_csv('/kaggle/working/Spotify_data_for_hot_100_select.csv').drop(columns=['Unnamed: 0'])
sp_h100=sp_h100.drop_duplicates(subset='track.id')
sp_a20=pd.read_csv('/kaggle/working/Spotify_data_for_popular_select.csv').drop(columns=['Unnamed: 0'])

## Analysis

In [None]:
#let's summon the forces of matplot
import matplotlib.pyplot as plt

#for visualization we are going to use several functions to decode audio analysis
def audio_key(key):
    all_keys=['C', 'C#', 'D','D#','E','F','F#','G','G#','A','A#','B']
    return all_keys[key]
def audio_meter(sig):
    all_sigs=['Unknown','1/1','1/2','3/4','4/4','5/8']
    if sig>len(all_sigs)-1: return 'Unknown'
    else: return all_sigs[sig]

plt.style.use('seaborn-poster')
hh = pd.DataFrame()
hh['Hits']=sp_h100['audio.key'].apply(audio_key).value_counts(normalize=True)
hh['Popular Songs']=sp_a20['audio.key'].apply(audio_key).value_counts(normalize=True)
hh.plot(kind='bar', title='Musical Key of Tracks in Hot 100 and Top 10 by Artist', legend=['Hits', 'Popular Songs'])
#plt.scatter(final_dataset['audio.key'],final_dataset['audio.tempo'], alpha=0.5, c='Purple') 

### Interesting finding 1
While the most popular key for both hits and popular songs is C#, it turns out **hits use it more frequently**

### Interesting finding 2
D# key is very unpopular, while C# is the most popular of all keys. While I can see why D# could be unpopular (because it is hard to play on that key on keyboard), popularity of C# is hard to explain (it is also one of the most challenging keys to play on keyboard). I suspect, that the music on Spotify is pitched up half-tone to make pick up process harder. This might affect **interesting finding 1**

In [None]:
#lets' summarize other audio parameters
sp_h100.iloc[:,[3,4,8,9,10,11,12]].plot.box(fontsize=8)
sp_a20.iloc[:,[3,4,8,9,10,11,12]].plot.box(fontsize=8)

### Intersting finding 3 (the lack of)
Based on ratings assigned to tracks by Spotify algorithm, there is no difference between hits and just popular songs. The diagrams look like twins

In [None]:
sp_a20['audio.tempo'].value_counts(bins=200, normalize=True)

tt = pd.DataFrame()
tt['Hits']=sp_h100['audio.tempo']
tt['Popular Songs']=sp_a20['audio.tempo']
tt.plot.hist(bins=200, alpha=0.5, density=1)

### Interesting finding 4
The most popular tempo across popular songs is 140 BPM, followed by the tie of 120 and 130 BPM. Hits prefer 140 BPM more often. 

### Interesting finding 5
Popular tempos follow 10 BPM grid. While naturally tempos tend to follow x2/x3 patterns, the use of software in music production seem to allow +/-10 BPM changes

## Conclusions
It would be interesting to apply statistical methods to comparison of hits vs popular songs, as this research offered only visual comparison. It would be also interesting to perform this analysis on per-artist basis to determine, are there key/tempo preferences and how do they change between hits and other songs.