# Musical Attribute Analysis -  Data Collection & Preprocessing

I have always been passionate about music and an avid music listener from a young age. I will be focusing this analysis to answer the following question: **"What makes up my musical taste?"**

In order to answer this question, I will be analyzing Audio Features of every song I listen to by utilizing the Spotify API. I have explained the Data Collection, Preprocessing, Analysis and Vizualization process in detail below.

This [notebook](https://github.com/sreegp/Musical-Attribute-Analysis/blob/master/Musical%20Attribute%20Analysis%20-%20Data%20Collection%20%26%20Preprocessing.ipynb) focuses on Data Collection & Preprocessing, and the second [notebook](https://github.com/sreegp/Musical-Attribute-Analysis/blob/master/Musical%20Attribute%20Analysis%20-%20Data%20Analysis%20%26%20Visualization.ipynb) focuses on Data Analysis and Vizualization.  

In [16]:
import pandas as pd 
import numpy as np
from pprint import pprint
import json
import math
import spotipy 
sp = spotipy.Spotify() 
from spotipy.oauth2 import SpotifyClientCredentials 

cid = "" 
secret = ""
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) 
sp.trace=False 


## Data Collection: Input Data From iTunes

Since I mostly use iTunes as my primary music library, I **exported my top ~4000 iTunes songs**, ranked by # of times played, into a csv (itunes_music.csv).

Attributes imported per song:
- **Name**
- **Artist**
- **Album**
- **Genre**
- **Played (i.e. play count)**

In [2]:
itunes_data = pd.read_csv('itunes_music.csv')

### Sample iTunes Data 

In [3]:
itunes_data.head() # Example of top 5 songs

Unnamed: 0,name,artist,album,genre,played
0,Naan Yen,A. R. Rahman & Rayhanah,Coke Studio India Season 3: Episode 1,Indian Pop,339
1,Vishnu Sahasranamam,Devotional Songs,Vishnu Sahasranamam,Soundtrack,210
2,Rehna Tu,A.R. Rahman,MTV Unplugged,Video,204
3,Maahi Ve,A.R.Rahman,Highway,StarMusiQ.Com,190
4,Daydreamer,Adele,19,Pop,183


## Data Preprocessing: Mapping Songs between iTunes Dataset and Spotify

In order to utilize audio features from the Spotify API, I had to map songs in my iTunes library to unique song 'name_uri' in the Spotify API. 'Name_uri' is used to reference songs throughout the Spotify API.

To obtain an accurate search, I created a query for every song by appending 'Song Name' + 'Song Album'. 
Using the query and the Spotify Search API, I obtained a unique Spotify handler (name_uri) for every song, which I can then use for further analysis. 

Out of the **~4000 iTunes songs**, I was only able to find **821** songs on Spotify. This was due to the difference in coverage of artists between iTunes and Spotify. The data was then dumped into result.json file.



In [4]:
result = []

for index, row in itunes_data.iterrows():
    if (pd.isnull(row[0]) == True) or (pd.isnull(row[2]) == True):
        continue
    else:    
        q = row[0] +  ' ' + row[2] # query song name + song album
        play_count = row[4] # number of time I played song
        search_response = sp.search(q = q, limit = 10)
        if search_response['tracks']['total'] == 0:
            continue
        else:
            for i in range(len(search_response['tracks']['items'])):
                if search_response['tracks']['items'][i]['name']  == row[0]:
                    result.append({"name": row[0], 
                                   "artist": search_response['tracks']['items'][i]['artists'][0]['name'],
                                      "artist_uri": search_response['tracks']['items'][i]['artists'][0]['uri'], 
                                      "album": search_response['tracks']['items'][i]['album']['name'], 
                                      "album_uri": search_response['tracks']['items'][i]['album']['uri'],
                                      "popularity": search_response['tracks']['items'][i]['popularity'],
                                      "name_uri": search_response['tracks']['items'][i]['uri'],
                                      "play_count": play_count 
                                     })
                    break
                else:
                    continue

with open('result.json', 'w') as outfile:
    json.dump(result, outfile)

### Sample Song, with Spotify URI

In [11]:
result[0] # Sample Song 

{'album': 'Coke Studio India Season 3: Episode 1',
 'album_uri': 'spotify:album:2CStgaiOhe1w4OXAoqP1gl',
 'artist': 'A.R. Rahman',
 'artist_uri': 'spotify:artist:1mYsTxnqsietFxj1OgoGbG',
 'name': 'Naan Yen',
 'name_uri': 'spotify:track:0PuaYHseEiuGj3syu49k6G',
 'play_count': 339,
 'popularity': 26}

### Ensuring Data Entries are Unique

In [260]:
data = json.load(open('result.json'))
## check if all data points are unique based off 'name album' key. If not unique, then delete.
seen_before = {}
for i in range(len(data)):
    if (data[i]['name'] + ' ' + data[i]['album']) in seen_before:
        seen_before[data[i]['name'] + ' ' + data[i]['album']] += 1
        del data[i]
    else:
        seen_before[data[i]['name'] + ' ' + data[i]['album']] = 1
print("Num of unique data points="+ str(len(data)))

Num of unique data points=821


## Data Preprocessing: Retrieve Audio Features for each Song through Spotify API

For each song, the Spotify Audio Features API (https://developer.spotify.com/web-api/get-audio-features/) is called to obtain the following features per track. We now have the preprocessed dataset required for analysis. This dataset is stored in final_data.json.

The following are details regarding the attibutes/features in the dataset extracted from Spotify.com.

- **Acousticness:** A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

- **Danceability:** Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

- **Energy:** Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.

- **Instrumentalness:** Predicts whether a track contains no vocals. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

- **Key:** The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

- **Liveness:** Detects the presence of an audience in the recording. A value above 0.8 provides strong likelihood that the track is live.

- **Loudness:** The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typical range between -60 and 0 db.

- **Mode:** Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

- **Speechiness:** Speechiness detects the presence of spoken words in a track.

- **Valence:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

- **Popularity:** Popularity of the song as defined by Spotify's algorithms which take into account factors like play count over a timeframe

In [276]:
data_with_features = []

for i in range(len(data)):
    temp_features = sp.audio_features(data[i]['name_uri'].split(':')[2])[0]
    temp = temp_features.copy()
    temp.update(data[i])
    data_with_features.append(temp)

with open('final_data.json', 'w') as outfile:
    json.dump(data_with_features, outfile, sort_keys=True, indent=4)

# data has been saved to 'final_data.json'

### Sample Song with Audio Features

In [17]:
final_data = json.load(open('final_data.json'))
pprint(final_data[0]) 

{'acousticness': 0.46,
 'album': 'Coke Studio India Season 3: Episode 1',
 'album_uri': 'spotify:album:2CStgaiOhe1w4OXAoqP1gl',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/0PuaYHseEiuGj3syu49k6G',
 'artist': 'A.R. Rahman',
 'artist_uri': 'spotify:artist:1mYsTxnqsietFxj1OgoGbG',
 'danceability': 0.318,
 'duration_ms': 325787,
 'energy': 0.508,
 'id': '0PuaYHseEiuGj3syu49k6G',
 'instrumentalness': 0.000121,
 'key': 1,
 'liveness': 0.106,
 'loudness': -6.067,
 'mode': 1,
 'name': 'Naan Yen',
 'name_uri': 'spotify:track:0PuaYHseEiuGj3syu49k6G',
 'play_count': 339,
 'popularity': 26,
 'speechiness': 0.0319,
 'tempo': 157.897,
 'time_signature': 3,
 'track_href': 'https://api.spotify.com/v1/tracks/0PuaYHseEiuGj3syu49k6G',
 'type': 'audio_features',
 'uri': 'spotify:track:0PuaYHseEiuGj3syu49k6G',
 'valence': 0.301}
