## Extracting Features From the Original Dataset

This notebook is used to extract features from the original dataset, which gives us limited information about the songs. Here, we use the "ari.py" script to extract a set of features about each song, along with the popularities of both the artist and song itself, along with genres

In [1]:
#Import from the other file
from scripts.ari import ari_to_features
import pandas as pd
from tqdm import tqdm
import re

In [2]:
#Load the raw_data from the repo
dataPath = '../data/raw_data.csv'
df = pd.read_csv(dataPath)
df.head()


Unnamed: 0.1,Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,name
0,0,0,Missy Elliott,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks
1,1,1,Britney Spears,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks
2,2,2,Beyoncé,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks
3,3,3,Justin Timberlake,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks
4,4,4,Shaggy,spotify:track:1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks


In [3]:
#Edit the track-uris to a more usable format
df["track_uri"] = df["track_uri"].apply(lambda x: re.findall(r'\w+$', x)[0])
df["track_uri"]

0        0UaMYEvWZi0ZqiDOoHU3YI
1        6I9VzXrHxO9rA9A5euc8Ak
2        0WqIKmW4BTrj3eJFmnCKMv
3        1AWQoqb9bSvzTjaLralEkT
4        1lzr43nnXAijIGYnCT8M8H
                  ...          
67498    5uCax9HTNlzGybIStD3vDh
67499    0P1oO2gREMYUCoOkzYAyFu
67500    2oM4BuruDnEvk59IvIXCwn
67501    4Ri5TTUgjM96tbQZd5Ua7V
67502    5RVuBrXVLptAEbGJdSDzL5
Name: track_uri, Length: 67503, dtype: object

In [4]:
testDF = df
#feature = ari_to_features(df["track_uri"])
#feature_df_test = pd.DataFrame(feature)
#feature_df_test.head()

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,name
0,0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks
1,1,1,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks
2,2,2,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks
3,3,3,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks
4,4,4,Shaggy,1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks


### Included Features
The code cell below gives an example of the features extracted from each track, showing the kind of information that is used to cluster the data further on.

In [6]:
#Test the feature extraction script, and display features
ari_to_features(df["track_uri"][0])
#print(df["track_uri"][0])

{'danceability': 0.904,
 'energy': 0.813,
 'key': 4,
 'loudness': -7.105,
 'mode': 0,
 'speechiness': 0.121,
 'acousticness': 0.0311,
 'instrumentalness': 0.00697,
 'liveness': 0.0471,
 'valence': 0.81,
 'tempo': 125.461,
 'type': 'audio_features',
 'id': '0UaMYEvWZi0ZqiDOoHU3YI',
 'uri': 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI',
 'track_href': 'https://api.spotify.com/v1/tracks/0UaMYEvWZi0ZqiDOoHU3YI',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/0UaMYEvWZi0ZqiDOoHU3YI',
 'duration_ms': 226864,
 'time_signature': 4,
 'artist_pop': 71,
 'genres': 'dance_pop hip_hop hip_pop pop pop_rap r&b rap urban_contemporary virginia_hip_hop',
 'track_pop': 66}

## Extraction
Below here, we extract features from each track using the Spotify API and the associated URI. This is done in 3 sections, due to the extremely long runtime of this process. We build a DataFrame containing these features.

In [None]:
first_half = df["track_uri"].unique()[:10000]
second_half = df["track_uri"].unique()[10000:20000]
third_half = df["track_uri"].unique()[20000:]
dataLIST = [first_half,second_half,third_half]

In [None]:
featureLIST = []

for i in tqdm([uri for uri in dataLIST[0]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [None]:
for i in tqdm([uri for uri in dataLIST[1]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [None]:
for i in tqdm([uri for uri in dataLIST[2]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [None]:
#Preview the DataFrame
featureDF = pd.DataFrame(featureLIST)
featureDF

## Finalising and Export
We finally merge the feature DataFrame with the original dataset, as this also contains useful information in the artist name and track name. This is then exported, as our processed data.

In [None]:
new_df = pd.merge(testDF,featureDF, left_on = "track_uri", right_on= "id")

In [None]:
new_df.to_csv('../data/processed_data.csv')