## Pre-processing Code & Documentation

***Note:** This file solely includes documentation for the pre-processing, documentation for the Cypher queries is included in "Cypher Queries + Documentation.txt".

First, import the Spotify music records dataset, and take a look at the different attributes and data types. This will help in deciding what features to use and how to normalize the data. For similarity calculation, it will be best to choose numerical/quantitative features.

In [1]:
import pandas as pd

# Import the dataset.
df = pd.read_csv('spotify.csv')
df.dtypes

Unnamed: 0            int64
track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object

To build the sample dataset, all the songs are split into groups based on their genre. For each of the 114 genres, 10 songs are randomly sampled and added to the collective sample set. If a given song is by "The Strokes" or in the "Is This It" album, it is also included in the sample set. A primarily goal in the design of the pre-processing code, was to make it easily modifiable and extensible, as such most operations are bundled with functions. For example, the sample_song_by_genre() function allows for any sample size to be passed in.

After sampling songs from each genre, additional processing is performed including treating null values and reseting the index. An 'id' column is also included for easy transfer on Neo4J, with it serving as a 'node id'. Below, the sample set is shown, ~1200 records.

In [2]:
# Determine sample set to build model. Randomly select 10 songs from each of the 114 genres for a sample of 11400 songs.
# Keeping an equal distribution of genre share to make the model generalizable (essentially a smaller version of the full
# dataset, retaining its integrity).

# For a given genre, use random sampling to generate a sample of the desired size.
def sample_songs_by_genre(genre, sample_size):
  genre_songs = df[df['track_genre'] == genre]
  genre_songs.reset_index(drop=True, inplace=True)
  genre_sample = pd.DataFrame()
  for i in range(len(genre_songs)):
    artist = genre_songs["artists"][i]
    album_name = genre_songs["album_name"][i]
    # Include all songs by The Strokes.
    if artist == "The Strokes" or album_name == "Is This It":
      genre_sample = pd.concat([genre_sample, genre_songs.iloc[[i]]], ignore_index=True)
  genre_sample = pd.concat([genre_sample, genre_songs.sample(n=sample_size, replace=True)], ignore_index=True)
  return genre_sample

# Utility to get all the unique genres present in the dataset.
def get_unique_genres():
  return df['track_genre'].unique()

# Create a sample for the whole dataset considering each genre.
def create_sample_set():
  sample_set = pd.DataFrame()
  # For each distinct genre, generate a sampling of 10 songs and add it to the collective sample set.
  # Note: Initially this was 100 songs per genre (total of 11-12k songs) but loading in even all the
  # pre-processed edges (reduction by 84%) took too long. With this there are about 1200 songs sampled.
  for genre in get_unique_genres():
    sample_set = pd.concat([sample_set, sample_songs_by_genre(genre, 10)], ignore_index=True)
  return sample_set

# Generate the sample set of songs to work from, these will serve as the nodes in the graph model.
sample_set = create_sample_set()
sample_set.drop(columns=["Unnamed: 0"], inplace=True)
sample_set = sample_set.dropna()
sample_set.reset_index(drop=True, inplace=True)
sample_set['id'] = sample_set.index
sample_set

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,id
0,7AZAGSn2B1APKe2XBbzB2G,Eddie Vedder,Into The Wild (Music For The Motion Picture),Far Behind,50,135320,False,0.467,0.855,4,...,1,0.0515,0.17400,0.000009,0.1100,0.5650,135.813,4,acoustic,0
1,1HfxPaJggVwFsvOtHbVzMz,Kris Allen,Kris Allen,Live Like We're Dying,55,212506,False,0.589,0.893,0,...,1,0.0397,0.02730,0.000000,0.3430,0.9400,92.011,4,acoustic,1
2,0JN6DXOZ08IwQhEBZJ3MFd,Filip Nordin,Covers - Chill Covers Of Popular Songs,Stitches,26,140547,False,0.686,0.347,0,...,1,0.0341,0.72500,0.000000,0.4110,0.5390,145.837,4,acoustic,2
3,0v1yN5C75um5Wx2WPtFl6k,Sara Farell,Sunflower,Sunflower,51,181897,False,0.568,0.213,2,...,1,0.0379,0.95500,0.000000,0.1130,0.3020,72.932,4,acoustic,3
4,6IF2P93LkyW4GqDQu1yS7H,Susie Suh;Robot Koch,Here with Me,Here with Me,55,238971,False,0.582,0.375,1,...,0,0.0329,0.70200,0.063100,0.1360,0.1060,129.909,4,acoustic,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1187,1gdKqUMnmKcTae3AXsBW4l,New Life Worship;Integrity's Hosanna! Music;Ro...,I Am Free,I Am Free,37,329106,False,0.516,0.717,2,...,1,0.0329,0.00194,0.003060,0.9040,0.1760,135.900,4,world-music,1187
1188,5uXMiTPXw21xFvyeyqxyIw,Hillsong Worship;Benjamin William Hastings,That's The Power (Live),That's The Power - Live,43,274533,False,0.454,0.635,10,...,1,0.0331,0.01030,0.000000,0.2330,0.0931,148.169,4,world-music,1188
1189,73THBZXMt4CZD0RCaUDXid,Lucas Cervetti,Frecuencias Álmicas en 432hz,"Frecuencia Álmica, Pt. 3",28,314000,False,0.410,0.107,7,...,1,0.0328,0.97400,0.774000,0.0780,0.0712,64.018,4,world-music,1189
1190,2FKd5vdRSk313W6B9s69tB,Hillsong Worship;Benjamin William Hastings,Awake,Come Alive - Studio,46,272426,False,0.335,0.571,0,...,1,0.0345,0.00492,0.000000,0.0656,0.1690,101.809,4,world-music,1190


In the graph model, each song from the sample set will be represented as a node, with attributes of the properties mentioned above. Edges represent a similarity relationship. For example, an edge from song A to song B, shows that song A has a certain similarity to song B.

For any pair of nodes, this similarity score is determine by calculating the cosine similarity. This is conducted for two data points A and B by finding the cosine of the angle between the vector representations of A and B:

$\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

For each song, 10 quantitative attributes are considered as features in the vector representation: danceability, energy, key, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo.

From the sample set dataframe, the respective columns for the features are extracted. Then, for each pair of nodes, the cosine similarity is determined using the sklearn.metrics library, based on the process described above. Then, a benchmark is generated for filtering out edges based on the a given percentile of all the similarity scores. First, I experimented with the 75th percentile, however ended up choosing the 90th percentile to reduce the number of edges and make the graph representation more efficient.

As such, an edge A -> B is only added, if the cosine similarity is above this benchmark, and if it is not equal to 1 (which would indicate a self-pairing). In addition, scores were not generated for repeat edges, such as B -> A and A -> B.

The edges included were then placed into a dataframe with the properties "song_1", "song_2", and "similarity" (score) which will act as a weight for the edge.


In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

features = ["danceability", "energy", "key", "loudness", "speechiness", "acousticness",
  "instrumentalness", "liveness", "valence", "tempo"]

# Obtains just the columns of these features.
sample_filtered_by_features = sample_set.loc[:, features]

# Evaluate which edges should be added by computing a cosine similarity metric for each pair of nodes, only
# adding the ones which are greater than the 90th percentile of calculated scores (ex. 0.9997493336635214).
def determine_edges():
    edges = []

    formatted_song_metrics = sample_filtered_by_features.values.tolist()
    scores = cosine_similarity(formatted_song_metrics)

    # Get the 90th percentile of all similarity scores. (excluding 1's - which indicate a self pairing/node-to-itself).
    benchmark = np.percentile(scores[scores < 1], 90)
    print(benchmark)

    # Iterate through the scores to find valid edges.
    for i in range(len(scores)):
        for j in range(i + 1, len(scores)):
            # Add an edge if the similarity score is above the 90th percentile.
            if scores[i, j] >= benchmark and scores[i, j] != 1:
                edges.append({'song_1': i, 'song_2': j, 'similarity': scores[i][j]})

    return pd.DataFrame(edges)


edges = determine_edges()
edges

0.9997428063101169


Unnamed: 0,song_1,song_2,similarity
0,0,10,0.999887
1,0,12,0.999989
2,0,15,0.999747
3,0,17,0.999934
4,0,19,0.999838
...,...,...,...
70789,1183,1188,0.999965
70790,1185,1188,0.999807
70791,1186,1187,0.999866
70792,1186,1191,0.999748


Finally, export the sampled songs/nodes and edges, along with their properties for the graph model to import into Neo4J.

In [4]:
# Download the nodes and edges as a csv to import into Neo4J.
from google.colab import files

edges.to_csv('edges.csv', index=False)
files.download('edges.csv')

sample_set.to_csv("nodes.csv")
files.download('nodes.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>