This notebook will go through my process for messing around with clustering using the DBSCAN algorithm

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN

In [2]:
raw_data = pd.read_csv('dataset.csv')
print(raw_data.head())

   danceability  energy  key  loudness  speechiness  acousticness  \
0         0.680   0.729    7    -5.097       0.0418        0.3670   
1         0.584   0.607   11    -6.605       0.0356        0.4260   
2         0.703   0.643    7    -5.544       0.0706        0.1920   
3         0.746   0.450    7    -8.543       0.0872        0.0407   
4         0.561   0.597   11    -6.000       0.0405        0.2860   

   instrumentalness  liveness  valence    tempo  time_signature  \
0               0.0    0.1590    0.830  120.029               4   
1               0.0    0.1010    0.374  117.817               4   
2               0.0    0.1430    0.528  102.059               4   
3               0.0    0.1720    0.336   95.998               4   
4               0.0    0.0979    0.355   76.826               4   

                 song_uri                     name  mode_major  mode_minor  
0  3TMUdD9vE4DoqDYi7VXStt              Fool's Gold           1           0  
1  3TKpJrY9q49Mj1JOsM9zGL   

In [3]:
raw_data = raw_data.astype({'key': 'float64', 'time_signature':'float64'}, copy=True)
prep_data = raw_data.copy()
prep_data = prep_data.drop(columns=['song_uri', 'name'])

print(prep_data.shape)
print(prep_data.head())
algo_data = prep_data.to_numpy()

(2156, 13)
   danceability  energy   key  loudness  speechiness  acousticness  \
0         0.680   0.729   7.0    -5.097       0.0418        0.3670   
1         0.584   0.607  11.0    -6.605       0.0356        0.4260   
2         0.703   0.643   7.0    -5.544       0.0706        0.1920   
3         0.746   0.450   7.0    -8.543       0.0872        0.0407   
4         0.561   0.597  11.0    -6.000       0.0405        0.2860   

   instrumentalness  liveness  valence    tempo  time_signature  mode_major  \
0               0.0    0.1590    0.830  120.029             4.0           1   
1               0.0    0.1010    0.374  117.817             4.0           1   
2               0.0    0.1430    0.528  102.059             4.0           1   
3               0.0    0.1720    0.336   95.998             4.0           1   
4               0.0    0.0979    0.355   76.826             4.0           0   

   mode_minor  
0           0  
1           0  
2           0  
3           0  
4           1

So the two parameters for DBSCAN are the Minimum # of Points in a neighborhood, and the Maximum Distance between two points in the same neighborhood.<br>
Since our clusters represent playlists, the smallest number of points we would want in a playlist might be 10, as that would give us a 30 minute playlist.<br>
To find some realistic values for distances, let's do some quick analysis. We have 13 total features.
2 of the features have a minimum of 0 and a maximum of 1. If we assume that the other 11 features have a minimum of -8 and a maximum of 8, then the vector representing the furthest possible distance would be have 11 distances of 16 and 2 distances of 1. The length of this vector is as follows

In [4]:
min_feature_values = np.min(prep_data)
max_feature_values = np.max(prep_data)
distance = max_feature_values - min_feature_values
print('\nMax possible distance: {}'.format(np.linalg.norm(distance.values)))


Max possible distance: 162.51068223535762


So, an absolute maximum distance in our transformed feature space is about 162.<br>
As epsilon rises, there will be less noise, as more points will be reachable by others.<br>
It is important to note that as min_samples rises, there will be fewer clusters

In [5]:
algo = DBSCAN(eps=2, min_samples=5).fit(algo_data)
labels = algo.labels_
clusters = set(labels)

total_songs = len(labels)
num_noise = sum(labels==-1)
print('Percent Noise: {:.2f}'.format(100*num_noise/total_songs))
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(n_clusters_))

playlist_clusters = []
for cluster_num, ndx in enumerate(clusters):
    if (ndx == -1):
        print('\nNoise Cluster')
    else:    
        num_points = sum(labels == ndx)
        if (num_points > 10):
            print('\nCluster {}'.format(cluster_num + 1))
            print('Number of Points: {}'.format(num_points))
            cluster = raw_data.iloc[(labels == ndx), :]
            print('Cluster Songs:\n{}'.format(cluster.name))
            playlist_clusters.append(ndx)

Percent Noise: 30.94
Number of clusters: 44

Cluster 1
Number of Points: 83
Cluster Songs:
0                                       Fool's Gold
32                  I Wanna Know (feat. Bea Miller)
67                                     Cold Showers
122                                      heavy snow
134                                 Ungrateful Eyes
                           ...                     
2144                       Say My Name (feat. Zyra)
2147                         It’s Only (feat. Zyra)
2148    Memories That You Call (feat. Monsoonsiren)
2151                                        Sundara
2155               Sun Models (feat. Madelyn Grant)
Name: name, Length: 83, dtype: object

Cluster 2
Number of Points: 351
Cluster Songs:
2                             For You
3                             Mean It
19                            Houdini
26                  You Are Losing Me
31                           Memories
                    ...              
2136                   

In [6]:
output_data = raw_data.copy()
output_data = output_data[['song_uri', 'name']]
output_data['cluster'] = labels
output_data = output_data[output_data.cluster.isin(playlist_clusters)]
print(output_data.shape)

(1295, 3)


In [7]:
print(output_data.head())

                 song_uri                name  cluster
0  3TMUdD9vE4DoqDYi7VXStt         Fool's Gold        0
2  0CJvDUBeEL1Rmpx7MH28CQ             For You        1
3  3GRSqlALWISqLeNncZMbpX             Mean It        1
6  649o53ULWYN1y7V2OI5kgo  Heat of the Summer        3
8  2a3dopgTF1q4rMVDJ1rwBU        Push My Luck        4


In [65]:
output_data.to_csv('clustered_data.csv')