# Spotify Data Project

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Intial EDA (Pre-Clustering Work)

The datasets we are using are from a kaggle set that uses the Spotify API to query song data. https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

In [3]:
df = pd.read_csv("archive/data.csv")
df_artists = pd.read_csv("archive/data_by_artist.csv")
df_genres = pd.read_csv("archive/data_by_genres.csv")
df_year = pd.read_csv("archive/data_by_year.csv")
df_w_genres = pd.read_csv("archive/data_w_genres.csv")

In [4]:
df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.995,['Carl Woitschach'],0.708,158648,0.195,0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,10,0.151,-12.428,1,Singende Bataillone 1. Teil,0,1928,0.0506,118.469,0.779,1928
1,0.994,"['Robert Schumann', 'Vladimir Horowitz']",0.379,282133,0.0135,0,6KuQTIu1KoTTkLXKrwlLPV,0.901,8,0.0763,-28.454,1,"Fantasiestücke, Op. 111: Più tosto lento",0,1928,0.0462,83.972,0.0767,1928
2,0.604,['Seweryn Goszczyński'],0.749,104300,0.22,0,6L63VW0PibdM1HDSBoqnoM,0.0,5,0.119,-19.924,0,Chapter 1.18 - Zamek kaniowski,0,1928,0.929,107.177,0.88,1928
3,0.995,['Francisco Canaro'],0.781,180760,0.13,0,6M94FkXd15sOAOQYRnWPN8,0.887,1,0.111,-14.734,0,Bebamos Juntos - Instrumental (Remasterizado),0,1928-09-25,0.0926,108.003,0.72,1928
4,0.99,"['Frédéric Chopin', 'Vladimir Horowitz']",0.21,687733,0.204,0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,11,0.098,-16.829,1,"Polonaise-Fantaisie in A-Flat Major, Op. 61",1,1928,0.0424,62.149,0.0693,1928


Most of the other datasets are aggregations of this one. The genre data is the only one that presents information that is not found in this dataset, and it provides aggregations of the data at the genre level or includes what genres an artist encapsualtes.

In [5]:
df_artists.head()

Unnamed: 0,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
0,"""Cats"" 1981 Original London Cast",0.575083,0.44275,247260.0,0.386336,0.022717,0.287708,-14.205417,0.180675,115.9835,0.334433,38.0,5,1,12
1,"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,33.076923,5,1,26
2,"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.285714,0,1,7
3,"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.444444,0,1,27
4,"""Joseph And The Amazing Technicolor Dreamcoat""...",0.605444,0.437333,232428.111111,0.429333,0.037534,0.216111,-11.447222,0.086,120.329667,0.458667,42.555556,11,1,9


In [None]:
df_genres.head()

In [None]:
df_w_genres.head()

In [None]:
df_year.head()

In [None]:
df.groupby("year").mean().head()

The columns in this dataset mostly go over technical muscial information, more detail can be found at this link: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

This link contains a detailed description of the popularity variable https://developer.spotify.com/documentation/web-api/reference/tracks/get-track/

#### EDA

Problem Statement Idea:

- Can we extract genre from the various musical features at the song level, using an unsupervised learning technique?
    - most likely learn genre via clustering, K-means or GMM?

In [None]:
df.isna().sum()

Null value check, perhaps we aren't accounting for the way this data represents null values, i.e. empty brackets, zero values, certain text strings

First off, how does spotify define genre? Let's take a look at how many genres they define genre in their aggregate dataset

In [None]:
unique_genres = df_genres["genres"].unique()
print(len(unique_genres))
unique_genres[:20]

In [None]:
df_w_genres[df_w_genres["genres"] == "[]"].head()

In [None]:
df_w_genres[df_w_genres["genres"] == "[]"].loc[56]

In trying to query our favorite artists from the song data, we noticed an interesting issue with how the artists are represented.

There are 2,664 genres which is a very large amount, we see that there are multiple genres that have an "acid" prefix. Likely we will cluster and assign our own intuitive genres to each cluster or try to reduce this genre layer down to something we could use for supervised learning.

We also see that there is an empty value for genre indicated by '[]', so we know that null values are indicated in this dataset beyond an 'na'

Through this, we determined that the non-numeric variables are stored as strings (even though some appear to be lists, this comes from later EDA). This means we will have to do some preprocessing if we want to use pandas functions to query to through them.

In [None]:
print(type(df["artists"][0]))
df["artists"]

In [None]:
df["artists"] = df["artists"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

In [None]:
type(df["artists"][0])

In [None]:
def query_artist(artist):
    return [True if df["artists"][i] == [artist] else False for i in range(len(df["artists"]))]

In [None]:
df[query_artist("MGMT")].sort_values("popularity", ascending = False).head()

In [None]:
# artist = top10_artists["artists"]
# artists_pop = top10_artists["popularity"]
# plt.bar(artist, artists_pop)
top10_artists.plot.bar("artists", "popularity")
plt.xticks(rotation= 45)
plt.title("Top 10 Most Popular Artists")

Looking to the popularity of artists in the dataset we see that the top 10 artists are relatively unknown artists (at least to us). Why could that be?


In [None]:
top10_artists

We see that the count value for these artists are extremely low, so likely these artists are "one-hit wonders" or have 2 very successful songs. Let's see how popularity measures for a universally loved artist like The Beatles

In [None]:
df_artists[df_artists["artists"] == "The Beatles"]

Interestingly, The Beatles have a popularity score of 48.06 compared to the above artists scores of 86-95. How does Spotify measure popularity? Let's look to the API

The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

So likely the Beatles score is averaged over all their songs, lowering their score as there is a count of 823. It is also interesting that popularity is affected by how recent a song has been played. Let's see how time affects popularity.

In [None]:
plt.plot(df_year['year'], df_year['popularity'])
plt.xlabel("Year")
plt.ylabel("Popularity")
plt.title("Popularity Over Time")

As we suspected popularity shows an increase over time, favoring more recent songs. This indicates that popularity is more better defined as "current popularity". Thus, the variable does not indicate how popular a song was when it came out, rather how popular a song was when the data was queried, roughly October, 11th 2020. This means it may not be a reliable variable to use, or we must use it acknowleding that it is not a measure of how popular an artist or song has been historically, rather currently.

Let's look at the most popular songs in the dataset as a sanity check

In [None]:
df.sort_values("popularity", ascending = False).head(20)

These look a lot more like familiar artists. This indicates we want to stick to the song level data as opposed to data aggregated at the artist level so we do not lose detail about the data through issues like the "one-hit wonder" inflation seen above.

It seems that we will want to focus our efforts on clustering the song level data using the technical music aspects to try and discern some innate pattern that we can abstract as genre. Let's look at some of this technical music data and observe.

In [None]:
df["acousticness"].hist()
plt.title("Distribution of Acousticness")
plt.xlabel("Accousticness")
plt.ylabel("Count")

In [None]:
df["danceability"].hist()
plt.title("Distribution of Danceability")
plt.xlabel("Danceability")
plt.ylabel("Count")

We see that the spread for acousticness is heavily concentrated in the 0 and 1 bins, and Danceability is more evenly spread throughout with low concentration in the lower and upper bound bins

We'll use the genre level data to look at trends in the technical music aspects, since it helps us learn how genre behavior trends for these technical music aspects

In [None]:
plt.scatter(df_genres["acousticness"], df_genres["energy"])
plt.title("ScatterPlot of Genres of Acoustiness vs Energy")
plt.xlabel("Acousticness")
plt.ylabel("Energy")

In [None]:
plt.scatter(df_genres["acousticness"], df_genres["loudness"])
plt.title("ScatterPlot of Genres of Acoustiness vs Loudness")
plt.xlabel("Acousticness")
plt.ylabel("Loudness")

Overall, through our EDA we've really decided on trying to cluster for genres at the song level. With multiple other aggregated data sets, we found that we lose specificity from the aggregation so we will choose the raw song data. Perhaps one way that we can measure our success is to compare our song clusters to the genres assigned to artists (though there is no genre variable in the song dataset).

In [None]:
plotting_cols = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "valence", "speechiness"]
def plot_song(song):
    song_df = df[df["name"] == song]
    song_df.iloc[0][plotting_cols].plot.bar()
    plt.xticks(rotation= 45)
    plt.title("Technical Values of " + song)
    plt.xlabel("Musical Features measured from 0-1")
    plt.ylabel("Value")

In [None]:
plot_song("Ymca")

## Unsupervised Clustering Section

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial import distance

In [None]:
df = pd.read_csv("archive/data.csv")
df_artist = pd.read_csv("archive/data_by_artist.csv")
df_genres = pd.read_csv("archive/data_by_genres.csv")
df_year = pd.read_csv("archive/data_by_year.csv")
df_w_genres = pd.read_csv("archive/data_w_genres.csv")

In [None]:
# Reducing data down to the columns of interest for pca
# I do not include explicit because that is not available in the genre aggregate data and I need the data to be consistent
X = df[['acousticness', 
       'danceability',
       'energy',
       'danceability', 
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo']]

# Created this so I could see if there were better clusters with less data, the data staus consistently blob-like
# X = df[['acousticness', 
#        'danceability',
#        'energy',
#        'danceability', 
#        'instrumentalness', 
#        'liveness', 
#        'loudness',
#        'speechiness', 
#        'tempo']]

#Performing PCA to see what's going on
pca = PCA(n_components=2)
#reducing data down to 2d and plotting it
X_2d = pca.fit_transform(X)
plt.scatter([i[0] for i in X_2d], [i[1] for i in X_2d])

In [None]:
#Creating new PCA for visualizing explained variance for principle components
pca2 = PCA()
pca2.fit(X)
    
ks = range(1,12)
ratios = pca2.explained_variance_ratio_
#print(ratios)
for k in ks:
     #Sanity check to make sure the splicing is getting correct length
    #print(len(ratios[:k]))
    k_ratio = sum(ratios[:k])
    print(f"The fraction of the total variance explained by the first {k} principal component(s) is " + str(k_ratio))

summed_ratios = [sum(ratios[:i]) for i in range(len(ratios))]
plt.figure(figsize=(15, 10))
plt.plot(range(len(ratios)), summed_ratios)
plt.xlabel("Number of Principal Components")
plt.ylabel("Fraction of Total Variance")
plt.title("Fraction of total variance vs. number of principal components")

In [None]:
#Creating a function to plot PCA
def plot_pca():
    with plt.style.context('seaborn-whitegrid'):
        plt.figure(figsize=(15, 10))
         
        plt.scatter([i[0] for i in X_2d], [i[1] for i in X_2d], c = 'b')

        plt.xlabel('Principal Component 1')
        plt.ylabel('Principal Component 2')
        #plt.legend()
        plt.title("Principal Components 1 and 2")
        plt.tight_layout()

In [None]:
#Creating a function to plot kmeans and circles on top of the pca graph 
def plot_kmeans():
    plt.plot(centers_2d[:,0], centers_2d[:,1], 'ro', label  = "centroid")

    for ind,i in zip(kmeans.labels_,centers_2d):
        #print(ind)
        
       
        #print(np.where(kmeans.labels_==ind)[0])
        class_inds=np.where(kmeans.labels_==ind)[0]
        X_class = X_2d[class_inds]

        dists = metrics.pairwise_distances([i], X_class)

        max_dist=np.max(dists)
        #print(max_dist)
        plt.gca().add_artist(plt.Circle(i, max_dist, fill=False))
        plt.legend()

Using this list of genres via https://examples.yourdictionary.com/major-types-of-music-from-around-the-world.html as "Top Music Genres In the World" : Classical, Country, Electronic dance music (EDM), Hip-hop ,Indie rock ,Jazz, K-pop, Metal, Oldies, Pop, Rap, Rhythm & blues (R&B), Rock

In [None]:
#Using 12 because that's the number of top genres described above, excluding oldies since that isn't a spotify gebre
kmeans = KMeans(n_clusters= 12)
kmeans.fit(X)

#reducing down the centers into 2d so they can be plotted along with our reduced data
centers = kmeans.cluster_centers_
centers_2d = pca.transform(centers)
centers_2d

##### Plotting with 2d clusters (when kmeans is trying to fit 2-d data instead of 11-d data)

In [None]:
kmeans_2d = KMeans(n_clusters = 12)
kmeans_2d.fit(X_2d)

centers_2dreal = kmeans_2d.cluster_centers_
centers_2dreal

plt.figure(figsize=(15, 10))
         
plt.scatter([i[0] for i in X_2d], [i[1] for i in X_2d], c = 'b')

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
        #plt.legend()
plt.title("Principal Components 1 and 2")
plt.tight_layout()

plt.plot(centers_2dreal[:,0], centers_2dreal[:,1], 'ro', label  = "centroid")

The clusters are more or less the same even when being done in 2d, this makes sense by the explained variance chart as we see that explained variance caps out at around 2 principle components

In [None]:
#Using 12 because that's the number of top genres described above, excluding oldies since that isn't a spotify gebre
kmeans = KMeans(n_clusters= 12)
kmeans.fit(X)

#reducing down the centers into 2d so they can be plotted along with our reduced data
centers = kmeans.cluster_centers_
centers_2d = pca.transform(centers)
centers_2d

In [None]:
#making sure pca function works on its own
plot_pca()

In [None]:
#Using the two functions together
plot_pca()
plot_kmeans()

In [None]:
centers

In [None]:
centers_2d

In [None]:
kmeans_2d = KMeans(n_clusters= 12)
kmeans_2d.fit(X_2d)
centers = kmeans_2d.cluster_centers_
#centers_2d


plt.figure(figsize=(15, 10))
     
plt.scatter([i[0] for i in X_2d], [i[1] for i in X_2d], c = 'b')

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
    #plt.legend()
plt.title("Principal Components 1 and 2")
plt.tight_layout()
        
plt.plot(centers[:,0], centers[:,1], 'ro', label  = "centroid")

Visualizing our clusters in 2d space will be pretty tough if this is all correct. You can't really tell the difference between . I think this is mostly attributed to the fact that this data does not work well in a 2-dimensional space. If the data is doomed for dimension reduction then how do we visualize our clusters and try to discern genre?

What if I up the number of clusters?

In [None]:
kmeans = KMeans(n_clusters= 60)
kmeans.fit(X)
centers = kmeans.cluster_centers_
centers_2d = pca.transform(centers)

plot_pca()
plot_kmeans()

Adding clusters doesn't do much, still getting massive circles. Is my distance calculation correct? Also would be worth comparing circle functions to other people.

#### Trying a different approach using the genre as a centroid
First we cut down our genre data to our genres of interest

In [None]:
df_genres["genres"] = df_genres["genres"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

df_genres = df_genres[[True if (len(df_genres.loc[i, "genres"]) == 1) else False for i in range(len(df_genres))]]
df_genres = df_genres.reset_index(drop = True)

In [None]:
popular_genres = ["classical", "pop", "country", "edm", "hip hop", "indie rock", "jazz", "k-pop", "metal", "oldies", "rap", "r&b", "rock"]

In [None]:
trimmed_genre_df = df_genres[[True if df_genres.loc[i,"genres"][0] in popular_genres else False for i in range(len(df_genres))]]
trimmed_genre_df.head()

In [None]:
genre_X = trimmed_genre_df[['acousticness', 
       'danceability',
       'energy',
       'danceability', 
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo']]

genre_y = trimmed_genre_df["genres"]
genre_2d = pca.transform(genre_X)

In [None]:
kmeans = KMeans(n_clusters= 12)
kmeans.fit(X)
centers = kmeans.cluster_centers_
centers_2d = pca.transform(centers)
plot_pca()
plot_kmeans()
plt.plot()
plt.plot(genre_2d[:,0], genre_2d[:,1], 'yo', label  = "Average Genre Value")
plt.legend()

In [None]:
#Seeing where each of our cluster genres land to interpret associating clusters with genres later on 
with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(15, 10))
         
    plt.scatter([i[0] for i in X_2d], [i[1] for i in X_2d], c = 'black')

    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    #plt.legend()
    plt.title("Principal Components 1 and 2")
    plt.tight_layout()
#plt.plot(genre_2d[:,0], genre_2d[:,1], 'yo', label  = "Average Genre Value")
#plot_kmeans()
plt.plot(centers_2d[:,0], centers_2d[:,1], 'ro', label  = "centroid")

for i in range(len(genre_2d)):
    center = genre_2d[i]
    lab = genre_y.values[i][0]
    plt.scatter(center[0], center[1], label = lab)
plt.legend()

We can't visualize this very well in 2 dimensions, as these are values we can attribute to cluster centers and they are not clearly seperable. Additionally this data is at the song level

In [None]:
centers = kmeans.cluster_centers_
genre_2d

What I'm trying to do is give a name to each cluster center by attributing it to the nearest euclidean distance of our genres we want to learn from.

In [None]:
#Calculating min distance and adding it into a list corresponding with the clusters
center_genre_names = []
for center in centers:
    
    min_dist = 1000000000
    min_genre = ""
    for i in range(len(genre_X)):
        genre = genre_y.values[i]
        row = genre_X.iloc[i]
        dist = distance.euclidean(row.values, center)
        if dist < min_dist:
            min_dist = dist
            min_genre = genre
    center_genre_names.append(min_genre)
        

center_genre_names

In [None]:
#Printing it out nicely
for i in range(len(centers)):
    print("Cluster ", str(i + 1), "is closest to the genre:", center_genre_names[i][0])

Well it looks like our clusters get most focused around these genres, meaning they aren't picking up on any underlying patterns in the data. It is also possible that these underlying patterns don't exist, perhaps we can learn more by looking at supervised clustering (Isaac's work)

Seeing how things work if I turn the average values into cluster centers

In [None]:
kmeans = KMeans(n_clusters= 12)
kmeans.fit(X)
kmeans.cluster_centers_ = genre_X.values
centers = kmeans.cluster_centers_

In [None]:
centers == genre_X.values

In [None]:
centers_2d = pca.transform(centers)
plot_pca()
plot_kmeans()
plt.plot()

## "Supervised" Clustering

In [None]:
# Do the necessary imports
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split

In [None]:
#Import the data
dat = pd.read_csv("archive/data.csv")
dat_artist = pd.read_csv("archive/data_by_artist.csv")
dat_genres = pd.read_csv("archive/data_by_genres.csv")
dat_year = pd.read_csv("archive/data_by_year.csv")
dat_w_genres = pd.read_csv("archive/data_w_genres.csv")

#### Smaller Dataset work ("Pure", guarenteed songs in the genre)

In [None]:
# Fix issue with genres list (From string to list of strings)
dat_w_genres["genres"] = dat_w_genres["genres"].apply(lambda x: x.replace("'", "").strip('][').split(', '))


In [None]:
# will leave in empty list for one-genre, shouldn't be an issue
# This leaves us with only artists that have worked in one genre
dat_w_genres = dat_w_genres[[True if (len(dat_w_genres.loc[i, "genres"]) == 1) else False for i in range(len(dat_w_genres))]]
dat_w_genres = dat_w_genres.reset_index(drop = True)

In [None]:
# Fix for the stringed list that we have for artists (same as for genres)
dat["artists"] = dat["artists"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

In [None]:
# Get all classical artists in "Pure" genres
classical_artists = dat_w_genres[[True if "classical" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]

# Get pop artists
pop_artists = dat_w_genres[[True if "pop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
pop_artists.head()

In [None]:
# Get classical artists
classical_artists = classical_artists["artists"]
classical_artists.head()

In [None]:
# Get country artists
country_artists = dat_w_genres[[True if "country" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
country_artists.head()

In [None]:
# Get other genre artists
edm_artists = dat_w_genres[[True if "edm" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
hiphop_artists = dat_w_genres[[True if "hip hop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
indierock_artists = dat_w_genres[[True if "indie rock" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
jazz_artists = dat_w_genres[[True if "jazz" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
kpop_artists = dat_w_genres[[True if "k-pop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
metal_artists = dat_w_genres[[True if "metal" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
# No oldies for spotify that I could find
oldies_artists = dat_w_genres[[True if "oldies" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
rap_artists = dat_w_genres[[True if "rap" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
randb_artists = dat_w_genres[[True if "r&b" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
rock_artists = dat_w_genres[[True if "rock" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]


In [None]:
# Get all songs from the pop, classical and country artists
possible_pop = dat[[pop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_classical = dat[[classical_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_country = dat[[country_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

In [None]:
# Get all songs from the edm, hiphop and indierock artists
possible_edm = dat[[edm_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_hiphop = dat[[hiphop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_indierock = dat[[indierock_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]


In [None]:
# Get all songs from the jazz, kpop and metal artists
possible_jazz = dat[[jazz_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_kpop = dat[[kpop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_metal = dat[[metal_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

In [None]:
# Get all songs from the rap, r and b and rock artists
possible_rap = dat[[rap_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_randb = dat[[randb_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_rock = dat[[rock_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

In [None]:
# Choose the numeric features, add the genre we choose
pop = possible_pop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
pop["genre"] = "pop"

classical = possible_classical[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
classical["genre"] = "classical"

country = possible_country[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
country["genre"] = "country"

edm = possible_edm[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
edm["genre"] = "edm"

hiphop = possible_hiphop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
hiphop["genre"] = "hiphop"

indierock = possible_indierock[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
indierock["genre"] = "indierock"

jazz = possible_jazz[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
jazz["genre"] = "jazz"

kpop = possible_kpop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
kpop["genre"] = "kpop"

metal = possible_metal[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
metal["genre"] = "metal"

rap = possible_rap[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
rap["genre"] = "rap"

randb = possible_randb[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
randb["genre"] = "randb"

rock = possible_rock[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
rock["genre"] = "rock"

# Choose Genres
allg = pd.concat([pop, rap, indierock, rock, metal, randb, kpop, jazz, classical, hiphop, edm, country])
X = pd.concat([pop, rap])

In [None]:
# See how many of each genre
allg["genre"].value_counts()

In [None]:
# Do PCA, find explained variance for each component
pca = PCA(n_components=18)
y = X["genre"]
X = X[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
pca.fit(X)
pcavar = pca.explained_variance_ratio_
plt.plot(pcavar)

In [None]:
# Get 2D Data visualization
pca_mod = PCA(n_components=2)
pcadat = pca_mod.fit_transform(X)
d = pd.DataFrame(data=pcadat, columns=["Principal Component 1", "PC2"])
pops = d[(y == "pop").reset_index(drop = True)]
raps = d[(y == "rap").reset_index(drop = True)]
plt.plot(pops["Principal Component 1"], pops["PC2"], 'bo', raps["Principal Component 1"], raps["PC2"], 'ro')
plt.xticks([], [])
plt.yticks([], [])
plt.legend(("Pop", "Rap"))

In [None]:
# Split into Train and Test data
train_data, test_data, train_labels, test_labels = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Look for best clustering of pop and rap (From project 3 function)
train_labels = train_labels.reset_index(drop = True)
test_labels = test_labels.reset_index(drop = True)

### STUDENT START ###
dat = pd.DataFrame(columns=["Type of Covariance", "Number of PCA Components", "Number of GMM Components", "Parameters", "Accuracy"])
l = 0
    
    # Start of Function
    # Values of PCA Components
for i in range(1, 18):
        # Fit the PCA, get Pos and Neg
    pca_mod = PCA(n_components=i)
    pcadat=pca_mod.fit_transform(train_data) 
    data = pd.DataFrame(data=pcadat)
        
    pca_mod2 = PCA(n_components = i)
    pcadat2 = pca_mod2.fit_transform(test_data) 
    data2 = pd.DataFrame(data=pcadat2)
            
    popsongs = data[train_labels == "pop"]
    poplabels = train_labels[train_labels == "pop"]
    

    rapsongs = data[train_labels == "rap"]
    raplabels = train_labels[train_labels == "rap"]
    
    for j in [1, 2, 3, 4, 5, 6, 7, 8]:
        
        params = 2 * (j + (j - 1) + (j*np.sum(range(1, i+1))))
                
        modelpop = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelpop.fit(popsongs, poplabels)
        
        modelrap = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelrap.fit(rapsongs, raplabels)
            
        poslik = modelpop.score_samples(data2)
        neglik = modelrap.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Full", i, j, params, totalacc]
        l += 1
    
    for j in [1, 2, 3, 4, 5, 6, 7, 8]:
        params = 2 * ((2*j - 1) + i*j)
            
        modelpos = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelpos.fit(popsongs, poplabels)
        
        modelneg = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelneg.fit(rapsongs, raplabels)
            
        poslik = modelpos.score_samples(data2)
        neglik = modelneg.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Diag", i, j, params, totalacc]
        l += 1
    
    for j in [1, 2, 3, 4, 5, 6, 7, 8]:
        params = 2 * (3*j - 1)
            
        modelpos = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelpos.fit(popsongs, poplabels)
        
        modelneg = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelneg.fit(rapsongs, raplabels)
            
        poslik = modelpos.score_samples(data2)
        neglik = modelneg.score_samples(data2)
            
            # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
            # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Spherical", i, j, params, totalacc]
        l += 1
        
    for j in range(1, 13):
        params = 2 * (np.sum(range(1, i)) + 2*j - 1)
                
        modelpos = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelpos.fit(popsongs, poplabels)
                
        modelneg = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelneg.fit(rapsongs, raplabels)
                
        poslik = modelpos.score_samples(data2)
        neglik = modelneg.score_samples(data2)
                
                # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Tied", i, j, params, totalacc]
        l += 1

        # Get model with best accuracy
dat[dat["Accuracy"] == max(dat["Accuracy"])]

#### Larger clustering work ("Possible" data - songs not guarenteed in genre)

In [None]:
# Reimport data for fresh start
dat = pd.read_csv("archive/data.csv")
dat_artist = pd.read_csv("archive/data_by_artist.csv")
dat_genres = pd.read_csv("archive/data_by_genres.csv")
dat_year = pd.read_csv("archive/data_by_year.csv")
dat_w_genres = pd.read_csv("archive/data_w_genres.csv")

In [None]:
# Artists fix again
dat["artists"] = dat["artists"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

In [None]:
# Get all artists for a genre
classical_artists = dat_w_genres[[True if "classical" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

pop_artists = dat_w_genres[[True if "pop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

country_artists = dat_w_genres[[True if "country" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

edm_artists = dat_w_genres[[True if "edm" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

hiphop_artists = dat_w_genres[[True if "hip hop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

indierock_artists = dat_w_genres[[True if "indie rock" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

jazz_artists = dat_w_genres[[True if "jazz" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

kpop_artists = dat_w_genres[[True if "k-pop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

metal_artists = dat_w_genres[[True if "metal" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

# No oldies for spotify that I could find
oldies_artists = dat_w_genres[[True if "oldies" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

rap_artists = dat_w_genres[[True if "rap" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

randb_artists = dat_w_genres[[True if "r&b" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

rock_artists = dat_w_genres[[True if "rock" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

In [None]:
# Get all songs from the artists in a genre
possible_pop = dat[[pop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_classical = dat[[classical_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_country = dat[[country_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

possible_edm = dat[[edm_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_hiphop = dat[[hiphop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_indierock = dat[[indierock_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

possible_jazz = dat[[jazz_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_kpop = dat[[kpop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_metal = dat[[metal_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

possible_rap = dat[[rap_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_randb = dat[[randb_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_rock = dat[[rock_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

In [None]:
# Choose numeric features, add genre
pop = possible_pop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
pop["genre"] = "pop"

classical = possible_classical[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
classical["genre"] = "classical"

country = possible_country[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
country["genre"] = "country"

edm = possible_edm[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
edm["genre"] = "edm"

hiphop = possible_hiphop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
hiphop["genre"] = "hiphop"

indierock = possible_indierock[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
indierock["genre"] = "indierock"

jazz = possible_jazz[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
jazz["genre"] = "jazz"

kpop = possible_kpop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
kpop["genre"] = "kpop"

metal = possible_metal[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
metal["genre"] = "metal"

rap = possible_rap[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
rap["genre"] = "rap"

randb = possible_randb[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
randb["genre"] = "randb"

rock = possible_rock[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
rock["genre"] = "rock"

# Choose Genres
X = pd.concat([pop, rap, rock, indierock, metal, jazz, randb, kpop, classical, country, hiphop, edm])

In [None]:
# Set X and y properly
y = X["genre"]
X = X[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]

In [None]:
# Get number of each
y.value_counts()

In [None]:
# Just predicting rock gives us 28%
# This was a sanity check -- modified multiple times
len(y_test)

In [None]:
# Test train split
train_data, test_data, train_labels, test_labels = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Run to find best clustering model for all 12 genres
train_labels = train_labels.reset_index(drop = True)
test_labels = test_labels.reset_index(drop = True)

### STUDENT START ###
dat = pd.DataFrame(columns=["Type of Covariance", "Number of PCA Components", "Number of GMM Components", "Parameters", "Accuracy"])
l = 0
    
    # Start of Function
    # Values of PCA Components
for i in range(1, 18):
        # Fit the PCA, get Pos and Neg
    pca_mod = PCA(n_components=i)
    pcadat=pca_mod.fit_transform(train_data) 
    data = pd.DataFrame(data=pcadat)
        
    pca_mod2 = PCA(n_components = i)
    pcadat2 = pca_mod2.fit_transform(test_data) 
    data2 = pd.DataFrame(data=pcadat2)
            
    popsongs = data[train_labels == "pop"]
    poplabels = train_labels[train_labels == "pop"]
    
    classicalsongs = data[train_labels == "classical"]
    classicallabels = train_labels[train_labels == "classical"]
    
    hiphopsongs = data[train_labels == "hiphop"]
    hiphoplabels = train_labels[train_labels == "hiphop"]
    
    jazzsongs = data[train_labels == "jazz"]
    jazzlabels = train_labels[train_labels == "jazz"]
    
    indierocksongs = data[train_labels == "indierock"]
    indierocklabels = train_labels[train_labels == "indierock"]
    
    rocksongs = data[train_labels == "rock"]
    rocklabels = train_labels[train_labels == "rock"]
    
    edmsongs = data[train_labels == "edm"]
    edmlabels = train_labels[train_labels == "edm"]
    
    countrysongs = data[train_labels == "country"]
    countrylabels = train_labels[train_labels == "country"]
    
    kpopsongs = data[train_labels == "kpop"]
    kpoplabels = train_labels[train_labels == "kpop"]
    
    metalsongs = data[train_labels == "metal"]
    metallabels = train_labels[train_labels == "metal"]
    
    randbsongs = data[train_labels == "randb"]
    randblabels = train_labels[train_labels == "randb"]

    rapsongs = data[train_labels == "rap"]
    raplabels = train_labels[train_labels == "rap"]
    
    for j in [1, 2, 3, 4]:
        
        params = 2 * (j + (j - 1) + (j*np.sum(range(1, i+1))))
                
        modelpop = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelpop.fit(popsongs, poplabels)
        
        modelhiphop = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelhiphop.fit(hiphopsongs, hiphoplabels)
        
        modeljazz = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modeljazz.fit(jazzsongs, jazzlabels)
        
        modelclassical = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelclassical.fit(classicalsongs, classicallabels)
        
        modeledm = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modeledm.fit(edmsongs, edmlabels)
        
        modelcountry = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelcountry.fit(countrysongs, countrylabels)
        
        modelrandb = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelrandb.fit(randbsongs, randblabels)
        
        modelindierock = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelindierock.fit(indierocksongs, indierocklabels)
        
        modelmetal = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelmetal.fit(metalsongs, metallabels)
        
        modelkpop = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelkpop.fit(kpopsongs, kpoplabels)
        
        modelrock = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelrock.fit(rocksongs, rocklabels)
        
        modelrap = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelrap.fit(rapsongs, raplabels)
            
        lik1 = modelpop.score_samples(data2)
        lik2 = modelrap.score_samples(data2)
        lik3 = modeledm.score_samples(data2)
        lik4 = modelrock.score_samples(data2)
        lik5 = modelindierock.score_samples(data2)
        lik6 = modeljazz.score_samples(data2)
        lik7 = modelcountry.score_samples(data2)
        lik8 = modelclassical.score_samples(data2)
        lik9 = modelkpop.score_samples(data2)
        lik10 = modelmetal.score_samples(data2)
        lik11 = modelrandb.score_samples(data2)
        lik12 = modelhiphop.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            vals = [lik1[k], lik2[k], lik3[k], lik4[k], lik5[k], lik6[k], lik7[k], lik8[k], lik9[k], lik10[k], lik11[k], lik12[k]]
            if np.argmax(vals) == 0:
                labs.append("pop")
            elif np.argmax(vals) == 1:
                labs.append("rap")
            elif np.argmax(vals) == 2:
                labs.append("edm")
            elif np.argmax(vals) == 3:
                labs.append("rock")
            elif np.argmax(vals) == 4:
                labs.append("indierock")
            elif np.argmax(vals) == 5:
                labs.append("jazz")
            elif np.argmax(vals) == 6:
                labs.append("country")
            elif np.argmax(vals) == 7:
                labs.append("classical")
            elif np.argmax(vals) == 8:
                labs.append("kpop")
            elif np.argmax(vals) == 9:
                labs.append("metal")
            elif np.argmax(vals) == 10:
                labs.append("randb")
            else:
                labs.append("hiphop")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Full", i, j, params, totalacc]
        l += 1
    
    for j in [1, 2, 3, 4]:
        params = 2 * ((2*j - 1) + i*j)
            
        modelpop = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelpop.fit(popsongs, poplabels)
        
        modelhiphop = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelhiphop.fit(hiphopsongs, hiphoplabels)
        
        modeljazz = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modeljazz.fit(jazzsongs, jazzlabels)
        
        modelclassical = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelclassical.fit(classicalsongs, classicallabels)
        
        modeledm = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modeledm.fit(edmsongs, edmlabels)
        
        modelcountry = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelcountry.fit(countrysongs, countrylabels)
        
        modelrandb = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelrandb.fit(randbsongs, randblabels)
        
        modelindierock = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelindierock.fit(indierocksongs, indierocklabels)
        
        modelmetal = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelmetal.fit(metalsongs, metallabels)
        
        modelkpop = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelkpop.fit(kpopsongs, kpoplabels)
        
        modelrock = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelrock.fit(rocksongs, rocklabels)
        
        modelrap = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelrap.fit(rapsongs, raplabels)
            
        lik1 = modelpop.score_samples(data2)
        lik2 = modelrap.score_samples(data2)
        lik3 = modeledm.score_samples(data2)
        lik4 = modelrock.score_samples(data2)
        lik5 = modelindierock.score_samples(data2)
        lik6 = modeljazz.score_samples(data2)
        lik7 = modelcountry.score_samples(data2)
        lik8 = modelclassical.score_samples(data2)
        lik9 = modelkpop.score_samples(data2)
        lik10 = modelmetal.score_samples(data2)
        lik11 = modelrandb.score_samples(data2)
        lik12 = modelhiphop.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            vals = [lik1[k], lik2[k], lik3[k], lik4[k], lik5[k], lik6[k], lik7[k], lik8[k], lik9[k], lik10[k], lik11[k], lik12[k]]
            if np.argmax(vals) == 0:
                labs.append("pop")
            elif np.argmax(vals) == 1:
                labs.append("rap")
            elif np.argmax(vals) == 2:
                labs.append("edm")
            elif np.argmax(vals) == 3:
                labs.append("rock")
            elif np.argmax(vals) == 4:
                labs.append("indierock")
            elif np.argmax(vals) == 5:
                labs.append("jazz")
            elif np.argmax(vals) == 6:
                labs.append("country")
            elif np.argmax(vals) == 7:
                labs.append("classical")
            elif np.argmax(vals) == 8:
                labs.append("kpop")
            elif np.argmax(vals) == 9:
                labs.append("metal")
            elif np.argmax(vals) == 10:
                labs.append("randb")
            else:
                labs.append("hiphop")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Diag", i, j, params, totalacc]
        l += 1
    
    for j in [1, 2, 3, 4]:
        params = 2 * (3*j - 1)
            
        modelpop = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelpop.fit(popsongs, poplabels)
        
        modelhiphop = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelhiphop.fit(hiphopsongs, hiphoplabels)
        
        modeljazz = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modeljazz.fit(jazzsongs, jazzlabels)
        
        modelclassical = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelclassical.fit(classicalsongs, classicallabels)
        
        modeledm = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modeledm.fit(edmsongs, edmlabels)
        
        modelcountry = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelcountry.fit(countrysongs, countrylabels)
        
        modelrandb = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelrandb.fit(randbsongs, randblabels)
        
        modelindierock = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelindierock.fit(indierocksongs, indierocklabels)
        
        modelmetal = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelmetal.fit(metalsongs, metallabels)
        
        modelkpop = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelkpop.fit(kpopsongs, kpoplabels)
        
        modelrock = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelrock.fit(rocksongs, rocklabels)
        
        modelrap = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelrap.fit(rapsongs, raplabels)
            
        lik1 = modelpop.score_samples(data2)
        lik2 = modelrap.score_samples(data2)
        lik3 = modeledm.score_samples(data2)
        lik4 = modelrock.score_samples(data2)
        lik5 = modelindierock.score_samples(data2)
        lik6 = modeljazz.score_samples(data2)
        lik7 = modelcountry.score_samples(data2)
        lik8 = modelclassical.score_samples(data2)
        lik9 = modelkpop.score_samples(data2)
        lik10 = modelmetal.score_samples(data2)
        lik11 = modelrandb.score_samples(data2)
        lik12 = modelhiphop.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            vals = [lik1[k], lik2[k], lik3[k], lik4[k], lik5[k], lik6[k], lik7[k], lik8[k], lik9[k], lik10[k], lik11[k], lik12[k]]
            if np.argmax(vals) == 0:
                labs.append("pop")
            elif np.argmax(vals) == 1:
                labs.append("rap")
            elif np.argmax(vals) == 2:
                labs.append("edm")
            elif np.argmax(vals) == 3:
                labs.append("rock")
            elif np.argmax(vals) == 4:
                labs.append("indierock")
            elif np.argmax(vals) == 5:
                labs.append("jazz")
            elif np.argmax(vals) == 6:
                labs.append("country")
            elif np.argmax(vals) == 7:
                labs.append("classical")
            elif np.argmax(vals) == 8:
                labs.append("kpop")
            elif np.argmax(vals) == 9:
                labs.append("metal")
            elif np.argmax(vals) == 10:
                labs.append("randb")
            else:
                labs.append("hiphop")
    
            # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Spherical", i, j, params, totalacc]
        l += 1
        
    for j in range(1, 7):
        params = 2 * (np.sum(range(1, i)) + 2*j - 1)
                
        modelpop = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelpop.fit(popsongs, poplabels)
                
        modelhiphop = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelhiphop.fit(hiphopsongs, hiphoplabels)
        
        modeljazz = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modeljazz.fit(jazzsongs, jazzlabels)
        
        modelclassical = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelclassical.fit(classicalsongs, classicallabels)
        
        modeledm = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modeledm.fit(edmsongs, edmlabels)
        
        modelcountry = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelcountry.fit(countrysongs, countrylabels)
        
        modelrandb = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelrandb.fit(randbsongs, randblabels)
        
        modelindierock = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelindierock.fit(indierocksongs, indierocklabels)
        
        modelmetal = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelmetal.fit(metalsongs, metallabels)
        
        modelkpop = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelkpop.fit(kpopsongs, kpoplabels)
        
        modelrock = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelrock.fit(rocksongs, rocklabels)
        
        modelrap = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelrap.fit(rapsongs, raplabels)
            
        lik1 = modelpop.score_samples(data2)
        lik2 = modelrap.score_samples(data2)
        lik3 = modeledm.score_samples(data2)
        lik4 = modelrock.score_samples(data2)
        lik5 = modelindierock.score_samples(data2)
        lik6 = modeljazz.score_samples(data2)
        lik7 = modelcountry.score_samples(data2)
        lik8 = modelclassical.score_samples(data2)
        lik9 = modelkpop.score_samples(data2)
        lik10 = modelmetal.score_samples(data2)
        lik11 = modelrandb.score_samples(data2)
        lik12 = modelhiphop.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            vals = [lik1[k], lik2[k], lik3[k], lik4[k], lik5[k], lik6[k], lik7[k], lik8[k], lik9[k], lik10[k], lik11[k], lik12[k]]
            if np.argmax(vals) == 0:
                labs.append("pop")
            elif np.argmax(vals) == 1:
                labs.append("rap")
            elif np.argmax(vals) == 2:
                labs.append("edm")
            elif np.argmax(vals) == 3:
                labs.append("rock")
            elif np.argmax(vals) == 4:
                labs.append("indierock")
            elif np.argmax(vals) == 5:
                labs.append("jazz")
            elif np.argmax(vals) == 6:
                labs.append("country")
            elif np.argmax(vals) == 7:
                labs.append("classical")
            elif np.argmax(vals) == 8:
                labs.append("kpop")
            elif np.argmax(vals) == 9:
                labs.append("metal")
            elif np.argmax(vals) == 10:
                labs.append("randb")
            else:
                labs.append("hiphop")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Tied", i, j, params, totalacc]
        l += 1

        # Get maximum accuracy model
dat[dat["Accuracy"] == max(dat["Accuracy"])]

In [None]:
# Get ready for 2 genre clustering on possible data set
X = pd.concat([pop, rap])

y = X["genre"]
X = X[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]

train_data, test_data, train_labels, test_labels = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# 2D PCA to visualize
pca_mod = PCA(n_components=2)
pcadat = pca_mod.fit_transform(X)
d = pd.DataFrame(data=pcadat, columns=["Principal Component 1", "PC2"])
pops = d[(y == "pop").reset_index(drop = True)]
raps = d[(y == "rap").reset_index(drop = True)]
plt.plot(pops["Principal Component 1"], pops["PC2"], 'bo', raps["Principal Component 1"], raps["PC2"], 'ro')
plt.xticks([], [])
plt.legend(("Pop", "Rap"))

In [None]:
# Look for best clustering for Pop and Rap
train_labels = train_labels.reset_index(drop = True)
test_labels = test_labels.reset_index(drop = True)

### STUDENT START ###
dat = pd.DataFrame(columns=["Type of Covariance", "Number of PCA Components", "Number of GMM Components", "Parameters", "Accuracy"])
l = 0
    
    # Start of Function
    # Values of PCA Components
for i in range(1, 18):
        # Fit the PCA, get Pos and Neg
    pca_mod = PCA(n_components=i)
    pcadat=pca_mod.fit_transform(train_data) 
    data = pd.DataFrame(data=pcadat)
        
    pca_mod2 = PCA(n_components = i)
    pcadat2 = pca_mod2.fit_transform(test_data) 
    data2 = pd.DataFrame(data=pcadat2)
            
    popsongs = data[train_labels == "pop"]
    poplabels = train_labels[train_labels == "pop"]
    

    rapsongs = data[train_labels == "rap"]
    raplabels = train_labels[train_labels == "rap"]
    
    for j in [1, 2, 3, 4, 5, 6, 7, 8]:
        
        params = 2 * (j + (j - 1) + (j*np.sum(range(1, i+1))))
                
        modelpop = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelpop.fit(popsongs, poplabels)
        
        modelrap = GaussianMixture(n_components=j,covariance_type='full',random_state=12345)
        modelrap.fit(rapsongs, raplabels)
            
        poslik = modelpop.score_samples(data2)
        neglik = modelrap.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Full", i, j, params, totalacc]
        l += 1
    
    for j in [1, 2, 3, 4, 5, 6, 7, 8]:
        params = 2 * ((2*j - 1) + i*j)
            
        modelpos = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelpos.fit(popsongs, poplabels)
        
        modelneg = GaussianMixture(n_components=j,covariance_type='diag',random_state=12345)
        modelneg.fit(rapsongs, raplabels)
            
        poslik = modelpos.score_samples(data2)
        neglik = modelneg.score_samples(data2)
                
                 # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Diag", i, j, params, totalacc]
        l += 1
    
    for j in [1, 2, 3, 4, 5, 6, 7, 8]:
        params = 2 * (3*j - 1)
            
        modelpos = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelpos.fit(popsongs, poplabels)
        
        modelneg = GaussianMixture(n_components=j,covariance_type='spherical',random_state=12345)
        modelneg.fit(rapsongs, raplabels)
            
        poslik = modelpos.score_samples(data2)
        neglik = modelneg.score_samples(data2)
            
            # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
            # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Spherical", i, j, params, totalacc]
        l += 1
        
    for j in range(1, 13):
        params = 2 * (np.sum(range(1, i)) + 2*j - 1)
                
        modelpos = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelpos.fit(popsongs, poplabels)
                
        modelneg = GaussianMixture(n_components=j,covariance_type='tied',random_state=12345)
        modelneg.fit(rapsongs, raplabels)
                
        poslik = modelpos.score_samples(data2)
        neglik = modelneg.score_samples(data2)
                
                # Label More Likely outcome
        labs = []
        for k in range(len(poslik)):
            if poslik[k] > neglik[k]:
                labs.append("pop")
            else:
                labs.append("rap")
    
                # Get accuracy
        acc = []
        for k in range(len(labs)):
            if labs[k] == test_labels[k]:
                acc.append(1)
            else:
                acc.append(0)
                    
        totalacc = sum(acc) / len(acc)
            
        dat.loc[l] = ["Tied", i, j, params, totalacc]
        l += 1

        # Get max accuracy
dat[dat["Accuracy"] == max(dat["Accuracy"])]

In [None]:
# Add test on Pure songs (Uses the guarenteed genre songs as the test set)
dat = pd.read_csv("archive/data.csv")
dat_artist = pd.read_csv("archive/data_by_artist.csv")
dat_genres = pd.read_csv("archive/data_by_genres.csv")
dat_year = pd.read_csv("archive/data_by_year.csv")
dat_w_genres = pd.read_csv("archive/data_w_genres.csv")

# Genre list fix
dat_w_genres["genres"] = dat_w_genres["genres"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

# Will be left with some observations as [''], shouldn't matter given what we do later
# Get list for pure artists
dat_w_genres = dat_w_genres[[True if (len(dat_w_genres.loc[i, "genres"]) == 1) else False for i in range(len(dat_w_genres))]]
dat_w_genres = dat_w_genres.reset_index(drop = True)

# Artists list fix
dat["artists"] = dat["artists"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

# Get all pure pop, rap artists
pop_artists = dat_w_genres[[True if "pop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
rap_artists = dat_w_genres[[True if "rap" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

# Find songs for those artists
possible_pop = dat[[pop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_rap = dat[[rap_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

# Choose the correct features, add labels
pops = possible_pop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
pops["genre"] = "pop"

raps = possible_rap[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
raps["genre"] = "rap"

Xs = pd.concat([pops, raps])

y_labels = Xs["genre"]
y_labels = y_labels.reset_index(drop = True)
x_data = Xs[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
x_data = x_data.reset_index(drop = True)

In [None]:
# Test Possible model (2D fit on larger data set) on pure
# Actually testing on best model
train_labels = train_labels.reset_index(drop = True)

pca_mod = PCA(n_components=7)
pcadat=pca_mod.fit_transform(train_data) 
data = pd.DataFrame(data=pcadat)
        
pca_mod2 = PCA(n_components = 7)
pcadat2 = pca_mod2.fit_transform(x_data) 
data2 = pd.DataFrame(data=pcadat2)

popsongs = data[train_labels == "pop"]
poplabels = train_labels[train_labels == "pop"]

rapsongs = data[train_labels == "rap"]
raplabels = train_labels[train_labels == "rap"]

modelpop = GaussianMixture(n_components=6,covariance_type='tied',random_state=12345)
modelpop.fit(popsongs, poplabels)
        
modelrap = GaussianMixture(n_components=6,covariance_type='tied',random_state=12345)
modelrap.fit(rapsongs, raplabels)
            
poslik = modelpop.score_samples(data2)
neglik = modelrap.score_samples(data2)
                
                 # Label More Likely outcome
labs = []
for k in range(len(poslik)):
    if poslik[k] > neglik[k]:
        labs.append("pop")
    else:
        labs.append("rap")
    
                # Get accuracy
acc = []
for k in range(len(labs)):
    if labs[k] == y_labels[k]:
        acc.append(1)
    else:
        acc.append(0)
                    
totalacc = sum(acc) / len(acc)

# Accuracy on Pure as test
totalacc

In [None]:
# Test data is now the possible pop and rap songs (Not a perfect set since some songs may be neither, large data set)
# Will use to test the pure models (built on guarenteed songs)
test_data = pd.concat([pop, rap])
test_labels = test_data["genre"]
test_labels = test_labels.reset_index(drop = True)
test_data = test_data[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
test_data = test_data.reset_index(drop = True)

In [None]:
# Reimport to get the pure data
dat = pd.read_csv("archive/data.csv")
dat_artist = pd.read_csv("archive/data_by_artist.csv")
dat_genres = pd.read_csv("archive/data_by_genres.csv")
dat_year = pd.read_csv("archive/data_by_year.csv")
dat_w_genres = pd.read_csv("archive/data_w_genres.csv")

In [None]:
# Fix for list of genres - changes from string of list to actual list
dat_w_genres["genres"] = dat_w_genres["genres"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

# Will be left with some observations as [''], shouldn't matter given what we do later
dat_w_genres = dat_w_genres[[True if (len(dat_w_genres.loc[i, "genres"]) == 1) else False for i in range(len(dat_w_genres))]]
dat_w_genres = dat_w_genres.reset_index(drop = True)

In [None]:
# Artist fix
dat["artists"] = dat["artists"].apply(lambda x: x.replace("'", "").strip('][').split(', '))
# Find the artists
pop_artists = dat_w_genres[[True if "pop" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]
rap_artists = dat_w_genres[[True if "rap" in dat_w_genres.loc[i,"genres"] else False for i in range(len(dat_w_genres))]]["artists"]

In [None]:
# Find the songs for pop and rap artists
possible_pop = dat[[pop_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]
possible_rap = dat[[rap_artists.isin(dat.loc[i, "artists"]).any() for i in range(len(dat))]]

In [None]:
# Get numeric features, add genres
pop = possible_pop[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
pop["genre"] = "pop"

rap = possible_rap[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
rap["genre"] = "rap"

X = pd.concat([pop, rap])

In [None]:
# Use previous best to evaluate other data set
# Best model from pure using the possible data as a test set
train_labels = X["genre"]
train_labels = train_labels.reset_index(drop = True)
train_data = X[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
train_data = train_data.reset_index(drop = True)

pca_mod = PCA(n_components=5)
pcadat=pca_mod.fit_transform(train_data) 
data = pd.DataFrame(data=pcadat)
        
pca_mod2 = PCA(n_components = 5)
pcadat2 = pca_mod2.fit_transform(test_data) 
data2 = pd.DataFrame(data=pcadat2)

popsongs = data[train_labels == "pop"]
poplabels = train_labels[train_labels == "pop"]

rapsongs = data[train_labels == "rap"]
raplabels = train_labels[train_labels == "rap"]

modelpop = GaussianMixture(n_components=1,covariance_type='full',random_state=12345)
modelpop.fit(popsongs, poplabels)
        
modelrap = GaussianMixture(n_components=1,covariance_type='full',random_state=12345)
modelrap.fit(rapsongs, raplabels)
            
poslik = modelpop.score_samples(data2)
neglik = modelrap.score_samples(data2)
                
                 # Label More Likely outcome
labs = []
for k in range(len(poslik)):
    if poslik[k] > neglik[k]:
        labs.append("pop")
    else:
        labs.append("rap")
    
                # Get accuracy
acc = []
for k in range(len(labs)):
    if labs[k] == test_labels[k]:
        acc.append(1)
    else:
        acc.append(0)
                    
totalacc = sum(acc) / len(acc)

# Get the overall accuracy
totalacc

## Switch to Supervised (Predicting Popularity)

#### Linear and Logistic Regression

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import linear_model

In [None]:
dat = pd.read_csv("archive/data.csv")
dat_artist = pd.read_csv("archive/data_by_artist.csv")
dat_genres = pd.read_csv("archive/data_by_genres.csv")
dat_year = pd.read_csv("archive/data_by_year.csv")
dat_w_genres = pd.read_csv("archive/data_w_genres.csv")

In [None]:
# Set features to the numeric ones, get Popularity as Y
X = dat[['acousticness', 
       'danceability',
       'energy',
       'year', 
       'explicit',
       'instrumentalness', 
       'key', 
       'liveness', 
       'loudness',
       'mode', 
       'speechiness', 
       'tempo',
        'valence', 'key', 'mode', 'loudness', 'explicit', 'duration_ms']]
y = dat["popularity"]

In [None]:
# Distplot to visualize the popularity values
import seaborn as sns
sns.distplot(y)
plt.xlabel("Popularity")

In [None]:
# Train Test split on the intial data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
y_test.head()

In [None]:
# Look to see if non-distortionary scaling will impact our regression results
from sklearn import preprocessing
mm_scaler = preprocessing.MinMaxScaler()

X_train_minmax = mm_scaler.fit_transform(X_train)
X_test_minmax = mm_scaler.transform(X_test)

In [None]:
# Run Transformed linear regression, get Mean Squared Error
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_minmax, y_train)
preds = model.predict(X_test_minmax)

mean_squared_error(y_test, preds)

In [None]:
# Run classic linear regression no correction
model1 = LinearRegression()
model1.fit(X_train, y_train)
preds = model1.predict(X_test)

mean_squared_error(y_test, preds)

In [None]:
# Get score (R^2) value for the transformed linear regression
model.score(X_test_minmax, y_test)

In [None]:
# Get score (R^2) value for the classic linear regression (slightly lower but functionally the same)
model1.score(X_test, y_test)

In [None]:
# Run linear regression normalizing the values
model2 = LinearRegression(normalize=True)
model2.fit(X_train, y_train)
preds = model2.predict(X_test)

mean_squared_error(y_test, preds)

In [None]:
# See score, slight improvement overall
model2.score(X_test, y_test)

In [None]:
# Get Prediction vs Residual plot
resid = preds - y_test
plt.plot(preds, resid, 'bo')

In [None]:
# Set up data frame for Pop, Pred, Resid
data = {"Actual Popularity": y_test, "Predicted": preds, "Residuals": preds-y_test}
data = pd.DataFrame(data=data)
data.head()

In [None]:
# Residual Prediction Plot
import seaborn as sns
g = sns.lmplot(x="Predicted", y="Residuals", data=data)

In [None]:
# Nice plot for popularity vs predicted
import seaborn as sns
g = sns.lmplot(x="Actual Popularity", y="Predicted", data=data)

In [None]:
# Ridge Regression CV
reg = linear_model.RidgeCV(alphas=np.logspace(-9, 9, 19))
reg.fit(X_train, y_train)
reg.alpha_

In [None]:
# Score for Ridge CV Model
reg.score(X_test, y_test)

In [None]:
# LASSO CV
reg = linear_model.LassoCV(cv=5,alphas=np.logspace(-6, 6, 13), max_iter=1000000).fit(X_train, y_train)
reg.alpha_

In [None]:
# Score for LASSO
reg.score(X_test, y_test)

In [None]:
# CV for Elastic Net
regr = linear_model.ElasticNetCV(cv=5, random_state=0, alphas=np.logspace(-6,6,13), max_iter=10000000, l1_ratio=[.1, .2, .3, .4, .5, .6, .7, .8, .9])
regr.fit(X_train, y_train)
regr.alpha_

In [None]:
# Get other parameter (l1 share)
regr.l1_ratio_

In [None]:
# Score for elastic net model
regr.score(X_test, y_test)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# Get mean for later usage (binarize)
np.mean(y_train)

In [None]:
# Do Logistic Regression for > 50
y_train = [1 if i > 50 else 0 for i in y_train]
y_test = [1 if i > 50 else 0 for i in y_test]

In [None]:
# Binarize on 50, find best model
from sklearn.metrics import accuracy_score
dat = pd.DataFrame(columns=["Regularization Parameter Lambda", "L1 or L2", "Accuracy"])
l = 0
    # Get intitial vocabulary
for i in np.logspace(-6,6,13):
    modell1 = LogisticRegression(C=i, solver="liblinear", multi_class="auto", penalty="l1", tol=0.015)
    modell1.fit(X_train, y_train)
    preds = modell1.predict(X_test)
    dat.loc[l] = [(1/i), "l1", accuracy_score(y_test, preds)]
    l+=1
    
    modell2 = LogisticRegression(C=i, solver="liblinear", multi_class="auto", penalty="l2", tol=0.015)
    modell2.fit(X_train, y_train)
    preds = modell2.predict(X_test)
    dat.loc[l] = [(1/i), "l2", accuracy_score(y_test, preds)]
    l+=1

In [None]:
# Can predict if > 50 with 80% accuracy, best model
dat[dat["Accuracy"] == max(dat["Accuracy"])]

In [None]:
# Binarize on mean
y_train = [1 if i > 31.56 else 0 for i in y_train]
y_test = [1 if i > 31.56 else 0 for i in y_test]

In [None]:
# Get best logistic regression for new binarize
from sklearn.metrics import accuracy_score
dat = pd.DataFrame(columns=["Regularization Parameter Lambda", "L1 or L2", "Accuracy"])
l = 0
    # Get intitial vocabulary
for i in np.logspace(-6,6,13):
    modell1 = LogisticRegression(C=i, solver="liblinear", multi_class="auto", penalty="l1", tol=0.015)
    modell1.fit(X_train, y_train)
    preds = modell1.predict(X_test)
    dat.loc[l] = [(1/i), "l1", accuracy_score(y_test, preds)]
    l+=1
    
    modell2 = LogisticRegression(C=i, solver="liblinear", multi_class="auto", penalty="l2", tol=0.015)
    modell2.fit(X_train, y_train)
    preds = modell2.predict(X_test)
    dat.loc[l] = [(1/i), "l2", accuracy_score(y_test, preds)]
    l+=1

In [None]:
# Get accuracy
dat[dat["Accuracy"] == max(dat["Accuracy"])]

In [None]:
# Percentage of 0's
1 - (sum(y_train) / len(y_train))

In [None]:
# Get median for binarize
np.median(y_train)

In [None]:
# Binarize on median
y_train = [1 if i > 33 else 0 for i in y_train]
y_test = [1 if i > 33 else 0 for i in y_test]

In [None]:
# Logistic with median binarize
from sklearn.metrics import accuracy_score
dat = pd.DataFrame(columns=["Regularization Parameter Lambda", "L1 or L2", "Accuracy"])
l = 0
    # Get intitial vocabulary
for i in np.logspace(-6,6,13):
    modell1 = LogisticRegression(C=i, solver="liblinear", multi_class="auto", penalty="l1", tol=0.015)
    modell1.fit(X_train, y_train)
    preds = modell1.predict(X_test)
    dat.loc[l] = [(1/i), "l1", accuracy_score(y_test, preds)]
    l+=1
    
    modell2 = LogisticRegression(C=i, solver="liblinear", multi_class="auto", penalty="l2", tol=0.015)
    modell2.fit(X_train, y_train)
    preds = modell2.predict(X_test)
    dat.loc[l] = [(1/i), "l2", accuracy_score(y_test, preds)]
    l+=1

In [None]:
# Get accuracy
dat[dat["Accuracy"] == max(dat["Accuracy"])]