# The Most Complete Spotify Genre Analysis

![](https://images.unsplash.com/photo-1513829596324-4bb2800c5efb?ixlib=rb-1.2.1&auto=format&fit=crop&w=750&q=80)

### In this notebook, we will explore the hidden facts about Spotify genres and answer the following questions:

### * How do genres look like?

We will explore the hidden truths behind the data by creating
heatmaps, scatterplots, barplots, word cloud etc.

### * Can we predict genres at all?
 
We are gonna test the K-Nearest Neighbors and Cosine Similarity Algorithm
on how good they are at imputing missing genres

### * Can we cluster genres into bigger genre groups?

How could we teach machines to tell the difference between (canadian pop, swedish pop, dance pop) 
 and (blues, country, folk music) genre groups? Without any doubt, K-Means Clustering is the way to do!
 
### * Are the newly discovered clusters that different?

In order to evaluate our new findings, we are gonna visualize results, calculate silhouette scores and make t-tests and one-way ANOVA tests for each "number of clusters" selection. 

### And much more... So let's get started!

In [None]:
#basic libraries
import pandas as pd
import numpy as np
import warnings
!pip install ppscore
warnings.filterwarnings("ignore")
import ppscore as pps
import ast
from tqdm.notebook import tqdm
import math
from collections import Counter

#visualization
import seaborn as sns
from matplotlib import pyplot as plt
from wordcloud import WordCloud
plt.style.use("ggplot")

#statistical analysis & machine learning
from sklearn.cluster import KMeans as KM
from sklearn.metrics import silhouette_score as score
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split as splitter
from sklearn.model_selection import cross_val_score as validator
from statsmodels.stats import power as sms
from scipy.stats import pearsonr, shapiro, ttest_ind, f_oneway, levene

#text preprocessing
import nltk
from collections import Counter
import string
from nltk.corpus import stopwords

In [None]:
#string representation of list -> list
def str2list(x):
    try:
        return ast.literal_eval(x)
    except:
        return np.nan

#accuracy metric
def accuracy(true, pred):
    return sum(true == pred)/len(pred)

#calculate accuracy of knn model (imputing algorithm 1)
def knn_score(k, X_known, y_known):
    model = KNN(n_neighbors = k)
    return validator(model, X_known, y_known, cv=5)

#calculate accuracy of many models
def many_knn_score(start_n, end_n, X_known, y_known):
    scores = []
    for i in range(start_n, end_n):
        scores.append(knn_score(i, X_known, y_known).mean())
    progress = pd.Series(scores, index = np.arange(start_n, end_n))
    fig = plt.figure(figsize = (10, 10))
    ax = fig.subplots()
    progress.plot(ax=ax, kind="line")
    ax.set_ylabel("accuracy")
    ax.set_xlabel("n_neighbors")
    plt.show()

#cosine similarity
def cosine_similarity(x1, x2):
    return np.dot(x1, x2)/(np.linalg.norm(x1)*np.linalg.norm(x2))

#imputing algorithm 2
def similarity_algorithm(X_train):
    similarity_lists = []
    for x in tqdm(range(len(X_train))):
        similarity_list = []
        for genre in range(len(genre_profile)):
            similarity = cosine_similarity(X_train.iloc[x], genre_profile.iloc[genre])
            similarity_list.append(similarity)
        similarity_lists.append(similarity_list)
    df = pd.DataFrame(similarity_lists, columns=popular_genres) 
    return df.transpose()

#silhouette score
def calc_silhouette(X, y):
    return score(X, y)

#determine clusters and find out how good model is using silhouette score, t-test and/or anova
def k_means(data, n_clusters, n_components=2):
    model = KM(n_clusters=n_clusters)
    model.fit(df_std)
    preds = model.predict(data)
    decomposer = PCA(n_components=n_components)
    decomposer.fit(df_std)
    data_for_plot = decomposer.transform(data)
    cols = ["x"+str(x+1) for x in range(n_components)]
    df_genre_profile = pd.DataFrame(data_for_plot, index=data.index, columns=cols)
    df_genre_profile["cluster"] = preds
    cluster_score = calc_silhouette(data, preds)
    tester = Test(alpha=0.05)
    anova_score = tester.anova(*[df_genre_profile[df_genre_profile["cluster"] == x]["x1"].values for x in range(n_clusters)])
    anova_score = anova_score["anova_stat"] if anova_score["test_is_accepted"] else None
    ttest_score = None
    if n_clusters == 2:
        ttest_score = tester.ttest(df_genre_profile[df_genre_profile["cluster"] == 0]["x1"].values,
                                   df_genre_profile[df_genre_profile["cluster"] == 1]["x1"].values)
        ttest_score = ttest_score["ttest_stat"] if ttest_score["test_is_accepted"] else None
    return df_genre_profile, cluster_score, anova_score, ttest_score

#plot a kmeans model
def plot_k_means(data, n_clusters):
    df_genre_profile, cluster_score, anova_score, ttest_score = k_means(data=data, n_clusters=n_clusters)
    rgb_colormap = np.random.randint(0, 255, size=(n_clusters, 3))/255
    rgb_values = rgb_colormap[df_genre_profile["cluster"]]
    
    fig = plt.figure(figsize = (10, 10))
    ax = fig.subplots()
    df_genre_profile.plot(ax=ax, x="x1", y="x2", kind = "scatter", c = rgb_values)
    string = f"n_clusters: {n_clusters}, silhouette: {cluster_score:.4f}"
    if anova_score:
        string += f", t-test: {list(test_score)[0]:.4f}"
    if ttest_score:
        string += f", anova: {list(anova_score)[0]:.4f}"
    ax.set_title(string)
    return ax

#plot many k-means models based on different n_clusters
def k_many_clusters(data, start_n, end_n):
    for i in range(start_n, end_n):
        plot_k_means(data, i)
        plt.show()
        
#Test class includes both ttest and anova tests
class Test():
    def __init__(self, alpha, power=0.90, only_result=True, ind_limit=0.20):
        """alpha and power required for identifying the min. sample size
        and ind_limit that defines the dependence using correlation coefficient"""
        self.alpha = alpha
        self.power = power
        self.only_result = only_result
        self.ind_limit = ind_limit
    def ttest(self, a, b):
        """min. sample size, shapiro, pearsonr and ttest, and their corresponding p-values"""
        only_result = self.only_result
        power, alpha = self.power, self.alpha
        p_a = a.mean()
        p_b = b.mean()
        n_a = len(a)
        n_b = len(b)
        effect_size = (p_b-p_a)/a.std()
        n_req = int(sms.TTestPower().solve_power(effect_size=effect_size, power=power, alpha=alpha))
        if len(a) > len(b):
            a = a[:len(b)]
        elif len(a) < len(b):
            b = b[:len(a)]
        stat1, p1 = shapiro(a)
        stat2, p2 = shapiro(b)
        stat3, p3 = pearsonr(a, b)
        stat4, p4 = ttest_ind(b, a)
        
        result_dict = {"power": power, "alpha": alpha, "n_req": n_req,
                       "n_control": n_a, "n_test": n_b, "shapiro_control_stat": stat1,
                       "shapiro_control_p": p1, "shapiro_test_stat": stat2, "shapiro_test_p": p2,
                       "pearsonr_stat": stat3, "pearsonr_p": p3, "ttest_stat": stat4,
                       "ttest_p": p4, "ind_limit": self.ind_limit, "very_low_number": n_req > n_a or n_req > n_b,
                       "control_is_normal": alpha < p1, "test_is_normal": alpha < p2,
                       "very_low_correlation": self.ind_limit > abs(stat3), "very_high_dependence": p3 < alpha,
                       "no_difference": p4 > alpha, "test_is_bigger": stat4 > 0, "control_is_bigger": stat4 < 0}
        
        accepted = all([
            not(result_dict["very_low_number"]),
            (result_dict["control_is_normal"] and result_dict["test_is_normal"]),
            (not(result_dict["very_high_dependence"]) or result_dict["very_low_correlation"])
        ])
        
        result_dict.update({"test_is_accepted": accepted})
        result_dict = {key: result_dict[key] for key in ["ttest_stat", "ttest_p", "test_is_accepted"]} if only_result else result_dict
        return result_dict
    def anova(self, *args):
        """shapiro, levene, one-way anova and their corresponding p-values"""
        only_result = self.only_result
        alpha = self.alpha
        normality = [shapiro(x) for x in args]
        every_group_is_normal = True if all([x[1] > alpha for x in normality]) else False
        stat1, p1 = levene(*args)
        equal_variance = False if(p1 < alpha) else True
        stat2, p2 = f_oneway(*args)
        equal_means = False if (p2 < alpha) else True
        accepted = all([every_group_is_normal, equal_variance, not(equal_means)])
        result_dict = {"alpha": alpha, "normality":every_group_is_normal, "levene_stat":stat1, "levene_p": p1,
                       "homogenity": equal_variance, "anova_stat": stat2, "anova_p": p2, 
                       "groups_are_different": not(equal_means), "test_is_accepted": accepted}
        result_dict = {key: result_dict[key] for key in ["anova_stat", "anova_p", "test_is_accepted"]} if only_result else result_dict
        return result_dict
    
#predictions over new data
def predict_cluster(sample):
    sample = sample.copy()
    for i in range(len(in_cols)):
        col = in_cols[i]
        sample[col] = (sample[col]-genre_means[i])/genre_stds[i]
    df_genre_profile = k_means(sample, 4)[0]
    return df_genre_profile["cluster"].to_dict()

#generate fake name for clusters
def create_cluster_name(artists_clusters):
    counts = []
    artists_data = [artists_clusters.query("cluster == "+str(x)).index for x in range(4)]
    for data in artists_data:
        data_string = " ".join(data).lower()
        tokens = nltk.word_tokenize(data_string)
        stopset = set(stopwords.words('english') + list(string.punctuation) + ["orchestra", "band", "symphony"])
        data = [token for token in tokens if token not in stopset and len(token) > 2]
        count = pd.Series(dict(Counter(data)))
        counts.append(count.sort_values(ascending=False)[:3].to_dict())
    name_dict = {}
    for i, count in enumerate(counts):
        indices = np.random.permutation(len(count))
        count = np.array(list(count.keys()))
        count = count[indices]
        genre_name = " ".join(count)
        name_dict.update({"Cluster "+str(i): genre_name})
    return name_dict

In [None]:
df = pd.read_csv("/kaggle/input/spotify-dataset-19212020-160k-tracks/data_by_genres.csv")
df_2 = pd.read_csv("/kaggle/input/spotify-dataset-19212020-160k-tracks/data_w_genres.csv")

In [None]:
out_cols = ["genres", "artists", "mode", "count", "key"]
in_cols = [x for x in df.columns if x not in out_cols] 

df = df.set_index("genres")[in_cols].drop("[]", 0)
df #genre data

In [None]:
#fill nan values by 0
df_2.set_index("artists", inplace=True)
df_2["genres"][df_2["genres"] == "[]"] = np.nan
df_2["genres"] = df_2["genres"].fillna(0)
df_2

In [None]:
#standardize data
df_2_std = df_2.copy()
for col in in_cols:
    df_2_std[col] = (df_2[col]-df_2[col].mean())/df_2[col].std()
       
#extract individual genres from genre lists
df_2_std.reset_index(inplace = True)
collist = list(df_2_std.columns)
new_rows = []
for index in tqdm(range(len(df_2_std))):
    row = df_2_std.iloc[index]
    genre_list = str2list(row["genres"])
    row = pd.DataFrame(row).transpose()
    if(not(isinstance(genre_list, list) and len(genre_list) != 0)):
        pass
    else:
        if(len(genre_list) == 1):
            row["genres"] = genre_list[0]
            new_rows.append(list(row.values[0]))
        else:
            row = pd.concat([row for i in range(len(genre_list))], 0)
            row["genres"] = genre_list
            for i in range(len(genre_list)):
                new_rows.append(list(row.values[i]))
                
df_known = pd.DataFrame(new_rows, columns = collist)

In [None]:
#export
df_known.to_csv("data_each_genres.csv")
df_known

In [None]:
X_known = df_known[in_cols]
y_known = df_known["genres"]

In [None]:
#missing data
df_unknown = df_2_std[df_2_std["genres"] == 0]
df_unknown

In [None]:
X_unknown = df_unknown[in_cols]
y_unknown = df_unknown["genres"]

In [None]:
correlations = pps.matrix(df_known.reset_index()[["artists", "genres"]])

fig = plt.figure(figsize=(5, 5))
ax = fig.subplots()
sns.heatmap(pd.DataFrame(correlations["ppscore"].values.reshape(2, 2),
                         columns = ["artists", "genres"], index = ["artists", "genres"]),
                         cmap = "Wistia", axes = ax)
ax.set_title("Predictive Power Score of Artists and Genres")
plt.show()

### -> The artists feature isn't a good indicator (its predictive power score is nearly 0), so we cannot really predict genres by artists.

In [None]:
y_known.value_counts()[:25].to_dict()

In [None]:
fig = plt.figure(figsize = (10, 10))
ax = fig.subplots()
y_known.value_counts()[:25].plot(ax=ax, kind = "pie")
ax.set_ylabel("")
ax.set_title("Top 25 most popular genres")
plt.show()

### -> There is a huge bias towards the most popular genres, especially Pop and Rock. In the following sections, we can increase the accuracy of missing data imputation with a simple yet clever trick: Only include the most popular ones!

In [None]:
max_words = 400
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', max_words = max_words, colormap="viridis",
                min_font_size = 10).generate(" ".join(df.index))

plt.figure(figsize=(10, 10))
plt.imshow(wordcloud)
plt.axis("off") 
plt.tight_layout(pad = 0)
plt.title(f"The most {max_words} frequent words in genres")
plt.show()

### -> The most frequent words are mostly words from the most popular genres. Pop and Rock, which are the top 2 most popular genres, are also the top 2 most frequent words, so we can conclude that there may be too many subgenres of these genres (e.g. Canadian Pop => Pop, Alternative Rock => Rock).

In [None]:
fig = plt.figure(figsize=(10, 10))
ax = fig.subplots()
ax.set_title("Top 25 artists having the most genres")
ax.set_ylabel("Genres")
ax.set_xlabel("Artists")
df_known["artists"].value_counts()[:25].plot(ax=ax, kind="bar")
plt.show()

### -> There are nearly 10k artists who do not have any genre at all, and there are artists who have too many genres (Deerhunter (23) followed by Wire(20)). The second type of artists are pretty rare, which leads to too much sparsity. That's why artists' predictive power score is too low. 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Long_tail.svg/1200px-Long_tail.svg.png)

### -> Visualization of "Long Tail" phenomenon, X=number of genres, Y=number of artists

In [None]:
many_knn_score(1, 6, X_known, y_known)

### -> When we simply impute missing genres using KNN, it produces results with a very low accuracy score. The more neighbors are calculated, the more accurate the results are. N_neighbors = 100 => Acc = 0.04, N_neighbors = 500 => Acc = 0.06. The most probable reason it might be increasing is that the data in general is biased towards the most frequent genres. The solution is definitely gonna be including only the most popular genres. 

In [None]:
popular_genres = list(y_known.value_counts()[:25].index)
df_known_w_populars = df_known[df_known["genres"].isin(popular_genres)]
X_known_w_populars = df_known_w_populars[in_cols]
y_known_w_populars = df_known_w_populars["genres"]

many_knn_score(1, 26, X_known_w_populars, y_known_w_populars)

### -> When we only use the most popular genres in the knn imputing and increase the number of neighbors to 26, the performance rises to 0.22: What a great improvement! Let's increase the number of clusters!

In [None]:
many_knn_score(250, 260, X_known_w_populars, y_known_w_populars)

### -> It looks like the accuracy does not improve any more... It is not a clever idea to impute missing genres because only 1 of 5 results is correct. 

In [None]:
genre_profile = df_known_w_populars[["genres", *in_cols]].groupby("genres").mean()
similarity_matrix=similarity_algorithm(X_known)

preds=list(similarity_matrix.index[similarity_matrix.values.argmax(axis=0)])

accuracy(preds, y_known)

### -> As it turns out, cosine similarity isn't a good measurement for this specific purpose, as its accuracy score is the worst among others (5e-3). It is even worse than filling all missing genres with the most popular genre Pop, which results in 9e-3. By the way, it is time-consuming and requires too much memory.

In [None]:
genre_means, genre_stds = [], []
df_std = df.copy()
for col in in_cols:
    mean = df_std[col].mean()
    std = df_std[col].std()
    genre_means.append(mean)
    genre_stds.append(std)
    df_std[col] = (df_std[col] - mean) / std
df_std

In [None]:
k_many_clusters(df_std, 2, 10)

### -> Observations: 
### ---> As the number of clusters increases, silhouette score decreases (N_clusters = 2 => Silhouette = 0.30, N_clusters = 9 => Silhouette = 0.15), which means the goodness of model decreases
### ---> One-Way ANOVA and T-Test are rejected due to the lack of at least one of the following assumptions:

### -------> T-Test: a) Normality, b) Different Means, c) No Correlation
### -------> One-Way ANOVA: a) Normality, b) Different Means, c) Equal Variance

### The data is not normally distributed, because in fact at the beginning we are transforming a multi-dimensional normal data into a single-dimensional non-normal data, which I believe it is to a certain extent not a huge deal-breaker at all. So let's do calculation!

In [None]:
i = 2
tester = Test(alpha = 0.05, only_result=False)
df_genre_profile = k_means(df_std, 2)[0]
ttest_result = tester.ttest(df_genre_profile[df_genre_profile["cluster"] == 0]["x1"].values, df_genre_profile[df_genre_profile["cluster"] == 1]["x1"].values)

ttest_result

### -> Given that we pick 2 as the number of clusters, T-Test is rejected due to the lack of normality. Otherwise, the results indicate a very strong mean difference. (Group 1 has a lot bigger mean than Group 0)

In [None]:
anova_results = []
for i in range(2, 9):
    tester = Test(alpha=0.05, only_result=False)
    df_genre_profile = k_means(df_std, i)[0]
    test_data = [df_genre_profile[df_genre_profile["cluster"] == x]["x1"] for x in range(i)]
    anova_results.append(tester.anova(*test_data))
    
pd.DataFrame(anova_results)

### -> Normality and homogenity are not fulfilled, so the test was rejected every time. However interestingly, the groups are always different. As the number of clusters increases, the mean difference decreases (6130 -> 1667), so picking a low number is a better option. The distribution is the most homogenous if the number of clusters is equal to 4 or 5 (levene-p is high and variance score is low). As you may notice in the plots above, 4 would be the best option! 

In [None]:
#predictions over new samples
#e.g. predict_cluster(df.iloc[:350])

#prediction over all data
{k: v for i, (k, v) in enumerate(predict_cluster(df).items()) if i%100==0}

In [None]:
pred = predict_cluster(df)

df_known_new = df_known.copy()
df_known_new["cluster"] = df_known_new["genres"].map(lambda x: pred[x])

In [None]:
#artists - clusters (group by artists and find the most frequent clusters)
artists_clusters = df_known_new[["cluster", "artists"]].groupby("artists").agg(lambda x: x.value_counts().index[0])
artists_clusters.value_counts(normalize=True)

In [None]:
fig = plt.figure(figsize = (10, 10))
ax = fig.subplots()
ax.set_title("The distribution of clusters w.r.t. artists")
ax.set_xlabel("Clusters")
artists_clusters.value_counts().plot(ax=ax, kind="pie", ylabel="Percentage", legend=True)
plt.show()

### -> Distribution is nearly uniform, except for one cluster dominates nearly the half of artists.

In [None]:
create_cluster_name(artists_clusters)

### -> Machine-generated names for possible genre groups. It is created by picking the top 3 most frequent words in the names of artists which are not stopwords like "the" or "a", and ordering them randomly. Can you see how amazing it is!

# Conclusion

### -> The data is biased towards the most popular genres, mostly Pop and Rock, and which is why it is more difficult to predict less popular genres. 
### ---> Suggestion: Try over- or undersampling. 

### -> The imputing seems to work poorly both for K-Nearest Neighbors and Cosine Similarity Matcher.

### ---> Suggestion: Play around with the KNN hyperparameters such as distance criterion. Try a neural network regressor with tuned hyperparameters, maybe MissForest imputer or even a NLP classifier which analyzes the artists' names. If it goes well, impute data and repeat the analysis.
### ---> Hint: Classification results will be more accurate, as only the most popular genres are included in the labels. With this clever trick, the performance of has increased from 2-3%s to 22-23%s! (n_clusters in 2-9 interval)

### -> Both silhouette scores are too low and we can not prove significant group differences, which is why K-Means Clustering didn't work as expected.
### ---> Suggestion: Play around with the hyperparameters in K-Means model (e.g. set distance criterion to "weighted" instead of "uniform"(default))or try other clustering algorithms such as DBSCAN and Mean Shift.

### -> According to statistical tests, the most suitable pick for number of clusters is 4.

# The ball is on your court, try it yourself. By the way, don't forget to upvote my notebook :)