In [None]:
import pandas as pd
import numpy as np
import pickle # do i need this if I only use the pandas method?
import skfuzzy as fuzz

import ipympl
#%config InlineBackend.figure_format='retina'

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib_inline
import matplotlib.cm as cm
import seaborn as sns

import warnings 
warnings.filterwarnings("ignore")

Since the t-SNE values provided visible clusters that could not be found via HDBSCAN clustering, I will use t-SNE values to perform fuzzy c-means clustering 

## Fuzzy classification

Multilabel classification is the ability to predict multiple discrete labels/classes per data point input. It should not be confused with multiclass classification, which simply means there are more than two potential classes to predict membership of, but usually only one per data point. Every label is assigned a 0 or 1 for every data point. Although there is no membership probability given, these can be used to rank belongingness. These methods have the advantage of being offered by `sk-learn`.

However, in this case "fuzzy" classification may be more accurate (and more interesting if there are overlapping Roud numbers), as it explicitly outputs degrees of membership. Fuzzy classification is not part of `sk-learn`. I considered the following options:
- `sk-fuzzy` offers it as part of their [fuzzy clustering](https://pythonhosted.org/scikit-fuzzy/api/skfuzzy.cluster.html) but not as a pure classification model. If I take the number of clusters `c` to be 3210 (the output of the best HDBSCAN run) I can get around the limitation of the underlying C-Means model that the number of clusters must be specified in advance. However, the other limitations may still be a problem (i.e. it assumes a uniform cluster shape and struggles with higher-dimensional data).

In [None]:
import skfuzzy as fuzz
df_classify.reset_index()

Unnamed: 0,level_0,index,key_name,name,version_in_key,bi_file,dt_file,roud,lyrics,roud_count,lyric_embed_instructor,tsne_embedding,tsne_hdb_cluster
0,0,0,"A Robin, Jolly Robin",A Robyn Jolly Robyn,A,Perc1185,HEYROBIN,,"""[F]rom what appears to be the most ancient of...",,"[-0.032208655, -0.0039244993, -0.02159848, 0.0...","[-98.88156127929688, 28.498804092407227]",214
1,1,1,"A Robin, Jolly Robin",(No Title),B,Perc1185,HEYROBIN,,"71 'Hey, Robin, jolly Robin, 72 Tell me how...",,"[-0.032028995, 0.020379173, -0.016789645, 0.03...","[-99.28376770019531, 28.85739517211914]",214
2,2,2,"A, U, Hinny Bird","A, U, Hinny Bird",A,StoR160,,235,"A, U, hinny burd; The bonny lass o' Benwell, A...",1.0,"[-0.025857605, 0.010645705, -0.02403562, 0.050...","[-74.82328796386719, -26.113561630249023]",3050
3,3,3,Adieu to Erin (The Emigrant),Adieu to Erin,A,SWMS255,,2068,"Oh, when I breathed a last adieu, To Erin's an...",1.0,"[-0.043128256, 0.008317871, -0.040352777, 0.01...","[-18.571348190307617, -37.902496337890625]",2787
4,4,4,"Agincourt Carol, The",The Song of Agincourt,A,MEL51,AGINCRT1,V29347,"Deo gracias anglia, Redde pro victoria, 1 Owre...",3.0,"[-2.9962948e-05, -0.009317334, -0.017600924, 0...","[-85.41560363769531, 44.82973861694336]",83
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9968,9968,10108,,Zeb Tourney's Girl,,LE18,ZEBTURNY,2249,"Down in the Tennessee mountains,\nFar from the...",2.0,"[-0.0389951, -0.008086728, -0.058877233, 0.037...","[47.74774932861328, 17.576581954956055]",2258
9969,9969,10109,,Zebra Dun,,LB16,ZEBRADUN,3237,We was camped on the plains at the head of the...,1.0,"[-0.044695836, -0.012055198, -0.042596623, 0.0...","[49.38885498046875, 6.010506629943848]",2451
9970,9970,10110,,Zen Gospel Singing,,,ZENGOSPE,,I once was a Baptist and on each Sunday morn\n...,,"[-0.043211307, -0.0138436, -0.017630804, 0.003...","[68.2158203125, 18.50948715209961]",1222
9971,9971,10111,,Zuleika,,,ZULIKA,,"Zuleika was fair to see,\nA fair Persian maide...",,"[-0.0115272915, -0.029302498, -0.022729361, 0....","[15.239205360412598, -15.814308166503906]",2158


The data must be input in the format of a 2D array, size (S, N), where N is the number of "data sets" (i don't know if this means data points - [O'Reilly](https://www.oreilly.com/library/view/mastering-machine-learning/9781788621113/6967d36f-e04e-46d3-8c99-e30e2193d464.xhtml) also says it's the number of features!) and S is the number of features in each vector. 

Note: Initially the data was the wrong way round. Each embedding index (768) was clustered instead of each data point (9973) (wrong axis). Per O'Reilly I tranposed the `data` numpy array and the clustering was successful.

In [None]:
df_classify = pd.read_pickle('df_classify.p')
ncenters = 3210
data = np.array(df_classify['lyric_embed_instructor'].tolist())
data = data.transpose()
data.shape

(768, 9973)

Fit the model. Returns:	
1. cluster centres: `cntr`: "Data for each center(?) along each feature provided" for each of the requested clusters c. 2d array, size (S, c)
2. matrices: `u`: Final fuzzy c-partitioned matrix. `u0`: Initial guess at u matrix. `d`: Final Euclidian distance matrix. 2d array, (S, N)
3. model assessment information: `jm`: "Objective function history" of model performance at each iteration. 1d array, length P. `p`: Number of iterations run. `fpc`: fuzzy partition coefficient.

In [None]:
cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(data, ncenters, 2, error=0.005, maxiter=1000, init=None)

Examine the clusters:

In [None]:
u.shape

(3210, 9973)

In [None]:
cntr.shape

(3210, 768)

Only 462 clusters were chosen as a first choice for each data point. The largest cluster was assigned 1184 points.

In [None]:
cluster_membership = np.argmax(u, axis=0) #TODO: check this
clusters, counts = np.unique(cluster_membership, return_counts=True)
counts.max()

1184

The fuzzy partition coefficient (FPC) is a metric of how cleanly the data is described by the model on a scale from 0 to 1, with 1 being the best. My FPC is close to 0 which suggests a poor clustering.

In [None]:
fpc

0.00031152649052995895

Let's add the first-choice labels and visualise them:

In [None]:
df_classify['fuzzy_cluster_0'] = cluster_membership
df_classify['fuzzy_cluster_0_label'] = cluster_membership.astype(str) #string version just for plotting labels
df_classify

In [None]:
#plot
# make non-cluster -1 grey
color_map = {'-1': 'lightgray'}  # nb key is a string
# colours for other cluster labels
num_clusters = ncenters
other_colors = px.colors.qualitative.Light24[:num_clusters - 1]
color_map.update(dict(zip(map(str, range(num_clusters - 1)), other_colors)))

# # Calculate localized centroids for 'roud' annotations
# def calculate_annotations():
#     unique_roud_values = df_classify['roud'].unique()
#     annotations = []

#     for roud_value in unique_roud_values:
#         roud_clusters = df_classify[df_classify['roud'] == roud_value]['fuzzy_cluster_0_label'].unique()
        
#         for cluster_label in roud_clusters:
#             roud_points = df_classify[(df_classify['roud'] == roud_value) & (df_classify['fuzzy_cluster_0_label'] == cluster_label)]
#             if len(roud_points) > 0:
#                 centroid_x = roud_points['tsne_embedding'].str.get(0).mean()
#                 centroid_y = roud_points['tsne_embedding'].str.get(1).mean()
#                 if len(roud_points) > 1:
#                     annotations.append(dict(x=centroid_x, y=centroid_y,
#                                             xref="x", yref="y",
#                                             text=roud_value,
#                                             showarrow=True,
#                                             arrowhead=2,
#                                             ax=0,
#                                             ay=-25))
#                 else:
#                     annotations.append(dict(x=centroid_x, y=centroid_y,
#                                             xref="x", yref="y",
#                                             text=roud_value,
#                                             showarrow=False,
#                                             ax=0,
#                                             ay=0))
#     return annotations

# annotations = calculate_annotations()

# Create the scatter plot
fig = px.scatter(df_classify.sort_values(by='fuzzy_cluster_0'),
                 x=df_classify['tsne_embedding'].str.get(0),
                 y=df_classify['tsne_embedding'].str.get(1),
                 color='fuzzy_cluster_0_label',
                 color_discrete_map=color_map,
                 hover_name='name',
                 hover_data={'roud': True},
                 labels={'fuzzy_cluster_0_label': 'Cluster (Fuzzy, 0)'},
                 title='Whole dataset Fuzzy clustering (instructor embeddings)'
                )

# # Add buttons to toggle annotations
# button_on = dict(label='Annotations On',
#                  method='relayout',
#                  args=[{'annotations': annotations}])

# button_off = dict(label='Annotations Off',
#                   method='relayout',
#                   args=[{'annotations': []}])

# fig.update_layout(updatemenus=[
#     dict(type='buttons', showactive=True, buttons=[button_on, button_off])
# ])

fig.update_layout(xaxis_title='t-SNE X', yaxis_title='t-SNE Y')
fig.update_xaxes(scaleanchor="y", scaleratio=1)
fig.update_yaxes(dtick=20)
fig.update_xaxes(dtick=20)

fig.update_layout(width=1400, height=1100)

fig.show()


Let's now examine the first, second and third choice clusters and their probabilities:

In [None]:
#TODO: first extract the fuzzy data and append to df

#get a group of songs
df_classify[df_classify['name'] == 'Alouette'] # reveals cluster 2240
df_classify[df_classify['fuzzy_cluster_0'] == 2240]

Unnamed: 0,index,key_name,name,version_in_key,bi_file,dt_file,roud,lyrics,roud_count,lyric_embed_instructor,tsne_embedding,tsne_hdb_cluster,fuzzy_cluster_0,fuzzy_cluster_0_label
236,253,"Craven Churn-Supper Song, The","Craven Churn-Supper Song, The",A,BeCo382,,13471.0,"Be not moved at my strain, For nothing study s...",1.0,"[-0.012664082, -0.023014212, -0.0296101, 0.056...","[15.842228889465332, 22.00914764404297]",-1,2240,2240
273,290,"Do, Do, Pity My Case","Do, Do, Pity My Case",A,BAF805,,11590.0,"Do, do pity my case, In some lady's garden, My...",1.0,"[-0.013756712, -0.00012300834, -0.022789443, 0...","[-14.78901481628418, -49.41680908203125]",2872,2240,2240
538,570,Jinny Get Your Hoecake Done,The Hoe-Cake,A,Fus158C,,16825.0,"Jinny, get your hoecake done, my love, Jinny, ...",1.0,"[-0.03661858, -0.027427817, -0.039609604, 0.02...","[-69.33150482177734, -37.35103225708008]",3043,2240,2240
608,645,Lazy Mary (She Won't Get Up),What Will You Give Me if I Get Up?,B,R396,,6561.0,"""What will you give me if I get up, If I get u...",2.0,"[-0.024192596, -0.0023780533, -0.0232418, 0.02...","[-70.6299819946289, -78.60198974609375]",119,2240,2240
641,679,London Bridge Is Falling Down,London Bridge Is Falling Down,A,R578,,502.0,"London bridge is falling down, Falling down, f...",1.0,"[-0.02975991, 0.015273309, -0.0085026175, 0.00...","[-19.276100158691406, -47.125858306884766]",3135,2240,2240
909,973,"Quaker's Courtship, The","Quaker's Courtship, The",A,R362,,716.0,"Oh-dear-me! I'm for pleasure, not for sportin'...",1.0,"[-0.027867021, 0.0044446876, -0.023296269, 0.0...","[-42.103084564208984, -22.050146102905273]",3027,2240,2240
1075,1152,Three Blind Mice,Three Blind Mice,A,FSWB413A,THREEBLN,3753.0,"The music notation is archaic, with the vertic...",4.0,"[-0.008864541, 0.0069865636, -0.007616302, 0.0...","[29.552587509155273, -30.147357940673828]",2540,2240,2240
1380,1463,,Alouette,,,ALOETT,,"Alouette, gentile Alouette,\nAlouette, je te p...",,"[-0.02266857, 0.012477776, -0.010748615, 0.031...","[-10.653217315673828, 14.567709922790527]",1141,2240,2240
2387,2477,,Brother Gorilla,,,BROGORIL,,Translated by Jake Thackray from the French\nT...,,"[-0.05307475, -0.012492705, -0.032474436, 0.05...","[-7.643563270568848, -38.861446380615234]",2841,2240,2240
2413,2503,,Buckingham Palace,,,CHNGGARD,,They're changing guard at Buckingham Palace\nC...,,"[-0.029468872, 0.025955772, -0.02959578, 0.033...","[-86.67301940917969, 33.97333908081055]",193,2240,2240


In [None]:
data = pandas.read_pickle(lyrics_dataset.p)

# Perform Fuzzy C-Means clustering
cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
    data.T, c=3, m=2, error=0.005, maxiter=1000, init=None
)

# Retrieve the cluster centers
cluster_centers = cntr.T

# Retrieve the membership values
membership_values = u.T