### K-Means Clustering on a Multi-Class and Multi-Label Data Set


In this project , I have studied k-means clustering for classification of data on the Anuran Calls (MFCCs) Data Set. <br>
I have also performed <b> Monte-Carlo Simulation </b> and tested it.

It is a multilabel dataset with three columns of labels. This dataset was created segmenting 60 audio records belonging to 4 different families, 8 genus, and 10 species. Each audio corresponds to one specimen (an individual frog).

The data is downloaded from :
https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import hamming_loss
import random
import statistics

In [3]:
# Loading the DATA
all_data = pd.read_csv('..\data\Frogs_MFCCs.csv')
all_data=all_data.drop('RecordID',axis=1)

In [4]:
cols=[e for e in all_data if e not in ( 'Family','Genus','Species')]
X=all_data.loc[:,cols]
Y=all_data.loc[:, ['Family','Genus','Species']]

In [5]:

final_ham_loss=[]
final_ham_score=[]

fam_maj_trip={p:[] for p in range(1,51)}
genus_maj_trip={p:[] for p in range(1,51)}
species_maj_trip={p:[] for p in range(1,51)}
#-----------------------------------------------------------------------------
# Monte Carlo Simulation- Performing the procedure 50 times
#-----------------------------------------------------------------------------

for cls in range(1,51):
    silh_avg = dict()
#-----------------------------------------------------------------------------    
# Finding the Optimal K value between 2-20 automatically using SilHoutte Average
#-----------------------------------------------------------------------------

    for k in range(2,20):
        rand_value=random.randint(0, 900)
        k_means = KMeans(n_clusters=k,init='k-means++',random_state=rand_value).fit(X)
        labels = k_means.labels_
        silh_avg.update({k:(metrics.silhouette_score(X, labels))})
   
    #print("Iteration : ",cls)
    #print("Average Silhoutte score values : ",silh_avg)
#-----------------------------------------------------------------------------   
# Selecting the K-value with the maximum Silhoutte Score
#-----------------------------------------------------------------------------

    optimal_k = max(silh_avg,key=silh_avg.get)
    
    rand_value=random.randint(0, 900)

#-----------------------------------------------------------------------------
# Performing K-means Clustering For Optimal K-value Found
#-----------------------------------------------------------------------------

    X1=X
    k_means_f = KMeans(n_clusters=4, random_state=rand_value).fit(X1)
    cluster_labels = k_means_f.labels_

    clusters = pd.concat([X1,Y,pd.DataFrame({'labels':cluster_labels.tolist()})],axis = 1)
    clusters['labels'].value_counts()

 
    #print("Optimal Cluster value : ",optimal_k)
    for k in range(4):
        find= clusters[clusters['labels']==k]
        #print('Cluster',k+1)
        #print('\nMajority class in family - ',find['Family'].value_counts().index[0])
        #print('Majority class in genus - ',find['Genus'].value_counts().index[0])
        #print('Majority class in species - ',find['Species'].value_counts().index[0])
        #print('\n')

#-----------------------------------------------------------------------------
# Determining the Majority Triplet for each Cluster
#-----------------------------------------------------------------------------

    maj_trip = {k:[] for k in range(4)}
    for k in range(4):
        c_value = clusters[clusters['labels']==k]
        maj_trip[k].append(c_value['Family'].value_counts().index[0])
        maj_trip[k].append(c_value['Genus'].value_counts().index[0])
        maj_trip[k].append(c_value['Species'].value_counts().index[0])
        fam_maj_trip[cls].append(c_value['Family'].value_counts().index[0])
        genus_maj_trip[cls].append(c_value['Genus'].value_counts().index[0])
        species_maj_trip[cls].append(c_value['Species'].value_counts().index[0])

    
    clusters['family_pred'] = 'none'
    clusters['genus_pred'] = 'none'
    clusters['species_pred'] = 'none'

    for k in range(4):
        clusters['family_pred'] = np.where(clusters['labels']==k,maj_trip[k][0],clusters['family_pred'])
        clusters['genus_pred'] = np.where(clusters['labels']==k,maj_trip[k][1],clusters['genus_pred'])
        clusters['species_pred'] = np.where(clusters['labels']==k,maj_trip[k][2],clusters['species_pred'])

#-----------------------------------------------------------------------------
# Calculating the Average Hamming Score
#-----------------------------------------------------------------------------
    fam_s=hamming_loss(clusters['Family'],clusters['family_pred'])
    gen_s=hamming_loss(clusters['Genus'],clusters['genus_pred'])
    spec_s=hamming_loss(clusters['Species'],clusters['species_pred'])

    ham_loss_s=(fam_s+gen_s+spec_s)/3

    #print("Hamming Loss : ",np.round(ham_loss_s,6))
    #print("Hamming Score : ",1-ham_loss_s)

#-----------------------------------------------------------------------------
# Calculating the average and standard deviation of the 50 Hamming Distances 
#-----------------------------------------------------------------------------

    final_ham_loss.append(np.round(ham_loss_s,6))
    final_ham_score.append((1-ham_loss_s))


print("Standart Deviation of 50 Hamming Distances : {}".format(statistics.stdev(final_ham_score)))

print("Average of the 50 Hamming Distances : {}".format(statistics.mean(final_ham_score)))


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

Standart Deviation of 50 Hamming Distances : 0.006124184940487849
Average of the 50 Hamming Distances : 0.775394950196896


  super()._check_params_vs_input(X, default_n_init=10)


In [6]:
fin=list(zip(list(fam_maj_trip.values()),list(genus_maj_trip.values()),list(species_maj_trip.values())))
fin

major_trip={}
for j in range(0,50):
    major_trip.update({j:list(zip(fin[j][0],fin[j][1],fin[j][2]))})
        

### Results After Monte Carlo Simulation :

#### Standard Deviation of 50 Hamming Distances : 0.009651644097019792
#### Average of the 50 Hamming Distances : 0.7751104933981933

In [10]:
i=range(1,51)
final=pd.DataFrame({"Iteration" : i,"Optimal K":optimal_k,"Hamming Score":final_ham_score,"Majority triplets (Family,Genus,Species) for every cluster":list(major_trip.values())})

# Permanently changes the pandas settings
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
 

final

  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,Iteration,Optimal K,Hamming Score,"Majority triplets (Family,Genus,Species) for every cluster"
0,1,4,0.777577,"[(Hylidae, Hypsiboas, HypsiboasCordobae), (Hylidae, Hypsiboas, HypsiboasCinerascens), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Dendrobatidae, Ameerega, Ameeregatrivittata)]"
1,2,4,0.777577,"[(Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCinerascens), (Dendrobatidae, Ameerega, Ameeregatrivittata), (Hylidae, Hypsiboas, HypsiboasCordobae)]"
2,3,4,0.777577,"[(Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCordobae), (Hylidae, Hypsiboas, HypsiboasCinerascens), (Dendrobatidae, Ameerega, Ameeregatrivittata)]"
3,4,4,0.777577,"[(Hylidae, Hypsiboas, HypsiboasCinerascens), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Dendrobatidae, Ameerega, Ameeregatrivittata), (Hylidae, Hypsiboas, HypsiboasCordobae)]"
4,5,4,0.778226,"[(Hylidae, Hypsiboas, HypsiboasCordobae), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCinerascens), (Dendrobatidae, Ameerega, Ameeregatrivittata)]"
5,6,4,0.754876,"[(Hylidae, Hypsiboas, HypsiboasCordobae), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Leptodactylidae, Adenomera, AdenomeraAndre), (Hylidae, Hypsiboas, HypsiboasCordobae)]"
6,7,4,0.777577,"[(Dendrobatidae, Ameerega, Ameeregatrivittata), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCordobae), (Hylidae, Hypsiboas, HypsiboasCinerascens)]"
7,8,4,0.777577,"[(Dendrobatidae, Ameerega, Ameeregatrivittata), (Hylidae, Hypsiboas, HypsiboasCinerascens), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCordobae)]"
8,9,4,0.777716,"[(Hylidae, Hypsiboas, HypsiboasCinerascens), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCordobae), (Dendrobatidae, Ameerega, Ameeregatrivittata)]"
9,10,4,0.777855,"[(Hylidae, Hypsiboas, HypsiboasCinerascens), (Leptodactylidae, Adenomera, AdenomeraHylaedactylus), (Hylidae, Hypsiboas, HypsiboasCordobae), (Dendrobatidae, Ameerega, Ameeregatrivittata)]"
