# Calculating the performance

Silhouette score is used to evaluate the quality of clusters created using the K-Means clustering algorithm.This measure has a range of [-1, 1].To calculate the Silhouette score for each observation/data point, the following distances need to be found out for each observations belonging to all the clusters:

   1.Mean distance between the observation and all other data points in the same cluster. This distance can also be called a mean intra-cluster distance. The mean distance is denoted by "a". 
   
   2.Mean distance between the observation and all other data points of the next nearest cluster. This distance can also be called a mean nearest-cluster distance. The mean distance is denoted by "b"
   
 

### Packages Required

In [243]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
import glob
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


### Importing multiple files using os and glob

The glob module is used here to retrive csv file names matching the specified pattern.

In [171]:
path="C://Users/Home/sample-dataset/ant-1.3"
read_files=glob.glob(os.path.join(path,"*.csv"))

In [239]:
read_files[:5]

['C://Users/Home/sample-dataset/ant-1.3\\ant-1.3.csv',
 'C://Users/Home/sample-dataset/ant-1.3\\ant-1.4.csv',
 'C://Users/Home/sample-dataset/ant-1.3\\ant-1.5.csv',
 'C://Users/Home/sample-dataset/ant-1.3\\ant-1.6.csv',
 'C://Users/Home/sample-dataset/ant-1.3\\ant-1.7.csv']

### Calculating Silhouette Score

A range of candidate values of k (number of clusters) is picked [2,3,4,5,6,7,8] and then trained using K-Means clustering for each of the values of k. For each k-Means clustering model the average silhouette coefficient is calculated.A high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters(High inter cluster distance and low intra cluster distance).

In [247]:
def calculateSilhoutteScore(x,target):
    max_val=-1
    range_n_clusters = [2,3,4,5,6,7,8]
    row_list=[]
    row_list.append(target)
    for n_clusters in range_n_clusters:
        
        clusterer = KMeans(n_clusters=n_clusters, random_state=10)
        cluster_labels = clusterer.fit_predict(x)
        silhouette_avg = silhouette_score(x, cluster_labels)
        if(silhouette_avg>max_val):
            k=n_clusters
            max_val=silhouette_avg
        print("For n_clusters =",n_clusters, "The average silhouette_score is :", silhouette_avg)
        row_list.append(round(silhouette_avg,4))
    #print("Best k-value",k)
    #print(row_list)
    
    return k,row_list
        

    

In [248]:

df_list=[]

for files in read_files:
    data=pd.read_csv(files)
    x=data.iloc[:,3:]
    target=data.iloc[0,0]+"-"+str(data.iloc[0,1])
    print("Project: ",target)
    n_cluster,row_list=calculateSilhoutteScore(x,target)
    df_list.append(row_list)
    KMean= KMeans(n_clusters=n_cluster)
    KMean.fit(x)
    label=KMean.predict(x)
    
    
    print("----------------")


Project:  ant-1.3
For n_clusters = 2 The average silhouette_score is : 0.867656262723419
For n_clusters = 3 The average silhouette_score is : 0.6449335120984063
For n_clusters = 4 The average silhouette_score is : 0.6260178768500043
For n_clusters = 5 The average silhouette_score is : 0.5733877763741747
For n_clusters = 6 The average silhouette_score is : 0.5651878315778446
For n_clusters = 7 The average silhouette_score is : 0.5716345889622133
For n_clusters = 8 The average silhouette_score is : 0.4618421135202239
----------------
Project:  ant-1.4
For n_clusters = 2 The average silhouette_score is : 0.6909507763284335
For n_clusters = 3 The average silhouette_score is : 0.6535662481960056
For n_clusters = 4 The average silhouette_score is : 0.6517469430248358
For n_clusters = 5 The average silhouette_score is : 0.6640923576900749
For n_clusters = 6 The average silhouette_score is : 0.5986950664576589
For n_clusters = 7 The average silhouette_score is : 0.5042445125615287
For n_cluste

### Comparison

The average silhouette scores for all values of k is converted into a dataframe project wise.

In [249]:
def create_df(df_list,target_list):
    new_df = pd.DataFrame(columns=target_list, data=df_list)
    return new_df

In [250]:
column_list=["PROJECT-NAME","2-CLUSTERS","3-CLUSTERS","4-CLUSTERS","5-CLUSTERS","6-CLUSTERS","7-CLUSTERS","8-CLUSTERS"]
new_df=create_df(df_list,column_list)
new_df.head(10)

Unnamed: 0,PROJECT-NAME,2-CLUSTERS,3-CLUSTERS,4-CLUSTERS,5-CLUSTERS,6-CLUSTERS,7-CLUSTERS,8-CLUSTERS
0,ant-1.3,0.8677,0.6449,0.626,0.5734,0.5652,0.5716,0.4618
1,ant-1.4,0.691,0.6536,0.6517,0.6641,0.5987,0.5042,0.4949
2,ant-1.5,0.7951,0.7194,0.7119,0.6651,0.6602,0.5671,0.5768
3,ant-1.6,0.7497,0.7336,0.6732,0.671,0.598,0.6013,0.6003
4,ant-1.7,0.7941,0.7678,0.6999,0.5783,0.5785,0.5808,0.5813
5,ArcPlatform-1,0.8036,0.7668,0.7456,0.5646,0.5625,0.5504,0.5351
6,Berek-1,0.8901,0.6676,0.6426,0.6361,0.6175,0.6333,0.5991
7,camel-1,0.9023,0.7299,0.7051,0.5565,0.5627,0.5345,0.5351
8,camel-1.2,0.9538,0.8464,0.6387,0.6153,0.5597,0.5623,0.5012
9,camel-1.4,0.9756,0.8613,0.6504,0.6492,0.6462,0.649,0.5817


The silhouette dataframe shows that the n_cluster value of 3 is a good pick.

### Information gain 

In [267]:
from __future__ import division
from docopt import docopt
from collections import Counter
from docopt import docopt
import pandas
import math


In [268]:
def entropy(l):
    p = Counter(l)
    total = float(len(l))
    return -sum(count/total * math.log(count/total, 2) for count in p.values())


In [280]:
dfList=[]
for files in read_files:
    data=pd.read_csv(files)
    class_entropy = round(entropy(data['bug'].tolist()),4)
    print('Class entropy:', class_entropy)
    
    decision_entropy = 0
    target=data.iloc[0,0]+"-"+str(data.iloc[0,1])
    r_list=[]
    r_list.append(target)
    r_list.append(class_entropy)
    col=list(data.columns[3:21].values)
    for i in col:
        
        for name, g in data.groupby(i):
            g_total = g.count().values[0]
            p = Counter(g['bug'].tolist())
            r = []
            for v in p.values():
                r.append(-v/g_total * math.log(v/g_total, 2))
            decision_entropy += g_total/data.count().values[0] * sum(r)
        #print('Column age:', decision_entropy)
        gain=round(abs(class_entropy - decision_entropy),4)
        print('Gain of :'+i+ str(gain ))
        r_list.append(gain)
    dfList.append(r_list)
    print(r_list)
    print("-----")

Class entropy: 0.8645
Gain of :wmc0.4248
Gain of :dit0.3744
Gain of :noc1.1848
Gain of :cbo1.6589
Gain of :rfc1.8447
Gain of :lcom2.1333
Gain of :ca2.6906
Gain of :ce3.2264
Gain of :npm3.6839
Gain of :lcom33.7471
Gain of :loc3.7791
Gain of :dam4.4177
Gain of :moa5.1634
Gain of :mfa5.6333
Gain of :cam5.7188
Gain of :ic6.5095
Gain of :cbm7.2728
Gain of :amc7.2888
['ant-1.3', 0.8645, 0.4248, 0.3744, 1.1848, 1.6589, 1.8447, 2.1333, 2.6906, 3.2264, 3.6839, 3.7471, 3.7791, 4.4177, 5.1634, 5.6333, 5.7188, 6.5095, 7.2728, 7.2888]
-----
Class entropy: 0.9181
Gain of :wmc0.2178
Gain of :dit0.629
Gain of :noc1.4844
Gain of :cbo2.0975
Gain of :rfc2.528
Gain of :lcom3.0553
Gain of :ca3.8332
Gain of :ce4.4812
Gain of :npm5.1743
Gain of :lcom35.558
Gain of :loc5.6949
Gain of :dam6.4864
Gain of :moa7.2834
Gain of :mfa7.6674
Gain of :cam7.9923
Gain of :ic8.8549
Gain of :cbm9.6956
Gain of :amc9.7925
['ant-1.4', 0.9181, 0.2178, 0.629, 1.4844, 2.0975, 2.528, 3.0553, 3.8332, 4.4812, 5.1743, 5.558, 5.6949, 

In [288]:
col=list(data.columns[3:21].values)
col.insert(0,"project name")
col.insert(1,"class-entropy")
newDf = pd.DataFrame(columns=col, data=dfList)
newDf.head()

Unnamed: 0,project name,class-entropy,wmc,dit,noc,cbo,rfc,lcom,ca,ce,npm,lcom3,loc,dam,moa,mfa,cam,ic,cbm,amc
0,ant-1.3,0.8645,0.4248,0.3744,1.1848,1.6589,1.8447,2.1333,2.6906,3.2264,3.6839,3.7471,3.7791,4.4177,5.1634,5.6333,5.7188,6.5095,7.2728,7.2888
1,ant-1.4,0.9181,0.2178,0.629,1.4844,2.0975,2.528,3.0553,3.8332,4.4812,5.1743,5.558,5.6949,6.4864,7.2834,7.6674,7.9923,8.8549,9.6956,9.7925
2,ant-1.5,0.5466,0.2266,0.2907,0.8052,1.1776,1.3283,1.4814,1.9566,2.288,2.6541,2.78,2.8414,3.2051,3.6751,3.8639,3.9936,4.4866,4.9477,4.9477
3,ant-1.6,1.327,0.5754,0.6684,1.9254,2.7281,3.1831,3.678,4.7069,5.5978,6.4245,6.7012,6.798,7.7778,8.8935,9.536,9.8482,11.1112,12.3082,12.3911
4,ant-1.7,1.206,0.4436,0.71,1.8577,2.7091,3.23,3.6917,4.709,5.5722,6.434,6.7505,6.9255,7.8231,8.8682,9.4621,9.7981,10.9611,12.0857,12.1802
