# Erisoglu 2011 "New" Algorithm

See: [A new algorithm for initial cluster centers in k-means algorithm](https://www.sciencedirect.com/science/article/pii/S0167865511002248)

## Initial setup and configuration

In [None]:
import imports
import numpy as np
import sklearn.datasets as skdatasets
import sklearn.metrics as skmetrics 
from metrics import accuracy
from matplotlib import pyplot as plt
import kmeans
from initialisations import erisoglu

In [None]:
eris = erisoglu.Erisoglu()

dataset = skdatasets.load_iris()
data = dataset.data
target = dataset.target

K = 3

## Attempts to reproduce results

### Some descriptive statistics (Table 1)

In [None]:
for i in range(0, len(dataset.feature_names)):
    
    mean = np.mean(dataset.data.T[i])
    stddev = np.std(dataset.data.T[i])
    cvj = eris.variation_coefficient(dataset.data.T[i])
    
    print('%s: mean=%f, stdddev=%f, cvj=%f, ' % (dataset.feature_names[i], mean, stddev, cvj))
    

**TODO:** why is standard deviation slightly different?

### Correlations with other features (Table 2)

In [None]:
for i in [0,1,2]:
    
    corr = eris.correlation_coefficient(data.T[3], data.T[i])
    
    print('%s:\t%f' % (dataset.feature_names[i], corr))

**NOTE:** these match the paper exactly as long as we amend the correlation coefficient to Pearson's.

### Paper selects petal width, sepal width as main, secondary:

In [None]:
main = eris._find_main_axis(data.T)
secondary = eris._find_secondary_axis(data.T, main)

print("Main feature deemed to be: %s" % (dataset.feature_names[main]))
print("Secondary feature deemed to be: %s" % (dataset.feature_names[secondary]))

### Paper selects *m* = [1.993, 3.0573]

In [None]:
initial, axes = eris._initialise(dataset.data)
center = eris._find_center(data.T, axes)

print("Center was:", center)
print("Initial seed is:", initial)

### Find candidates (c1...K)

In [None]:
c = eris._generate_candidates(dataset.data, K, initial, axes)

print(c)

Which match the graph and distances in the paper:

![alt text](ss1.png "Logo Title Text 1")



### Try to replicate means of initial clusters (M1...K)

Paper gives (with order corrected):

    m1 = [5.1774, 3.6516, 1.4903, 0.2677]  
    m2 = [6.4024, 2.9506, 5.1193, 1.7916]  
    m3 = [5.1278, 2.7917, 2.5722, 0.6361]

In [None]:
M = eris.generate(dataset.data, K)

print(M)

## Run k-means

In [None]:
result = kmeans.cluster(data, K, c.copy())

print(result['labels'])

### Metrics

In [None]:
#acc = skmetrics.accuracy_score(target, result['labels'])
ari = skmetrics.adjusted_rand_score(target, result['labels'])

#print("Accuracy Score:", acc)
print("Adjusted Rand Index:", ari)

**NOTE:** paper uses Rand Index, which doesn't seem to be offered in eg. sklearn.

### Confusion matrix and actual accuracy score

In [None]:
cm = skmetrics.confusion_matrix(target, result['labels'])
print(cm)

print((50 + 47 + 36)/len(data))

## Graphs 

### Fig. 4a

In [None]:
formatter = plt.FuncFormatter(lambda i, *args: dataset.target_names[int(i)])

plt.figure(figsize=(5, 4))

# nb. The paper plots them in a counter-intuitive fashion
plt.scatter(data[:, axes.secondary], data[:, axes.main], c=target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(dataset.feature_names[axes.secondary])
plt.ylabel(dataset.feature_names[axes.main])

plt.tight_layout()
plt.show()