<center><h1>Regional classifier (Visualization 2)</h1></center>

### Step 1: SOM training.

The code below implements the class *SOM* (self-organizing maps in a two-dimensional grid) and a function to plot data
and neurons over all training iterations in the special case when the features space is also two-dimensional.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd
import plotly.offline as plt
import plotly.graph_objs as go

from devcode.models.som import SOM
from devcode.utils import scale_feat, printDateTime
from load_dataset import datasets


from math import ceil

plt.init_notebook_mode(connected=True) # enabling plotly inside jupyter notebook

Dataset:  Features.shape:   # of classes:
vc2c      (310, 6)          2
vc3c      (310, 6)          3
wf24f     (5456, 24)        4
wf4f      (5456, 4)         4
wf2f      (5456, 2)         4
pk        (195, 22)         2


Example of class `SOM` runing:

The code below trains the SOM's in all datasets:

Note: The number of neurons chosen was approximately $5\sqrt{N}$ of the dataset and a square grid to arrange them.

In [3]:
%%time

nEpochs = 100
soms = {}
for name, data in datasets.items():
    N = len(data['features'].index) # number of datapoints
    l = ceil((5*N**.5)**.5) # side length of square grid of neurons
    X = data['features'].values
    
    # scaling dataset
    X1, X2 = scale_feat(X,X,scaleType='min-max')
    X=X1
    
    # SOM training
    som = SOM(l,l)
    som.fit(X=X, alpha0=0.1, sigma0=3, nEpochs=nEpochs)
    
    soms[name] = som
    
    print("{} done!".format(name))
    
# CPU times: user 3min 10s, sys: 114 ms, total: 3min 10s
# Wall time: 3min 10s

vc2c done!
vc3c done!
wf24f done!
wf4f done!
wf2f done!
pk done!
CPU times: total: 2min 12s
Wall time: 2min 13s


### Step 2: Clustering of the SOM.

Clustering and searching for optimal $k$ in range 2 to $\sqrt{C}$, where $C$ is the number os SOM prototypes:

In [4]:
%%time
    
from sklearn.cluster import KMeans
from devcode.utils.metrics import DB, dunn_fast, CH

printDateTime()
validation_indices = {
    'DB':   {},
    'Dunn': {},
    'CH':   {}
}

for dataset_name in datasets:
        som = soms[dataset_name]
        C = len(som.neurons)
        ks = [i for i in range(2, ceil(C**(1/2)))] # range to search for k in k-means

        n_init = 10 # number of independent rounds of initialization
        validation_indices['DB'][dataset_name]   = [0]*len(ks)
        validation_indices['Dunn'][dataset_name] = [0]*len(ks)
        validation_indices['CH'][dataset_name]   = [0]*len(ks)
        for i in range(len(ks)):
            kmeans = KMeans(n_clusters=ks[i], n_init=n_init, init='random').fit(som.neurons)
            # test if number of distinct clusters == number of clusters specified
            centroids = kmeans.cluster_centers_
            if len(centroids) == len(np.unique(centroids,axis=0)):
                validation_indices['DB'][dataset_name][i] = DB(kmeans,som.neurons)
            else:
                validation_indices['DB'][dataset_name][i] = np.inf

            validation_indices['Dunn'][dataset_name][i] = dunn_fast(som.neurons, kmeans.labels_)
            validation_indices['CH'][dataset_name][i]   = CH(kmeans, som.neurons)

        print("End of dataset {}".format(dataset_name))

2022-08-07 08:32:23.502624
End of dataset vc2c
End of dataset vc3c
End of dataset wf24f
End of dataset wf4f
End of dataset wf2f
End of dataset pk
CPU times: total: 33.5 s
Wall time: 4.62 s


In [5]:
from devcode.utils.visualization import plot_validation_indices

for dataset_name in datasets:
    plot_validation_indices(dataset_name, validation_indices)
    #plot_db(db[dataset_name], dataset_name)
    for index_name, results_vec in validation_indices.items():
        results = results_vec[dataset_name]
        k_opt = np.argmin(results) if index_name=='DB' else np.argmax(results)
        k_opt += 2
        print("K_opt for {} dataset using {} index: {}".format(
              dataset_name, index_name, k_opt))
        
    #print("K_opt for {} dataset is: {}".format(dataset_name, np.argmin(db[dataset_name])+2))

K_opt for vc2c dataset using DB index: 6
K_opt for vc2c dataset using Dunn index: 9
K_opt for vc2c dataset using CH index: 2


K_opt for vc3c dataset using DB index: 8
K_opt for vc3c dataset using Dunn index: 9
K_opt for vc3c dataset using CH index: 2


K_opt for wf24f dataset using DB index: 17
K_opt for wf24f dataset using Dunn index: 4
K_opt for wf24f dataset using CH index: 2


K_opt for wf4f dataset using DB index: 5
K_opt for wf4f dataset using Dunn index: 17
K_opt for wf4f dataset using CH index: 2


K_opt for wf2f dataset using DB index: 5
K_opt for wf2f dataset using Dunn index: 3
K_opt for wf2f dataset using CH index: 2


K_opt for pk dataset using DB index: 3
K_opt for pk dataset using Dunn index: 4
K_opt for pk dataset using CH index: 2


In [6]:
from devcode.utils.visualization import plot_kmeans
plot_kmeans(kmeans,som.neurons)