<center><h1>Classification by regional modeling</h1></center>

Classification by regional modeling consists in a five-step approach:
1. Setting the hyper-parameters. In this step, we specify the number of SOM prototypes $C$. It must be also defined as
the maximum number of regions $K_{max}$. Without any prior knowledge, we will set in this example $K_{max} = \sqrt{C}$.


2. SOM training. In order to build regional models, follow the procedure introduced by Vesanto and Alhoniemi [1].
Thus, the very first step requires training the SOM as usual, with $C$ prototypes.


3. Clustering of the SOM. The step consists in performing clustering over the $C$ SOM prototypes. Although one may
use any clustering algorithm for this step, for the sake of simplicity, we use the standard K-means algorithm in
combination with the Davies–Bouldin (DB) index. The DB index is a clustering validity measure commonly used for
finding the optimal number of clusters, but any suitable measure can be equally used (see [2]). Thus, we compute
$K = 1, 2, ... K_{max}$ partitioning of the SOM prototypes and the corresponding DB index value as well.
The optimal partitioning, represented by $K_{opt}$ partitions, is then the value of $K$ wich minimizes the DB index.


4. Partitioning SOM prototypes into regions. Once $K_{opt}$ is selected, the $r$-th cluster of SOM prototypes,
$r = 1...K_{opt}$, is composed of all weight vectors $w_i$ that are mapped onto the prototype $p_r$ of the K-means
algorithm. More formally, the set of SOM prototypes associated with the r-th prototype of the K-means algorithm
is defined as: $$W_r = \{w_i \in R^{p+q} | \|w_i-p_r\| < \|w_i-p_j\|, \forall j =1,...,K_{opt}, j\neq r \}$$


5. Mapping data points to regions. The fourth step consists in finding $K_{opt}$ data partitions, denoted by
$\{X_1\}$, $\{X_2\}$, ... , $\{X_{K_{opt}}\}$ of the training dataset by mapping each datapoint to a region
$r \in \{1, ... , K_{opt}\}$. In other words, let us denote $N_r$ as the number of data vectors in $\{X_r\}$.
Then, the partition $\{X_r\}$ is composed of those input vectors $x_{rμ}$, $μ = 1, ... , N_r$ , whose closest SOM
prototype belongs to $W_r$.


6. Building classification models over the regions. Finally, once the original dataset has been divided into $K_{opt}$
subsets (one per region), the last step consists in building $K_{opt}$ regional classification models using
$X_r$, $r = 1, ... , K_{opt}$.

To test this framework the datasets below were gathered from the UCI repository:
* Vertebral Column
* Wall-Following
* Alzheimer (aquele usado na disciplina)

OBS: Was chosen to maintain k-1 dummies variables when we had k categories, so the missing category is identified when
all dummies variables are zero.

## Step 1: Setting the hyper-parameters.
## Step 2: SOM training.

The code below implements the class *SOM* (self-organizing maps in a two-dimensional grid) and a function to plot data
and neurons over all training iterations in the special case when the features space is also two-dimensional.

In [2]:
import numpy as np
import pandas as pd

from utils import scale_feat, printDateTime
from load_dataset import datasets
from som import SOM

# New SOM
import plotly.offline as plt
import plotly.graph_objs as go
from math import ceil

plt.init_notebook_mode(connected=True) # enabling plotly inside jupyter notebook

Example of class `SOM` runing:

In [3]:

%%time
import datetime
print(datetime.datetime.now())

nEpochs = 30
print("nEpochs = {}".format(nEpochs))

data = datasets['wf2f']
N = len(data['features'].index) # number of datapoints
l = ceil((5*N**.5)**.5) # tamanho do lado da grid quadrada de neurônios
X = data['features'].values.copy()

X1, X2 = scale_feat(X,X,scaleType='min-max')
X=X1

som = SOM(l, l)
som.fit(X=X, alpha0=0.1, sigma0=3, nEpochs=nEpochs, saveNeuronsHist=True, verboses=1)
print(som.ssdHist[-1])

2022-07-29 14:42:48.295286
nEpochs = 30
End of epoch 1
End of epoch 2
End of epoch 3
End of epoch 4
End of epoch 5
End of epoch 6
End of epoch 7
End of epoch 8
End of epoch 9
End of epoch 10
End of epoch 11
End of epoch 12
End of epoch 13
End of epoch 14
End of epoch 15
End of epoch 16
End of epoch 17
End of epoch 18
End of epoch 19
End of epoch 20
End of epoch 21
End of epoch 22
End of epoch 23
End of epoch 24
End of epoch 25
End of epoch 26
End of epoch 27
End of epoch 28
End of epoch 29
End of epoch 30
0.12625690481847635
CPU times: total: 22.7 s
Wall time: 22.9 s


In [4]:
som.plotSSD()

In [5]:
som.plotSOM(X)

BoundedIntText(value=0, description='epoch:', max=30, step=10)


The code below trains the SOM's in all datasets:

Note: The number of neurons chosen was approximately $5\sqrt{N}$ of the dataset and a square grid to arrange them.

In [None]:
%%time
printDateTime()

nEpochs = 100
soms = {}
for name, data in datasets.items():
    N = len(data['features'].index) # number of datapoints
    l = ceil((5*N**.5)**.5) # side length of square grid of neurons
    X = data['features'].values
    
    # scaling dataset
    X1, X2 = scale_feat(X,X,scaleType='min-max')
    X=X1
    
    # SOM training
    som = SOM(l,l)
    som.fit(X=X, alpha0=0.1, sigma0=3, nEpochs=nEpochs)
    
    soms[name] = som
    
    print("{} done!".format(name))
    
# CPU times: user 3min 10s, sys: 114 ms, total: 3min 10s
# Wall time: 3min 10s

2022-07-29 14:43:14.528479
vc2c done!
vc3c done!
wf24f done!


In [30]:
for dataset in datasets:
    print(dataset)
    soms[dataset].plotSSD()

vc2c


vc3c


wf24f


wf4f


wf2f


pk


# Step 3: Clustering of the SOM.

Clustering and searching for optimal $k$ in range 2 to $\sqrt{C}$, where $C$ is the number os SOM prototypes:

In [8]:
%%time
    
from sklearn.cluster import KMeans
from base import DB, dunn_fast, CH

printDateTime()
validation_indices = {
    'DB':   {},
    'Dunn': {},
    'CH':   {}
}

for dataset_name in datasets:
        som = soms[dataset_name]
        C = len(som.neurons)
        ks = [i for i in range(2, ceil(C**(1/2)))] # range to search for k in k-means

        n_init = 10 # number of independent rounds of initialization
        validation_indices['DB'][dataset_name]   = [0]*len(ks)
        validation_indices['Dunn'][dataset_name] = [0]*len(ks)
        validation_indices['CH'][dataset_name]   = [0]*len(ks)
        for i in range(len(ks)):
            kmeans = KMeans(n_clusters=ks[i], n_init=n_init, init='random').fit(som.neurons)
            # test if number of distinct clusters == number of clusters specified
            centroids = kmeans.cluster_centers_
            if len(centroids) == len(np.unique(centroids,axis=0)):
                validation_indices['DB'][dataset_name][i] = DB(kmeans,som.neurons)
            else:
                validation_indices['DB'][dataset_name][i] = np.inf

            validation_indices['Dunn'][dataset_name][i] = dunn_fast(som.neurons, kmeans.labels_)
            validation_indices['CH'][dataset_name][i]   = CH(kmeans, som.neurons)

        print("End of dataset {}".format(dataset_name))

2019-07-06 14:53:05.423260


NameError: name 'soms' is not defined

In [35]:
def plot_validation_indices(dataset_name, validation_indices):
    data = []
    for index_name, results_vec in validation_indices.items():
    #for validation_index in validation_indices:
        #print(index_name)
        #print(results_vec[dataset_name])
        data.append(go.Scatter(
            x=[i for i in range(2, len(results_vec[dataset_name])+2)],
            y=results_vec[dataset_name], 
            mode='lines+markers', 
            name="{} index".format(index_name)))

    
    layout = go.Layout(
        title = "Indices vs k [{} dataset]".format(dataset_name),
        legend=dict(orientation="h", y=-.05),
        xaxis=dict(title="Number of clusters (k)"),
        yaxis=dict(title="Indices values")
    )

    fig = go.Figure(data=data, layout=layout)
    plt.iplot(fig)


for dataset_name in datasets:
    plot_validation_indices(dataset_name, validation_indices)
    #plot_db(db[dataset_name], dataset_name)
    for index_name, results_vec in validation_indices.items():
        results = results_vec[dataset_name]
        k_opt = np.argmin(results) if index_name=='DB' else np.argmax(results)
        k_opt += 2
        print("K_opt for {} dataset using {} index: {}".format(
              dataset_name, index_name, k_opt))
        
    #print("K_opt for {} dataset is: {}".format(dataset_name, np.argmin(db[dataset_name])+2))

K_opt for vc2c dataset using DB index: 4
K_opt for vc2c dataset using Dunn index: 8
K_opt for vc2c dataset using CH index: 2


K_opt for vc3c dataset using DB index: 2
K_opt for vc3c dataset using Dunn index: 7
K_opt for vc3c dataset using CH index: 2


K_opt for wf24f dataset using DB index: 17
K_opt for wf24f dataset using Dunn index: 7
K_opt for wf24f dataset using CH index: 2


K_opt for wf4f dataset using DB index: 5
K_opt for wf4f dataset using Dunn index: 19
K_opt for wf4f dataset using CH index: 2


K_opt for wf2f dataset using DB index: 5
K_opt for wf2f dataset using Dunn index: 3
K_opt for wf2f dataset using CH index: 2


K_opt for pk dataset using DB index: 2
K_opt for pk dataset using Dunn index: 7
K_opt for pk dataset using CH index: 2


In [8]:
def plot_kmeans(kmeans, X):
    data = []
    if X is not None:
        datapoints = go.Scatter(
            x = X[:,0], 
            y = X[:,1], 
            mode='markers',
            name='data',
            marker = dict(
                 size = 5,
                 color = '#03A9F4'
                )
        )
        data.append(datapoints)
    
    kmeans_clusters = go.Scatter(
        x=kmeans.cluster_centers_[:,0],
        y=kmeans.cluster_centers_[:,1], 
        mode='markers', 
        name='kmeans clusters', 
        marker = dict(size=10,color = '#673AB7')
    )
    data.append(kmeans_clusters)

    layout = go.Layout(
        title = "Data + KMeans clusters",
        xaxis=dict(title="$x_1$"),
        yaxis=dict(title="$x_2$"),
    )

    fig = go.Figure(data=data, layout=layout)
    plt.iplot(fig)

plot_kmeans(kmeans,som.neurons)

NameError: name 'kmeans' is not defined

# Step 4: Partitioning SOM prototypes into regions:

# Step 4: Mapping data points to regions:

# Step 6: Building classification models over the regions.

Example of class `RegionalModel` running with linear models:

In [28]:
from utils import dummie2multilabel
from regional_learning import RegionalModel
%%time
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.cluster import KMeans


dataset_name='wf2f'

X = datasets[dataset_name]['features'].values
Y = datasets[dataset_name]['labels'].values

# Train/Test split = 80%/20%
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)

# scaling features
X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType='min-max')

#N = len(dataset['features'].index) # number of datapoints
N = len(X_tr_norm) # number of datapoints in the train split
l = ceil((5*N**.5)**.5) # side length of square grid of neurons

som = SOM(l,l)
som_params={
    'alpha0':    0.01,
    'sigma0':    1,
    'nEpochs':   1,
    'verboses':  0            
}

C = l**2 # number of SOM neurons in the 2D grid
k_values = [i for i in range(2, ceil(np.sqrt(C)))] # 2 to sqrt(C)
cluster_params={
    'n_clusters': {'metric':   DB,        # when a dictionary is pass a search begins
                   'criteria': np.argmin, # search for smallest DB score 
                   'k_values': k_values}, # around the values provided in 'k_values'
    'n_init':     10, # number of initializations
    'init':       'random', 
    #'n_jobs':     -1
}

linearModel = linear_model.LinearRegression(n_jobs=-1)

rm = RegionalModel(som, linearModel)
rm.fit(X=X_tr_norm, Y=y_train, verboses=0,
        SOM_params     = som_params,
        Cluster_params = cluster_params)

# Evaluating in the test dataset
y_pred = rm.predict(X_ts_norm)
y_pred = np.round(np.clip(y_pred, 0, 1)) # rounding prediction numbers

cm = confusion_matrix(dummie2multilabel(y_test),
                      dummie2multilabel(y_pred))
#cm = np.asarray(cm).reshape(-1) # matrix => array
acc=0
total=sum(sum(cm))
for j in range(len(cm)):
    acc += cm[j,j] # summing the diagonal
acc/=total

CPU times: user 1.69 s, sys: 2.98 ms, total: 1.69 s
Wall time: 1.69 s


In [29]:
acc

0.8113553113553114

Evaluation of regional OLS in the datasets:

In [113]:
# constant hyperparameters:
test_size = 0.2
scaleType = 'min-max'
n_resamplings = 100

# hyperparameters grid search:
num = 3
alphas = np.linspace(0.1, 0.5,  num=num).tolist()
sigmas = np.linspace(3,    10,   num=num).tolist()
epochs = np.linspace(100,  500, num=num, dtype='int').tolist()

# vector of random states for train/test split
random_states = np.random.randint(np.iinfo(np.int32).max, size=n_resamplings).tolist()
cases = [
    {
         "dataset_name" : dataset_name
        ,"random_state":  random_state
        ,"som_params":    { "alpha0"  : alpha0
                           ,"sigma0"  : sigma0
                           ,"nEpochs" : nEpochs
                          }
    }
    # hyperparameters possible values
    for dataset_name in datasets.keys()
    for random_state in random_states
    for alpha0       in alphas
    for sigma0       in sigmas
    for nEpochs      in epochs
]

print("alphas: {}\nsigmas: {}\nepochs: {}\n".format(alphas,sigmas,epochs))

print("# of alphas: {}\n# of sigmas: {}\n# of epochs: {}\n# of random_states: {}\n# of datasets: {}\n".format(
    len(alphas), len(sigmas), len(epochs), len(random_states), len(list(datasets.keys()))))

print("# of cases: {}".format(len(cases)))

alphas: [0.1, 0.30000000000000004, 0.5]
sigmas: [3.0, 6.5, 10.0]
epochs: [100, 300, 500]

# of alphas: 3
# of sigmas: 3
# of epochs: 3
# of random_states: 100
# of datasets: 6

# of cases: 16200


In [12]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from functools import partial
from multiprocessing import Pool

def evalRLM(case):
    dataset_name = case['dataset_name']
    random_state = case['random_state']
    som_params   = case['som_params']
    
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values
    
    # Train/Test split
    X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=random_state)
    # scaling features
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scaleType)

    N = len(X_tr_norm) # number of datapoints in the train split
    l = ceil((5*N**.5)**.5) # side length of square grid of neurons

    som = SOM(l,l)

    C = l**2 # number of SOM neurons in the 2D grid
    k_values = [i for i in range(2, ceil(np.sqrt(C)))] # 2 to sqrt(C)
    cluster_params={
        'n_clusters': {'metric':   DB,        # when a dictionary is pass a search begins
                       'criteria': np.argmin, # search for smallest DB score 
                       'k_values': k_values}, # around the values provided in 'k_values'
        'n_init':     10, # number of initializations
        'init':       'random'
        #'n_jobs':     0
    }

    linearModel = linear_model.LinearRegression()

    rlm = RegionalModel(som, linearModel)
    rlm.fit(X=X_tr_norm, Y=y_train,
            SOM_params     = som_params,
            Cluster_params = cluster_params)

    # Evaluating in the train set
    y_tr_pred = rlm.predict(X_tr_norm)
    y_tr_pred = np.round(np.clip(y_tr_pred, 0, 1)) # rounding prediction numbers

    cm_tr = confusion_matrix(dummie2multilabel(y_train),
                             dummie2multilabel(y_tr_pred)
                            ).reshape(-1) # matrix => array

    # Evaluating in the test set
    y_ts_pred = rlm.predict(X_ts_norm)
    y_ts_pred = np.round(np.clip(y_ts_pred, 0, 1)) # rounding prediction numbers

    cm_ts = confusion_matrix(dummie2multilabel(y_test),
                             dummie2multilabel(y_ts_pred)
                            ).reshape(-1) # matrix => array

    data = [dataset_name, random_state]+list(som_params.values())+[cm_tr]+[cm_ts]
    return data

In [8]:
# browser notification when cell finishs with '%%notify'
# import jupyternotify
# ip = get_ipython()
# ip.register_magics(jupyternotify.JupyterNotifyMagics)

<IPython.core.display.Javascript object>

In [15]:
%%notify
from multiprocessing import Pool
import tqdm

data = [None]*len(cases)
count=0
pool = Pool()
for i in tqdm.tqdm(pool.imap_unordered(evalRLM, cases), total=len(cases)):
    data[count] = i
    count+=1
pool.close()
pool.join()


results = np.vstack(data)
header  = ["dataset_name", "random_state", "alpha0", "sigma0", "nEpochs", "cm_tr", "cm_ts"]
results_df = pd.DataFrame(results, columns=header)

filename = "ROLS - all - n_res={n_resamplings} - {datetime}.csv".format(
    n_resamplings=n_resamplings,
    datetime=datetime.datetime.now()
)
results_df.to_csv(filename,index=False) # saving results in csv file

# [{elapsed}<{remaining}

100%|██████████| 16200/16200 [84:53:44<00:00,  1.07s/it]   


<IPython.core.display.Javascript object>

### Processing results:

In [111]:
# loading simulation results
df_results = pd.read_csv('ROLS - all - n_res=100 - 2019-07-10 04:50:57.404253.csv')
df_results.head()

Unnamed: 0,dataset_name,random_state,alpha0,sigma0,nEpochs,cm_tr,cm_ts
0,vc2c,127815836,0.1,3.0,100,[156 13 15 64],[32 9 7 14]
1,vc2c,127815836,0.1,10.0,100,[153 16 16 63],[32 9 6 15]
2,vc2c,127815836,0.1,6.5,100,[162 7 21 58],[34 7 9 12]
3,vc2c,127815836,0.1,10.0,300,[157 12 21 58],[32 9 8 13]
4,vc2c,127815836,0.1,6.5,300,[155 14 19 60],[33 8 7 14]


Results for each data set:

In [148]:
som_params = [
    {
     "alpha0"  : alpha0
    ,"sigma0"  : sigma0
    ,"nEpochs" : nEpochs
    }
    for alpha0       in alphas
    for sigma0       in sigmas
    for nEpochs      in epochs
]

header = list(som_params[0].keys()) + ['Minimum', 'Maximum', 'Median', 'Mean', 'Std. Deviation']

df_ds = {}
for dataset_name in datasets: # For this specific dataset
    print(dataset_name)
    df = df_results.loc[df_results['dataset_name'] == dataset_name] # get simulation results

    count = 0
    df_data   = np.zeros((len(som_params), len(header))) # matriz que guardará resultados numéricos
    for params in som_params:
        df_case = df.loc[(df['alpha0']  == params['alpha0']) & 
                         (df['sigma0']  == params['sigma0']) &
                         (df['nEpochs'] == params['nEpochs'])]

        # converting confusion matrix from string to numpy array
        cm_ts = np.array([[int(x) for x in result[1:-1].split()] for result in df_case['cm_ts'].values])

        #data = cm_ts
        length = cm_ts.shape[1]
        cm_side = int(np.sqrt(length))

        acc   = [0]*len(cm_ts)
        for i in range(len(cm_ts)):
            cm = np.reshape(cm_ts[i], (cm_side,cm_side))
            acc[i] = np.trace(cm)/np.sum(cm)

        df_data[count,:] = np.matrix([
            params['alpha0'], params['sigma0'], params['nEpochs'],
            min(acc), max(acc), np.median(acc), np.mean(acc), np.std(acc)])
        count+=1


    df_ds[dataset_name] = pd.DataFrame(df_data, columns=header)
    print(df_ds[dataset_name].head()) # TODO: display
    print('-'*100,'\n'*2)

vc2c


Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
0,0.1,3.0,100.0,0.677419,0.919355,0.83871,0.827097,0.050134
1,0.1,3.0,300.0,0.725806,0.935484,0.83871,0.829516,0.045291
2,0.1,3.0,500.0,0.709677,0.919355,0.83871,0.82871,0.043863
3,0.1,6.5,100.0,0.693548,0.903226,0.83871,0.828871,0.046573
4,0.1,6.5,300.0,0.709677,0.935484,0.830645,0.828871,0.045785


---------------------------------------------------------------------------------------------------- 


vc3c


Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
0,0.1,3.0,100.0,0.645161,0.919355,0.822581,0.811129,0.049624
1,0.1,3.0,300.0,0.709677,0.919355,0.822581,0.814194,0.046326
2,0.1,3.0,500.0,0.645161,0.919355,0.806452,0.815484,0.05193
3,0.1,6.5,100.0,0.693548,0.919355,0.814516,0.810806,0.044748
4,0.1,6.5,300.0,0.709677,0.919355,0.822581,0.818226,0.044573


---------------------------------------------------------------------------------------------------- 


wf24f


Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
0,0.1,3.0,100.0,0.79304,0.878205,0.850733,0.849762,0.015719
1,0.1,3.0,300.0,0.813187,0.881868,0.850733,0.850852,0.013273
2,0.1,3.0,500.0,0.808608,0.885531,0.851648,0.85217,0.012592
3,0.1,6.5,100.0,0.818681,0.884615,0.854396,0.853965,0.012983
4,0.1,6.5,300.0,0.804945,0.880952,0.854396,0.852564,0.012035


---------------------------------------------------------------------------------------------------- 


wf4f


Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
0,0.1,3.0,100.0,0.839744,0.930403,0.873626,0.877564,0.019076
1,0.1,3.0,300.0,0.832418,0.93956,0.877747,0.881126,0.019123
2,0.1,3.0,500.0,0.839744,0.93315,0.877289,0.877711,0.01598
3,0.1,6.5,100.0,0.840659,0.935897,0.876374,0.879954,0.021112
4,0.1,6.5,300.0,0.830586,0.92674,0.877289,0.876429,0.017391


---------------------------------------------------------------------------------------------------- 


wf2f


Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
0,0.1,3.0,100.0,0.769231,0.960623,0.908883,0.911016,0.023783
1,0.1,3.0,300.0,0.822344,0.956044,0.90522,0.905302,0.021214
2,0.1,3.0,500.0,0.812271,0.955128,0.904762,0.907152,0.021152
3,0.1,6.5,100.0,0.855311,0.957875,0.907509,0.907326,0.017636
4,0.1,6.5,300.0,0.830586,0.954212,0.906136,0.907115,0.015866


---------------------------------------------------------------------------------------------------- 


pk


Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
0,0.1,3.0,100.0,0.717949,1.0,0.871795,0.860513,0.059177
1,0.1,3.0,300.0,0.692308,0.974359,0.871795,0.870513,0.056686
2,0.1,3.0,500.0,0.692308,0.974359,0.871795,0.859487,0.056466
3,0.1,6.5,100.0,0.692308,1.0,0.871795,0.865128,0.06203
4,0.1,6.5,300.0,0.692308,1.0,0.871795,0.856923,0.058939


---------------------------------------------------------------------------------------------------- 




Taking the best values by higher mean in accuracy.

In [149]:
data = np.array([df.sort_values('Mean', ascending=False).iloc[0,:].values for df in df_ds.values()])
idx_label = list(df_ds.keys())
df_rols = pd.DataFrame(data, columns=header, index=[idx_label])
df_rols

Unnamed: 0,alpha0,sigma0,nEpochs,Minimum,Maximum,Median,Mean,Std. Deviation
vc2c,0.1,10.0,100.0,0.725806,0.951613,0.83871,0.832419,0.04504
vc3c,0.1,6.5,300.0,0.709677,0.919355,0.822581,0.818226,0.044573
wf24f,0.5,10.0,300.0,0.825092,0.886447,0.85989,0.860018,0.012187
wf4f,0.5,6.5,300.0,0.850733,0.935897,0.878205,0.883727,0.020317
wf2f,0.1,10.0,300.0,0.849817,0.960623,0.908425,0.911868,0.017521
pk,0.1,3.0,300.0,0.692308,0.974359,0.871795,0.870513,0.056686


# Globlal OLS

In [116]:
# constant hyperparameters:
test_size = 0.2
scaleType = 'min-max'
n_resamplings = 100

# vector of random states for train/test split
random_states = np.random.randint(np.iinfo(np.int32).max, size=n_resamplings).tolist()
cases = [
    {
         "dataset_name" : dataset_name
        ,"random_state":  random_state
    }
    for dataset_name in datasets.keys()
    for random_state in random_states
]

print("# of random_states: {}\n# of datasets: {}\n".format(
    len(random_states), len(list(datasets.keys()))))

print("# of cases: {}".format(len(cases)))

# of random_states: 100
# of datasets: 6

# of cases: 600


In [75]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import linear_model

def evalGOLS(case):
    dataset_name = case['dataset_name']
    random_state = case['random_state']
    
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values
    
    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=random_state)

    # scaling features
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType='min-max')

    model = linear_model.LinearRegression().fit(X_tr_norm,y_train)
    
    # Evaluating in the train set
    y_tr_pred = model.predict(X_tr_norm)
    y_tr_pred = np.round(np.clip(y_tr_pred, 0, 1)) # rounding prediction numbers

    cm_tr = confusion_matrix(dummie2multilabel(y_train),
                             dummie2multilabel(y_tr_pred)
                            ).flatten() # matrix => array
    
    # Evaluating in the test set
    y_ts_pred = model.predict(X_ts_norm)
    y_ts_pred = np.round(np.clip(y_ts_pred, 0, 1)) # rounding prediction numbers

    cm_ts = confusion_matrix(dummie2multilabel(y_test),
                             dummie2multilabel(y_ts_pred)
                            ).flatten() # matrix => array

    
    data = [dataset_name, random_state]+[cm_tr]+[cm_ts]
    return data

In [76]:
from multiprocessing import Pool
import tqdm

data = [None]*len(cases)

pool = Pool()
data =[result for result in tqdm.tqdm(pool.imap_unordered(evalGOLS,cases), total=len(cases))]
pool.close()
pool.join()

results = np.vstack(data)
header  = ["dataset_name", "random_state", "cm_tr", "cm_ts"]
results_df = pd.DataFrame(results, columns=header)

100%|██████████| 600/600 [00:07<00:00, 82.56it/s] 


In [77]:
results_df.head()

Unnamed: 0,dataset_name,random_state,cm_tr,cm_ts
0,vc2c,696399911,"[162, 10, 32, 44]","[37, 1, 8, 16]"
1,vc2c,665400398,"[156, 12, 27, 53]","[36, 6, 5, 15]"
2,vc2c,1026613726,"[148, 17, 18, 65]","[39, 6, 4, 13]"
3,vc2c,1775733262,"[156, 15, 28, 49]","[37, 2, 7, 16]"
4,vc2c,406143792,"[163, 10, 28, 47]","[33, 4, 10, 15]"


Processing results (taking the best values by higher mean in accuracy):

In [150]:
header = ['Minimum', 'Maximum', 'Median', 'Mean', 'Std. Deviation']

data      = np.zeros(( len(datasets.keys()), len(header) ))
idx_label = [' ']*len(datasets.keys())
count=0
for dataset_name in datasets: # For this specific dataset
    df = results_df.loc[results_df['dataset_name'] == dataset_name] # get simulation results
    
    # converting confusion matrices to numpy matrix
    cm_ts = np.array([array for array in df['cm_ts'].values])
       
    length = cm_ts.shape[1]
    cm_side = int(np.sqrt(length))
    acc   = [0]*len(cm_ts)
    for i in range(len(cm_ts)):
        cm = np.reshape(cm_ts[i], (cm_side,cm_side))
        acc[i] = np.trace(cm)/np.sum(cm)

    data[count,:] = np.array([min(acc), max(acc), np.median(acc), np.mean(acc), np.std(acc)])
    idx_label[count] = dataset_name
    count+=1
    
df_ols = pd.DataFrame(data, columns=header, index=[idx_label])
df_ols

Unnamed: 0,Minimum,Maximum,Median,Mean,Std. Deviation
vc2c,0.725806,0.919355,0.83871,0.833387,0.041377
vc3c,0.66129,0.903226,0.774194,0.778226,0.047867
wf24f,0.600733,0.666667,0.638278,0.638022,0.013931
wf4f,0.700549,0.761905,0.727564,0.727793,0.012644
wf2f,0.691392,0.759158,0.72207,0.722976,0.013688
pk,0.692308,0.974359,0.871795,0.866667,0.04972


# Comparing results:

In [171]:
header = ['Dataset', 'Model']+list(df_ols.columns)

temp_rols = df_rols.rename_axis('Dataset').reset_index().loc[:,[x for x in header if x!='Model']]
temp_rols.insert(1,'Model',['ROLS']*len(datasets.keys()))

temp_ols = df_ols.rename_axis('Dataset').reset_index()
temp_ols.insert(1,'Model',['OLS']*len(datasets.keys()))

print(
    pd.concat([temp_ols,temp_rols]).sort_index()
) # TODO: display

Unnamed: 0,Dataset,Model,Minimum,Maximum,Median,Mean,Std. Deviation
0,vc2c,OLS,0.725806,0.919355,0.83871,0.833387,0.041377
0,vc2c,ROLS,0.725806,0.951613,0.83871,0.832419,0.04504
1,vc3c,OLS,0.66129,0.903226,0.774194,0.778226,0.047867
1,vc3c,ROLS,0.709677,0.919355,0.822581,0.818226,0.044573
2,wf24f,OLS,0.600733,0.666667,0.638278,0.638022,0.013931
2,wf24f,ROLS,0.825092,0.886447,0.85989,0.860018,0.012187
3,wf4f,OLS,0.700549,0.761905,0.727564,0.727793,0.012644
3,wf4f,ROLS,0.850733,0.935897,0.878205,0.883727,0.020317
4,wf2f,OLS,0.691392,0.759158,0.72207,0.722976,0.013688
4,wf2f,ROLS,0.849817,0.960623,0.908425,0.911868,0.017521


# References

[1] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans.
Neural Netw. 11 (2000) 586–600.

[2] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, J. Intell. Inf. Syst. 17 (2001) 107–145.