<center><h1>Classification by local modeling</h1></center>

## Summary:
1. [Introduction](#introduction)

2. [Local Ordinary Least Squares (L-OLS)](#lols)
    
    2.1. [Influence of the number of clusters on model accuracy](#lols-#-clusters)
    
3. [Local Least Squares Support Vector Machine (L-LSSVM)](#l_lssvm)

### 1. Introduction <a class="anchor" id="introduction"></a>

Classic classification by local modeling is a two-step approach for modeling:

1. An unsupervised clustering algorithm is run to find regions in the dataset;
2. For each region, a model is built with the respective data partition.

For inference the procedure is similar:

1. A similarity metric is used to determine the new data point region, e.g. euclidian distance from regions prototypes;
2. The model from that specific region is used to predict the class of the new data point.

There are a lot of clustering algorithms but, for the sake of simplicity, it will be used only K-means.

The class **LocalModel**, implemented below, create an easy way to implement and test local models for classification:

In [48]:
from sklearn.cluster import KMeans
from devcode.models.lssvm import LSSVM
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [49]:
def run_simulation(dataset_name, kmeans, clf_model, test_size=0.2):
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values

    # Train/Test split = 80%/20%
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size)

    # Scaling features
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType='min-max')
    
    lm = LocalModel(ClusterAlg=kmeans, ModelAlg=clf_model)
    lm.fit(X_tr_norm, y_train, verboses=1)

    y_pred_tr = lm.predict(X_tr_norm, rounded=True)
    y_pred_ts = lm.predict(X_ts_norm, rounded=True)

    cm_tr = confusion_matrix(dummie2multilabel(y_train),
                                     dummie2multilabel(y_pred_tr))
    cm_ts = confusion_matrix(dummie2multilabel(y_test),
                                     dummie2multilabel(y_pred_ts))

    acc_tr = cm2acc(cm_tr)
    acc_ts = cm2acc(cm_ts)

    print(f"Train accuracy: {acc_tr}\nTest accuracy:  {acc_ts}\n")

### 2. Local Ordinary Least Squares (L-OLS) <a class="anchor" id="lols"></a>

#### Description
Example of using local learning method using Ordinary Least Square (OSL) as base classifier.

In [50]:
%%time

# 1. Select the number of clusters (i.e., number of local regions)
n_clusters = 5

linear_clf = LinearRegression()
kmeans     = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)

run_simulation(dataset_name="vc2c", kmeans=kmeans, clf_model=linear_clf)

Start of clusterization: 2022-08-05 12:23:24.470290




Start of local models training: 2022-08-05 12:23:25.023427
Train accuracy: 0.8709677419354839
Test accuracy:  0.8387096774193549

CPU times: total: 6.77 s
Wall time: 679 ms


In [62]:
%%time

# 1. Select the number of clusters (i.e., number of local regions)
n_clusters = 5

linear_clf = LSSVM(gamma=1, kernel='rbf', sigma=4)
kmeans     = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)

run_simulation(dataset_name="vc2c", kmeans=kmeans, clf_model=linear_clf)

Start of clusterization: 2022-08-05 12:38:48.127997




Start of local models training: 2022-08-05 12:38:48.678009
Train accuracy: 0.717741935483871
Test accuracy:  0.7258064516129032

CPU times: total: 6.97 s
Wall time: 676 ms


### 3.  Local Least Squares Support Vector Machine (L-LSSVM)  <a class="anchor" id="l_lssvm"></a>

#### Description
Example of using local learning method using Least Square Support Vector Machine (LSSVM) as base classifier.

In [65]:
%%time
# %autoreload

from sklearn.cluster import KMeans
from devcode.models.lssvm import LSSVM
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

dataset_name = "vc2c"

clf_dict = {
    'linear': LSSVM(gamma=1, kernel='linear'),
    'poly'  : LSSVM(gamma=1, kernel='poly', d=2),
    'rbf'   : LSSVM(gamma=1, kernel='rbf', sigma=1)
}

print(dataset_name)
n_train = datasets[dataset_name]['features'].values.shape[0]

k_values = np.linspace(2, np.ceil(np.sqrt(len(X_train))), num=5, dtype='int').tolist() # 2 to sqrt(N)
    
for n_clusters in k_values:
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)

    for kernel_type, clf in clf_dict.items():
        print(f'Nº of clusters: {n_clusters} | Kernel: {kernel_type}')
        run_simulation(dataset_name=dataset_name, kmeans=kmeans, clf_model=clf)

vc2c
Nº of clusters: 2 | Kernel: linear
Start of clusterization: 2022-08-05 12:48:55.618971




Start of local models training: 2022-08-05 12:48:56.159649
Train accuracy: 0.8346774193548387
Test accuracy:  0.7580645161290323

Nº of clusters: 2 | Kernel: poly
Start of clusterization: 2022-08-05 12:48:56.273131




Start of local models training: 2022-08-05 12:48:56.858883
Train accuracy: 0.8346774193548387
Test accuracy:  0.8387096774193549

Nº of clusters: 2 | Kernel: rbf
Start of clusterization: 2022-08-05 12:48:56.975883




Start of local models training: 2022-08-05 12:48:57.562245
Train accuracy: 0.8306451612903226
Test accuracy:  0.8064516129032258

Nº of clusters: 5 | Kernel: linear
Start of clusterization: 2022-08-05 12:48:57.687411




Start of local models training: 2022-08-05 12:48:58.260738
Train accuracy: 0.8145161290322581
Test accuracy:  0.8064516129032258

Nº of clusters: 5 | Kernel: poly
Start of clusterization: 2022-08-05 12:48:58.377293




Start of local models training: 2022-08-05 12:48:58.978408
Train accuracy: 0.8145161290322581
Test accuracy:  0.7580645161290323

Nº of clusters: 5 | Kernel: rbf
Start of clusterization: 2022-08-05 12:48:59.091407




Start of local models training: 2022-08-05 12:48:59.683461
Train accuracy: 0.8387096774193549
Test accuracy:  0.7419354838709677

Nº of clusters: 9 | Kernel: linear
Start of clusterization: 2022-08-05 12:48:59.803455




Start of local models training: 2022-08-05 12:49:00.406731
Train accuracy: 0.7741935483870968
Test accuracy:  0.7419354838709677

Nº of clusters: 9 | Kernel: poly
Start of clusterization: 2022-08-05 12:49:00.524731




Start of local models training: 2022-08-05 12:49:01.132685
Train accuracy: 0.8064516129032258
Test accuracy:  0.7741935483870968

Nº of clusters: 9 | Kernel: rbf
Start of clusterization: 2022-08-05 12:49:01.257682




Start of local models training: 2022-08-05 12:49:01.861394
Train accuracy: 0.8346774193548387
Test accuracy:  0.7258064516129032

Nº of clusters: 12 | Kernel: linear
Start of clusterization: 2022-08-05 12:49:01.983475




Start of local models training: 2022-08-05 12:49:02.602396
Train accuracy: 0.8306451612903226
Test accuracy:  0.6774193548387096

Nº of clusters: 12 | Kernel: poly
Start of clusterization: 2022-08-05 12:49:02.723723




Start of local models training: 2022-08-05 12:49:03.327291
Train accuracy: 0.8225806451612904
Test accuracy:  0.8225806451612904

Nº of clusters: 12 | Kernel: rbf
Start of clusterization: 2022-08-05 12:49:03.436666




Start of local models training: 2022-08-05 12:49:04.045010
Train accuracy: 0.7943548387096774
Test accuracy:  0.8225806451612904

Nº of clusters: 16 | Kernel: linear
Start of clusterization: 2022-08-05 12:49:04.186819




Start of local models training: 2022-08-05 12:49:04.793954
Train accuracy: 0.8104838709677419
Test accuracy:  0.7419354838709677

Nº of clusters: 16 | Kernel: poly
Start of clusterization: 2022-08-05 12:49:04.918954




Start of local models training: 2022-08-05 12:49:05.525229
Train accuracy: 0.8467741935483871
Test accuracy:  0.8064516129032258

Nº of clusters: 16 | Kernel: rbf
Start of clusterization: 2022-08-05 12:49:05.651495




Start of local models training: 2022-08-05 12:49:06.266793
Train accuracy: 0.8266129032258065
Test accuracy:  0.8225806451612904

CPU times: total: 2min
Wall time: 10.8 s
