<center><h1>Classification by local modeling</h1></center>

## Summary:
1. [Introduction](#introduction)

2. [Local Ordinary Least Squares (L-OLS)](#lols)
    
    2.1. [Influence of the number of clusters on model accuracy](#lols-#-clusters)
    
3. [Local Least Squares Support Vector Machine (L-LSSVM)](#l_lssvm)

# 1. Introduction <a class="anchor" id="introduction"></a>

Classic classification by local modeling is a two-step approach for modeling:

1. An unsupervised clustering algorithm is run to find regions in the dataset;
2. For each region, a model is built with the respective data partition.

For inference the procedure is similar:

1. A similarity metric is used to determine the new data point region, e.g. euclidian distance from regions prototypes;
2. The model from that specific region is used to predict the class of the new data point.

There are a lot of clustering algorithms but, for the sake of simplicity, it will be used only K-means.

The class **LocalModel**, implemented below, create an easy way to implement and test local models for classification:

# 2. Local Ordinary Least Squares (L-OLS) <a class="anchor" id="lols"></a>

Example of `LocalModel` class running with OLS:

In [3]:
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn import linear_model
import numpy as np
import pandas as pd

from utils import scale_feat, dummie2multilabel, cm2acc
from local_learning import LocalModel
from load_dataset import datasets



dataset_name = "vc2c"
X = datasets[dataset_name]['features'].values
Y = datasets[dataset_name]['labels'].values

# Train/Test split = 80%/20%
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
# scaling features
X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType='min-max')

n_clusters=5
print("Number of clusters: {}".format(n_clusters))
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)
linReg = LinearRegression()

lm = LocalModel(ClusterAlg=kmeans, ModelAlg=linReg)
lm.fit(X_tr_norm, y_train, verboses=1)

y_pred_tr = lm.predict(X_tr_norm, rounded=True)
y_pred_ts = lm.predict(X_ts_norm, rounded=True)


cm_tr = confusion_matrix(dummie2multilabel(y_train),
                         dummie2multilabel(y_pred_tr))
cm_ts = confusion_matrix(dummie2multilabel(y_test),
                         dummie2multilabel(y_pred_ts))
   
acc_tr = cm2acc(cm_tr)
acc_ts = cm2acc(cm_ts)

print("Train accuracy: {}\nTest accuracy:  {}".format(acc_tr, acc_ts))

Number of clusters: 5
Start of clusterization: 2022-08-02 16:49:02.236676
Start of local models training: 2022-08-02 16:49:02.270674
Train accuracy: 0.8629032258064516
Test accuracy:  0.8225806451612904


## 2. Local Least Squares Support Vector Machine (L-LSSVM) <a class="anchor" id="l_lssvm"></a>

Example of `LocalModel` class running with LSSVM:

In [None]:
%%time
# %autoreload

from sklearn.cluster import KMeans
from lssvm import LSSVM
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# small_datasets =['vc2c', 'vc3c', 'pk']

for dataset_name in datasets:
    print(dataset_name)

    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values

    X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)  # Train/Test split = 80%/20%
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType='min-max') # scaling features
    
    k_values = np.linspace(2, np.ceil(np.sqrt(len(X_train))), num=5, dtype='int').tolist() # 2 to sqrt(N)
    
    for n_clusters in k_values:
        print("# of clusters: {}".format(n_clusters))
        kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)

        clf_dict = {
            'linear': LSSVM(gamma=1, kernel='linear'),
            'poly'  : LSSVM(gamma=1, kernel='poly', d=2),
            'rbf'   : LSSVM(gamma=1, kernel='rbf', sigma=1)
        }

        for kernel_type, clf in clf_dict.items():
            print('kernel: {}'.format(kernel_type))
            lm = LocalModel(ClusterAlg=kmeans, ModelAlg=clf)
            lm.fit(X_tr_norm, y_train, verboses=0)

            y_pred_tr = lm.predict(X_tr_norm)
            y_pred_ts = lm.predict(X_ts_norm)

            cm_tr = confusion_matrix(dummie2multilabel(y_train),
                                     dummie2multilabel(y_pred_tr))
            cm_ts = confusion_matrix(dummie2multilabel(y_test),
                                     dummie2multilabel(y_pred_ts))

            acc_tr = cm2acc(cm_tr)
            acc_ts = cm2acc(cm_ts)

            print("Train accuracy: {}\nTest accuracy:  {}\n".format(acc_tr, acc_ts))

    print('\n')
    print('#'*60)
    print('\n'*2)