<center><h1>Classification by local modeling</h1></center>

### Summary:
1. [Introduction](#introduction)

2. [Local Ordinary Least Squares (L-OLS)](#lols)
    
    2.1. [Influence of the number of clusters on model accuracy](#lols-#-clusters)
    
3. [Local Least Squares Support Vector Machine (L-LSSVM)](#l_lssvm)

### 1. Introduction <a class="anchor" id="introduction"></a>

Classic classification by local modeling is a two-step approach for modeling:

1. An unsupervised clustering algorithm is run to find regions in the dataset;
2. For each region, a model is built with the respective data partition.

For inference the procedure is similar:

1. A similarity metric is used to determine the new data point region, e.g. euclidian distance from regions prototypes;
2. The model from that specific region is used to predict the class of the new data point.

There are a lot of clustering algorithms but, for the sake of simplicity, it will be used only K-means.

In [28]:
import warnings
warnings.filterwarnings('ignore')

In [29]:
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from devcode.utils import scale_feat, dummie2multilabel, cm2acc
from devcode import run_simulation, run_round
from devcode.models.lssvm import LSSVM
from devcode.models.local_learning import LocalModel

import numpy as np
import pandas as pd
from load_dataset import get_datasets

In [30]:
datasets = get_datasets()

base_path = "results/simulations/" # Folder where the simulation files will be saved

# Hyper-parameters:
n_init    = 5         # Number of independent runs
test_size = 0.2       # Test size of 20%
scaleType = 'min-max' # Type of feature scaling

results_df = {'GOLS': {}, 'LOLS': {}}

### 2. Global model training

As a baseline, we will use a global model, in this case, a global linear model.

In [31]:
%%time
# Applying Global Ordinary Least Squares
header = [dataset_name for dataset_name in datasets.keys()]
results = np.zeros((n_init, len(header)))

count=0 # counting datasets
for dataset_name in datasets:
    X = datasets[dataset_name]['features'].values
    y = datasets[dataset_name]['labels'].values
    
    acc = [0]*n_init
    for i in range(n_init):
        acc_tr, acc_ts = run_round(X, y, test_size, LinearRegression, {})
        acc[i] = acc_tr
            
    results[:,count] = acc
    count+=1
    
results_gols = pd.DataFrame(results, columns=header)
results_gols.head()
# Wall time: 3.21 s

Train accuracy: 0.8387096774193549
Test accuracy:  0.9032258064516129

Train accuracy: 0.842741935483871
Test accuracy:  0.8064516129032258

Train accuracy: 0.8467741935483871
Test accuracy:  0.8548387096774194

Train accuracy: 0.8508064516129032
Test accuracy:  0.8387096774193549

Train accuracy: 0.8266129032258065
Test accuracy:  0.8709677419354839

Train accuracy: 0.8145161290322581
Test accuracy:  0.6935483870967742

Train accuracy: 0.7983870967741935
Test accuracy:  0.8225806451612904

Train accuracy: 0.7741935483870968
Test accuracy:  0.8709677419354839

Train accuracy: 0.7741935483870968
Test accuracy:  0.7580645161290323

Train accuracy: 0.7741935483870968
Test accuracy:  0.8225806451612904

Train accuracy: 0.6482584784601283
Test accuracy:  0.6401098901098901

Train accuracy: 0.6287809349220899
Test accuracy:  0.6666666666666666

Train accuracy: 0.6455087076076994
Test accuracy:  0.6172161172161172

Train accuracy: 0.6402383134738772
Test accuracy:  0.63003663003663

Train acc

Unnamed: 0,vc2c,vc3c,wf24f,wf4f,wf2f,pk
0,0.83871,0.814516,0.648258,0.723419,0.724794,0.910256
1,0.842742,0.798387,0.628781,0.72846,0.72594,0.903846
2,0.846774,0.774194,0.645509,0.726169,0.721815,0.916667
3,0.850806,0.774194,0.640238,0.727773,0.724794,0.910256
4,0.826613,0.774194,0.64253,0.722273,0.721815,0.910256


### 3. Local model training

In [32]:
%%time
import datetime

# Applying Local Ordinary Least Squares

n_ks = 2 # number of clusters to evaluate for each dataset
# for each dataset
for dataset_name in datasets:
    print("Starting '{}' at {}".format(dataset_name,datetime.datetime.now()))
    
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values
    
    max_k = int(0.8*len(X)*(1-test_size)) # k_max = 80% of the number of samples on the train set
    ks = np.linspace(2, max_k, num=n_ks, dtype='int')
    print("ks = {}".format(ks))
    header = ["k={}".format(i) for i in ks] # header with the number of clusters
    
    results = np.zeros((n_init, len(ks)))
    for j in range(len(ks)): # for each value of k
        print(ks[j])
        for i in range(n_init): # run n_init independent train/test split and evaluation
            # Train/Test split
            X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size)

            # scaling features
            X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scaleType)
    
            # creating clustering and model algorithm
            kmeans = KMeans(n_clusters=ks[j], n_init=10)
            linReg = LinearRegression()
            
            # creating and fitting local model
            lm = LocalModel(ClusterAlg=kmeans, ModelAlg=linReg)
            lm.fit(X_tr_norm, y_train)

            # evaluating accuracy on the test set
            y_pred_ts = lm.predict(X_ts_norm, rounded=True)
            cm_ts = confusion_matrix(dummie2multilabel(y_test),
                                     dummie2multilabel(y_pred_ts))

            acc_ts = cm2acc(cm_ts)
            results[i,j] = acc_ts

        
    results_df = pd.DataFrame(results, columns=header)
    filename   = f"{base_path}LOLS - {dataset_name} - n_init {n_init}"
    results_df.to_csv(filename, sep='\t') # saving results in CSV file
    print("{} done!".format(dataset_name))
    print(" ")
    
# CPU times: user 55min 5s, sys: 18min 22s, total: 1h 13min 28s
# Wall time: 12h 57min 42s

Starting 'vc2c' at 2022-08-15 12:07:00.110313
ks = [  2 198]
2
198
vc2c done!
 
Starting 'vc3c' at 2022-08-15 12:07:01.765312
ks = [  2 198]
2
198
vc3c done!
 
Starting 'wf24f' at 2022-08-15 12:07:03.461310
ks = [   2 3491]
2
3491
wf24f done!
 
Starting 'wf4f' at 2022-08-15 12:09:14.614312
ks = [   2 3491]
2
3491


KeyboardInterrupt: 

### 4. Processing results:

The following results are part of the experiments conducted during the development of this work.

In [33]:
# Note that the 'base_path' tells you where to get the results files. 
# In this case, we are getting the files regarding our experiments.
base_path = "results/"

#loading results
pk_path = f"{base_path}LOLS - pk - n_init 100 - 2019-06-12.csv"

vc2c_path = f"{base_path}LOLS - vc2c - n_init 100 - 2019-06-12.csv"
vc3c_path = f"{base_path}LOLS - vc3c - n_init 100 - 2019-06-12.csv"

wf2f_path  = f"{base_path}LOLS - wf2f - n_init 100 - 2019-06-12.csv"
wf4f_path  = f"{base_path}LOLS - wf4f - n_init 100 - 2019-06-12.csv"
wf24f_path = f"{base_path}LOLS - wf24f - n_init 100 - 2019-06-12.csv"

In [34]:
results = {'GOLS': {}, 'LOLS': {}}

results['GOLS'] = results_gols

results['LOLS']['pk']   = pd.read_csv(pk_path, delim_whitespace=True)

results['LOLS']['vc2c'] = pd.read_csv(vc2c_path, delim_whitespace=True)
results['LOLS']['vc3c'] = pd.read_csv(vc3c_path, delim_whitespace=True)

results['LOLS']['wf2f']  = pd.read_csv(wf2f_path, delim_whitespace=True)
results['LOLS']['wf4f']  = pd.read_csv(wf4f_path, delim_whitespace=True)
results['LOLS']['wf24f'] = pd.read_csv(wf24f_path, delim_whitespace=True)

FileNotFoundError: [Errno 2] No such file or directory: 'results/LOLS - pk - n_init 100 - 2019-06-12.csv'

In [None]:
import plotly.offline as py
import plotly.graph_objs as go

from devcode.utils.visualization import render_boxplot
# py.init_notebook_mode(connected=True) # enabling plot within jupyter notebook

for dataset_name in datasets:
    ks = results['LOLS'][dataset_name].columns.tolist()
    ks.insert(0,"k=0")
    
    render_boxplot(results, dataset_name, ks)

We can see that:

* In the Vertebral Column dataset we had a drop in accuracy when using local modeling, showing us that the problem is
simple enough to be resolved with a linear classifier;
* In the Wall-following dataset we had an improvement in accuracy, more features we had more difference we saw.
That gives us evidence that the classification problem has a non-linear decision boundary and that local modeling had
the ability to approximate this non-linearity by a combination of local linear classifiers;
* In the Parkinson dataset we had a slight improvement in accuracy, showing us that local linear classifier was better
than global linear classifier.