<center><h1>Classification by local modeling</h1></center>

### Summary:
1. [Introduction](#introduction)

2. [Local Ordinary Least Squares (L-OLS)](#lols)
    
    2.1. [Influence of the number of clusters on model accuracy](#lols-#-clusters)
    
3. [Local Least Squares Support Vector Machine (L-LSSVM)](#l_lssvm)

### 1. Introduction <a class="anchor" id="introduction"></a>

Classic classification by local modeling is a two-step approach for modeling:

1. An unsupervised clustering algorithm is run to find regions in the dataset;
2. For each region, a model is built with the respective data partition.

For inference the procedure is similar:

1. A similarity metric is used to determine the new data point region, e.g. euclidian distance from regions prototypes;
2. The model from that specific region is used to predict the class of the new data point.

There are a lot of clustering algorithms but, for the sake of simplicity, it will be used only K-means.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from load_dataset import datasets

from devcode.utils import scale_feat, dummie2multilabel, cm2acc
from devcode import run_simulation, run_round
from devcode.models.lssvm import LSSVM
from devcode.models.local_learning import LocalModel

Dataset:  Features.shape:   # of classes:
vc2c      (310, 6)          2
vc3c      (310, 6)          3
wf24f     (5456, 24)        4
wf4f      (5456, 4)         4
wf2f      (5456, 2)         4
pk        (195, 22)         2


In [3]:
import numpy as np
import pandas as pd
from load_dataset import datasets

base_path = "C:/Users/Centaurinho/Desktop/Fonteles/Repository/regional-classifiers/results/"

# Hyper-parameters:
n_init    = 5 # number of independent runs
test_size = 0.2 # test size of 20%
scaleType = 'min-max' # type of feature scaling

results_df = {'GOLS': {}, 'LOLS': {}}

### 2.1. Global model training

As a baseline, we will use a global model, in this case, a global linear model

In [7]:
%%time
# Applying Global Ordinary Least Squares
header = [dataset_name for dataset_name in datasets.keys()]
results = np.zeros((n_init, len(header)))

count=0 # counting datasets
for dataset_name in datasets:
    X = datasets[dataset_name]['features'].values
    y = datasets[dataset_name]['labels'].values
    
    acc = [0]*n_init
    for i in range(n_init):
        acc_tr, acc_ts = run_round(X, y, test_size, LinearRegression, {})
        acc[i] = acc_tr
            
    results[:,count] = acc
    count+=1
    
results_gols = pd.DataFrame(results, columns=header)
results_gols.head()
# Wall time: 3.21 s

Train accuracy: 0.8467741935483871
Test accuracy:  0.8387096774193549

Train accuracy: 0.8387096774193549
Test accuracy:  0.8387096774193549

Train accuracy: 0.842741935483871
Test accuracy:  0.8548387096774194

Train accuracy: 0.8225806451612904
Test accuracy:  0.9032258064516129

Train accuracy: 0.8346774193548387
Test accuracy:  0.8870967741935484

Train accuracy: 0.8064516129032258
Test accuracy:  0.6612903225806451

Train accuracy: 0.7943548387096774
Test accuracy:  0.7580645161290323

Train accuracy: 0.8104838709677419
Test accuracy:  0.6774193548387096

Train accuracy: 0.7903225806451613
Test accuracy:  0.7741935483870968

Train accuracy: 0.8024193548387096
Test accuracy:  0.8225806451612904

Train accuracy: 0.6452795600366636
Test accuracy:  0.6410256410256411

Train accuracy: 0.6404674610449129
Test accuracy:  0.6391941391941391

Train accuracy: 0.648487626031164
Test accuracy:  0.6144688644688645

Train accuracy: 0.6370302474793768
Test accuracy:  0.6428571428571429

Train ac

Unnamed: 0,vc2c,vc3c,wf24f,wf4f,wf2f,pk
0,0.846774,0.806452,0.64528,0.72319,0.723648,0.884615
1,0.83871,0.794355,0.640467,0.722044,0.725023,0.897436
2,0.842742,0.810484,0.648488,0.725023,0.717461,0.929487
3,0.822581,0.790323,0.63703,0.727544,0.717003,0.923077
4,0.834677,0.802419,0.641613,0.72319,0.719065,0.910256


In [None]:
%%time
import datetime

# Applying Local Ordinary Least Squares

n_ks = 3 # number of clusters to evaluate for each dataset
# for each dataset
for dataset_name in datasets:
    print("Starting '{}' at {}".format(dataset_name,datetime.datetime.now()))
    
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values
    
    max_k = int(0.8*len(X)*(1-test_size)) # k_max = 80% of the number of samples on the train set
    ks = np.linspace(2, max_k, num=n_ks, dtype='int')
    print("ks = {}".format(ks))
    header = ["k={}".format(i) for i in ks] # header with the number of clusters
    
    results = np.zeros((n_init, len(ks)))
    for j in range(len(ks)): # for each value of k
        print(ks[j])
        for i in range(n_init): # run n_init independent train/test split and evaluation
            # Train/Test split
            X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size)

            # scaling features
            X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scaleType)
    
            # creating clustering and model algorithm
            kmeans = KMeans(n_clusters=ks[j], n_init=10)
            linReg = LinearRegression()
            
            # creating and fitting local model
            lm = LocalModel(ClusterAlg=kmeans, ModelAlg=linReg)
            lm.fit(X_tr_norm, y_train)

            # evaluating accuracy on the test set
            y_pred_ts = lm.predict(X_ts_norm, rounded=True)
            cm_ts = confusion_matrix(dummie2multilabel(y_test),
                                     dummie2multilabel(y_pred_ts))

            acc_ts = cm2acc(cm_ts)
            results[i,j] = acc_ts

        
    results_df = pd.DataFrame(results, columns=header)
    filename = f"{base_path}LOLS - {dataset_name} - n_init {n_init}"
    results_df.to_csv(filename, sep='\t') # saving results in CSV file
    print("{} done!".format(dataset_name))
    print(" ")
    
# CPU times: user 55min 5s, sys: 18min 22s, total: 1h 13min 28s
# Wall time: 12h 57min 42s

## 1.1 Processing results:

In [8]:
#loading results
pk_path = f"{base_path}LOLS - pk - n_init 100 - 2019-06-12.csv"

vc2c_path = f"{base_path}LOLS - vc2c - n_init 100 - 2019-06-12.csv"
vc3c_path = f"{base_path}LOLS - vc3c - n_init 100 - 2019-06-12.csv"

wf2f_path  = f"{base_path}LOLS - wf2f - n_init 100 - 2019-06-12.csv"
wf4f_path  = f"{base_path}LOLS - wf4f - n_init 100 - 2019-06-12.csv"
wf24f_path = f"{base_path}LOLS - wf24f - n_init 100 - 2019-06-12.csv"

results = {'GOLS': {}, 'LOLS': {}}

results['GOLS'] = results_gols

results['LOLS']['pk']   = pd.read_csv(pk_path, delim_whitespace=True)

results['LOLS']['vc2c'] = pd.read_csv(vc2c_path, delim_whitespace=True)
results['LOLS']['vc3c'] = pd.read_csv(vc3c_path, delim_whitespace=True)

results['LOLS']['wf2f']  = pd.read_csv(wf2f_path, delim_whitespace=True)
results['LOLS']['wf4f']  = pd.read_csv(wf4f_path, delim_whitespace=True)
results['LOLS']['wf24f'] = pd.read_csv(wf24f_path, delim_whitespace=True)

In [11]:
import plotly.offline as py
import plotly.graph_objs as go

from devcode.utils.visualization import render_boxplot
# py.init_notebook_mode(connected=True) # enabling plot within jupyter notebook

for dataset_name in datasets:
    ks = results['LOLS'][dataset_name].columns.tolist()
    ks.insert(0,"k=0")
    
    render_boxplot(results, dataset_name, ks)

We can see that:

* In the Vertebral Column dataset we had a drop in accuracy when using local modeling, showing us that the problem is
simple enough to be resolved with a linear classifier;
* In the Wall-following dataset we had an improvement in accuracy, more features we had more difference we saw.
That gives us evidence that the classification problem has a non-linear decision boundary and that local modeling had
the ability to approximate this non-linearity by a combination of local linear classifiers;
* In the Parkinson dataset we had a slight improvement in accuracy, showing us that local linear classifier was better
than global linear classifier.