<center><h1>Classification by local modeling</h1></center>

### Summary:
1. [Introduction](#introduction)

2. [Local Ordinary Least Squares (L-OLS)](#lols)
    
    2.1. [Influence of the number of clusters on model accuracy](#lols-#-clusters)
    
3. [Local Least Squares Support Vector Machine (L-LSSVM)](#l_lssvm)

### 1. Introduction <a class="anchor" id="introduction"></a>

Classic classification by local modeling is a two-step approach for modeling:

1. An unsupervised clustering algorithm is run to find regions in the dataset;
2. For each region, a model is built with the respective data partition.

For inference the procedure is similar:

1. A similarity metric is used to determine the new data point region, e.g. euclidian distance from regions prototypes;
2. The model from that specific region is used to predict the class of the new data point.

There are a lot of clustering algorithms but, for the sake of simplicity, it will be used only K-means.

### 2.1. Influence of the number of clusters on model accuracy <a class="anchor" id="lols-#-clusters"></a>

As a baseline, we will use a global model, in this case, a global linear model.

In [3]:
from load_datasets import datasets

base_path = "C:/Users/Centaurinho/Desktop/Fonteles/Repository/regional-classifiers/results/"

# Hyper-parameters:
n_init    = 5 # number of independent runs
test_size = 0.2 # test size of 20%
scaleType = 'min-max' # type of feature scaling

results_df = {'GOLS': {}, 'LOLS': {}}

ModuleNotFoundError: No module named 'load_datasets'

In [4]:
%%time
# Applying Global Ordinary Least Squares
header = [dataset_name for dataset_name in datasets.keys()]
results = np.zeros((n_init, len(header)))

count=0 # counting datasets
for dataset_name in datasets:
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values
    
    acc = [0]*n_init
    for i in range(n_init):
        # Train/Test split
        X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size)
        
        # scaling features
        X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scaleType)
        
        linearModel = linear_model.LinearRegression().fit(X_tr_norm,y_train)
                   
        # Evaluating in the test dataset
        y_pred = linearModel.predict(X_ts_norm)
        y_pred = np.round(np.clip(y_pred, 0, 1)) # rounding prediction numbers

        cm = confusion_matrix(dummie2multilabel(y_test),
                              dummie2multilabel(y_pred))
        acc[i] = cm2acc(cm)
            
    results[:,count] = acc
    count+=1
    
results_gols = pd.DataFrame(results, columns=header)
# Wall time: 3.21 s

NameError: name 'datasets' is not defined

In [4]:
%%time
import datetime

# Applying Local Ordinary Least Squares

n_ks = 3 # number of clusters to evaluate for each dataset
# for each dataset
for dataset_name in datasets:
    print("Starting '{}' at {}".format(dataset_name,datetime.datetime.now()))
    
    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values
    
    max_k = int(0.8*len(X)*(1-test_size)) # k_max = 80% of the number of samples on the train set
    ks = np.linspace(2, max_k, num=n_ks, dtype='int')
    print("ks = {}".format(ks))
    header = ["k={}".format(i) for i in ks] # header with the number of clusters
    
    results = np.zeros((n_init, len(ks)))
    for j in range(len(ks)): # for each value of k
        print(ks[j])
        for i in range(n_init): # run n_init independent train/test split and evaluation
            # Train/Test split
            X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size)

            # scaling features
            X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scaleType)
    
            # creating clustering and model algorithm
            kmeans = KMeans(n_clusters=ks[j], n_init=10)
            linReg = LinearRegression()
            
            # creating and fitting local model
            lm = LocalModel(ClusterAlg=kmeans, ModelAlg=linReg)
            lm.fit(X_tr_norm, y_train)

            # evaluating accuracy on the test set
            y_pred_ts = lm.predict(X_ts_norm, rounded=True)
            cm_ts = confusion_matrix(dummie2multilabel(y_test),
                                     dummie2multilabel(y_pred_ts))

            acc_ts = cm2acc(cm_ts)
            results[i,j] = acc_ts

        
    results_df = pd.DataFrame(results, columns=header)
    filename = f"{base_path}LOLS - {dataset_name} - n_init {n_init}"
    results_df.to_csv(filename, sep='\t') # saving results in CSV file
    print("{} done!".format(dataset_name))
    print(" ")
    
# CPU times: user 55min 5s, sys: 18min 22s, total: 1h 13min 28s
# Wall time: 12h 57min 42s

Starting 'vc2c' at 2022-08-02 16:44:21.253953
ks = [  2 100 198]
2
100
198


NameError: name 'base_path' is not defined

## 1.1 Processing results:

In [5]:
#loading results
pk_path = f"{base_path}LOLS - pk - n_init 100 - 2019-06-12.csv"

vc2c_path = f"{base_path}LOLS - vc2c - n_init 100 - 2019-06-12.csv"
vc3c_path = f"{base_path}LOLS - vc3c - n_init 100 - 2019-06-12.csv"

wf2f_path  = f"{base_path}LOLS - wf2f - n_init 100 - 2019-06-12.csv"
wf4f_path  = f"{base_path}LOLS - wf4f - n_init 100 - 2019-06-12.csv"
wf24f_path = f"{base_path}LOLS - wf24f - n_init 100 - 2019-06-12.csv"

results = {'GOLS': {}, 'LOLS': {}}

results['GOLS'] = results_gols

results['LOLS']['pk']   = pd.read_csv(pk_path, delim_whitespace=True)

results['LOLS']['vc2c'] = pd.read_csv(vc2c_path, delim_whitespace=True)
results['LOLS']['vc3c'] = pd.read_csv(vc3c_path, delim_whitespace=True)

results['LOLS']['wf2f']  = pd.read_csv(wf2f_path, delim_whitespace=True)
results['LOLS']['wf4f']  = pd.read_csv(wf4f_path, delim_whitespace=True)
results['LOLS']['wf24f'] = pd.read_csv(wf24f_path, delim_whitespace=True)

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Centaurinho/Desktop/Fonteles/Repository/regional-classifiers/results/LOLS - pk - n_init 100 - 2019-06-12.csv'

In [None]:
import plotly.offline as py
import plotly.graph_objs as go
# py.init_notebook_mode(connected=True) # enabling plot within jupyter notebook

for dataset_name in datasets:
    ks = results['LOLS'][dataset_name].columns.tolist()
    ks.insert(0,"k=0")
    
    data = [{}]*(len(ks)+1)
    data[0] = go.Box(
        y=results['GOLS'][dataset_name].values,
        name = ks[0][2:],
        marker = dict(color = '#2980b9')
    )
    for i in range(2, len(ks)):
        trace = go.Box(
            y=results['LOLS'][dataset_name][ks[i]].values,
            name = ks[i][2:],
            marker = dict(color = '#2980b9')
        )
        data[i] = trace

    layout = go.Layout(
        title = "Accuracy vs number of clusters [{}]".format(dataset_name),
        showlegend=False,
        yaxis=dict(title="Accuracy on the test set"),
        xaxis=dict(title="Number of clusters")
    )

    fig = go.Figure(data=data,layout=layout)
    py.iplot(fig)

We can see that:

* In the Vertebral Column dataset we had a drop in accuracy when using local modeling, showing us that the problem is
simple enough to be resolved with a linear classifier;
* In the Wall-following dataset we had an improvement in accuracy, more features we had more difference we saw.
That gives us evidence that the classification problem has a non-linear decision boundary and that local modeling had
the ability to approximate this non-linearity by a combination of local linear classifiers;
* In the Parkinson dataset we had a slight improvement in accuracy, showing us that local linear classifier was better
than global linear classifier.

# 3. Local Least Squares Support Vector Machine (L-LSSVM) <a class="anchor" id="l_lssvm"></a>

Example of `LocalModel` class running with LSSVM:

In [None]:
# %load_ext autoreload

In [None]:
%%time
# %autoreload

from sklearn.cluster import KMeans
from lssvm import LSSVM
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# small_datasets =['vc2c', 'vc3c', 'pk']

for dataset_name in datasets:
    print(dataset_name)

    X = datasets[dataset_name]['features'].values
    Y = datasets[dataset_name]['labels'].values

    X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)  # Train/Test split = 80%/20%
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType='min-max') # scaling features
    
    k_values = np.linspace(2, np.ceil(np.sqrt(len(X_train))), num=5, dtype='int').tolist() # 2 to sqrt(N)
    
    for n_clusters in k_values:
        print("# of clusters: {}".format(n_clusters))
        kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)

        clf_dict = {
            'linear': LSSVM(gamma=1, kernel='linear'),
            'poly'  : LSSVM(gamma=1, kernel='poly', d=2),
            'rbf'   : LSSVM(gamma=1, kernel='rbf', sigma=1)
        }

        for kernel_type, clf in clf_dict.items():
            print('kernel: {}'.format(kernel_type))
            lm = LocalModel(ClusterAlg=kmeans, ModelAlg=clf)
            lm.fit(X_tr_norm, y_train, verboses=0)

            y_pred_tr = lm.predict(X_tr_norm)
            y_pred_ts = lm.predict(X_ts_norm)

            cm_tr = confusion_matrix(dummie2multilabel(y_train),
                                     dummie2multilabel(y_pred_tr))
            cm_ts = confusion_matrix(dummie2multilabel(y_test),
                                     dummie2multilabel(y_pred_ts))

            acc_tr = cm2acc(cm_tr)
            acc_ts = cm2acc(cm_ts)

            print("Train accuracy: {}\nTest accuracy:  {}\n".format(acc_tr, acc_ts))

    print('\n')
    print('#'*60)
    print('\n'*2)