#  Machine Learning - Mini Project II

### Learning Algorithm: Classifier

Created on March 10, 2019 by Diogo Cosin <d.ayresdeoliveira@jacobs-university.de> and Ralph Florent <r.florent@jacobs-university.de>.

### Description
Train a classifier for the Digits dataset by implementing a full processing pipeline from feature extraction to (linear) classifier training, attempting to squeeze performance out of the classifier using cross-validation and regularization techniques.

### Summary
The script below is intended to... 

WIP

Note: The algorithm is tested on the OCR datasets from the `DigitsBasicsRoutine.zip`, which was provided by Professor Dr. H. Jaeger, Machine Learning Professor at [Jacobs University Bremen](https://www.jacobs-university.de).

In [21]:
""" Learning Algorithm: Classifier """

# -*- coding: utf-8 -*-
# 
# Created on April 01, 2019
# Authors: 
#        Diogo Cosin <d.ayresdeoliveira@jacobs-university.de>,
#        Ralph Florent <r.florent@jacobs-university.de>


# Import relevant libraries
import sys
import matplotlib.pyplot as plt
import numpy as np

sys.path.append('./assets/')
from miniprojectone import get_data_points, k_means, get_codebooks


# START: Helper functions
def load_data(one_hot_encoding=None):
    """ Load the data and parse (if specified, one-hot encoding) numbered 
        class labels.
        
        Parameters
        ----------
        one_hot_encoding: bool, None
            determine whether one-hot encoding should be used or not
            
        Returns
        -------
        dataframe: array-like (n_samples, m_features)
    """
    # load data frame with no class labels defined
    dataframe = get_data_points()
    if one_hot_encoding is None:
        return dataframe
    return inject_label(dataframe, one_hot_encoding)

def one_hot_encode(ith, k=10):
    """ Apply one-hot encoding technique for a k-dimensional vector
        
        This function is intended to work like a lightweight version one-hot-encoder.
        It is adapted to our specific needs. 
        
        Parameters
        ----------
        ith: int
            ith position of the one-of-K discrete label
        
        k: int
            number of k-classes for the encoding vector
            
        Returns
        -------
        encoded: array
            vector with zeros and one in the ith position
    """
    if ith > k:
        ith = k # avoid out of bound exception
        
    index = ith - 1
    encoded = np.zeros(k)
    encoded[index] = 1
    return encoded


def inject_label(dataset, one_hot_encoding=False):
    """ Inject into data frame class labels as numbered or one-hot encoded 
    """
    dataframe, k_class = [], 10
    digits = np.array_split(dataset, k_class) # split into 10 arrays
    for i in range(k_class):
        digit =  digits[i] # n-obs x k-dim
        if i == 0: 
            i = 10 # define "0" as class 10
        encoded = one_hot_encode(i) if one_hot_encoding else i
        
        for point in digit:
            dataframe.append( np.insert(point, len(point), encoded) )
            
    return np.array(dataframe) 


def split_data(dataframe):
    """ Split data frame into training and testing data
    
    Parameters
    ----------
    dataframe: array-like (n_samples, m_features)
        2000 digit patterns x 240 features and 10 one-hot encoded class labels

    Returns
    -------
    encoded: list
        list containing array of training and array of testing data
    """
    digits = np.array_split(dataframe, 10)
    
    train_data, test_data = [], []

    for digit in digits:
        train_data.extend(digit[:100])
        test_data.extend(digit[100:])
    
    return [np.array(train_data), np.array(test_data)]


def select_features(dataset, K=1):
    """ Apply Euclidean distance between k-means centroid and each point
    to extract k features.
    """
    clusters = k_means(dataset, K)
    codebooks = get_codebooks(clusters)
    features = []
    # transfrom xi -> fi by reducing dimensionality from n to k features
    for point in dataset:
        # compute distance between data point and centroid 
        feature = [np.linalg.norm(point - c) for c in codebooks]
        features.append( np.array(feature) )
    
    return np.array(features)

def linear_regressor(alpha=0):
    weights = []
    return weights

# END: Helper functions

### Loading training and testing data
The `load_data` function helps framing the data with numbered or one-hot encoding class labels to build an in-memory data frame. 

Given this data frame, let's load the first 100 digit-patterns as training data and the second 100 as testing data, for a specific digit.

In [22]:
dataframe = load_data(one_hot_encoding=True)

# Distribute dataframe into training and testing data
training_data, testing_data = split_data(dataframe)

print(dataframe.shape, training_data.shape, testing_data.shape)

(2000, 250) (1000, 250) (1000, 250)


### Feature selection

Let's use K-means clustering algorithm to select relevant features in the training data

In [23]:

        
raw_dataframe = load_data()
raw_training_data, raw_testing_data = split_data(raw_dataframe)

selected_features = select_features(raw_training_data, 20)
print(raw_dataframe.shape, selected_features.shape)

(2000, 240) (1000, 19)


In [24]:
features_labels = inject_label(selected_features, one_hot_encoding=True)
features_labels.shape

(1000, 29)

### Classifier: Linear Regression

Let's compute the linear regression weights with a trailing 1 as bias

