# How to Add Primitives into EMADE
### Written by Austin Dunn

In this notebook I will demonstrate how to add a primitive into EMADE using the new wrapper classes.

First, we need recreate the wrapper classes EMADE uses.

In [1]:
from abc import ABC, abstractmethod
from functools import partial

class RegistryWrapper(ABC):
    """
    Abstract Base Class cannot be Instantiated

    Stores a mapping of primitives used in generating the PrimitiveSet
    Can be subclassed for different types of primitives

    The first object of input_types must be a EmadeDataPair

    Args:
        input_types: common inputs for every primitive stored in the wrapper
                     used to create a mapping between arg and index
                     example mapping created: {'EmadeDataPair0': 0, 'TriState1': 1}

    """
    def __init__(self, input_types=[]):
        self.prependInputs = input_types
        self.kwpos = {}
        for i in range(len(input_types)):
            self.kwpos[input_types[i].__name__ + str(i)] = i
        self.registry = {}

    @abstractmethod
    def register(self, name, p_fn, s_fn, input_types):
        pass

    def get_registry(self):
        return self.registry

The above class cannot be instantiated on its own. If I try to run the code:

`my_wrapper = RegistryWrapper()`

Python throws the TypeError below:

`TypeError: Can't instantiate abstract class RegistryWrapper with abstract methods register`

To avoid this we need to make a subclass of RegistryWrapper and implement the abstract method "register".

However, we are still missing a piece. We need a wrapper for registering the primitives into the primitive set AND a wrapper for returning the results of a primitive.

Below is a condensed version of a primitive wrapper used by EMADE (cache handling, mode handling, and error handling have been removed). 

I have also ommitted the `kwpos` argument used for mapping arguments in *args and handling multiple data objects.

If you are interested in how that code works you can find it in the source code under `wrapper_methods.py`

In [2]:
import copy as cp

def primitive_wrapper(p_name, primitive_f, setup_f, *args, **kwargs):
    """Condensed Wrapper method for handling data in primitives
       This wrapper expects a list of numpy arrays instead of EMADE's DataPair object

        Note: Any utility objects e.g. convolution kernels should be initialized in f_setup, or passed in as kwargs
        Supports primitives that operate on 1 data pair. Abstracts data traversal, uses function on inner data.
        Checklist for users:
        - Write f_setup - will receive first instance of data, and primitive args.
            - If defined, return named arguments that f will use
        - Write f - f is given data of an instance, and named arguments in f_setup
            - If f_setup not defined, f will receive data and primitive args
        - Docstring must be defined using __doc__
        - Primitives that use a setup method must be called without kwargs in unit tests

    Args:
        p_name: string name of the primitive
        primitive_f: method to call on data
        setup_f: optional setup method
        args: primitive method arguments (list type)
        kwargs: primitive method keyword arguments (dict type)

    Returns:
        updated numpy array
    """
    # For debugging purposes let's print out method name
    print(p_name)
    
    # we can assume in this case that the first argument is always the numpy array
    data_list = args[0]
    args = args[1:]
    # ^ the decision to not let inner functions see the numpy array explicitly

    # Note - if setup_f is None, we forward all args passed to primitive to inner func
    # The choice is made to withhold primitive args if setup_f is provided, setup_f must label them as kwargs
    # This makes the data processing signature as concise as possible
    is_setup = False
    helper_kwargs = {} # Set up reference

    def setup_wrap(data):
        nonlocal helper_kwargs, args, is_setup
        if setup_f is not None:
            helper_kwargs = {**kwargs, **setup_f(data, *args)}
            args = ()
        is_setup = True
        
    # Make a deepcopy of the data list to make sure the original object is not modified
    data_list = cp.deepcopy(data_list)
        
    for i in range(len(data_list)):
        data = data_list[i]
        if not is_setup:
            # run the setup method if exists
            # this will only get called once
            setup_wrap(data)
        # run arbitrary method on the numpy array
        data_list[i] = primitive_f(data, *args, **helper_kwargs)
    
    return data_list
    

Now that we have our primitive wrapper we can finally define our custom registry wrapper below.

In [3]:
class MyRegistryWrapper(RegistryWrapper):
    """
    This wrapper is a standard registry wrapper for primitives
    Used by signal_methods, spatial_methods, and feature_extraction_methods

    Stores a mapping of primitives used in generating the PrimitiveSet

    The first object of input_types must be a EmadeDataPair

    Args:
        input_types: common inputs for every primitive stored in the wrapper
                     used to create a mapping between arg and index
                     example mapping created: {'EmadeDataPair0': 0, 'TriState1': 1}

    """
    def __init__(self, input_types=[]):
        super().__init__(input_types)

    def register(self, name, p_fn, s_fn, input_types):
        # create wrapped method
        wrapped_fn = partial(primitive_wrapper, name, p_fn, s_fn)

        # create a mapping for adding primitive to pset
        self.registry[name] = {"function": wrapped_fn,
                               "types": input_types}

        # return wrapped method
        return wrapped_fn

Note that the register method is now implemented and we are letting the super class (RegistryWrapper) handle the constructor method (__ init __)

Let's start by explaining this line: `wrapped_fn = partial(primitive_wrapper, name, p_fn, s_fn)`

- **primitive_wrapper** is the wrapper method we defined above to process the numpy arrays.

- **name** is a simple string of the primitive name. We'll see an example of this later.

- **p_fn** stands for primitive function/method. It's the method modifying our passed in numpy arrays.

- **s_fn** stands for setup function/method. It's an optional method for setting up constants used by p_fn.

The most important part is the `partial` method call. 

This method returns a new callable method with the name, p_fn, and s_fn arguments constant every time the method is called.

The next line: `self.registry[name] = {"function": wrapped_fn, "types": input_types}`

is simply mapping the string name of the primitive to the partial function we made and the input types passed into register.

What is `input_types`? In this case it's any additional arguments to a primitive not defined in the constructor.

Let's take a look at an example below:

In [4]:
method_registry = MyRegistryWrapper([list])

When we instantiated the `MyRegistryWrapper` we gave it a list of types. In this case we only gave it `list` because our `primitive_wrapper` method expects a list of numpy arrays as the first argument.

By consequence any method we register into `method_registry` must expect a numpy array as its first argument. Why?
Because of this block of code we wrote earlier:

```
for i in range(len(data_list)):
    data = data_list[i]
    if not is_setup:
        # run the setup method if exists
        # this will only get called once
        setup_wrap(data, None)
    # run arbitrary method on the numpy array
    data_list[i] = primitive_f(data, *args, **helper_kwargs)
```

`primitive_wrapper` expects a list, but our primitives need to expect a singular numpy array.

Now we have everything we need to start making primitives.

We'll go through examples of how to code different kinds of primitives in EMADE.

# Standard Primitives
### Description: Methods using the standard primitive_wrapper
---

## Sharing a helper method:

In [5]:
def my_add_helper(data, value):
    return data + value

my_add = method_registry.register("AddInt", my_add_helper, None, [int])
my_add.__doc__ = """
Adds an integer to every element of a numpy array

Args:
    data: numpy array
    value: integer to add
"""

my_add_float = method_registry.register("AddFloat", my_add_helper, None, [float])
my_add_float.__doc__ = """
Adds a float to every element of a numpy array

Args:
    data: numpy array
    value: float to add
"""

Above are two simple primitives. 

Below is additional documentation for how the registry system works.

- **register:** For both methods we defined the name, passed in a method that modifies a numpy array, told the registry this     primitive has no setup method, and told the registry this primitive requires an integer.


- **.__ doc __:** This is required for EMADE documentation. Our documentation will show the method `my_add` which is actually our wrapped partial method. However, this method does not have the typical `def my_add():` most python methods have, so we have to add our documentation to the `__doc__` attribute of the method for our documentation to work.


- **my_add_helper:** EMADE's registry wrapper allows us to use the same helper method for both primitives defined above. The primitive_wrapper used by both primitives will call `my_add_helper` during EMADE's individual evaluation.

Example:

In [6]:
import numpy as np

# create a list of numpy arrays
my_list = [np.array([1., 2., 3., 4.]), np.array([5., 6., 7., 8.])]

# run primitive using registry wrapper + primitive wrapper
print(my_add(my_list, 10))
print(my_add_float(my_list, 1.2))

AddInt
[array([11., 12., 13., 14.]), array([15., 16., 17., 18.])]
AddFloat
[array([2.2, 3.2, 4.2, 5.2]), array([6.2, 7.2, 8.2, 9.2])]


## Utilizing a Setup Method:

In [7]:
def fraction_setup(data, value):
    if value != 0:
        value = 1 / value
    else:
        value = np.inf
    return { 'value':value }

def fraction_helper(data, value=0.33):
    return data * value

fraction = method_registry.register("Fraction", fraction_helper, fraction_setup, [float])
fraction.__doc__ = """
Multiplies every element of a numpy array by a fraction

Args:
    data: numpy array
    value: fraction denominator
"""

print(fraction(my_list, 2))
print(fraction(my_list, 0))

Fraction
[array([0.5, 1. , 1.5, 2. ]), array([2.5, 3. , 3.5, 4. ])]
Fraction
[array([inf, inf, inf, inf]), array([inf, inf, inf, inf])]


The above primitive utilizes both a helper method and a setup method.

Note: All primitives are required to have a helper method. You can NOT add a primitive in with only a setup method.

Why use a setup method? The answer is in the `primitive_wrapper` code we looked at earlier.

Setup methods only run once in `primitive_wrapper`. This means we do not have to recalcuate the value of the fraction for every numpy array in the data list. We can assume it will be constant for every array.

When implementing new primitives you should always be looking out for cases when a setup method can be used.

But wait there's **more**!

Multiple primitives can use the same setup method just like the example with helper methods we saw earlier.

## Multiple data objects:

In [8]:
def my_add_pair_helper(data, second_data):
    return data + second_data

my_add_arr = method_registry.register("AddArray", my_add_pair_helper, None, [])
my_add_arr.__doc__ = """
Adds two numpy arrays together (list + array)

Args:
    data: numpy array
    second_data: numpy array
"""

# create a new numpy array
my_arr = np.array([9., 10., 11., 12.])

# run primitive
print(my_add_arr(my_list, my_arr))

'''
Optional multiple data_list code
'''

method_registry_2 = MyRegistryWrapper([list, list])

my_add_pair = method_registry_2.register("AddPair", my_add_pair_helper, None, [])
my_add_pair.__doc__ = """
Adds two numpy arrays together (list + list)

Args:
    data: numpy array
    second_data: numpy array
"""

# create a new list of numpy arrays
# used for optional 
my_list_2 = [np.array([9., 10., 11., 12.]), np.array([13., 14., 15., 16.])]

# print(my_add_pair(my_list, my_list_2))

AddArray
[array([10., 12., 14., 16.]), array([14., 16., 18., 20.])]


Sometimes in EMADE we want to operate on more than one data object. In the example above we only took in a numpy array as the second argument and not a new list of numpy arrays.

This is because our condensed `primitive_wrapper` does not support more than one data list. However, EMADE does support two data objects at once using the `kwpos` argument we saw earlier.

I encourage you to modify the `primitive_wrapper` in this notebook to support `my_list_2`. If you're stuck you can look at the `primitive_wrapper` code in `wrapper_methods.py`.

# Abnormal Primitives
### Description: Methods using a wrapper different from primitive_wrapper
---

## Machine Learning Methods:

Machine learning methods do not follow the same framework as other primitives because of three things:

- **LearnerType:** This is a custom object the learner wrapper requires. It selects a random learner from a list of valid ones with a constant set of initial parameters.


- **ModifyLearner:** This is a special primitive which modifies the parameters of a LearnerType.


- **Estimator Objects:** these are the classifier objects you typically see in scikit-learn such as `DecisionTreeClassifier`. EMADE creates this object based on the parameters stored in `LearnerType`.

Below is a condensed version of the code EMADE uses to implement machine learning methods:

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import random

class LearnerType:
    '''
    Stores information about a machine learning algorithm
    '''

    def __init__(self):
        self.learnerName, self.learnerParams = learnerGen()

    def __repr__(self):
        return 'learnerType(\'' + str(self.learnerName) + '\', ' + str(self.learnerParams) + ')'
    
    
def learnerGen():
    """Generates a machine learner object

    Returns:
        a learner object
    """
    learners = ["KNN", "DECISION_TREE", "RAND_FOREST"]

    selectedLearner = random.choice(learners)
    learnerParams = None
    if selectedLearner == "KNN":
        learnerParams = {'K': 3, 'weights':0}
    elif selectedLearner == "DECISION_TREE":
        learnerParams = {'criterion':0, 'splitter':0}
    elif selectedLearner == "RAND_FOREST":
        learnerParams = {'n_estimators': 100, 'criterion':0, 'max_depth': 3, 'class_weight':0}

    return selectedLearner, learnerParams


def modifyLearner(learner, newValue, pos=0):
    """Modifies a machine learner object

    Args:
        learner: learner to modify
        newValue: new value for learner parameter
        pos: which parameter to insert new value

    Returns:
        a learner object
    """
    learnerName = learner.learnerName
    
    if (learnerName == "KNN") and (type(newValue) is int) and pos%2==0:
        learner.learnerParams['K'] = newValue
    elif (learnerName == "KNN") and (type(newValue) is int) and pos%2==1:
        learner.learnerParams['weights'] = newValue
        
    elif (learnerName == "DECISION_TREE") and (type(newValue) is float) and pos%2==0:
        learner.learnerParams['criterion'] = newValue
    elif (learnerName == "DECISION_TREE") and (type(newValue) is int) and pos%2==1:
        learner.learnerParams['splitter'] = newValue
        
    elif (learnerName == "RAND_FOREST") and (type(newValue) is int) and pos%3==0:
        learner.learnerParams['n_estimators'] = newValue
    elif (learnerName == "RAND_FOREST") and (type(newValue) is int) and pos%3==1:
        learner.learnerParams['class_weight'] = newValue
    elif (learnerName == "RAND_FOREST") and (type(newValue) is int) and pos%3==2:
        learner.learnerParams['criterion'] = newValue

    return learner


def mod_select(array, dictionary, key):
    """Selects based on key and mod

    Args:
        array: given array
        dictionary: given dictionary
        key: given key

    Returns:
        An item from the array
    """
    if key not in dictionary:
        return array[0]
    else:
        integer = dictionary[key]
        return array[integer % len(array)]


def get_scikit_model(learner):
    """Generates a machine learning classifier

    Given a learner object, produce an estimator that can be used
    either by itself, or with an ensemble technique

    Args:
        learner: type of machine learning classifier to return

    Returns:
        a machine learning classifier
    """
    # This can occur if we specify a learner not implemented in scikit
    estimator = None
    
    if learner.learnerName == "KNN":
        k = abs(int(learner.learnerParams['K']))
        weights_list = ['uniform', 'distance']
        weights = mod_select(weights_list, learner.learnerParams, 'weights')
        estimator = KNeighborsClassifier(n_neighbors=k,
                                         weights=weights)
        
    elif learner.learnerName == "DECISION_TREE":
        criterions = ['gini', 'entropy']
        criterion = mod_select(criterions, learner.learnerParams, 'criterion')
        splitters = ['best', 'random']
        splitter = mod_select(splitters, learner.learnerParams, 'splitter')
        estimator = DecisionTreeClassifier(criterion=criterion, splitter=splitter)
        
    elif learner.learnerName == "RAND_FOREST":
        n_estimators = abs(int(learner.learnerParams['n_estimators']))
        criterions = ['gini', 'entropy']
        criterion = mod_select(criterions, learner.learnerParams, 'criterion')
        max_depth = abs(int(learner.learnerParams['max_depth']))
        class_weights = [None, 'balanced', 'balanced_subsample']
        class_weight = mod_select(class_weights, learner.learnerParams, 'class_weight')
        estimator = RandomForestClassifier(n_estimators=n_estimators,
                                           criterion=criterion,
                                           max_depth=max_depth,
                                           class_weight=class_weight)

    return estimator


def learner_wrapper(train_data, test_data, label_data, learner):
    """
    Core method for all learners
    Creates the model
    Trains on training data
    Predicts on testing data
    Returns predicted labels of testing data

    Args:
        train_list: numpy array (training examples to train on)
        test_list: numpy array (test examples to evaluate on)
        label_list: labels for each training example (numpy array)
        learner: data structure storing information needed to generate model

    Returns:
        predicted labels
    """
    print("Learner Name:", learner.learnerName)
    print("Learner Params:", learner.learnerParams)

    """
    Setup Variables
    """
    # Get the underlying base estimator to use
    # This is the most important step because it builds on all our previous code
    base_estimator = get_scikit_model(learner)
    
    # deepcopy numpy arrays
    train_data = cp.deepcopy(train_data)
    test_data = cp.deepcopy(test_data)
    label_data = cp.deepcopy(label_data)
    
    '''
    For this example we do not need to deepcopy the numpy arrays because the underlying data is not being modified.
    However, in EMADE we always deepcopy the object storing all this data, which means doing so here is good practice.
    
    Why do we need the deepcopy? Because the multiprocessing EMADE uses allows more than one individual to be evaluated
    at the same time. If all these evaluations are using data with the same memory reference one evaluation will affect
    another.
    '''
    
    '''
    Fit estimator to training data
    Predict labels of testing data
    '''
    base_estimator.fit(train_data, label_data)
    predicted_classes = base_estimator.predict(test_data)
    
    '''
    Return prediction
    '''
    return predicted_classes

Hopefully you have read over the code above because now we are going run a small scale machine learning example.

Keep in mind as you are reading over everything that all the code in this notebook is designed for genetic programming and deap.

### Data
| Input | Output |
| ----------- | ----------- |
| 0 0 1 | 0 |
| 1 1 1 | 1 |
| 1 0 1 | 1 |
| 0 1 1 | 0 |
| 1 0 0 | 1 |
| 0 1 0 | 0 |


In [19]:
# initialize our data
train_data = np.array([[0,0,1],
                       [1,1,1],
                       [1,0,1],
                       [0,1,1]])

train_labels = np.array([[0],
                         [1],
                         [1],
                         [0]])

test_data = np.array([[1,0,0],
                      [0,1,0]])

test_labels = np.array([[1],
                        [0]])

# Make a random learner
my_learner = LearnerType()

# Modify the learner
# Try playing around with this method!
my_learner = modifyLearner(my_learner, 1, pos=1)

# print out the predicted labels
print(learner_wrapper(train_data, test_data, train_labels.ravel(), my_learner))

Learner Name: RAND_FOREST
Learner Params: {'n_estimators': 100, 'criterion': 0, 'max_depth': 3, 'class_weight': 1}
[1 0]


Hopefully you now have an understanding of how machine learning primitives work in EMADE.

I encourage you to add new machine learning methods to the code above. 

Try adding different datasets and regression. You can even add a neural network!

## Fit-Transform Methods:

Some primitives such as PCA, K-means clustering, and feature selection methods fit and transform on an entire dataset.

These methods do not operate example by example like `primitive_wrapper` and they do not use `.predict` like `learner_wrapper`.

We will need a new registry wrapper and primitive wrapper.

The code for these is below:

In [27]:
import sklearn.feature_selection

SCORING_FUNCTION_LIST = [
    sklearn.feature_selection.f_classif,
    sklearn.feature_selection.chi2,
    sklearn.feature_selection.f_regression
    ]

def fit_transform_wrapper(p_name, helper_function, setup_function, train_data, test_data, label_data, *args):
    """Template for primitives which fit and transform data
       but do not modify the target

    Args:
        p_name:           name of primitive
        helper_function:  returns transformed data
        setup_function:   returns method to fit to data
        data:             numpy array
        target:           numpy array containing the labels of data
        args:             list of arguments

    Returns:
        modified data pair
    """
    print(p_name)
    
    '''
    Transform
    '''
    method = setup_function(*args)
    new_train_data, new_test_data = helper_function(train_data,
                                                    test_data,
                                                    label_data,
                                                    method)
    
    return new_train_data, new_test_data


class RegistryWrapperFT(RegistryWrapper):
    """
    This wrapper is specific to methods which fit and transform data
    but do not modify the target

    Stores a mapping of primitives used in generating the PrimitiveSet

    The first object of input_types must be a numpy array

    Args:
        input_types: common inputs for every primitive stored in the wrapper
                     used to create a mapping between arg and index
                     example mapping created: {'EmadeDataPair0': 0, 'TriState1': 1}

    """
    def __init__(self, input_types=[]):
        super().__init__(input_types)

    def register(self, name, p_fn, s_fn, input_types):
        # create wrapped method
        wrapped_fn = partial(fit_transform_wrapper, name, p_fn, s_fn)

        # create a mapping for adding primitive to pset
        self.registry[name] = {"function": wrapped_fn,
                               "types": input_types}

        # return wrapped method
        return wrapped_fn

    
# initialize new registry wrapper
transform_registry = RegistryWrapperFT([np.ndarray, np.ndarray, np.ndarray])
    
    
def feature_selection_helper(train_data, test_data, labels, function):
    new_train_data = function.fit_transform(train_data, labels)
    new_test_data = function.transform(test_data)
    return new_train_data, new_test_data

def select_k_best_scikit_setup(scoring_function=0, k=10):
    # force k to be positive
    k = abs(k)
    # Get scoring function
    scoring_function = SCORING_FUNCTION_LIST[scoring_function%len(SCORING_FUNCTION_LIST)]
    # Create a selector for scikit's select k best
    return sklearn.feature_selection.SelectKBest(scoring_function, k)

select_k_best_scikit = transform_registry.register("mySelKBest", feature_selection_helper, select_k_best_scikit_setup, [int, int])
select_k_best_scikit.__doc__ = """
Returns the result of a select k best method using scikit

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

Args:
    scoring_function: function taking two arrays X and y, and returning a
                      pair of arrays (scores, pvalues) or a single array of scores
    k: number of top features to select

Returns:
    feature selection method
"""


'''
Run Primitive
'''
new_train_data, new_test_data = select_k_best_scikit(train_data, test_data, train_labels.ravel(), 0, 1)

print("Selected Training Data:\n", new_train_data)
print("Selected Testing Data:\n", new_test_data)

mySelKBest
Selected Training Data:
 [[0]
 [1]
 [1]
 [0]]
Selected Testing Data:
 [[1]
 [0]]


  f = msb / msw
  f = msb / msw


## Congratulations! 
## You (Hopefully) know how to add primitives to EMADE now