# Comparing the Performance of scikit-eLCS and the Original eLCS Algorithm

Author: Robert Zhang - Univeresity of Pennsylvania, B.S.E Computer Science, B.S.E. Economics (SEAS '22, WH '22)

Advisor: Ryan Urbanowicz, PhD - University of Pennsylvania, Department of Biostatics, Epidemiology, and Informatics & Institue for Biomedical Informatics (IBI)

Date: 04/05/2020

Requirements: (Python 3)
<ul>
    <li>scikit-eLCS</li>
    <li>pandas</li>
    <li>numpy</li>
    <li>scipy</li>
    <li>scikit-learn</li>
</ul>

## Introduction
This notebook presents a comparison between the performance of the original eLCS Algorithm, as presented in the 2017 textbook "Introduction to Learning Classifier Systems" by Ryan Urbanowicz and Will Browne, and the new scikit-eLCS Python package.

The scikit-eLCS package is a sklearn compatible Python implementation of the original eLCS Algorithm. It was designed to perform equally well in terms of training/testing accuracy and training time, while being significantly more user friendly, and including an array of additional real time & post-training analysis tools. This notebook will demonstrate these capabilities in detail.

The scikit-eLCS source code and a complete walkthrough of its usage can be found at <a href=https://github.com/UrbsLab/scikit-eLCS>this Github Repository</a>. The package can be installed via **pip3 install scikit-eLCS**.

This notebook uses a slightly modified version of the original eLCS algorithm to improve useability, clarity, and to make it perform (runtime wise) more similar to the scikit-eLCS package, which would run slightly faster otherwise due to its lack of mandatory evaluation, printing, and exporting during training. Thus, this notebook is only comparing the runtime of the core algorithm implementations, rather than evaluating the runtimes of the two packages as a whole.
<ul>
    <li>Removed from original eLCS all print statements</li>
    <li>Removed from original eLCS all obligatory evaluation procedures during training</li>
    <li>Removed from original eLCS all export functionality during training</li>
    <li>Made training accuracy easier to access for original eLCS</li>
    <li>Removed the need for a config file (used param passing instead)</li>
    <li>Removed the need for both a test and train file, and made all files csv's instead of txt's</li>
</ul>

## Notebook Organization
**Part 0: Setting Up Some Helper Methods**

**Part 1: Comparing Training Accuracy and Runtime**
<ul>
    <li> 6-bit Multiplexer Problem </li>
    <li> 11-bit Multiplexer Problem </li>
    <li> 20-bit Multiplexer Problem </li>
</ul>

**Part 2: Comparing Testing Accuracy**
<ul>
    <li> 6-bit Multiplexer Problem </li>
    <li> 11-bit Multiplexer Problem </li>
    <li> 20-bit Multiplexer Problem </li>
</ul>

**Part 3: Quick Demo of Additional Analysis Tools Provided by scikit-eLCS**
<ul>
    <li> Iteration Tracking Tool </li>
    <li> Rule Population Tool </li>
    <li> Population Statistics Tools </li>
</ul>

## Part 0: Setting Up Helper Methods

In [1]:
from eLCS_Timer import Timer
from eLCS_ParamParser import ParamParser
from eLCS_Offline_Environment import Offline_Environment
from eLCS_Algorithm import eLCS
from eLCS_Constants import *
import numpy as np
import time

def runOriginaleLCS(dataFile,labelPhenotype,learningIterations,randomSeed,cv=False):
    #Run the e-LCS algorithm.
    if cv == False:
        ParamParser(dataFile,cv=cv,labelPhenotype=labelPhenotype,learningIterations=learningIterations,randomSeed=randomSeed)
        timer = Timer() 
        cons.referenceTimer(timer)
        env = Offline_Environment()
        cons.referenceEnv(env)
        cons.parseIterations()
        e = eLCS()
        return np.array([e.trainEval[0],cons.timer.globalDeletion,cons.timer.globalEvaluation,cons.timer.globalMatching,cons.timer.globalSelection,cons.timer.globalSubsumption,cons.timer.globalTime])
    else:
        l = []
        ParamParser(dataFile,cv=cv,labelPhenotype=labelPhenotype,learningIterations=learningIterations,randomSeed=randomSeed)
        for i in range(cv):
            cons.setCV()
            timer = Timer() 
            cons.referenceTimer(timer)
            env = Offline_Environment()
            cons.referenceEnv(env)
            cons.parseIterations()
            e = eLCS()
            l.append(e.testEval[0])
        return np.mean(np.array(l))

import skeLCS
import pandas as pd
from sklearn.model_selection import cross_val_score

def runScikiteLCS(dataFile,classLabel,learningIterations,randomSeed,cv=False):
    data = pd.read_csv(dataFile)
    dataFeatures = data.drop(classLabel,axis=1).values
    dataPhenotypes = data[classLabel].values
    model = skeLCS.eLCS(learningIterations = learningIterations,randomSeed = randomSeed)

    if cv == False:
        model.fit(dataFeatures,dataPhenotypes)
        score = model.score(dataFeatures,dataPhenotypes)
        return np.array([score,model.timer.globalDeletion,model.timer.globalEvaluation,model.timer.globalMatching,model.timer.globalSelection,model.timer.globalSubsumption,model.timer.globalTime])
    else:
        formatted = np.insert(dataFeatures,dataFeatures.shape[1],dataPhenotypes,1)
        np.random.shuffle(formatted)
        dataFeatures = np.delete(formatted,-1,axis=1)
        dataPhenotypes = formatted[:,-1]
        return np.mean(cross_val_score(model,dataFeatures,dataPhenotypes,cv=cv))

randomSeeds = [0,1,2,3,4]

## Part 1: Comparing Training Accuracy and Runtime
We will use the n-bit Multiplexer Problem to test the training accuracy and runtime of the two eLCS implementations. The Multiplexer Problem is a benchmark LCS problem, due to its highly epistatic and heterogeneous nature.
<br>
<br>
<img src="MP.jpg">

We will use the same hyperparameters for both eLCS implementations, and also use the same random seed, to ensure the exact replicability (without a set random seed however, the results of analysis will still yield highly similar conclusions).

### 6-bit Multiplexer Problem with Original eLCS

In [2]:
# avgOriginal = np.array([0,0,0,0,0,0,0])
# for seed in randomSeeds:
#     avgOriginal = np.add(avgOriginal,runOriginaleLCS('Datasets/Multiplexer6.csv','class','5000',seed))
# avgOriginal /= 5

# print("Average Training Accuracy: "+str(avgOriginal[0]))
# print("Average Deletion Time: "+str(avgOriginal[1]))
# print("Average Evaluation Time: "+str(avgOriginal[2]))
# print("Average Matching Time: "+str(avgOriginal[3]))
# print("Average Selection Time: "+str(avgOriginal[4]))
# print("Average Subsumption Time: "+str(avgOriginal[5]))
# print("Average Total Training Time: "+str(avgOriginal[6]))

### 6-bit Multiplexeer Problem with scikit-eLCS

In [3]:
# avgScikit = np.array([0,0,0,0,0,0,0])
# for seed in randomSeeds:
#     avgScikit = np.add(avgScikit,runScikiteLCS('Datasets/Multiplexer6.csv','class',5000,seed))
# avgScikit /= 5

# print("Average Training Accuracy: "+str(avgScikit[0]))
# print("Average Deletion Time: "+str(avgScikit[1]))
# print("Average Evaluation Time: "+str(avgScikit[2]))
# print("Average Matching Time: "+str(avgScikit[3]))
# print("Average Selection Time: "+str(avgScikit[4]))
# print("Average Subsumption Time: "+str(avgScikit[5]))
# print("Average Total Training Time: "+str(avgScikit[6]))

### 11-bit Multiplexer Problem with Original eLCS

In [4]:
# avgOriginal = np.array([0,0,0,0,0,0,0])
# for seed in randomSeeds:
#     avgOriginal = np.add(avgOriginal,runOriginaleLCS('Datasets/Multiplexer11.csv','class','5000',seed))
# avgOriginal /= 5

# print("Average Training Accuracy: "+str(avgOriginal[0]))
# print("Average Deletion Time: "+str(avgOriginal[1]))
# print("Average Evaluation Time: "+str(avgOriginal[2]))
# print("Average Matching Time: "+str(avgOriginal[3]))
# print("Average Selection Time: "+str(avgOriginal[4]))
# print("Average Subsumption Time: "+str(avgOriginal[5]))
# print("Average Total Training Time: "+str(avgOriginal[6]))

### 11-bit Multiplexer Problem with scikit-eLCS

In [5]:
# avgScikit = np.array([0,0,0,0,0,0,0])
# for seed in randomSeeds:
#     avgScikit = np.add(avgScikit,runScikiteLCS('Datasets/Multiplexer11.csv','class',5000,seed))
# avgScikit /= 5

# print("Average Training Accuracy: "+str(avgScikit[0]))
# print("Average Deletion Time: "+str(avgScikit[1]))
# print("Average Evaluation Time: "+str(avgScikit[2]))
# print("Average Matching Time: "+str(avgScikit[3]))
# print("Average Selection Time: "+str(avgScikit[4]))
# print("Average Subsumption Time: "+str(avgScikit[5]))
# print("Average Total Training Time: "+str(avgScikit[6]))

### 20-bit Multiplexer Problem with Original eLCS

In [6]:
# avgOriginal = np.array([0,0,0,0,0,0,0])
# for seed in randomSeeds:
#     avgOriginal = np.add(avgOriginal,runOriginaleLCS('Datasets/Multiplexer20.csv','class','10000',seed))
# avgOriginal /= 5

# print("Average Training Accuracy: "+str(avgOriginal[0]))
# print("Average Deletion Time: "+str(avgOriginal[1]))
# print("Average Evaluation Time: "+str(avgOriginal[2]))
# print("Average Matching Time: "+str(avgOriginal[3]))
# print("Average Selection Time: "+str(avgOriginal[4]))
# print("Average Subsumption Time: "+str(avgOriginal[5]))
# print("Average Total Training Time: "+str(avgOriginal[6]))

### 20-bit Multiplexer Problem with scikit-eLCS

In [7]:
# avgScikit = np.array([0,0,0,0,0,0,0])
# for seed in randomSeeds:
#     avgScikit = np.add(avgScikit,runScikiteLCS('Datasets/Multiplexer20.csv','class',10000,seed))
# avgScikit /= 5

# print("Average Training Accuracy: "+str(avgScikit[0]))
# print("Average Deletion Time: "+str(avgScikit[1]))
# print("Average Evaluation Time: "+str(avgScikit[2]))
# print("Average Matching Time: "+str(avgScikit[3]))
# print("Average Selection Time: "+str(avgScikit[4]))
# print("Average Subsumption Time: "+str(avgScikit[5]))
# print("Average Total Training Time: "+str(avgScikit[6]))

## Part 2: Comparing Testing Accuracy
We will conduct a 3-fold CV 5 times (for 5 random seeds) for the 3 Multiplexer Problems above

### 6-bit Multiplexer Problem with Original eLCS

In [8]:
avgOriginal = 0
for seed in randomSeeds:
    avgOriginal += runOriginaleLCS('Datasets/Multiplexer6.csv','class','5000',seed,cv=3)
avgOriginal /= 5

print("Average Testing Accuracy: "+str(avgOriginal))

Average Testing Accuracy: 0.9242424242424242


### 6-bit Multiplexer Problem with scikit-eLCS

In [9]:
avgScikit = 0
for seed in randomSeeds:
    avgScikit += runScikiteLCS('Datasets/Multiplexer6.csv','class',5000,seed,cv=3)
avgScikit /= 5

print("Average Testing Accuracy: "+str(avgScikit))

Average Testing Accuracy: 0.8242424242424242


### 11-bit Multiplexer Problem with Original eLCS

In [10]:
avgOriginal = 0
for seed in randomSeeds:
    avgOriginal += runOriginaleLCS('Datasets/Multiplexer11.csv','class','5000',seed,cv=3)
avgOriginal /= 5

print("Average Testing Accuracy: "+str(avgOriginal))

Average Testing Accuracy: 0.9906295754026354


### 11-bit Multiplexer Problem with scikit-eLCS

In [11]:
avgScikit = 0
for seed in randomSeeds:
    avgScikit += runScikiteLCS('Datasets/Multiplexer11.csv','class',5000,seed,cv=3)
avgScikit /= 5

print("Average Testing Accuracy: "+str(avgScikit))

Average Testing Accuracy: 0.982326090560386


### 20-bit Multiplexer Problem with Original eLCS

In [12]:
avgOriginal = 0
for seed in randomSeeds:
    avgOriginal += runOriginaleLCS('Datasets/Multiplexer20.csv','class','10000',seed,cv=3)
avgOriginal /= 5

print("Average Testing Accuracy: "+str(avgOriginal))

Average Testing Accuracy: 0.8267866066966516


### 20-bit Multiplexer Problem with scikit-eLCS

In [13]:
avgScikit = 0
for seed in randomSeeds:
    avgScikit += runScikiteLCS('Datasets/Multiplexer20.csv','class',10000,seed,cv=3)
avgScikit /= 5

print("Average Testing Accuracy: "+str(avgScikit))

Average Testing Accuracy: 0.7678457448116813
