# Adaptive CMA-ES configurations - Pre-processing

This Python Notebook covers the pre-processing of data for the adaptive CMA-ES research.

The input data consists of raw **BBOB** logging files (a few GB's).

As output, we store a CSV with the _steepnesses_ of each pre-specified 'section' for all runs, separated into files for each function/dimensionality pair.

> Sander van Rijn<br>
> s.j.van.rijn@liacs.leidenuniv.nl<br>
> LIACS<br>
> 2018-02-28

In [1]:
%matplotlib inline

from __future__ import division, print_function

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
from collections import Counter

In [2]:
# Some utility functions for dealing with the representations

# First, some hardcoded variables
num_options_per_module = [2]*9        # Binary part
num_options_per_module.extend([3]*2)  # Ternary part
max_length = 11
factors = [2304, 1152, 576, 288, 144, 72, 36, 18, 9, 3, 1]

def list_all_representations():
    """ Create a list of all possible representations for the modular CMA-ES.
        Each representation is itself a list with <max_length> integer entries {0, 1, ..., n},
        where 'n' is the number of options for the module in that position.
    """
    products = []
    # count how often there is a choice of x options
    counts = Counter(num_options_per_module)
    for num, count in sorted(counts.items(), key=lambda x: x[0]):
        products.append(product(range(num), repeat=count))
    all_representations = []
    for representation in list(product(*products)):
        all_representations.append(list(sum(representation, ())))
    return all_representations


def reprToString(representation):
    """ Function that converts the structure parameters of a given ES-structure representation to a string

        >>> reprToInt([0,0,0,0,0,1,0,1,0,1,0])
        >>> '00000101010'
    """
    return ''.join([str(i) for i in representation[:max_length]])

In [3]:
data_location = '/media/rijnsjvan/Data/SurfDrive/Research Data/Adaptive ES/test_results/'
repetition_format = '-{rep:02d}'
file_name = '{config}/{D}d-f{f}/data_f{f}/bbobexp{rep}_f{f}_DIM{D}.dat'

instances = list(range(5))
num_repetitions = 5
ndims = [5, 20]
fids = [1, 10, 15, 20]

num_steps = 51
powers = np.linspace(2, -8, num_steps)
target_values = np.power([10]*num_steps, powers)

all_configurations = list_all_representations()

In [4]:
# A utility function for loading a full result file
def loadfile(fname, max_budget):
    """ Load a file
        :param fname:       The name of the file to retrieve the data from
        :param max_budget:  Maximum available budget for this optimization run
        :return:            Data from the given file as numpy float array
    """

    data = np.genfromtxt(fname, delimiter=' ', skip_header=1, dtype=[np.int, np.float])
    indices, values = map(np.array, list(zip(*data)))
    repetitions = np.append(indices[1:],[max_budget + 1]) - indices
    
    if repetitions[-1] < 0:
        repetitions[-1] = 0
    
    fitnesses = np.repeat(values, repetitions)[:max_budget]
        
    return fitnesses

In [5]:
def determineTimesToTargets(data, targets):
    """ Given the entire run-data of an algorithm, calculate the runtimes in evaluations
        of the algorithm performance during each section.
    """
    times_to_targets = np.array([0]*len(targets), dtype=np.float)
    prev_idx = 0
    for idx, target in enumerate(targets):
        below_target = data < target
        indices = np.argwhere(below_target)
        if len(indices) > 0:
            times_to_targets[idx] = np.min(indices)
        else:
            break
    
    return times_to_targets

# Simplifying: loads of data to managable CSV's
So far, this has all been basic setup stuff. Now we're going to actually simplify our data.

Rather than working with the data of all complete runs, we will summarize to what we are actually interested in: the performance gradients/convergence speeds of each algorithm at various points during the optimization process (terminology not yet final)

In [6]:
def createSteepnessRecord(representation, ndim, fid, iid, rep, *, budget_factor=1e4):
    """ Create a single record: what are the steepnesses for all sections for 
        a given run: algorithm {representation} on {ndim}D f{fid}, instance {iid} repetition {rep}
    """
    budget = int(ndim * budget_factor)
    run_num = iid*num_repetitions + rep
    if run_num == 0:
        run_num = ''
    else:
        run_num = repetition_format.format(rep=run_num)

    fname = file_name.format(config=reprToString(representation), f=fid, D=ndim, rep=run_num)
    
    data = loadfile(fname, budget)
    # steepnesses = determine_steepnesses(data, steepness_sections)
    steepnesses = determineTimesToTargets(data, target_values)
    return (representation, ndim, fid, iid, rep, *steepnesses)

# Labels for the records created by the function above to be used when loading the records into a pandas dataframe
record_labels = [
    'Representation', 
    'ndim', 
    'function ID', 
    'instance ID', 
    'repetition', 
    *(str(sec) for sec in target_values)
]

In [7]:
# defining a progress bar (https://github.com/alexanderkuk/log-progress)
def log_progress(sequence, every=None, size=None, name='Items'):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )

** WARNING **

The following block of code does the heavy lifting. It is not parallelized and easily 30-60 seconds for every 1.000 records to create. For the ~900.000 records that were created in the first run of this code, it took 7:40 hours.

_ You have been warned... _

In [8]:
# %%notify
# https://github.com/shoprunner/jupyter-notify


def createsummarycsv(ndim, fid, cases):
    all_records = []
    for configuration, iid, rep in log_progress(cases, every=100, name='{}D F{}'.format(ndim, fid)):
        try:
            record = createSteepnessRecord(configuration, ndim, fid, iid, rep)
            all_records.append(record)
        except FileNotFoundError:
            pass
        except OSError:
            pass

    df = pd.DataFrame.from_records(all_records, columns=record_labels)
    df.to_csv('steepness_data_{}D-f{}.csv'.format(ndim, fid))


from IPython.lib import backgroundjobs as bg
os.chdir(data_location)

cases = list(product(all_configurations, instances, list(range(num_repetitions))))
num_cases = len(all_configurations)*len(ndims)*len(fids)*len(instances)*num_repetitions
print('Found {} cases to process. This may take a while...'.format(num_cases))

jobs = bg.BackgroundJobManager()

for ndim, fid in product(ndims, fids):
    jobs.new(createsummarycsv, ndim, fid, cases)

Found 921600 cases to process. This may take a while...
Starting job # 0 in a separate thread.
Starting job # 2 in a separate thread.
Starting job # 3 in a separate thread.
Starting job # 4 in a separate thread.
Starting job # 5 in a separate thread.


Starting job # 6 in a separate thread.
Starting job # 7 in a separate thread.
Starting job # 8 in a separate thread.


And we're done. Now we have the pre-processed CSV files to work with instead, which we will do in another Notebook for clarity's sake.

Of course, if the data in the CSVs has to be changed, this script has to be run again.