# Semantic Debugging

Given the many executions we can generate, it is only natural that these executions would also be subject to _machine learning_ in order to learn which features of the input (or the execution) would be associated with failures.

In this chapter, we study the _Alhazen_ approach, one of the first of this kind.
Alhazen by Kampmann et al. [[KHSZ20](https://publications.cispa.saarland/3107/7/fse2020-alhazen.pdf)] automatically learn the associations between the failure of a program and _features of the input data_, say "The error occurs whenever the `<expr>` element is negative"


This chapter is based on an Alhazen implementation and exercise contributed by [Martin Eberlein](https://martineberlein.github.io), TU Berlin. Thanks a lot, Martin!

In [None]:
# from bookutils import YouTubeVideo
# YouTubeVideo("w4u5gCgPlmg")

**Prerequisites**

* This chapter extends the ideas from [the chapter on Generalizing Failure Circumstances](DDSetDebugger.ipynb).

In [None]:
import bookutils.setup

In [None]:
from typing import List, Tuple, Dict, Any

## Synopsis

<!-- Automatically generated. Do not edit. -->



_For those only interested in using the code in this chapter (without wanting to know how it works), give an example.  This will be copied to the beginning of the chapter (before the first section) as text with rendered input and output._

You can use `int_fuzzer()` as:

```python
print(int_fuzzer())
```
```python
=> 76.5

```


## Alhazen in a Nutshell

when diagnosing why a program fails, the first step is to determine the circumstances under which the program failed. Kampmann et al. [[KHSZ20](https://publications.cispa.saarland/3107/7/fse2020-alhazen.pdf)] presented an approach to automatically discover the circumstances of program behavior.
Their approach associates the program’s failure with the syntactical features of the input data, allowing them to learn and extract the properties that result in the specific behavior.

Their reference implementation _Alhazen_ can generate a diagnosis and explain why, for instance, a particular bug occurs.
More formally, Alhazen forms a hypothetical model based on the observed inputs.
Additional test inputs are generated and executed to refine or refute the hypothesis, eventually obtaining a prediction model of the circumstances of why the behavior in question takes place.
Alhazen use a _Decision Tree classifier_ to learn the association between the program behavior and the input features.

![title](PICS/Alhazen.png)

The tool is named after [Ḥasan Ibn al-Haytham](https://en.wikipedia.org/wiki/Ibn_al-Haytham) (latinized name: Alhazen).
Often referred to as the "Father of modern optics", Ibn al-Haytham made significant contributions to the principles of optics and visual perception.
Most notably, he was an early proponent of the concept that a hypothesis must be supported by experiments, and thus
one of the inventors of the _scientific method_, the key process in the Alhazen tool.

## Motivation

In [None]:
from fuzzingbook.Grammars import Grammar

In [None]:
CALC_GRAMMAR: Grammar = {
    "<start>":
        ["<function>(<term>)"],

    "<function>":
        ["sqrt", "tan", "cos", "sin"],

    "<term>": ["-<value>", "<value>"],

    "<value>":
        ["<integer>.<integer>",
         "<integer>"],

    "<integer>":
        ["<digit><integer>", "<digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
START_SYMBOL = "<start>"

## Features

In this section, we are concerned with the problem of extracting semantic features from inputs. In particular, Alhazen defines various features based on the input grammar, such as *existance* and *numeric interpretation*. These features are then extracted from the parse trees of the inputs (see [Section 3 of the paper](https://publications.cispa.saarland/3107/7/fse2020-alhazen.pdf) for more details).

The implementation of the feature extraction module consists of the following three tasks:
1. Implementation of individual feature classes, whose instances allow to derive specific feature values from inputs
2. Extraction of features from the grammar through instantiation of the aforementioned feature classes
3. Computation of feature vectors from a set of inputs, which will then be used as input for the decision tree

### Feature Classes

In [None]:
from abc import ABC, abstractmethod

class Feature(ABC):
    '''
    The abstract base class for grammar features.

    Args:
        name : A unique identifier name for this feature. Should not contain Whitespaces.
               e.g., 'type(<feature>@1)'
        rule : The production rule (e.g., '<function>' or '<value>').
        key  : The feature key (e.g., the chosen alternative or rule itself).
    '''

    def __init__(self, name: str, rule: str, key: str) -> None:
        self.name = name
        self.rule = rule
        self.key = key
        super().__init__()

    def __repr__(self) -> str:
        '''Returns a printable string representation of the feature.'''
        return self.name_rep()

    @abstractmethod
    def name_rep(self) -> str:
        pass

    @abstractmethod
    def get_feature_value(self, derivation_tree) -> float:
        '''Returns the feature value for a given derivation tree of an input.'''
        pass

In [None]:
from fuzzingbook.GrammarFuzzer import expansion_to_children, DerivationTree

class ExistenceFeature(Feature):
    '''
    This class represents existence features of a grammar. Existence features indicate
    whether a particular production rule was used in the derivation sequence of an input.
    For a given production rule P -> A | B, a production existence feature for P and
    alternative existence features for each alternative (i.e., A and B) are defined.

    name : A unique identifier name for this feature. Should not contain Whitespaces.
           e.g., 'exist(<digit>@1)'
    rule : The production rule.
    key  : The feature key, equal to the rule attribute for production features,
           or equal to the corresponding alternative for alternative features.
    '''
    def __init__(self, name: str, rule: str, key: str) -> None:
        super().__init__(name, rule, key)

    def name_rep(self) -> str:
        if self.rule == self.key:
            return f"exists({self.rule})"
        else:
            return f"exists({self.rule} == {self.key})"

    def get_feature_value(self, derivation_tree) -> float:
        '''Returns the feature value for a given derivation tree of an input.'''
        raise NotImplementedError

    def get_feature_value(self, derivation_tree: DerivationTree) -> float:
        '''Counts the number of times this feature was matched in the derivation tree.'''
        (node, children) = derivation_tree

        # The local match count (1 if the feature is matched for the current node, 0 if not)
        count = 0

        # First check if the current node can be matched with the rule
        if node == self.rule:

            # Production existance feature
            if self.rule == self.key:
                count = 1

            # Production alternative existance feature
            # We compare the children of the expansion with the actual children
            else:
                expansion_children = list(map(lambda x: x[0], expansion_to_children(self.key)))
                node_children = list(map(lambda x: x[0], children))
                if expansion_children == node_children:
                    count= 1

        # Recursively compute the counts for all children and return the sum for the whole tree
        for child in children:
            count = max(count, self.get_feature_value(child)) 

        return count

In [None]:
from fuzzingbook.GrammarFuzzer import tree_to_string
from numpy import nanmax, isnan

class NumericInterpretation(Feature):
    '''
    This class represents numeric interpretation features of a grammar. These features
    are defined for productions that only derive words composed of the characters
    [0-9], '.', and '-'. The returned feature value corresponds to the maximum
    floating-point number interpretation of the derived words of a production.

    name : A unique identifier name for this feature. Should not contain Whitespaces.
           e.g., 'num(<integer>)'
    rule : The production rule.
    '''
    def __init__(self, name: str, rule: str) -> None:
        super().__init__(name, rule, rule)

    def name_rep(self) -> str:
        return f"num({self.key})"

    def get_feature_value(self, derivation_tree) -> float:
        '''Returns the feature value for a given derivation tree of an input.'''
        raise NotImplementedError

    def get_feature_value(self, derivation_tree: DerivationTree) -> float:
        '''Determines the maximum float of this feature in the derivation tree.'''
        (node, children) = derivation_tree

        value = float('nan')
        if node == self.rule:
            try:
                #print(self.name, float(tree_to_string(derivation_tree)))
                value = float(tree_to_string(derivation_tree))
            except ValueError:
                #print(self.name, float(tree_to_string(derivation_tree)), "err")
                pass

        # Return maximum value encountered in tree, ignoring all NaNs
        tree_values = [value] + [self.get_feature_value(c) for c in children]
        if all(isnan(tree_values)):
            return value
        else:
            return nanmax(tree_values)

### Extracting Feature Sets from Grammars

In [None]:
def extract_existence(grammar: Grammar) -> List[ExistenceFeature]:
    '''
        Extracts all existence features from the grammar and returns them as a list.
        grammar : The input grammar.
    '''

    features = []

    for rule in grammar:
        # add the rule
        features.append(ExistenceFeature(f"exists({rule})", rule, rule))
        # add all alternatives
        for count, expansion in enumerate(grammar[rule]):
            features.append(ExistenceFeature(f"exists({rule}@{count})", rule, expansion))

    return features

In [None]:
from fuzzingbook.Grammars import reachable_nonterminals
from collections import defaultdict
import re

# Regex for non-terminal symbols in expansions
RE_NONTERMINAL = re.compile(r'(<[^<> ]*>)')

def extract_numeric(grammar: Grammar) -> List[NumericInterpretation]:
    '''
        Extracts all numeric interpretation features from the grammar and returns them as a list.

        grammar : The input grammar.
    '''

    features = []

    # Mapping from non-terminals to derivable terminal chars
    derivable_chars = defaultdict(set)

    for rule in grammar:
        for expansion in grammar[rule]:
            # Remove non-terminal symbols and whitespace from expansion
            terminals = re.sub(RE_NONTERMINAL, '', expansion).replace(' ', '')

            # Add each terminal char to the set of derivable chars
            for c in terminals:
                derivable_chars[rule].add(c)

    # Repeatedly update the mapping until convergence
    while True:
        updated = False
        for rule in grammar:
            for r in reachable_nonterminals(grammar, rule):
                before = len(derivable_chars[rule])
                derivable_chars[rule].update(derivable_chars[r])
                after = len(derivable_chars[rule])

                # Set of derivable chars was updated
                if after > before:
                    updated = True

        if not updated:
            break

    numeric_chars = set(['0','1','2','3','4','5','6','7','8','9','.','-'])

    for key in derivable_chars:
        # Check if derivable chars contain only numeric chars
        if len(derivable_chars[key] - numeric_chars) == 0:
            features.append(NumericInterpretation(f"num({key})", key))

    return features

In [None]:
def get_all_features(grammar: Grammar) -> List[Feature]:
    return extract_existence(grammar) + extract_numeric(grammar)

In [None]:
get_all_features(CALC_GRAMMAR)

### Extracting Feature Values from Inputs

**Note**: This is a rather slow implementation, for many grammars with many syntactically features, the feature collection can be optimized

In [None]:
from fuzzingbook.Parser import EarleyParser
from fuzzingbook.Grammars import Grammar
import pandas
from pandas import DataFrame

def collect_features(sample_list: List[str],
                     grammar: Grammar) -> DataFrame:

    data = []

    # parse grammar and extract features
    all_features = get_all_features(grammar)

    # iterate over all samples
    for sample in sample_list:
        parsed_features = {}
        parsed_features["sample"] = sample
        # initate dictionary
        for feature in all_features:
            parsed_features[feature.name] = 0

        # Obtain the parse tree for each input file
        earley = EarleyParser(grammar)
        for tree in earley.parse(sample):

            for feature in all_features:
                parsed_features[feature.name] = feature.get_feature_value(tree)

        data.append(parsed_features)

    return pandas.DataFrame.from_records(data)

In [None]:
sample_list = ["sqrt(-900)", "sin(24)", "cos(-3.14)"]
collect_features(sample_list, CALC_GRAMMAR)

In [None]:
# TODO: handle multiple trees
from fuzzingbook.Parser import EarleyParser

def compute_feature_values(sample: str, grammar: Grammar, features: List[Feature]) -> Dict[str, float]:
    '''
        Extracts all feature values from an input.

        sample   : The input.
        grammar  : The input grammar.
        features : The list of input features extracted from the grammar.

    '''
    earley = EarleyParser(CALC_GRAMMAR)

    features = {}
    for tree in earley.parse(sample):
        for feature in get_all_features(CALC_GRAMMAR):
            features[feature.name_rep()] = feature.get_feature_value(tree)
    return features

In [None]:
all_features = get_all_features(CALC_GRAMMAR)
for sample in sample_list:
    print(f"Features of {sample}:")
    features = compute_feature_values(sample, CALC_GRAMMAR, all_features)
    for feature, value in features.items():
        print(f"    {feature}: {value}")

## The Alhazen Implementation

In [None]:
from typing import List
import pandas

GENERATOR_TIMEOUT = 10 # timeout in seconds

class Alhazen:
    def __init__(self, initial_inputs: List[str],
                 grammar: Grammar,
                 max_iter: int = 10,
                 generator_timeout: int = 10):

        self._initial_inputs = initial_inputs
        self._grammar = grammar
        self._max_iter = max_iter
        self._previous_samples = None
        self._data = None
        self._trees = []
        self._generator_timeout = generator_timeout
        self._setup()

    def _setup(self):
        self._previous_samples = self._initial_inputs

        self._all_features = extract_existence(self._grammar) + extract_numeric(self._grammar)
        self._feature_names = [f.name for f in self._all_features]

    def run(self):
        raise NotImplementedError()

    def _add_new_data(self, exec_data, feature_data):
        joined_data = exec_data.join(feature_data.drop(['sample'], axis=1))

        # Only add valid data
        new_data = joined_data[(joined_data['oracle'] != OracleResult.UNDEF)]
        new_data = joined_data.drop(joined_data[joined_data.oracle.astype(str) == "UNDEF"].index)
        if 0 != len(new_data):
            if self._data is None:
                self._data = new_data
            else:
                self._data = pandas.concat([self._data, new_data], sort=False)

    def _finalize(self):
        return self._trees

In [None]:
class Alhazen(Alhazen):
    def run(self):
        for iteration in range(self._max_iter):
            print(f"Starting Iteration: " + str(iteration))
            self._loop(self._previous_samples)

        return self._finalize()

    def _loop(self, sample_list):
        # obtain labels, execute samples (Initial Step, Activity 5)
        exec_data = execute_samples(sample_list)

        # collect features from the new samples (Activity 1)
        feature_data = collect_features(sample_list, self._grammar)

        # combine the new data with the already existing data
        self._add_new_data(exec_data, feature_data)

        # train a tree (Activity 2)
        dec_tree = train_tree(self._data)
        self._trees.append(dec_tree)

        # extract new requirements from the tree (Activity 3)
        new_input_specifications = get_all_input_specifications(dec_tree,
                                                self._all_features,
                                                self._feature_names,
                                                self._data.drop(['oracle'], axis=1))

        # generate new inputs according to the new input specifications
        # (Activity 4)
        new_samples = generate_samples(self._grammar, new_input_specifications, self._generator_timeout)
        self._previous_samples = new_samples

### Excursion: All the Details

This text will only show up on demand (HTML) or not at all (PDF). This is useful for longer implementations, or repetitive, or specialized parts.

### End of Excursion

## _Section 3_

\todo{Add}

_If you want to introduce code, it is helpful to state the most important functions, as in:_

* `random.randrange(start, end)` - return a random number [`start`, `end`]
* `range(start, end)` - create a list with integers from `start` to `end`.  Typically used in iterations.
* `for elem in list: body` executes `body` in a loop with `elem` taking each value from `list`.
* `for i in range(start, end): body` executes `body` in a loop with `i` from `start` to `end` - 1.
* `chr(n)` - return a character with ASCII code `n`

In [None]:
import random

In [None]:
def int_fuzzer() -> float:
    """A simple function that returns a random float"""
    return random.randrange(1, 100) + 0.5

In [None]:
# More code
pass

## _Section 4_

\todo{Add}

## Synopsis

_For those only interested in using the code in this chapter (without wanting to know how it works), give an example.  This will be copied to the beginning of the chapter (before the first section) as text with rendered input and output._

You can use `int_fuzzer()` as:

In [None]:
print(int_fuzzer())

## Lessons Learned

* _Lesson one_
* _Lesson two_
* _Lesson three_

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

* [use _assertions_ to check conditions at runtime](Assertions.ipynb)
* [reduce _failing inputs_ for efficient debugging](DeltaDebugger.ipynb)


## Background

_Cite relevant works in the literature and put them into context, as in:_

The idea of ensuring that each expansion in the grammar is used at least once goes back to Burkhardt \cite{Burkhardt1967}, to be later rediscovered by Paul Purdom \cite{Purdom1972}.

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_