Skip to content
kevinjalbert edited this page Oct 20, 2011 · 20 revisions

Abstract

Concurrent software bugs appear intermittently due to the non-deterministic nature of how threads might be scheduled to run. A concurrency bug can be fixed by using known concurrency mechanisms to ensure proper thread scheduling. ARC takes advantage of this fact by evolving a buggy program using known concurrency mutation operators in attempts to ensure proper program execution. An evolved program is evaluated using ConTest by repeatedly exploring thread interleavings to provide a certain level of confidence that the majority of schedules have been explored. A fitness function will evaluated the functional correctness of the evolved program. The evaluation feedback from ConTest is used to heuristically select the most appropriate mutation operator to next apply. A second phase of evolution is applied after an appropriate level of functional fitness is achieved to optimize the non-functional fitness. This phase will minimize the usage of synchronization mechanisms to ensure timely execution of the evolved program while retaining the functional correctness.

#Things to consider:

  • If no improvements are made in phase 1 (bug fixing), should the algorithm proceed to phase 2 (optimization) or stop? Is there any point in attempting to optimize a program that doesn't work correctly?
  • Should the user have the option to run phase 1 only, or phase 2 only? Say the program works and they want to only try optimizing it. (Or time is short and they only want to try and bug fix it.)
  • We should update our terminology. We are really using an Evolutionary Programming approach: EP is mutation driven, doesn't use crossover and every parent passes a child into the next generation. (http://en.wikipedia.org/wiki/Evolutionary_programming)

#TODO

  • When mutated code is compiled, indicate whether it succeeded or failed
  • Capture the Java classpath
  • Somehow differentiate directory structure (/generation/member/...) for both phases
    • Save project in pristine state AND bug fixed state AND optimized state

#FUTURE WORK

  • Make testing phase concurrent
  • Add more stopping criteria to the GA (Time, no progress)
  • Increase functionality of the TXL operators
  • Recognize more Java project types (Maven?)
  • Allow phases 1 and 2 to operate independently of each other

#Purpose The goal of ARC - Automatically Repair Concurrency is to provide a tool that attempts to repair some concurrency bugs in Java source code. Java was selected due to its mature concurrency libraries, idioms and projects with known concurrency bugs.

ARC operates in two stages. The first attempts to fix concurrency bugs, the second attempts to improve the performance of the concurrent code. A genetic algorithm is used to progressively mutate the buggy program's concurrent code blocks to a state where it executes in a timely fashion without errors. Two main criteria are used to determine the fitness of a mutant:

  • Correctness of the code -> No concurrency bugs
  • Performance of the code -> Executes as quickly as possible

#Features

  • Uses tools with known track records: Python, TXL, ConTest, ConMan, Java and Ant
  • Genetic algorithm (GA) runs for a user-defined number of generations
  • User-definable parameters:
  • Population
  • Mutation rate
  • Mutation operators used
  • Timeout for detecting deadlocks
  • Testing is augmented through a noise making technique (random thread delays)

#Problem Definition Concurrent Java programs are hard to debug. A small number of interleavings of the concurrent code may cause the problem, making it hard to verify, find and fix. This segment of code illustrates a concurrency bug: Writing to a unprotected shared variable:

obj.write ( var1 ); 

Note that this bug is specific to concurrent programs. It doesn't occur in programs that use only a single thread. This segment might result in a data race during the execution of the program if there are two or more threads trying to write to the variable. Which one writes first is random. To prevent this problem synchronization is used to control access to the variable:

synchronized ( lock ) { 
    obj.write ( var1 ); 
}

This is a simple example of how a concurrency bug is fixed: The addition of synchronization where necessary. One problem with adding synchronization to source code is that it imposes additional overhead in the lock acquisition and thread contention. Acquiring the lock takes time and while the lock is held, no other thread can access the locked code. Adding synchronization might result in another concurrency bug, deadlock. If threads are blocking each other from proceeding, execution halts. A fine balance between adding synchronization to ensure correctness while maintaining a high level of performance is desirable. This is what ARC attempts to achieve.

The problem is defined as:

Finding the best configuration of a concurrent program, that results in the highest possible correctness and performance.

Search-Based Software Engineering (SBSA) techniques are often used to tackle problems of this type. [Cite Arcuri and Weimer et. al.] Genetic algorithms (GAs) are used to guide the evolution of the concurrent code towards the most correct configuration. In the second step the GA attempts to optimize the execution of the concurrent parts of the source code without sacrificing correctness. Genetic algorithms consist of a number of parts:

  1. Genetic representation of the problem
  • A member is a project with mutations applied to it
  1. Mutation operators that act on the member
  • Apply mutation operators to the source code of the project in such a way that it still compiles
  1. Fitness evaluation of the candidate fixes
  2. Terminating condition for the algorithm

###Algorithm Overview _The following Poster gives a high-level overview of the problem and the general approach to finding a solution.

###High-Level Overview:

Phase 1: Bug Fixing:

  1. Parse the project's source to find the concurrent code
  2. Create an initial population of proposed solutions and mutate each one once
    • Mutate each project once or mutate each Java file in the project once?
  3. Run each member (solution) a number of times and evaluate how it does in terms of: Successful executions, deadlocks detected, data races detected (and others?)
  4. If one of the members runs without errors, the fix has been found. Proceed to step 7.
  5. Based on the results of 3, choose an appropriate mutation operator and apply it to each solution
  6. If search time remains, return to step 3, otherwise stop.

Phase 2: Optimization:

  1. As in step 1, parse the modified source to find the concurrent code
  2. Select the subset of the operators from phase 1 useful for optimization
  3. Create an initial population of proposed solutions and mutate each one once
  • Any mutated project that decreases correctness is discarded
  1. Run each solution a number of times and evaluate it's performance
  2. If one of the solutions meets a predefined performance goal, return it and stop
  3. If time remains, return to step 9. Otherwise, return the fixed program with the best performance found so far

Genetic Representation

A buggy project's concurrent source code is the target of mutations. Before any fixing can begin, all of the concurrent code must be identified. In the example below, lines of Java code are referred to as line_1, line_2, ... . Concurrent lines are identified by the presence of, Mutation_OP..._

Line_1 has concurrent code in it because Mutation_OP_2 can operate on it. As mutation operation 2 can also operate on lines 5 and 7, we differentiate each instance of operation 2 by calling them spots A, B and C. Line 2 has no concurrent code."

line_1     Mutation_OP_2_spot_A
line_2
line_3     Mutation_OP_1_spot_A
line_4
line_5     Mutation_OP_2_spot_B, Mutation_OP_3_spot_A
line_6     Mutation_OP_3_spot_B 
line_7     Mutation_OP_2_spot_C

There are three mutation operators so three arrays are created to represent them. In each array there is a bit for each place a mutation could occur. Bits are ordered as they occur. (in order that TXL parses the source code, not necessarily in the order of line number). Continuing the above example the arrays are:

Mutation_1   0             (Mutation 1 occurs once in the code)
Mutation_2 000             (Mutation 2 occurs three times ...)
Mutation_3  00

A flipped bit represents an applied mutation. For this example, if the mutation was on the first bit of the second mutation operator the source code becomes:

line_1     Mutation_OP_2_spot_A_APPLIED
line_2
line_3     Mutation_OP_1_spot_A
line_4
line_5     Mutation_OP_2_spot_B, Mutation_OP_3_spot_A
line_6     Mutation_OP_3_spot_B 
line_7     Mutation_OP_2_spot_C

And the representation becomes:

Mutation_1   0
Mutation_2 **1**00             (Mutation 2 applied to spot A)          
Mutation_3  00

This genetic representation allows for a simple encoding of the state of a mutant. After a mutation occurs the source is altered. Each mutation may create new places for mutation or remove mutation locations. After each mutation the potential number and locations of mutants might change. This requires a re-parse of the mutant source to find the new set of possible mutations and locations.

Continuing our previous example, the re-parse identifies the following mutation locations:

line_1     Mutation_OP_2_spot_A_APPLIED
line_2
line_3     Mutation_OP_1_spot_A
line_4
line_5     Mutation_OP_3_spot_A
line_6     Mutation_OP_3_spot_B 
line_7     Mutation_OP_2_spot_B, Mutation_OP_1_spot_B

Notice how Mutation_OP_2_spot_B has disappeared on line 5. Even though the mutation was applied on line one, blocks of code can be affected. The ordering of the second mutation operators has changed: As 2_B has disappeared, 2_C on line 7 becomes 2_B. A new mutation operator location appeared on line 7, 1_B. This list is recomputed each generation.

Mutation Operators

A mutation operator is a function that performs one mutation on a source file.

TODO: Operator list

#Functional Mutations Three mutation operators fix data races and deadlocks by adding synchronization to variables, methods and existing synchronization blocks.

TODO: More here.

#Non-Functional Mutations Four mutation operators improve performance by removing synchronization from variables, methods, existing synchronization blocks and removing any one synchronization call.

TODO: More here.

NOTE: The before and after effects of a mutation look like:

#Original:

  synchronized (lock1) { 
    ...
  }

#Mutated:

  /* MUTANT : "ASAS (Added Sync Around Sync)" */
  synchronized (this) {
    synchronized (lock1) { 
      ...
    }
  }
  /* MUTANT : "ASAS (Added Sync Around Sync)" */

Functional Mutations

  • Add synchronization around a variable (ASAV):

An unprotected shared variable (resource) causes a data race. By synchronizing it data races may be fixed:

obj.write ( var1 ); 
/* MUTANT : "ASAV (Added Sync Around Variable)" */
synchronized ( lock ) { 
    obj.write ( var1 ); 
}
/* MUTANT : "ASAV (Added Sync Around Variable)" */
  • Expand synchronization regions to include unprotected source code (EXSB and EXSA)

Data races can sometimes be caused if the synchronization region does not fully encapsulate access to the shared resources. Expanding the synchronization region may also fix the data race.

synchronized ( lock ) { 
    obj.write ( var1 ); 
}
obj.write ( var2 );
/* MUTANT : "EXSA (EXpand Sync region After)" */
synchronized ( lock ) { 
    obj.write ( var1 ); 
    obj.write ( var2 );
}
/* MUTANT : "EXSA (EXpand Sync region After)" */
  • Change Sync Order (CSO)

Common deadlocks occur due to the ordering of lock acquisition. By interchanging nested lock objects common deadlocks can be fixed:

synchronized ( lock1 ) { 
    synchronized ( lock2 ) { 
        obj.write ( var1 ); 
    }  
}
/* MUTANT : "CSO (Change Sync Order)" */
synchronized ( lock2 ) { 
    synchronized ( lock1 ) { 
        obj.write ( var1 ); 
    }  
}
/* MUTANT : "CSO (Change Sync Order)" */

*TODO: Add the rest

Non-Functional

  • Remove unnecessary synchronization regions

Synchronization regions create overhead due the time required in acquiring/releasing the lock and delays due to waiting for the lock. Removing unnecessary synchronization regions will improve performance.

synchronized ( lock ) { 
    obj.write ( var1 ); 
}
/* MUTANT : "RSAV (Remove Sync Around Variable)" */
...
obj.write ( var1 ); 
...
/* MUTANT : "RSAV (Remove Sync Around Variable)" */
  • Shrink synchronization region

Reducing the number of statements encapsulated in a synchronization region will allow the lock to be released quicker. The less time a thread holds the lock the less thread contention will exist, thus improving performance.

synchronized ( lock ) { 
    obj.write ( var1 ); 
    obj.write ( var2 );
}
/* MUTANT : "SHSA (SHrink Sync region After)" */
synchronized ( lock ) { 
    obj.write ( var1 ); 
}
obj.write ( var2 );
/* MUTANT : "SHSA (SHrink Sync region After)" */

*Todo: Add the rest

Fitness Function

To evolve a fix for concurrency bugs, the proposed fixes must be evaluated and ranked. In the GA a fitness function is used to rank each proposed solution after the mutation step. Separate fitness functions are used for the fixing (functional) and optimization (non-functional) steps.

Functional

This fitness function uses correct executions as the primary ranking mechanism:

[functional\ fitness(P) = \sum\limits_{i=0}^n\frac{interleavings\ without\ a\ bug}{interleavings\ tested}] [n = #\ of\ Test\ Cases]

To ensure a high-level of confidence, IBM's ConTest tool is used to evaluate each mutated program. ConTest inserts random delays in to the code. These delays increase the chances of unusual thread schedules occurring. By exploring more thread schedules it is possible to explore more of the paths which might lead to a bug.

Non-Functional

This fitness function uses speed of execution as the primary ranking mechanism. Note that any mutants who decrease correctness are automatically rejected:

[non-functional\ fitness(P) = \sum\limits_{i=0}^n\frac{average\ CPU\ time}{interleavings\ tested}] [n = #\ of\ Test\ Cases]

The first phase can lead to unnecessary amounts of synchronization. Performance will degrade due to the overhead of lock acquisition and thread contention. This phase attempts to improve performance by shrinking or removing unnecessary synchronization.

Terminating Condition

ARC has two terminating conditions:

  1. A fix is found in phase 1 and optimizations that don't impact correctness are found in phase 2.
  2. No improvements in correctness are found. After so many attempts the algorithm quits, indicating no (or minimal) improvements were made.

#State Space

The state space of this problem is quite interesting due to the changing state situation. Initially the source code is parsed for the set of possible mutation locations. These mutation locations are then used to encode the program into the bit arrays.

For example, consider a situation where there are three mutation operators and three possible mutation locations.

[possible\ locations\ to\ mutate\ = \sum\limits_{i=0}^n #\ bits\ in\ i] [n = #\ of\ mutation\ operators]

There are 9 possible locations that can be mutated assuming all the locations have not been mutated. Considering that each of the mutations can be applied in any combination we need to consider all the combinations of the mutation locations.

[possible\ states\ = possible\ locations\ to\ mutate\ ^ 2]

Note: This is confusing. There are 81 states if we are limited to 2 mutations, 999 if we are limited to 3 and so on (?)

(It is possible for a mutation to occur on a mutation.) From this, there are 81 possible states, which takes 9 mutations to achieve due to only one mutation at a time. This assumes, that the number of mutation locations does not change after a mutation occurs. This is where the unusualness of the state space comes from. When a mutation occurs, it becomes possible that the number of mutation locations lowers/raises or stays the same.

From this an unusual situation arises, essentially the state space grows or shrinks on every mutation. There is no way to quantify this difference as a change in the state space can actually loop back to the same state. Additionally, it is difficult to understand the number of mutations that can occur (it is possibility infinite).

[all\ possible\ states\ = \sum\limits_{i=0}^n (\ possible\ locations\ to\ mutate\ *\ n\ )\ ^ 2 ] [n = #\ of\ mutation\ actions]

#Initial State The initial state of a program is one without any mutation applied to it. An initial parse of the program will present all the possible mutations and the initial genetic representation. This initial genetic representation is where each of the bits for the mutation operator arrays are set to 0.

#Action An action in this algorithm is when a mutation is applied to the program's source code. The mutation will apply a single mutation operator to the source code, which results in a single change in respect to concurrency. When the mutation has been applied, the source code must be re-parsed so that the new genetic representation can be found. The action of the mutation is reflected in the genetic representation by flipping the corresponding bit.

#Goal The mutants are evaluated using the two defined fitness functions (functional and non-functional). When a state has been found that satisfies the functional fitness function (no more bugs left, or no progress has been made in a few generations), it is then used to find a state that retains the functional fitness and minimizes the non-functional fitness. The product of this approach is then the optimal state that maximizes the function fitness and minimizes the non-functional fitness.

#Cost The cost of a solution is the number of mutations required from the original state to reach the goal state. An improvement can also be made on the cost, if there are any mutations which reverse another mutations then these are able to be canceled out.

[cost\ of\ path\ = number\ of\ mutations\ -\ (\ cancelled\ pairs\ of\ mutations\ *\ 2\ )\ ]

This results in the quickest cost/path from the initial state to the goal state. A lower costing path is preferred over a higher costing path.