Skip to content

Chaffin method

Greg Egan edited this page May 25, 2019 · 12 revisions

The Chaffin method

The Chaffin method (named after Benjamin Chaffin, who devised it in 2014, as described in this blog article by Nathaniel Johnston) is a branch-and-bound algorithm for searching the tree of all digit strings (with digits from 1 to n) for those that have the largest number of distinct permutations as substrings, while only wasting a certain number of digits. When a string is reached that contains all the permutations, it is guaranteed to be an example of the shortest possible superpermutation for n.

By a wasted digit, we mean a digit that, when added to the string, leaves the last n digits as either not a permutation at all, or a permutation that has previously appeared in the string. For example, if we add the digit 3 to the string 123456 to get 1234563, the last 6 digits, 234563, is not a permutation, so the 3 we have added is a wasted digit.

In a Chaffin method search, we want to find the largest number of permutations, p, that we can visit while only wasting w digits. In the examples that follow, we will set n=6, and we will start all our strings with 123456. We need to consider the whole search tree that has 123456 as the root, and then 1234561, 1234562 ... 1234565 as the next nodes, and so on. We never add a digit that is the same as the one before it, as this could never yield a better result than we would obtain by dropping the repeated digit. But the full tree is impossible to search, as it will contain 5^x different strings, where x is the number of digits we add. But we can prune some branches of the search tree by using what we know about the maximum number of permutations that we could visit with a smaller value of w.

For example, suppose we found that for w=1, the most permutations we can visit is p=12. Then if we are doing the search for a larger value of w, at each node in the search tree (each partial string of digits we are looking at) we track how many more digits we can waste before we reach w, and also how many permutations we have visited so far. If we only have 1 more digit we can afford to waste, from our previous result we know that by spending that wasted digit we can only gain 12 more permutations. Suppose that, earlier in the search, we found 18 permutations in one of our strings, and at the node where we only have 1 more digit to waste, we have only visited 6 permutations. Then we know that the best we can do by continuing to search this branch is 6 permutations (those already visited in the string we have so far) plus 12 permutations (the maximum number of permutations that can be visited with only 1 wasted digit), for a total of 18 permutations. But that isn't any better than the current maximum of 18 that we found with the earlier string. So there is nothing to gain by searching this branch, and we can prune it from the search.

This is the basic approach used by the single-computer algorithm implemented in the code ChaffinMethod.c in the repository, along with some further refinements:

  • For each successive value of w, a possible value for the maximum permutation count, p, is estimated before the search begins, and the search proceeds with a pruning strategy that discards any branches that could not reach this target. If the estimate is too high, no strings at all are found, and the search is repeated with a lower estimate, having ruled out the value of p for which it was conducted. If the estimate is not too high, the search will eventually find a string, at which point it will either finish immediately (because the previous iteration ruled out any higher value for p), or, if it is the first iteration, it will need to continue until the entire breadth of the tree has been searched, in order to rule out any higher p. But with each improved value of p achieved when a new string is encountered, the pruning strategy will discard more of the tree, as more branches in the search will fail to offer a chance of obtaining an even higher value of p.
  • For values of w greater than or equal to (n-1)!, an additional pruning strategy is employed, in which the number of unvisited permutations in each cyclic-class (or ''1-cycle'') of the set of permutations is tracked throughout the search. A 1-cycle consists of all the permutations that can be reached by simply cycling the digits, e.g. the 1-cycle containing 123456 consists of the six permutations 123456, 234561, 345612, 456123, 561234 and 612345. Because it is impossible to move from one 1-cycle to another without wasting a digit, an upper bound can be placed on the number of permutations that can be added to a string while wasting the number of digits needed to move between the different 1-cycles and visit the previously unvisited permutations they contain. This bound can be lower than that obtained by only considering the maximum number of permutations that can be reached for a given number of wasted digits, as it takes into account the current distribution of permutations between the 1-cycles, whereas the general maximum p for a given w assumes that all permutations are unvisited.

Distributed Version

The distributed version, implemented in DistributedChaffinMethod.c and the associated server code, begins by giving one program the task of searching from the root node, say 123456. After searching to the full depth for a certain period of time, the program then switches to a splitting mode, in which it delegates any sufficiently deep sub-trees it encounters to separate tasks that other instances of the program can carry out, starting from their own nodes. This process repeats as many times as necessary, until the final tasks can be completed without spawning any new ones.

Each search of the tree for a given value of w and a given initial estimate for p involves a new iteration of this process. The final result for each w might be achieved with a single iteration, or it might require two or more if the estimate for p is initially too high.

At any point during an iteration, there are three kinds of tasks:

  • Assigned tasks: tasks that one of the ''client'' computers in the search is currently performing.
  • Unassigned tasks: tasks that are waiting to be allocated to a client.
  • Pending tasks: tasks that have been split off in the process of another search that is still ongoing. Pending tasks do not become unassigned (that is, they do not become available to be allocated to clients) until the task that spawned them has finished, because if the client abandons the task for some reason and it needs to be re-assigned to a different client, the re-assigned version of the task would potentially spawn duplicates of those pending tasks. This is further complicated by the fact that the precise set of tasks spawned by a search that starts at a given node will depend on the speed of the client computer, as the algorithm decides when to start splitting the task, and how deeply to explore each sub-tree before delegating it, based on elapsed time rather than fixed properties of the search tree. This means that in practice, the only way to avoid duplicate searches is to delay the assignment of spawned tasks until their parent task has finished.

Tasks can vary greatly in the amount of the tree that they search. An objective measure of the amount of computation that has been performed is a count of the nodes of the search tree that have been checked (which for this search are conveniently measured in multiples of 10^9, or ''giganodes'').

You can’t perform that action at this time.