Initial Findings

Mutating Source Code

I’ve been writing computer source code since about 1978, around age 7. I went to college, got my bachelor’s degree in Computer Information Systems in 1992, and have worked as a software engineer ever since.

Over the years, I’ve found that it takes a lot of design, experience, and coordination to keep any codebase from deteriorating as software engineers touch it. I’ve also found that making elegant systems takes a lot of forward-looking strategic effort. When I heard that blind random mutations and natural selection caused elegant interconnected systems to come into being, my instincts told me that wasn’t possible. I decided to write a code “mutator” that prepends, appends, overwrites, and inserts a mix of characters and python keywords to Python programs to see if my instincts were right or wrong.

In addition to the mutator, I also needed to create a selector to choose the acceptable mutations. There are a few levels of selectors. One is just seeing if the code will compile. Another more difficult selector I added was a suite of unit tests that explicitly test certain functionality against expected outputs. The third selector tests the code’s validity and readability/maintainability – a program called pylint.

I found that my instincts were correct. Not only did random mutation and selection of computer code not have creative power, but they were also highly destructive. The only way for these programs to survive was to have strong enough selectors to protect the vital parts of the code. Any portion of the code that selectors didn’t protect was utterly destroyed. I have included some preliminary examples below.

Mathematically this makes complete sense. There are practically infinite ways to arrange the characters that make up a program and only a comparatively tiny subset of those that are valid. An even smaller subset of those would do anything useful. And an even smaller subset would do it elegantly with many parts working together to achieve an end result. Random mutations are so much more likely to find a useless change than a useful change that they end up destroying any pre-existing order. For example, the tiny hello world example below has about 1.08e+70 possible combinations of characters that could make up a program of that length. That’s about 36 different characters to the 45th power, which is the length of the program. The human genome contains about 3.5 billion of 4 possible characters, which is 4 to the power of over 3 billion, which is practically infinite. For comparison, there are only 10 to the 80th power of atoms in the known universe. The numbers here are beyond astronomical, and these mathematical models support that random mutation should exhibit only destructive power at any significant scale.

Note that I’m not trying to simulate any biological processes here. I’m just testing the informational premises underlying mutation and selection and have tried to pick the simplest examples to demonstrate these mechanisms’ true nature. The examples below use a program that visually shows the differences between files - often called a diff program. Programmers use these to see what has changed over time in a program.

There are roughly 35 to 40 different 8-bit characters used in the program examples below. In biology, 4 different base pairs combine in triplets to map to one of 20 amino acids that link together to create proteins. My characters use 8 bits, 256 possible combinations, so I’m using a subset of the total possible characters. The 4-base triplets have 64 possible combinations, which redundantly map to one of 20 amino acids. So the mechanics are similar but not identical in scale between bases/bits and triplets mapped to characters/amino acids to make up programs/proteins.

Hello World - untested

The image above shows the differences between the starting point of the untested hello world program (on the left) and the final state of it after 1000 generations (mutations) (on the right). This program had no unit tests, so all that was tested is that it could be executed and wouldn’t return an error because of syntax errors or other errors.

There were only a few elements that survived. The first was the word def, a python keyword used to define a function. Mutate a letter of that word, and it turns into a syntax error – a fatal mutation. The second item was the first and second pairs of parenthesis. I was surprised to see these survive until I realized that the other would be unbalanced if only one got mutated and cause a syntax error. So pairs of things survive mutation. The colon at the end of the “def” line is the third thing that survived. Thinking about it now, it’s actually paired with the “def” keyword. If either one is destroyed, the other becomes unbalanced, and you get a syntax error. The same pairing concept also applies to the single quotes around the string literal.

The parts that were destroyed were just as interesting. The numbers at the start and end of the program are valid Python, but they do nothing at all. Some of these numbers managed to get a “J” suffix, which indicates that they are imaginary literals. One even has “%” signs embedded in it. This is the modulus operator, which returns the remainder of a division operation. Again, doing this operation on a number on a line by itself is valid but accomplishes nothing. The function name is destroyed since Python doesn’t care what you call your functions. A parameter “EL” was then added to the function that initially didn’t take a parameter. This new parameter is not used within the function, which is valid unless you run pylint against it, which will complain about it. I wasn’t using pylint on this experiment. Next, the “return” was destroyed and replaced with a garbage function call. I was initially surprised by this, but then realized that this wouldn’t cause a compile-time error, but a run-time error only if this hello_world function was actually called by something, which in the untested case, it isn’t. Of course, the text being returned is also destroyed since Python doesn’t care about the content of strings except in some special cases – this string not being one of those cases.

So without tests to check what this program is doing, it gets destroyed.

Hello World – tested

Hello World - tested

This looks much better. The unit test for this hello world program calls the function named hello_world() and then verifies that the text returned is “Hello World!”. Just doing that causes a lot of things to change from the untested version.

There are still random numbers above and below the function, which affect nothing. The name of the function is immutable because if it ever gets mutated, the unit tests will not find the function, and they will therefore fail, and the mutation will be rejected.

No parameter was added to the function because such a parameter isn’t optional. And since the unit test doesn’t pass in a parameter, the unit test would fail, and the mutation would be rejected.

The “return” and the text in quotes remain unchanged because either of those being changed would cause the unit test to fail. There was an “R” inserted before the quoted string, which, in Python, indicates that the text inside the string is to be taken literally and escape characters like backslashes are to be taken literally. Since there are no escape characters inside the string, this mutation did not affect the string being returned and therefore was accepted. There was also a newline character inserted before the closing parenthesis, which has no effect on the code’s operation but makes it less readable. Once I add pylint to the test suite, I believe most of these mutations will become impossible since they all degrade the quality, readability, and maintainability of the code.

Hello World - tested and pylinted

Here we have the pylint-constricted test that looks worse in some ways and better in others. To get a score of 10 out of 10 from pylint, you must have module- and function-level docstrings, which are the strings you see at the top of the file and within the function itself that start and end with triple quotes, which are multi-line strings in Python. These got destroyed, and so that’s the main regression with the pylint test. The unit test to check the phrase returned is still in place.

Notice that the random numbers are gone. They are rejected by pylint as being “pointless statements” and are therefore rejected. The docstrings are destroyed completely as they aren’t held in place by any unit tests.

Notice the top docstring has had its triple quote (“” ”) mutated into U” h”” which works since the prefix “U” means the string is Unicode, then the first quote makes this a normal string, and the next quote ends that string of “h”, then the next quote starts a new string which ends as the first quote of the triple quote, then the two quotes after that are just an empty string. This isn’t very important, just interesting that this paired item (opening and closing triple quotes) could be mutated because they can be split up and still work, unlike single quotes, or parenthesis, which are just single characters and can’t be split.

Pylint also prevented any lines from getting “too long” and held the number of newlines between certain parts of the code to acceptable numbers of newlines.

In summary, pylint prevented the numeric garbage from happening and kept line lengths from growing too far, and that’s about it.

Abiogenesis - untested

I decided to start with an empty file and see what would grow. Since I wasn’t sure what to test, I didn’t add a test to this one. I could have written a test to see if a self-replicating program appeared with many functions that work together to reproduce itself - like what would be necessary for a cell to emerge before it could replicate itself - but my instincts are that any such test would prevent any mutation that didn’t produce something functional right off the bat. I’d end up with an empty file. And really, without replication, there are no tests. No replication means no natural selection, which means no tests.

Again we see here that numeric constants can appear. These constants do nothing and are functionally equivalent to an empty file. Some “J” suffixes crept in, and some “%” modulus operators did as well. I ran another version of the mutator that inserts Python keywords and random characters and had a similar result but with a few “or” and “and” statements thrown into the middle of the random numbers.

So the only thing allowed to grow was something random that didn’t actually do anything productive.

Beak Length - tested

I wanted to test if adaptation could occur by mutating an already existing “knob” like beak length. Assume that shorter beak lengths will survive better than longer beak lengths and that anything over length 9 will be selected out. The unit test lets any beak length 9 or under live and kills off anything larger. In one of the tests, I got a beak length of 3 to live (the 9 was overwritten with a 3). Nothing else interesting happened as the rest of the mutations proved fatal by breaking the unit test or invalidating syntax. Adaptation works.

Multiple interdependent functions - tested

Multiple Interdependent functions - tested

Since unit tests are a form of pairs of things that work interdependently and protect each other from mutations, I decided to write a slightly more complex program with 4 functions in it. The fourth function calling each of the three other functions to get information from them and return the combined result of them. The unit test tested each of the three functions to ensure they returned the correct intermediate result and tested the fourth function’s result, which is redundant.

The results were as expected. No mutations that broke syntax or broke any intermediate or final result were allowed, leaving a tiny target for successful mutations. The familiar random numbers at the beginning and end of the program crept in, and a couple of newline characters placed where they had no effect also crept in.

Body Plans - tested

I recently heard a theory around multiple body plans being inside of a single genome. One possible method for many different body plans to emerge in a relatively short amount of time was for these body plans to be switched on rather than evolving into each other. I decided to see how information in such a situation would behave.

I assumed that some parts of the genome were common between the body plans and other parts were different between the body plans, and so I made a base class with some common information in the form of the word “Animal”. I then allowed 9 body plans to exist and selected one of them to be currently active – number 3 of 9. Then I started mutating the code. The unit tests just made sure that any one of the 9 body plans was returned. If body plan 3 got switched to body plan 7 - as long as body plan 7 was not destroyed by mutation - it would pass the test just fine.

The unit test held both the common “Animal” text and the currently selected body plan (number 3) in place. The rest of the body plans’ information was destroyed since those body plans weren’t interacting with the unit tests, and so mutations were allowed to destroy them. The selected body plan was never allowed to switch (i.e., from 3 to 7, for example) since switching to one of the deselected body plans would return the garbage data that had been mutated in that deselected body plan.

Notice that the if (or elif, short for “else if”) statements for body plans before number 3 are protected against mutation since the interpreter goes through them in order, and syntax errors there would be rejected. Notice, however, that the elif statements below body plan 3 were destroyed. This was allowed because once body plan 3 matched, it stopped evaluating the rest of the if statements, so they could contain runtime errors without breaking anything. This is similar to the mechanism that allows a “0 and …” to grow very long on a single line during mutation because after you evaluate “0 and” you don’t have to evaluate the rest of the line since you already know 0 AND anything is going to return false. This is called short circuit evaluation, and it prevents the parts of the code that are not evaluated to be mutated. The double equal signs and the numbers, elif’s, and colons were protected by their pairing and probably by the small target that single digits are – comparing the size of a single digit to the overall size of the code.

The exception raised on an invalid body plan also got mutated to the point of dysfunction except for the “raise” keyword and the usual suspects of paired keywords/symbols like parenthesis, quotes, etc. This experiment also suffered from the commonly seen garbage numbers at the top and bottom of the file.

Below is the code coverage for a body plans mutation run. Notice the lines that got coverage in the unit tests (green) are the sane lines, and the lines without unit test coverage (red) are the insane lines:

Body Plans - test coverage

English – tested

English - tested

Next up is the mutation of an English sentence. The starting point for our sentence is from Shakespeare. The ending point is not. The unit test for this one just checks to see if the mutation caused a spelling error or not. If it did, the mutation is rejected. If the words are all still valid, it accepts it. “The” was turned into “:De” – yeah, de is in the dictionary I found online, and punctuation is ignored in the test, so some words might gain a colon or other punctuation marks. “he” was changed into “MD”. “is” was turned into “iS” which is a clever coincidence to change the case of a letter. “Wise” was changed to “wide”. “The” was appended to be “theM”. The second “wise” in the sentence was turned into “wiRe”, which breaks the connection the two “wises” had in the beginning. “Man” was turned into “wan”. “to” became “tAp”. “be” became “EeL”. “fool” gained a colon as well.

The meaning of the original sentence has been lost in just 1000 generations. 91 mutations were accepted. Several mutations were spent making the garbage numbers before and after the quote line. I plan to add some grammar checking to see if we can keep a valid sentence structure intact. If we can, I’m not sure how I’d evaluate whether the sentence was “better” than the original or had gained some form of useful information above the original.

It’s interesting to note that the modifications being made in these examples modify the medium that information is flowing through between a human programmer and a computing device or between one mind and another mind. Normally, to alter ideas or create new ideas, that’s not done using the medium that the ideas are traveling through. It’s done in someone’s mind with a goal and purpose, which is then encoded into the medium used to deliver the idea. Trying to get new ideas by tampering with the medium itself is somewhat backward.

Next steps

I have several ideas about new tests I’d like to run:

I’m in the process of creating self-replicating code that will procreate and compete with other offspring for limited resources, etc., to see if that type of environment changes the nature of what random mutation and selection can do. See issue #11 for more detail. So far, the preliminary tests show the same kinds of useless garbage mutations as the tests above.
I would also like to perform more advanced grammar tests on the English test to see if the English sentence information content or validity could be preserved.
Continue enhancing the mutator to allow for other types of mutations, including duplication of parts of the code, deletions or truncations of parts of the code, keyword or operator pairs instead of single mutations, etc. It can currently prepend, append, insert, and overwrite with a 25% chance of occurring. Once I’ve added other kinds of mutation, I would like to adjust the mutation type probability to match DNA mutation.
I would like to try unit tests that can guide code to a desired result to see if that’s possible – kind of a more robust form of the adaptation idea presented above.

As I perform each test and see the results, it gives me more ideas for future tests.

Conclusion

So far, my intuition has been confirmed by these initial test results: random mutation and selection were not shown to have creative power and, in fact, showed only destructive (breakage) or wasteful power (in the case of random numbers appearing). The destruction occurred at different levels: in the syntax itself, in the information represented by that syntax, and the design behind achieving the desired result. Selection only helps protect the code from accepting mutations, protecting the original code from being destroyed. It seems that if anything could be created through this process, the resulting information would have to come from the environment/selection, not through the mutation. But the environment doesn’t care about internal organs. It just cares about surface interactions between the creature and the environment, like the unit test testing external interfaces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly