Added README.md.

tmciver · May 24, 2013 · 3063609 · 3063609
1 parent a114a78
commit 3063609
Showing 1 changed file with 63 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,63 @@
+#ByteGrep
+
+A Java library/executable used for finding byte sequences in files.  Please note that this project was created as a learning exercise.  It is not thoroughly implemented and is not very fast and so it's not recommended to be used in production.
+
+#Usage
+
+The jar file produced from this project is executable.  Two command line arguments are needed.  The first is the regular expression describing the sequence of bytes to find.  The second argument is a path to the file to be searched:
+
+    java -jar bytegrep.jar some-regex path/to/some/file
+
+If a byte sequence described by the given regular expression is found, the output should look something like:
+
+> Found match at byte offset 72
+
+or
+
+> No match found
+
+if a match was not found.
+
+# Regular Expressions
+
+The regular expression syntax is exactly what you'd expect with one caveat: the literal syntax is different.  Since we are looking for bytes and not characters, the following byte literal syntax is used:
+
+    0xXY
+
+where X and Y are any hexadecimal digits.  So if you wanted to find the following four bytes:
+
+    0xCAFEBABE
+
+you'd use this regular expression syntax:
+
+    0xCA0xFE0xBA0xBE
+
+Grouping, alternation and the quantifiers *, + and ? are supported.  So the following is also valid ByteGrep syntax:
+
+    (0x8F0x45)+0xAA?0x3C
+
+Currently the following meta-characters are not supported:
+
+    []{}^.$
+
+# Issues
+
+Other than the regular expression syntax not yet implemented as mentioned above this implementation does not do any backtracking.  This means that byte sequence described by syntax such as
+
+    0x45*0x450xAA
+
+will not be found because the first part of the expression (0x45*) will consume all the 0x45s before the 0xAA and then the next part of the expression (0x45) will not match.  Backtracking may be implemented in a future version.
+
+# Rationale
+
+As stated previously this project was created as a learning experience.  In particular there are two main concepts I wanted to learn.
+
+## Parsing
+
+The file `DefaultParser.java` is an implementation of what's known as a predictive recursive descent parser.  It's what parses the regular expression syntax string and creates from that a syntax tree representing the regular expression.  There's a nice comment at the top of that file that describes the grammar parsed by the parser in all its gory detail.
+
+## Syntax Trees and the Interpreter and Composite patterns
+
+The class `com.timmciver.bytegrep.RegularExpression` defines the interface (though it's currently an abstract class) that is implemented by the other classes in that package.  All of the other classes (except `RegularExpression` and `LiteralByte`) take one or more `RegularExpression`s in their constructors.  This use of the [Composite Pattern](http://en.wikipedia.org/wiki/Composite_pattern) allows one to build up an arbitrarily complex syntax tree representing a regular expression for a byte sequence.
+
+The match() method defined in RegularExpression and implemented by each of its subclasses is a variation of the [Interpreter Pattern](http://en.wikipedia.org/wiki/Interpreter_pattern).  But instead of executing operations the match() method checks the given input against the regular expression that its subclass represents.