Regular Expressions
===================

Regular expressions can be used to do moderately complex string matching and searching.  Regular expressions in Python are available via the `re` module, which wraps a fast regular expression engine written in C.

In [None]:
import re

Typically you want to either use the `match()` or `search()` functions.  Match looks for patterns which occur at the start of a string, while search will find patterns anywhere in a string.

In [None]:
string = "hello world"
print re.match("hello (\w+)", string)

If a pattern is found, you get a `match` object back which contains information about what was matched.  In paticular the `groups()` method is very useful.

In [None]:
match = re.match("hello (\w+)", string)
if match is not None:
    print match.group(1)

In [None]:
string = "hello there"
match = re.match("hello (\w+)", string)
if match is not None:
    print match.group(1)

Group 0 is the entire match, and groups are numbered after that by the order of the open parentheses (including nested parentheses).

If you are re-using patterns over and over again, it can make sense to compile them before doing your matching.  The `re` module is reasonably smart about caching compiled expressions, so this may only matter if you are doing a lot of string matching.

In [None]:
pattern1 = re.compile('hello (\w+)')
match = pattern1.match(string)
if match is not None:
    print match.group(1)

pattern1.match("hello world")

One thing to be aware of is that regular expressions use `\` to escape special characters a lot.  This can be problematic at times.

For example, let's say we wanted to split parts of a Windows path, which are separated by backslashes (note that there are better ways of doing this using `os.path`!).  In this case, we want to match a single `\`, but because `\` is a special character for regular expressions, we need to escape it, so the actual expression is:

  \\\\

Where things get nasty, is that is we just typed this as a Python string, we get:

In [None]:
pattern = "\\"
print pattern

The problem is that Python strings also use `\` for escaping, so if you want to match a single `\`, then you need to use four (!) backslashes.

In [None]:
pattern = "\\\\"
path = "C:\\foo\\bar\\baz.txt"
print re.split(pattern, path)

Because this sort of thing can be annoying, Python provides "raw" strings that ignore Python's usual escaping, which simplifies writing regular expressions:

In [None]:
pattern = r"\\"
path = r"C:\foo\bar\baz.txt"
print re.split(pattern, path)

You should also be aware that there are (mathematically proven) limits on the sort of pattern matching that you can do with regular expressions alone.  If the patterns are sufficiently complex, then you should either use a parser which is already written and available in the standard library (such as for comma-separated values, XML, or Python code itself) or write your own custom parser using tools such as PLY or PyParsing.

Numpy's `fromregex()`
=====================

Numpy provides a `fromregex()` function that allows you to use regular expressions to extract values from text into a structured numpy array.

If we have a file with values:

    1312 foo
    1534    bar
    444  qux


In [None]:
with open('test.dat', 'w') as f:
    f.write("""1312 foo
1534    bar
444  qux""")

We can create a regular expression which matches digits followed by some amount of whitespace, followed by any three characters:

In [None]:
pattern = "(\d+)\s+(...)"

The dtype of the array should then correspond to the groups of the pattern:

In [None]:
dt = [('num', 'int64'), ('key', 'S3')]

And then we can read in from the file using `fromregex`:

In [None]:
from numpy import fromregex
output = fromregex('test.dat', pattern, dt)
print output

In [None]:
print output['num']

Copyright 2008-2016, Enthought, Inc.<br>Use only permitted under license.  Copying, sharing, redistributing or other unauthorized use strictly prohibited.<br>http://www.enthought.com