#Regular Expressions

##Introduction
###Regular expressions
* A sequence of symbols and characters expressing a text pattern.
* A regular expression allows us to specify a string pattern that we can then search for in bodies of text.
* It is a mini-language with its own syntax and exsists in pretty much every programming language (and OS) worth its salt.

Some examples of a regular expressions are:

```
\w+_\d
^SL\d{3}_UC$
^1[012]0[01]\d{4,5}
```
Regexps are can appear quite dense but can be decomposed into **character classes** and **metacharacters**.
###Character classes
###Metacharacters


In python we use the `re` module to create and use regular expressions.

In [157]:
import re

###Task
We have a list of strings and some of these contain names that we want to extract. The names have the format
```
0123_FirstName_LastName
```
where the quantity of numbers at the beginning of the string are variable.

In [142]:
L = [
'123_Sam_Smith',
'Blah blah',
'2342_Katie_Price',
'More blah blah',
'String_without_numbers']

In [143]:
p = re.compile(r'\d+_([A-Z,a-z]+)_([A-Z,a-z]+)')

In [144]:
for el in L:
    m = p.match(el)
    if m:
        print m.groups()

('Sam', 'Smith')
('Katie', 'Price')


In [139]:
  m = p.match(L[0])

In [140]:
m.groups()

('Sam', 'Smith')

###Task
Find all occurences of "AGT" within a string of DNA where contiguous repeated occurences should be counted only once.

In [106]:
dna = 'AGTAGTACTACAAGTAGTCCAGTCCTTGGGAGTAGTAGTAGTAAGGGCCT'

In [107]:
p = re.compile(r'(AGT)+')
m = p.finditer(dna)
for match in m:
    print '(start, stop): {}'.format(match.span())
    print 'matching string: {}'.format(match.group())

(start, stop): (0, 6)
matching string: AGTAGT
(start, stop): (12, 18)
matching string: AGTAGT
(start, stop): (20, 23)
matching string: AGT
(start, stop): (30, 42)
matching string: AGTAGTAGTAGT


In [74]:
m.span()

(0, 3)

In [35]:
p = re.compile(r'\W+')
p.split('This is a test, short and sweet, of split().')

['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']

In [82]:
match.group()

'AGTAGTAGTAGT'

###Task
It has become a critical part of your job to determine if a string contains "wazup" or "wazzup" or "wazzzup", etc.

In [158]:
L = [
'wazzzzzzzup',
'wazup',
'waup',
'what is up',
'wazzzzzzzzzzzzzzzzzzzzzzzup']

In [159]:
p = re.compile(r'waz+up')

In [160]:
for el in L:
    if p.match(el):
        print el

wazzzzzzzup
wazup
wazzzzzzzzzzzzzzzzzzzzzzzup


##Resources
* https://docs.python.org/2/howto/regex.html
* https://www.regex101.com/#python