# Regular Expressions in Python

## Introduction

Regular expressions (aka regex or regexp or RE) is a tiny programming language to describe sets of strings, a sort of pattern.  
  
Regexs often provide a safer (and faster) way of searching than a simple text search.


They can be used for several things:

- Powerful search and replace function
- Extract information
- Format checking


Here we practise regular expressions using Python. We could also use the shell (grep/egrep, sed, awk), R, any other programming language and even many text editors and OpenOffice.

## Simple Patterns

The easiest regular expression is a literal string. For example, the regular expression `test` will match the string `test` exactly.


In python, we first have to load the built-in module `re`:  
```python
import re
```

The basic format is (query being a regular expression):
```python
Result = re.search(query, string)
```

In [8]:
import re
 
MyString = "Arabidopsis thaliana"
re.sub("thaliana", "lyrata", MyString)

'Arabidopsis lyrata'

  Let's start with a simple example. We define a regular expression `MyRe` putting the letter r immediately before the opening quotation mark:
```python
MyRe = r"(\w)(\w+)"
```

what | matches
-------------------|----------------   
. | any character    
\d | digit [0-9]
\w | word character [a-zA-Z0-9_]
\s | whitespace
[AGCT] | exactly one of the enclosed characters

**Quantifiers**

what | matches  
-------------------|----------------  
\w+ | matching once or more    
\w* | matching zero or more times
{n} | matching exactly n times
{a,b} | matching between a and b times

<br> 
<br>
```python  
MyRe = r"(\w)(\w+)"
```

The pattern `MyRe` captures 2 strings as indicated by the pair of brackets (1. the first letter, and 2. all the rest) counting from left to right.

Now we search our regular expression in the string `Arabidopsis thaliana`:

In [9]:
MyRe = r"(\w)(\w+)"

MyString = "Arabidopsis thaliana"
MyResult = re.search(MyRe, MyString)

# All matches together
print(MyResult.group(0))

# Now we can get the first captured match (first pair of brackets)
print(MyResult.group(1))

#And the second captured match (second pair of brackets)
print(MyResult.group(2))

Arabidopsis
A
rabidopsis


## Functions that use regular expressions

Function | Description
-------------------|----------------   
re.search() | Detect the presence of a pattern (**only first match**)
re.match() |  Like re.search but only if pattern matches the entire string
re.findall() | Like re.search but finds **all** matches
re.finditer() | Like re.findall but returns a list of match objects
re.sub(query, replacement, string) | Make **all** substitutions in a string
re.split() | Split a string according to a pattern




## Replacing text / Substitution

re.sub() is for replacing text. Its arguments are <pattern> (the regular expression), <repl(acement)> (the substitution pattern) and <string> (the string to work on).

The simplest regular expression pattern are just literal characters:
```python
MyString = "23rd May 2000"
re.sub("May", "July", MyString)
[1] "23rd July 2000"
```

sub() replaces all occurences by default. Hence here we remove any space
```python
re.sub(" ", "", " Hello World ")
'HelloWorld'
```

We can also replace a maximum number of occurences using  the `count` argument:
```python
re.sub(" ", "", " Hello World ", 1)
'Hello World '

re.sub(" ", "", " Hello World ", 2)
'HelloWorld '

re.sub(" ", "", " Hello World ", 3)
'HelloWorld'
```

Regular expressions have much more abilities. Here we use a regular expression to only keep the 3 digits after the comma.

```python
re.sub(r"(\d+\.\d{3})\d+", r"\1", "34.73322532")
'34.733'
```

or reformat a date:

```python
re.sub(r"(\d+)\w{2} (\w+) (\d+)", r"\2 \1 \3", "23rd May 2015")
'May 23 2015'
```
Often the same thing can be done using string functions like `split()` and pasting together subsets.



## Escape special characters

If we look for a meta-character `$ * + . ? [ ] ^ { } | ( ) \ ,` we precede them with a backslash. As they have a double meaning we need '\' to interpret them as ordinary characters.

In Arabidopsis, isoforms of a gene X are called like X.1, X.2, X.3. Sometimes we need a list of genes which we can achieve using a regular expression. It works with X.11 or X.100.

```python
re.sub(r"([^\.]+)\.\d+", r"\1", "AT5G11100.3")
"AT5G11100"
```

With square brackets `[]` we define our own character class. The ^ in square brackets means that it will match any character except the one written. We have to escape the dot `\` otherwise it means any character. Thus [^\.] means any character but a dot. This way is a safe way to capture everything until the first dot.


There are often many different ways to write a working regular expression. These give the same result:

```python
re.sub(r"([^\.]+)\.[0123456789]+", r"\1", "AT5G11100.3")
"AT5G11100"

re.sub(r"([A-z0-9]+)\.0-9]+", r"\1", "AT5G11100.3")
"AT5G11100"
```


## Finding text / Format checking

re.search() tells you whether a regular expression matches the input string. If it matches it returns `a match object` and if it does not match `None`.


In [6]:
import re

if re.search(r"21", "21 Aug 2014"):
        print("found")
else: 
        print("not found") 

found


If it matches the resulting Python object contains many attributes (private methods not shown).

```python
dir(re.search(r"21", "21 Aug 2014"))
 'end',
 'endpos',
 'expand',
 'group',
 'groupdict',
 'groups',
 'lastgroup',
 'lastindex',
 'pos',
 're',
 'regs',
 'span',
 'start',
 'string'
```

## Splitting

```python
>> s = "a 1 and 2 and 3 and 4"
>> a = re.split("\d", s) # every number is a delimiter
>> a
['a ', ' and ', ' and ', ' and ', '']
```

## The power of Regexs

Regexs can e.g. be used to remove HTML tags to get a simple text file:

```python
# This string contains HTML.
>> v = """<p id=x>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p>"""

# Replace HTML tags with an empty string.
>> result = re.sub("<.*?>", "", v)
>> print(result)
'Sometimes, simpler is better, but not always.'
```

The real power comes from combining these tools:
```
^AUG[AUGC]{30,1000}A{5,10}$
```
would identify full-length eukaryotic messenger RNA sequences (AUG start, followed by 30-1000 nucleotides, terminated by a 5-10 nt poly-A tail)


## Nota bene

- Regexs by default match the maximum number of characters (aka 'greedy', can be changed with the `?` modifier, e.g. `+?` or `*?`): e.g. the regex 'a*c' on a string 'abcdefghabc' will match 'abcdefghabc', not 'abc'
- Regexs are case-sensitive! (but can be changed with IGNORECASE flag)


## Application in Bioinformatics

Regular expressions are used a lot for data mangling (format conversion). Furthermore, regular expressions are often used for parsing, e.g. if you want to extract information from a BLAST report (e.g. the Sequence ID and the the E-value). Regular expressions can also be used to identify Sequence motifs (e.g. to search for a motif with 3 basic amino acids across 5 positions).

- protein domains
- DNA transcription factor binding motifs
- restriction enzyme cut sites
- degenerate PCR primer sites
- runs of mononucleotides
- read mapping locations

### Final Comment

Trial and error: Sometimes regular expressions do not behave as expected. In case of difficulties try to start simple, test parts of the regular expression and combine them once the subparts work. Often it also helps to do two rounds of replacements.

## Sources

- [Regular Expression HOWTO (Python doc)](https://docs.python.org/3/howto/regex.html)
- [re library documentation](https://docs.python.org/3/library/re.html)
- [Software Carpentry v4](http://software-carpentry.org/v4/regexp/index.html)
- [Haddock & Dunn. Practical Computing for Biologists. Sinauer Associates 2011.](http://practicalcomputing.org)
- [Python for Biologists](http://pythonforbiologists.com/regular-expressions/)

## Further reading

- [Regex Cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)
- [RegExr](https://regexr.com/) is an online tool to learn, build, & test Regexs
- [Sequence Analysis with RegExs](http://www.dalkescientific.com/writings/NBN/slides/regexps.pdf)

**Regular expression in other languages**  
- [in R](http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Functions_which_use_regular_expressions_in_R)  
- [using sed](http://www.grymoire.com/Unix/Sed.html#uh-4)

## Exercises

1. Write a function is_vowel() that takes a character (i.e. a string of length 1) and returns True if it is a vowel, False otherwise.

2. Define a function is_palindrome() that recognizes palindromes (i.e. words that look the same written backwards). For example, is_palindrome("radar") should return True.

3. Modify some of the input strings and patterns/regular expression in the lesson and check whether they produce the expected results. You can e.g. add additional text to the input string or remove parts of the regex.

4. Write a regular expression that checks whether a string has the valid format like "21. Aug 2014".

5. Often we get very long sample names from sequencing centers. Develop a regex to remove the noninformative (repeated) parts of these 2 sample names:  
  X20120401_Wyderrun31_1367.05.1_05.RCC  
  X20120401_Wyderrun31_1482.05.1_08.RCC  

6. Search for the motif [CG]CTCGA in the sequence GTGCCCCTCGAGAGGAGGGCGCGCGCCGCGCGCTCGACGCGATCGGCGCTCAGCGAGCGAGCTCCTCGAAGCGATCCGCGCGCGCT. Print position and motif for each match. Hint: use `re.finditer` and a `for` loop.

7. (advanced) Have a closer look at the syntax for regular expression logic. For example, one can use
    ```or``` statements using the pipe character:

  ```python
  motif = re.compile('[CG]CTCGA|GCGCGC')
  ```

  Write a function that will find the occurences and positions of the motif CAGCCGCG in the following gapped sequences, and return the gapped position of the match (*ie* don't cheat and simply ignore the gaps):

   CCA--G-C---A--GCCG---C-GG--TA-AT

  CGCA--G-C---A--GCCG---C-GG--TA-AT

  TGCA--G-C---A--GCCG---C-GG--TA-AT
  
## Solutions

In [10]:
# Solution for Exercise 1

import re

def is_vowel(char):
    if re.search("[aeiouAEIOU]", char):
        return True
    return False

print(is_vowel("a"))
print(is_vowel("z"))

True
False


In [11]:
# Solution for Exercise 2

import re

def is_palindrome(str):
    if str == str[::-1]:
        return True
    return False

print(is_palindrome('radar'))
print(is_palindrome('mars'))

True
False


In [40]:
# Solution for Exercise 4

import re

def is_correct_format(str):
    if re.match(r"\d{2}\. \w{3} \d{4}", str):
        return True
    else:
        return False

# correct format
print(is_correct_format('21. Aug 2014'))
# day is 1-digit only
print(is_correct_format('2. Aug 2014'))
# correct format
print(is_correct_format('02. Aug 2014'))
# dot is missing
print(is_correct_format('02 Aug 2014'))
# year is only 2-digit
print(is_correct_format('02. Aug 20'))
# A really good version should also check that the middle part is a valid month and the range of day and year.
#  matches could be captured and tested individually
print(is_correct_format('02. NNN 2018'))

True
False
True
False
False
True


In [50]:
# Solution for Exercise 5

# several solutions are possible, we can use a regexp or split into parts by '_' and concatenate the parts of interest
import re

def shorten_samples_names(str):
    # [^_] matches any character but a '_' 
    return re.sub(r"X.*Wyderrun\d+_([^_]+_\d+)\.RCC", r"\1", str)


print(shorten_samples_names('X20120401_Wyderrun31_1367.05.1_05.RCC'))
print(shorten_samples_names('X20120401_Wyderrun31_1482.05.1_08.RCC'))

1367.05.1_05
1482.05.1_08


In [70]:
# Solution for Exercise 6

import re

motif = r"[CG]CTCGA"

dna = 'GTGCCCCTCGAGAGGAGGGCGCGCGCCGCGCGCTCGACGCGATCGGCGCTCAGCGAGCGAGCTCCTCGAAGCGATCCGCGCGCGCT'

for match in re.finditer(motif, dna):
    print(match.group(), match.span())

CCTCGA (5, 11)
GCTCGA (31, 37)
CCTCGA (63, 69)


In [71]:
# Solution for Exercise 7

import re

def matchMotif(motif,seq):
    
    # insert optional gaps '-*' between all characters of motif
    regex = list()
    for char in motif[:-1]:      
        char = char + '-*'
        regex.append(char)
    
    regex.append(motif[-1])
    regex = "".join(regex)
    print("regex string: " + regex)
    regex = re.compile(regex)
    
    m = regex.finditer(seq)
    
    if m:
        for match in m:
            print(match.span())
            print(match.group())
    
motif1 = 'CAGCCGCG'

seqs = ['CCA--G-C---A--GCCG---C-GG--TA-AT','CGCA--G-C---A--GCCG---C-GG--TA-AT','TGCA--G-C---A--GCCG---C-GG--TA-AT',
       '---CTTGCTCGT---C---A------GC-CGC----------G-----']

for seq in seqs:
    matchMotif(motif1, seq)

regex string: C-*A-*G-*C-*C-*G-*C-*G
(7, 24)
C---A--GCCG---C-G
regex string: C-*A-*G-*C-*C-*G-*C-*G
(8, 25)
C---A--GCCG---C-G
regex string: C-*A-*G-*C-*C-*G-*C-*G
(8, 25)
C---A--GCCG---C-G
regex string: C-*A-*G-*C-*C-*G-*C-*G
(15, 43)
C---A------GC-CGC----------G
