
![Py4Eng](https://dl.dropboxusercontent.com/u/1578682/py4eng_logo.png)

# Regular expressions
## Yoav Ram

# Regular expressions
Suppose we have a DNA sequence in which we want to look for a specific pattern, say, 'TATAGGA'.  
What do we do?  
Easy, we use the `find` method.

In [None]:
seq = "ccgcaattcactctataggagcaggaacatggataaagctcacagtcgca"
if seq.find('tatagga') >= 0:
    print('pattern found!')

OK, but what if we need to look for a more flexible pattern, such as 'TATAGGN'?  
We can do:

In [None]:
if seq.find('tatagga') >= 0 or seq.find('tataggt') >= 0 or seq.find('tataggc') >= 0 or seq.find('tataggg') >= 0:
    print('pattern found!')

But that's lots of work and also, what if we need 'TATAGNN'?  
There are too many combinations to cover manually!  
What we need is a more general way of doing such matching. This is what __Regular expressions__ are for!

### What are regular expressions?
Regular expressions (regex) are sets of characters that represents a search pattern. It's like a specific language that was designed to tell us how a text string should look. It includes special symbols which allow us to depict flexible strings.  
This is a very powerful tool when looking for patterns or parsing text.  
We'll soon see what we can do with it and how to use it.

### Using regular expressions
In order to use regex, we need to use pythons built-in dedicated module. That means we don't have to install anything, just import the `re` module.

In [None]:
import re

#### Raw strings
We've already encountered some special characters, such as _\n_ (newline) and _\t_ (tab).  
In regular expressions we want to avoid any confussion, and therefore use a special notation, telling python that we have no buisness with special characters here. We simply put an _r_ __outside__ the quotation marks. 

In [None]:
normal_string = "There will be\na new line"
raw_string = r"There won't be\na new line"
print(normal_string)
print(raw_string)

__ALWAYS use raw strings when working with regular expressions!__

### Searching for patterns
This is the most basic task regex is used for. We just want to know if a pattern can be found within a string.  
The first step when working with regex is always _compiling_. This means we transform a simple string such as 'tatagga' into a regex pattern. This is done using `re.compile()`

In [None]:
regex = re.compile(r'tatagga') # notice the 'r'

We didn't match anything yet, just prepared the regex pattern. Once we have it, we can use it to seqrch within another string. For this we can use the `re.search()` method. It takes two parameters: regex and string to search ('target string') and returns _True_ if the pattern was found and _False_ otherwise.

In [None]:
if re.search(regex,seq):
    print('pattern found!')

#### Character groups
The last example wasn't particularly useful, right?  
OK, so here's when it gets interesting. We can define character groups within our regex, so that any of them will be matched. We do that using square brackets, and put all possible matches within them. So if we want to match 'TATAGGN' we'll do:

In [None]:
regex = re.compile(r'tatagg[atgc]')
if re.search(regex,seq):
    print('pattern found!')

In [None]:
if re.search(regex,"tataggn"):
    print('pattern found!')

We can put any list of characters within the brackets. There are also a few tricks to make things easier:  
* [0-9] - any digit
* [a-z] - any letter
* [a-p] - any letter between a and p
  
There are also special symbols for common groups:  
* \d - any digit (equivalent to [0-9])
* \w - any 'word' character - letters, digits and underscore (equivalent to [a-zA-Z0-9\_)
* \s - any whitespace character - space, tab, newline and other weird stuff (equivalent to [ \t\n\r\f\v])
  
And finally, there's the _wildcard_ symbol, represented by a dot (.).  
This means any character (except for a newline).  
__Careful with this one!__ It'll take almost anything, so use it wisely.

In [None]:
# examples:
regex = re.compile(r'\d[d-k][2-8].')
if re.search(regex,'7f6,'):
    print('pattern found!')

In [None]:
if re.search(regex,'hello7f6world'):
    print('pattern found!')

In [None]:
if re.search(regex,'5l7o'):
    print('pattern found!')

In [None]:
if re.search(regex,'7f6'):
    print('pattern found!')

#### Being negative
Sometimes we want to tell python to search for 'anything but...'. We can do that in two ways:  
If we are using character groups in square brackets, we can simply add a cadet (^) before the characters. For example `[^gnp%]` means 'match anything but 'g','n','p' or '%''. If we are using the special character groups, we can replace the symbol with a capital letter, so for example \D means 'match anything but a digit'.

In [None]:
regex = re.compile(r'AAT[^G]TAA')
if re.search(regex,'AATCTAA'):
    print('pattern found!')

In [None]:
if re.search(regex,'AATGTAA'):
    print('pattern found!')

In [None]:
regex = re.compile(r'AAT\STAA')
if re.search(regex,'AATCTAA'):
    print('pattern found!')

In [None]:
if re.search(regex,'AATGTAA'):
    print('pattern found!')

In [None]:
if re.search(regex,'AAT TAA'):
    print('pattern found!')

#### Alteration
When we want to create multiple options for longer patterns, character groups are not enough. In these cases we have to use the special '|' (pipe) character, which simply means 'or'.  
For example, if we want to match a pattern that starts with AGG, then either CCG __or__ TAG, and finally GTG, we can do: 

In [None]:
regex = re.compile(r'AGG(CCG|TAG)GTG')
if re.search(regex,'AGGTAGGTG'):
    print('pattern found!')

In [None]:
if re.search(regex,'AGGCCGGTG'):
    print('pattern found!')

In [None]:
if re.search(regex,'AGGCCTGTG'):
    print('pattern found!')

#### Repetition
In many cases, we want to write regular expressions where a part of the pattern repeats itself multiple times. For that, we use _quantifiers_.  
If we know exactly how many repetitions we want, we can use `{<number>}`:

In [None]:
regex = re.compile(r'GA{5}T')
if re.search(regex,'GAAAAAT'):
    print('pattern found!')

In [None]:
if re.search(regex,'GAAAT'):
    print('pattern found!')

We can also set an acceptable range of number of repeats, which is done using `{<minimum repeats>,<maximum repeats>`:

In [None]:
regex = re.compile(r'GA{3,5}T')
if re.search(regex,'GAAAAAT'):
    print('pattern found!')

In [None]:
if re.search(regex,'GAAAT'):
    print('pattern found!')

In [None]:
if re.search(regex,'GAAAAAAT'):
    print('pattern found!')

To say 'x or more repetitions', we use `{x,}`. For 'up to x repetitions', we can use `{0,x}`.

For more general cases, there are three special symbols we can use:  
- \+ - repeat 1 or more times
- \* - repeat 0 or more times
- ? - repeat 0 or 1 times, or in other words 'optional' character.

In [None]:
regex = re.compile(r'GA+TT?[AC]*')
if re.search(regex,'GAATTACCA'):
    print('pattern found!')

In [None]:
if re.search(regex,'GATACCA'):
    print('pattern found!')

In [None]:
if re.search(regex,'GTACCA'):
    print('pattern found!')

In [None]:
if re.search(regex,'GAAAAAAAT'):
    print('pattern found!')

__Note 1__: Quantifiers always refer to the character that appears right before them. This could be a normal character or a character group. If we want to indicate a repeat of several characters, we enclose them in ().

In [None]:
regex = re.compile(r'GGCG(AT)+GGG')
if re.search(regex,'GGCGATATATATGGG'):
    print('pattern found!')

In [None]:
if re.search(regex,'GGCGATTAATGGG'):
    print('pattern found!')

In [None]:
regex = re.compile(r'GGCG(AT)?GGG')
if re.search(regex,'GGCGATGGG'):
    print('pattern found!')

In [None]:
if re.search(regex,'GGCGGGG'):
    print('pattern found!')

__Note 2__: Whenever we want to match one of the special regex characters in its 'normal' context, we simply put a '\' before it. For example: \\*, \\+, \\{...

In [None]:
regex = re.compile(r'.+\{\d+\}\.')
sentence = 'A sentence that ends with number in curly brackets {345}.'
if re.search(regex,sentence):
    print('pattern found!')

## <span style="color:blue">Class exercise 4D</span>

The code below includes a list of made-up gene names. Complete it to only print gene names that satisfy the following criteria:  
1. Contain the letter 'd' __or__ 'e'  
2. Contain the letter 'd' __and__ 'e', in that order (not necessarily in a row)
3. Contain three or more digits in a row

In [None]:
import re
genes = ['xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847', 'hedle3455', 'xjhd53e', '45da', 'de37dp','map492ty']

# 1.
print('Gene names containing d or e:')
regex1 = re.compile(r'[de]')
for gene in genes:
    if re.search(regex1,gene):
        print(gene)
        
print('------------------------')

# 2.
print('Gene names containing d and e, in that order:')
regex2 = re.compile(r'd[^e]*e')
for gene in genes:
    if re.search(regex2,gene):
        print(gene)
        
print('------------------------')

# 3.
print('Gene names containing three digits in a row:')
regex3 = re.compile(r'\d{3,}')
for gene in genes:
    if re.search(regex3,gene):
        print(gene)

#### Enforcing positions
We can enforce the a regex to match only the start or end of the input string. We do that by using the ^ and $ symbols, respectively.

In [None]:
regex = re.compile(r'^my name')
if re.search(regex,'my name is Slim Shady'):
    print('pattern found!')

In [None]:
if re.search(regex,'This is my name'):
    print('pattern found!')

In [None]:
regex = re.compile(r'my name$')
if re.search(regex,'This is my name'):
    print('pattern found!')

We can combine the start and end symbols to match a whole string:

In [None]:
regex = re.compile(r'^GC[GTC]{2,10}TTA$')
if re.search(regex,'GCTTCGCTTA'):
    print('pattern found!')

In [None]:
if re.search(regex,'GCTTCGCTTAG'):
    print('pattern found!')

### Extracting matches
OK, now that we know the 'language' of regular expression, let's see another useful thing we can do with it.  
So far, we only used regex to test if a string matches a pattern, but sometimes we also want to extract parts of the string for later use.  
Let's take an example.

#### The GATA-4 Transcription factor
GATA-4 is a TF in humans, known to have an important role in cardiac development (Oka, T., Maillet, M., Watt, A. J., Schwartz, R. J., Aronow, B. J., Duncan, S. A., & Molkentin, J. D. (2006). Cardiac-specific deletion of Gata4 reveals its requirement for hypertrophy, compensation, and myocyte viability. Circulation research, 98(6), 837-845.)  
It is also known to bind the motif: AGATADMAGRSA (where M = A or C, D = A,G or T, R = A or G and S = C or G).  
Using regex, it's easy to write a function that checks if a sequence includes this motif.
![Motif](lec4_files/gata4.jpg)

In [None]:
def check_for_GATA4(sequence):
    motif_regex = re.compile(r'AGATA[AG][AC]AG[AG][CG]A')
    if re.search(motif_regex,sequence):
        return True
    else:
        return False

In [None]:
test_seq1 = 'AGAGTCTTTGAGATAGCAGACATAGTATATGGATTACGCTGGTCTTGTAAACCATAAAAGGAGAGCCACACTCTCCCTAAGACTCAGGGAAGAGGCCAAAGCCCCACCACCAGCACCCAAAGCTG'
check_for_GATA4(test_seq1)

In [None]:
test_seq2 = 'AGAGTCTTTGAGATAGTAGACATAGTATATGGATTACGCTGGTCTTGTAAACCATAAAAGGAGAGCCACACTCTCCCTAAGACTCAGGGAAGAGGCCAAAGCCCCACCACCAGCACCCAAAGCTG'
check_for_GATA4(test_seq2)

But what if we want to extract the actual sequence that matches the regex?  
Let's have another look at the `re.search()` method. So far, we only used it to test if a match exists or not. But it actually returns something, which we can use to get the exact match, with the `group()` method.  
This method is used on the search result to get the match. So the following function will return the actual match in the sequence, if one exists. Otherwise, it will return `None`.

In [None]:
def find_GATA4_motif(sequence):
    motif_regex = re.compile(r'AGATA[AG][AC]AG[AG][CG]A')
    result = re.search(motif_regex,sequence)   # notice the assignment here
    if result is None:
        return None
    else:
        return result.group()

In [None]:
print(find_GATA4_motif(test_seq1))

In [None]:
print(find_GATA4_motif(test_seq2))

Since most of the motif is fixed, we might only be interested in the 'ambiguous' parts (that is, the DM part and the RS part). We can _capture_ specific parts of the pattern by enclosing them with parentheses. Then we can extract them by giving the `group()` method an argument, where '1' means 'extract the first captured part', '2' means 'extract the second captured part' and so on. The following function will capture the ambiguous positions and return them as elements of a list.

In [None]:
def extract_ambiguous_for_GATA4(sequence):
    motif_regex = re.compile(r'AGATA([AG])([AC])AG([AG])([CG])A') # notice the parentheses
    result = re.search(motif_regex,sequence)
    if result is None:
        return None
    else:
        D = result.group(1)
        M = result.group(2)
        R = result.group(3)
        S = result.group(4)
        return [D,M,R,S]

In [None]:
D,M,R,S = extract_ambiguous_for_GATA4(test_seq1)
print('D nucleotide:',D)
print('M nucleotide:',M)
print('R nucleotide:',R)
print('S nucleotide:',S)

### More on regular expression
There are some other cool things we can do with regex, which we'll not discuss here:
* Split strings by regex
* Substitute parts of string using regex
* Get the position in the string where a pattern was found  
If you want to do any of these, take a look at the re module documentation  
https://docs.python.org/3/library/re.html

#### Recommended:
The Regex Coach is a very useful software when dealing with more complex patterns. It lets you try your regular expressions interactively, see if they work and what parts are extracted. Download and more information [here](http://www.weitz.de/regex-coach/#install).

## <span style="color:blue">Class exercise 4E</span>

The 'GATA4_promoters.fasta' file includes (made-up) promoter sequences for genes suspected to be regulated by GATA-4.  
We'll use everything we've learned so far to write a program that summarizes some interesting statistics regarding the GATA-4 motifs in these promoters.  
First, let's adjust the parse\_fasta() function we created earlier for the specif format of the promoters file:

In [None]:
def parse_promoters_fasta(file_name):
    """
    Receives a path to a fasta file, and returns a dictionary where the keys
    are the sequence names and the values are the sequences.
    """
    # create an empty dictionary to store the sequences
    sequences = {}
    # open fasta file for reading
    with open(file_name,'r') as f:
        # Loop over file lines
        for line in f:
            # if header line
            if line.startswith('>'):
                seq_id = line[1:-1]   # take the whole line, except the '>' in the beginning and '\n' at the end
            # if sequence line
            else:
                seq = line.strip()
                sequences[seq_id] = seq
    return sequences

1)
Write a function that receives a promoters fasta dictionary, and counts how many of the promoters have the GATA-4 motif. Use any of the functions defined above and complete the code:

In [None]:
def count_promoters_with_motif(promoters_dictionary):
    """
    Receives a dictionary representing a promoters fasta file,
    and counts how many of the promoters include a GATA-4 motif.
    """
    promoters_count = 0   # store the number of promoters with GATA-4 motif
    for p in promoters_dictionary:
        if check_for_GATA4(promoters_dictionary[p]):
            promoters_count += 1
    return promoters_count

2) For promoters that do include the GATA-4 motif, we would like to know the frequencies of the different nucleotides for each of the four variable positions in the motif. Complete the code:

In [None]:
def get_positions_statistics(promoters_dictionary):
    """
    Receives a dictionary representing a promoters fasta file,
    and returns the frequencies of possible nucleotides in 
    each variable position.
    """
    # define a  dictionary for each position, to store the nucleotide frequencies
    # D position
    D_dict = {'A':0, 'G':0, 'T':0}
    # M position
    M_dict = {'A':0, 'C':0}
    # R position
    R_dict = {'A':0, 'G':0}
    # S position
    S_dict = {'C':0, 'G':0}
    
    # itterate over promoters
    for p in promoters_dictionary:
        # if promoter includes the GATA-4 motif
        if check_for_GATA4(promoters_dictionary[p]):
            # get variable nucleotides in promoter
            D,M,R,S = extract_ambiguous_for_GATA4(promoters_dictionary[p])
            # insert to dictionaries
            D_dict[D] += 1
            M_dict[M] += 1
            R_dict[R] += 1
            S_dict[S] += 1
            
    return D_dict, M_dict, R_dict, S_dict

3) Now, we just have to write a function that will summarize the results in a CSV file. It should receive the frequencies dictionaries and write statistics to an output file. Complete the code:

In [None]:
def summarize_results(D_dict, M_dict, R_dict, S_dict, output_file):
    with open(output_file, 'w') as fo:
        csv_writer = csv.writer(fo)
        # write headers line
        csv_writer.writerow(['Position','A','G','C','T'])
        # summarize D position
        csv_writer.writerow(['D',D_dict['A'],D_dict['G'],0,D_dict['T']])
        # summarize M position
        csv_writer.writerow(['M',M_dict['A'],0,M_dict['C'],0])
        # summarize R position
        csv_writer.writerow(['R',R_dict['A'],R_dict['G'],0,0])
        # summarize S position
        csv_writer.writerow(['S',0,S_dict['G'],S_dict['C'],0])

4) Now that we have all the functions ready, we can write the main program. Complete the code:

In [None]:
import csv
promoters_file = "lec4_files/GATA4_promoters.fasta"
output_file = "lec4_files/promoters_stats.csv"

# parse fasta file
promoters_dict = parse_promoters_fasta(promoters_file)

# Count promoters with/without GATA-4 motif
promoters_with_motif = count_promoters_with_motif(promoters_dict)
promoters_without_motif = len(promoters_dict) - promoters_with_motif
print('Total promoters:',promoters_with_motif + promoters_without_motif)
print('Promoters with GATA-4 motif:',promoters_with_motif)
print('Promoters without GATA-4 motif:',promoters_without_motif)

# Get statistics
D_dict, M_dict, R_dict, S_dict = get_positions_statistics(promoters_dict)
# write to CSV
summarize_results(D_dict, M_dict, R_dict, S_dict,output_file)

## Colophon
This notebook was written by [Yoav Ram](http://www.yoavram.com) and is part of the _Python for Engineers_ course.

The notebook was written using [Python](http://pytho.org/) 3.4.4, [IPython](http://ipython.org/) 4.0.3 and [Jupyter](http://jupyter.org) 4.0.6.

This work is licensed under a CC BY-NC-SA 4.0 International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)