# Regular Expressions

### Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match


Import re module

In [38]:
import re

# re.search()
The re.search function in Python's re module is used to search for the first occurrence of a regex pattern in a given string.

It stops searching after finding the first match and returns a match object.

Syntax -

match_object = re.search(pattern, string)

Match Object Methods
The match object returned by re.search has several useful methods and attributes:

group(): Returns the matched string.

start(): Returns the starting position of the match.

end(): Returns the ending position of the match.

span(): Returns a tuple containing the start and end positions of the match.

In [39]:
#search for a simple motif sequence
import re

pattern = r"CGT"
sequence = "ATGCGTACGTAG"
match = re.search(pattern, sequence)

if match:
    print(f"Match found: {match.group()} at position {match.start()}-{match.end()}")
else:
    print("No match found.")

Match found: CGT at position 3-6


In [40]:
match.span()

(3, 6)

In [41]:
#creating flags
import re

pattern = r"cgt"
sequence = "ATGCGTACGTAG"
match = re.search(pattern, sequence, re.IGNORECASE)

if match:
    print(f"Match found: {match.group()} at position {match.start()}-{match.end()}")
else:
    print("No match found.")


Match found: CGT at position 3-6


# re.match()
The re.match() function is used to determine if the beginning of a string matches a specified pattern. It only checks for a match at the start of the string.

In [45]:
# re.match()
print("\nre.match() Example:")
sequence = "ATGCGTACGTAG"
match = re.match(r"ATG", sequence)
if match:
    print("Match found at the start of the sequence")
else:
    print("No match found at the start of the sequence")


re.match() Example:
Match found at the start of the sequence


Note - re.search() searches the entire string for the first occurrence of the pattern.

# re.findall()
The re.findall function in Python's re module is used to find all occurrences of a specified pattern in a string and return them as a list. Unlike re.match and re.search, which return a match object, re.findall returns all matches as a list of strings.

In [48]:
# re.findall()
print("\nre.findall() Example:")
sequence = "ATGCGTACGTAG"
matches = re.findall(r"CGT", sequence)
print(f"All matches: {matches}")



re.findall() Example:
All matches: ['CGT', 'CGT']


Using re.finditer allows you to obtain not only the matched substrings but also their positions within the original string, which can be very useful for tasks that require precise location information.

In [49]:
import re

print("\nre.finditer() Example:")
sequence = "ATGCGTACGTAG"
matches = re.finditer(r"CGT", sequence)

print("All matches with indices:")
for match in matches:
    print(f"Match: {match.group()} at position: {match.start()}-{match.end()}")



re.finditer() Example:
All matches with indices:
Match: CGT at position: 3-6
Match: CGT at position: 7-10


# re.split()
The re.split function in Python's re module is used to split a string by the occurrences of a specified pattern. It works similarly to the built-in str.split method but allows for more complex and flexible patterns using regular expressions.

In [50]:
# re.split()
print("\nre.split() Example:")
sequence = "ATG-CGT-ACG-TAG"
parts = re.split(r"-", sequence)
print(f"Split sequence: {parts}")


re.split() Example:
Split sequence: ['ATG', 'CGT', 'ACG', 'TAG']


In [51]:
import re

# Define the DNA sequence with enzyme cut sites
sequence = "ATCGAATTCGCTTAAGCTTATGGAATTCTTCAAGCTTGGAATTCCAAGCTT"

# Define the pattern to match both EcoRI (GAATTC) and HindIII (AAGCTT) cut sites
pattern = r"GAATTC|AAGCTT"

# Split the sequence at the enzyme cut sites
parts = re.split(pattern, sequence)

# Print the resulting fragments
print(f"Split sequence: {parts}")

Split sequence: ['ATC', 'GCTT', 'ATG', 'TTC', 'G', 'C', '']


# re.sub()
The re.sub function in Python's re module is used to search for a pattern in a string and replace it with a specified replacement string. This is useful for modifying strings by replacing certain patterns with new values.

In [52]:
# re.sub()
print("\nre.sub() Example:")
sequence = "ATGCGTACGTAG"
new_sequence = re.sub(r"CGT", "XXX", sequence)
print(f"Replaced sequence: {new_sequence}")


re.sub() Example:
Replaced sequence: ATGXXXAXXXAG


In [53]:
import re

# Define the DNA sequence with codons
sequence = "ATGCGTTGGAAGCTATAGGGAATGTTATGA"

# Define the pattern and replacement dictionary
codon_to_aa = {
    r"ATG": "M",  # Methionine
    r"TGG": "W",  # Tryptophan
    r"GAA": "E",  # Glutamic acid
    r"TAA": "*",  # Stop codon
    r"TAG": "*",  # Stop codon
    r"TGA": "*"   # Stop codon
}

# Function to replace codons with corresponding amino acid abbreviations
def translate_codons(sequence, codon_to_aa):
    for codon, aa in codon_to_aa.items():
        sequence = re.sub(codon, aa, sequence)
    return sequence

# Translate the DNA sequence into a protein sequence
protein_sequence = translate_codons(sequence, codon_to_aa)

# Print the original and modified sequences
print(f"Original DNA sequence: {sequence}")
print(f"Translated protein sequence: {protein_sequence}")


Original DNA sequence: ATGCGTTGGAAGCTATAGGGAATGTTATGA
Translated protein sequence: MCGTWAAGCTA*GGAMTTMA


# re.compile()
The re.compile function in Python's re module is used to compile a regular expression pattern into a regex object. This object can be used to perform regex operations, which can improve performance if the same pattern is used multiple times. By compiling the pattern once, you avoid recompiling it each time it's used.

In [54]:
# re.compile()
print("\nre.compile() Example:")
pattern = re.compile(r"CGT")
match = pattern.search(sequence)
if match:
    print(f"Compiled match found at position: {match.start()}")


re.compile() Example:
Compiled match found at position: 3


In [55]:
import re

# Define the DNA sequence
sequence = (
    "ATGCGTACTGGTAAATGCGTACGTAGATGCGTACCTGAATGCGTACTGATGA"
)

# Define the pattern for extracting gene sequences
# The pattern starts with ATG and ends with TAA, TAG, or TGA
pattern = re.compile(r"ATG(?:[ATGC]{3})*?(?:TAA|TAG|TGA)")

# Find all gene sequences
matches = pattern.findall(sequence)

# Print the extracted gene sequences
print("Extracted gene sequences:")
for match in matches:
    print(match)


Extracted gene sequences:
ATGCGTACGTAG
ATGCGTACCTGA


# Practice Questions

# Q1
Given a mixed string containing multiple DNA sequences separated by various delimiters (e.g., commas, semicolons, and spaces), write a function to extract all valid DNA sequences. A valid DNA sequence consists only of the nucleotides A, T, G, and C. Additionally, validate each extracted sequence to ensure it is at least 10 nucleotides long.<br/><br>
seq = "ATGCGTACGTAG;TTAGCGTAT,CGTATGGC ATGNNNNNNNN,CCCGTGATGAAGT"

In [18]:
#write your code here

import re

seq = "ATGCGTACGTAG;TTAGCGTAT,CGTATGGC ATGNNNNNNNN,CCCGTGATGAAGT"

exp = r"[ATGC]*"



pattern = re.compile(exp, re.IGNORECASE)
matches = pattern.finditer(seq)

for match in matches :
    if ( len(string := seq[match.start(): match.end()])) >= 10:
        print(string)



ATGCGTACGTAG
CCCGTGATGAAGT


# Q2
Write a function to identify and extract variable length tandem repeats (e.g., "CAG", "GAA") from a DNA sequence. A tandem repeat is a sequence that appears consecutively one or more times.

seq = "CAGCAGGAAAAGGAAATGCGTCAGCAGCAG"


In [30]:
#write your code here
seq = "CAGCAGGAAAAGGAAATGCGTCAGCAGCAG"
exp = r"(CAG)(CAG)+|(GAA)(GAA)+"


pattern = re.compile(exp, re.IGNORECASE)
matches = pattern.finditer(seq)

for match in matches :
    string = seq[match.start(): match.end()]
    print(string)






CAGCAG
CAGCAGCAG


# Q3
Given a DNA sequence, write a function to identify and extract nested motifs using regular expressions with lookaheads. Specifically, find all occurrences of the motif "CG" that are followed by exactly two "A"s and another "CG". For each match, return the motif along with its start and end indices in the sequence.

In [None]:
#write your code here


Nested motifs with indices:
Repeat: CG at position: 2-4
Repeat: CG at position: 15-17


In [32]:
#write your code here
seq = "CAGCGAACGCAGGAAAAGGAAATGCGTCAGCAGCAG"
exp = r"CG(AA)CG"


pattern = re.compile(exp, re.IGNORECASE)
matches = pattern.finditer(seq)

for match in matches :
    string = seq[match.start(): match.end()]
    print(string)



CGAACG
