# Problem Solving with Regular Expressions

### 1. Extracting components from a FASTA file using regular expressions
```
hsbgpg1.fasta

>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT
ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCCCTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGCAGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGGTGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGATGGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCTGTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTCTGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGGCTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGGGAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCGGATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTTCAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTGCTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCCCCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
```

The multiline aspect of the sequence has to be further worked upon

In [10]:
import re

with open("hsbgpg1.fasta") as F:
    s=F.read()
    w=re.search(">(.+?)(\s.*?)\n(.*)",s)
    if w:
        seqid=w.group(1).strip()
        seqdesc=w.group(2).strip()
        seq=w.group(3).strip()
        print(seqid)
        print(seqdesc)
        print(seq)

HSBGPG
Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT


### 2. We want to split a dna string based on EcoRI sites G/AATTC

Explore how to take care of the second part correctly


In [12]:
import re
dna="AAACCCCGAATTCGGCTGTGAATTCCCCCCCCGAATTCTTGTATA"
p=re.compile("(.*?G)(AATTC.*?)")
L=p.findall(dna)
for x in L:
    print(x[0])

AAACCCCG
GGCTGTG
CCCCCCCG


### 3. Display the CDS with highest GC value
ATG......(TGA|TAA|TAG)

In [24]:
import re
def getGC(seq):
    return (seq.count("G")+seq.count("C"))/len(seq)*100

def sortongc(d):
    return d['gc']

dna="AAATGGCGCGCGCGTAAGTGACCATGGGGGGTGAGAAAATGGGGGG"
L=re.findall("(ATG.+?)(TGA)|(TAA)|(TAG)",dna)
FL=[]
for x in L:
    d={}
    d["seq"]=x[0]
    d["gc"]=getGC(x[0])
    FL.append(d)

FL.sort(reverse=True,key=sortongc)

print(FL[0]["seq"])

ATGGGGGG


### 4. Delete all occurrences of a motif from a dna


In [26]:
import re
dna="AAATGGCGCGCGCGTAAGTGACCATGGGGGGTGAGAAAATGGGGGG"
motif="ATG"
s=re.sub("ATG.C","",dna)
print(s)

AAGCGCGCGTAAGTGACCATGGGGGGTGAGAAAATGGGGGG
