## Homework 8:

__Exercise 1.__ Go to prosite and find the Gamma-glutamyl phosphate reductase signature. Write a script to detect all yeast proteins that have the Gamma-glutamyl phosphate reductase signature. Your script should output a dataframe with the folowing info about the proteins: accession number, match to the first, span of the motif, and the protein's description.

**NOTE:** You can just copy the file with all the yeast protein form last class in the current directory, you don't need to download it again.

```
[VA]-x(5)-A-[LIVAMTCK]-x-[HWFY]-[IM]-x(2)-[HYWNRFT]-[GSNT]-[STAG]-x(0,1)-H-[ST]-[DE]-x(1,2)-I
```

In [1]:
from Bio import SeqIO
import pandas as pd
import re

pattern = "Y\w[NQHD][KHR][DE][IVA]F[LM]R[ED]"
protFile = "orf_trans.fasta"

def findinseqfile(pattern, filein):
    information = []
    for seq_record in SeqIO.parse(filein, "fasta"):
        result = re.search(pattern,str(seq_record.seq))
        if result:
            information.append([seq_record.name, result.group(), result.span(), str(seq_record.seq)])

    return pd.DataFrame(information, columns=['acc','match','start_end','seq'])

findinseqfile(pattern, protFile)

Unnamed: 0,acc,match,start_end,seq
0,YMR186W,YSNKEIFLRE,"(23, 33)",MAGETFEFQAEITQLMSLIINTVYSNKEIFLRELISNASDALDKIR...
1,YPL240C,YSNKEIFLRE,"(23, 33)",MASETFEFQAEITQLMSLIINTVYSNKEIFLRELISNASDALDKIR...


__Exercise 2.__ Now do the same for the Hexapeptide-repeat containing-transferases signature.

```
[LIV]-[GAED]-x(2)-[STAV]-x-[LIV]-x(3)-[LIVAC]-x-[LIV]-[GAED]-x(2)-[STAVR]-x-[LIV]-[GAED]-x(2)-[STAV]-x-[LIV]-x(3)-[LIV]
```

In [7]:
ex2 = "[LIV][GAED]\w{2}[STAV]\w[LIV]\w{3}[LIVAC]\w[LIV][GAED]\w{2}[STAVR]\w[LIV][GAED]\w{2}[STAV]\w[LIV]\w{3}[LIV]"
findinseqfile(ex2, protFile)

Unnamed: 0,acc,match,start_end,seq
0,YDL055C,IDPTAKISSTAKIGPDVVIGPNVTIGDGV,"(256, 285)",MKGLILVGGYGTRLRPLTLTVPKPLVEFGNRPMILHQIEALANAGV...
1,YJL218W,IGGGVSIIPGVNIGKNSVIAAGSVVIRDI,"(138, 167)",MGVLENIVPGELYDANYDPDLLKIRKETKIKLHEYNTLSPADENKK...


__Exercise 3.__ Now find the 14-3-3 proteins signatures. The 14-3-3 proteins seem to have multiple biological activities and play a key role in signal transduction pathways and the cell cycle. The prosite database uses two motifs to determine members of this family.

Write a script to search for proteins in yeast that have both domains in either order. You should find two proteins.

Your script should show a dataframe with the proteins: accession number, match to the first motif, span of the first motif, match to the second motif, span of the second motif, and the proteins description.

Although your regex doesn't need to match the domains in the reverse order for it to identify both yeast proteins, I would like for you to write a regex that would be able to identify such a case for this exercise purpose.

```
[RA]-N-L-[LIV]-S-[VG]-[GA]-Y-[KN]-N-[IVA]
```

and

```
Y-K-[DE]-[SG]-T-L-I-[IML]-Q-L-[LF]-[RHC]-D-N-[LF]-T-[LS]-W-[TANS]-[SAD]
```

In [10]:
def findinseqfile(pattern, p2, filein):
    information = []
    for seq_record in SeqIO.parse(filein, "fasta"):
        result = re.search(pattern,str(seq_record.seq))
        re1 = re.search(p2,str(seq_record.seq))
        if result and re1:
            information.append([seq_record.name, result.group(), re1.group(), result.span(), re1.span(), str(seq_record.seq)])

    return pd.DataFrame(information, columns=['acc','match1', 'match2','start_end1','start_end2','seq'])

pattern = '[RA]NL[LIV]S[VG][GA]Y[KN]N[IVA]'
p2 = 'YK[DE][SG]TLI[IML]QL[LF][RHC]DN[LF]T[LS]W[TANS][SAD]'

findinseqfile(pattern, p2, protFile)

Unnamed: 0,acc,match1,match2,start_end1,start_end2,seq
0,YDR099W,RNLLSVAYKNV,YKDSTLIMQLLRDNLTLWTS,"(42, 53)","(215, 235)",MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLL...
1,YER177W,RNLLSVAYKNV,YKDSTLIMQLLRDNLTLWTS,"(42, 53)","(215, 235)",MSTSREDSVYLAKLAEQAERYEEMVENMKTVASSGQELSVEERNLL...


__Exercise 4.__ Parsing and extracting data from a URL:

This is form the tutorial that you should have completed.

When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.

URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.

http://regexone.com:80/page

The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.

In the exercise below, try to extract the protocol, host and port of the all the resources listed in this string.

```
ftp://file_server.com:21/top_secret/life_changing_plans.pdf
https://regexone.com/lesson/introduction#section
file://localhost:4040/zip_file
https://s3cur3-server.com:9999/
market://search/angry%20birds
```

You can work interactively here: https://regexone.com/problem/extracting_url_data to find the right regular expression, then use re.finditer to create a dataframe with columns protocol, host and port for each of the matches in the string.

In [8]:
link = ['ftp://file_server.com:21/top_secret/life_changing_plans.pdf']
link.append('https://regexone.com/lesson/introduction#section')
link.append('file://localhost:4040/zip_file')
link.append('https://s3cur3-server.com:9999/')
link.append('market://search/angry%20birds')
    
data = {}

for i in range(len(link)):
    tbl = 'link'+ str(i+1)
    data[tbl] = {}
               
    prot_search = re.search('(\w+)://', str(link[i]))
    prot = prot_search.group(1)
    data[tbl]['protocol'] = prot
    
    host_search = re.search('://([\w\-\.]+)', str(link[i]))
    host = host_search.group(1)
    data[tbl]['host'] = host

    port = 'N/a'
    portsearch = re.search('(:(\d+))', str(link[i]))
    if portsearch:
        port = portsearch.group(1)
    data[tbl]['port'] = port
                     
    path = link[i]
    data[tbl]['path'] = path
    
pd.DataFrame(data)


Unnamed: 0,link1,link2,link3,link4,link5
protocol,ftp,https,file,https,market
host,file_server.com,regexone.com,localhost,s3cur3-server.com,search
port,:21,N/a,:4040,:9999,N/a
path,ftp://file_server.com:21/top_secret/life_chang...,https://regexone.com/lesson/introduction#section,file://localhost:4040/zip_file,https://s3cur3-server.com:9999/,market://search/angry%20birds
