## [Finding a Protein Motif]()

### Background
A structural and functional unit of the protein is a **protein domain**: in terms of the protein's primary structure, the domain is an interval of amino acids that can evolve and function independently.

Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions; see Figure 1.

A component of a domain essential for its function is called a **motif**, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, etc.) Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.

Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is **[UniProt](https://www.uniprot.org/)**, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.

### UniProt
To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into
```
http://www.uniprot.org/uniprot/uniprot_id
```

Alternatively, you can obtain a protein sequence in FASTA format by following
```
http://www.uniprot.org/uniprot/uniprot_id.fasta
```

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

### Problem
**Given:** At most 15 UniProt Protein Database access IDs.

**Return:** For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.

### Example
Input:
```
A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST
```

Output:
```
B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614
```

In [8]:
import requests
from wasims_toolbox import read_fasta

def get_protein_sequence(access_id: str) -> dict:
    
    """
    Returns protein sequence from given UniProt access ID
    as dictionary with key-value pairs of name to sequence.
    
    Args:
        access_id (str): UniProt Protein Database access ID
    
    Returns:
        dict: Protein name and sequence
    """
    
    response = requests.get(f"https://rest.uniprot.org/uniprotkb/{access_id}.fasta")
    protein = read_fasta(sequences=response.text)
    return protein

proteins = {}
access_ids = ["A2Z669", "B5ZC00", "P07204", "P20840"]

for access_id in access_ids:
    protein = get_protein_sequence(access_id)
    proteins.update(protein)

proteins

{'sp|A2Z669|CSPLT_ORYSI CASP-like protein 5A2 OS=Oryza sativa subsp. indica OX=39946 GN=OsI_33147 PE=3 SV=1': 'MRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR',
 'sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) OX=565575 GN=glyQS PE=3 SV=1': 'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQKDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSSNEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVNFKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKYLNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYDLSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILMDLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIYCLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK',
 'sp|P07204|TRBM_HUMAN Thrombomodulin OS=Homo sapiens OX=9606 GN=

In [9]:
def find_n_glycosylation_motifs(protein: str) -> list:
    
    """
    Returns locations of N-glycosylation motif for
    the given protein using UniProt access ID.
    
    Args:
        protein (str): Protein sequence
    
    Returns:
        list: Positions in the protein chain where the 
        N-glycosylation motif occurs.
    """

find_n_glycosylation_motifs("B5ZC00")