# BINF 200 - Biological structures - 1

In this assignment, we will query and navigate biological knowledgebases programatically and manually. Please read the text carefully and execute the code chinks in sequential order.

The chunks in this notebook present the following icons, indicating what is expected of you:
- ❓: Please edit the text to fill in your answer.
- ▶: Check what the code is doing and run
- 💻: Please edit the code according to the instructions and run the chunk.


### ❓ Your name in amino acids

Edit the table below so that the first column contains the letter of your first and last name.

| Letter | Amino acid? | Name | Polarity | Net charge at pH 7.4 |
| - | - | - | - | - |
| M | Yes | Methionine | Nonpolar | Neutral |


### ▶ Download human proteins from UniProt

The code below downloads identifiers of reviewed human proteins from UniProt stored in the repository of the course.

In [1]:
import requests
from io import BytesIO
import gzip
import pandas as pd

response = requests.get("https://github.com/mvaudel/BINF200-bio-sequences-structures/raw/main/assignments/assignment_3/resources/uniprotkb_reviewed_true_AND_model_organ_2023_10_28.tsv.gz")

# When getting something from a URL, the response code 200 indicates a success
if response.status_code == 200:

  # Get the content of the response
  content = response.content

  # Create a BytesIO object to work with the gzipped content
  with BytesIO(content) as bio:

    # Use gzip to decompress the content
    with gzip.open(bio, 'rb') as f:

      # Use pandas to read the table (adjust the options as needed)
      proteins_df = pd.read_csv(f, delimiter='\t', header = 0)

# If we did not get an http code 200, check the meaning of the code to troubleshoot the issue
else:
    print(f"Failed to download the file. Http error code: {response.status_code}.")


### 💻 Select a protein

Edit the code below to replace `YOUR_NAME` with your actual name and run the chunk to sample a protein.

💡 In the unlikely event that you get the same protein as me or one of your friends, change the string `my_name` again to get a new protein, e.g. using your cat's name.

In [2]:
# Your name
my_name = "YOUR_NAME"

# Sample protein
index = hash(my_name)

if index < 0:
    index += 2**32

index = index % len(proteins_df)

print(f"Congratulations! Your protein accession is: {proteins_df['Entry'][index]}")


Congratulations! Your protein accession is: Q14117


🕵 Instructions

In the following, follow the different questions and fill the answers using your protein. Examples with my protein are provided for guidance.

### ❓ Protein information

Find you protein in UniProt using a web browser. Edit the table below with information on your protein.

| Attribute | Value |
| - | - |
| Protein Accession | Q9NSA1 |
| Gene name | FGF21 |
| Protein name | Fibroblast growth factor 21 |

### ❓ Protein function

What is the function of your protein according to UniProt?

> Stimulates glucose uptake in differentiated adipocytes via the induction of glucose transporter SLC2A1/GLUT1 expression (but not SLC2A4/GLUT4 expression). Activity requires the presence of KLB. Regulates systemic glucose homeostasis and insulin sensitivity.

### ❓ Protein sequence

What is the sequence of your protein according to UniProt?

> MDSDETGFEHSGLWVSVLAGLLLGACQAHPIPDSSPLLQFGGQVRQRYLYTDDAQQTEAHLEIREDGTVGGAADQSPESLLQLKALKPGVIQILGVKTSRFLCQRPDGALYGSLHFDPEACSFRELLLEDGYNVYQSEAHGLPLHLPGNKSPHRDPAPRGPARFLPLPGLPPALPEPPGILAPQPPDVGSSDPLSMVGPSQGRSPSYAS

### ❓ Protein structure

How were the structures in UniProt determined?

> X-ray, NMR, and predicted using AlphaFold

What parts of the sequence do these cover?

> X-ray 186-209, NMR 42-169, NMR 42-164, AlphaFold 1-209.

What features does the structure present?

> Helix, beta strands, and turns

### ❓ Genetic sequence

Look up the gene name in Ensembl. How many transcripts are encoded by this gene according to Ensembl?

> 2

Under `Transcript ID` in the transcript table, select a transcript encoding your UniProt Match. If multiple transcripts match, select the first one.

Select `Exons` in the menu to the left. How many introns and exons are in this gene?

> 4 exons, 3 introns

Select `cDNA`, what is the genetic code corresponding to the ten first amino acids of your protein sequence?

> ATGGACTCGGACGAGACCGGGTTCGAGCAC

Select `Protein`, how many exons make up the final sequence?

> 3

Is there an amino acid overlapping with a splice site? If yes, between which exons?

> S between exon 1 and 2

### ❓ Genetic variation

Now select `Variants` in the menu to the left, in the variants table, is there a missense variant? If yes, what residues are changed?

> rs762567273, D, V

Are there in the table both tolerated and deleterious variants according to SIFT? What is their respective frequencies?

> rs574758901, tolerated, maf < 0.01
> rs1432613460, deleterious, maf < 1e-6

Query your gene in ClinVar. How many variants are listed in the different levels of clinical significance?

| Clinical significance | N |
| - | - |
| Conflicting interpretations | 0 |
| Benign | 3 |
| Likely benign | 2 |
| Uncertain significance | 20 |
| Likely pathogenic | 1 |
| Pathogenic | 9 |

If pathoegnic variants were found, can you find an example of condition?

> Developmental and intellectual disability

### ❓ PDB entries

Query the UniProt identifier of your protein. How many structures are available? How were they determined?

> 3 structures: 2 NMR, 1 X-ray

Select the largest structure. Select `Structure` under Èxplore in 3D. Using the JSmol viewer, build a 3D representation using the space fill style with the secondary structure indicated in color and attach it to your assignment.






