# 19302 human sequences proteome of Ensembl 92 canonical transcripts 

This notebook contains the code to generate the 19302 human sequences proteome of Ensembl 92 canonical transcripts (downloaded using Biomart). The canonical transcripts have been further filtered to be unique, meaning not including those belonging to scaffolds (only standard chromosomes).

With this approach, we aim to reduce to the maximum the presence of replicated IDs, specifically of symbols mapped to more than one ENSG-ENST-ENSP trio. In case any replicate remains, that protein will be not analyzed in the functional validation. 

This is the proteome used for degrons functional analysis.

## Import libraries

In [1]:
# to reload automatically the changes in the scripts.
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import pandas as pd
import gzip

## my modules ##
sys.path.append("../scripts/Utils/")    # modules folder
from fasta_utils import readFasta_header_gzip

## Define variables and paths

In [1]:
base = "../../"

data = "data/"

uniq_can_transcripts_path = os.path.join(base, data, "external/biomart/ensembl_canonical_transcripts_uniq.tsv")
uniq_can_transcripts_enst_path = os.path.join(base, data, 
                                              "external/biomart/ensembl_canonical_transcripts_uniq_ENSTs.txt")
protein_seqs_path = os.path.join(base, data, "external/biomart/biomart92_proteome_cantranscripts_uniq.txt.gz")

## 1. List of ENSTs from canonical transcripts belonging to standard chromosomes

Used in Biomart as filter to download the protein sequences.

*ENST = Ensembl Transcript ID

In [4]:
canonical_transcripts_uniq = pd.read_csv(uniq_can_transcripts_path, sep = "\t")

In [5]:
canonical_transcripts_uniq

Unnamed: 0,ensg,enst,symbol,chr
0,ENSG00000013503,ENST00000228347,POLR3B,12
1,ENSG00000136044,ENST00000551662,APPL2,12
2,ENSG00000136051,ENST00000620430,WASHC4,12
3,ENSG00000151967,ENST00000445224,SCHIP1,3
4,ENSG00000166535,ENST00000299698,A2ML1,12
...,...,...,...,...
19297,ENSG00000125780,ENST00000381458,TGM3,20
19298,ENSG00000268104,ENST00000598581,SLC6A14,X
19299,ENSG00000204227,ENST00000374656,RING1,6
19300,ENSG00000040275,ENST00000265295,SPDL1,5


In [8]:
canonical_transcripts_uniq.chr.unique()

array(['12', '3', '1', '4', '5', '8', '15', '14', '22', '17', '21', '6',
       '13', '2', '11', '7', '18', '19', '20', '16', '9', '10', 'X', 'Y'],
      dtype=object)

In [11]:
canonical_transcripts_uniq["enst"].to_csv(uniq_can_transcripts_enst_path, index = False, header = False)

## 2. Process Biomart 92 proteome containing standard canonical transcripts only

Fetch protein sequences from Biomart.
- Biomart specifications:
    - Database: Ensembl Genes 92
    - Dataset: Human genes (GRCh38.p12)
    - Filters: GENE -> Input external references ID list -> Transcript stable ID -> upload list of canonical transcripts belonging to standard chroms (`../data/external/biomart/ensembl_canonical_transcripts_uniq_ENSTs.txt`)
    - Attributes (in Sequences): Gene stable ID, Transcript stable ID, Protein stable ID, Gene name, Peptide. 
- Resulting file: `../data/external/biomart/biomart92_proteome_cantranscripts.txt.gz`.

In [4]:
proteome = readFasta_header_gzip(protein_seqs_path)

# Expected number of read sequences: 19302 (Count functionality in Biomart)

Number of retrieved sequences: 19302



Note that the peptides downloaded from Biomart contain an asterisk at the end of the sequence to indicate when an aminoacid was translated from a STOP codon. This asterisks have no interest for downstream analysis, removed them when loading with `readFasta_header_gzip` above.

In [5]:
# check * have been removed

proteome["ENSG00000006611|ENST00000005226|ENSP00000005226|USH1C|Q9Y6N9"]

'MDRKVAREFRHKVDFLIENDAEKDYLYDVLRMYHQTMDVAVLVGDLKLVINEPSRLPLFDAIRPLIPLKHQVEYDQLTPRRSRKLKEVRLDRLHPEGLGLSVRGGLEFGCGLFISHLIKGGQADSVGLQVGDEIVRINGYSISSCTHEEVINLIRTKKTVSIKVRHIGLIPVKSSPDEPLTWQYVDQFVSESGGVRGSLGSPGNRENKEKKVFISLVGSRGLGCSISSGPIQKPGIFISHVKPGSLSAEVGLEIGDQIVEVNGVDFSNLDHKEAVNVLKSSRSLTISIVAAAGRELFMTDRERLAEARQRELQRQELLMQKRLAMESNKILQEQQEMERQRRKEIAQKAAEENERYRKEMEQIVEEEEKFKKQWEEDWGSKEQLLLPKTITAEVHPVPLRKPKSFGWFYRYDGKFPTIRKKGKDKKKAKYGSLQDLRKNKKELEFEQKLYKEKEEMLEKEKQLKINRLAQEVSETEREDLEESEKIQYWVERLCQTRLEQISSADNEISEMTTGPPPPPPSVSPLAPPLRRFAGGLHLHTTDLDDIPLDMFYYPPKTPSALPVMPHPPPSNPPHKVPAPPVLPLSGHVSASSSPWVQRTPPPIPIPPPPSVPTQDLTPTRPLPSALEEALSNHPFRTGDTGNPVEDWEAKNHSGKPTNSPVPEQSFPPTPKTFCPSPQPPRGPGVSTISKPVMVHQEPNFIYRPAVKSEVLPQEMLKRMVVYQTAFRQDFRKYEEGFDPYSMFTPEQIMGKDVRLLRIKKEGSLDLALEGGVDSPIGKVVVSAVYERGAAERHGGIVKGDEIMAINGKIVTDYTLAEAEAALQKAWNQGGDWIDLVVAVCPPKEYDDELASLPSSVAESPQPVRKLLEDRAAVHRHGFLLQLEPTDLLLKSKRGNQIHR'

In [9]:
# Save with asterisks removed

with open(data_path+others_data_path+proteome_file, "w") as f:
    
    for header in proteome.keys():
        
        f.write(">"+header+"\n")
        f.write(proteome[header]+"\n")
        
        

In [12]:
# check if the proteome has been saved properly

proteome = readFasta_header_gzip(data_path+others_data_path+proteome_file_gz)

Number of retrieved sequences: 19302



In [13]:
proteome["ENSG00000006611|ENST00000005226|ENSP00000005226|USH1C|Q9Y6N9"] # correct

'MDRKVAREFRHKVDFLIENDAEKDYLYDVLRMYHQTMDVAVLVGDLKLVINEPSRLPLFDAIRPLIPLKHQVEYDQLTPRRSRKLKEVRLDRLHPEGLGLSVRGGLEFGCGLFISHLIKGGQADSVGLQVGDEIVRINGYSISSCTHEEVINLIRTKKTVSIKVRHIGLIPVKSSPDEPLTWQYVDQFVSESGGVRGSLGSPGNRENKEKKVFISLVGSRGLGCSISSGPIQKPGIFISHVKPGSLSAEVGLEIGDQIVEVNGVDFSNLDHKEAVNVLKSSRSLTISIVAAAGRELFMTDRERLAEARQRELQRQELLMQKRLAMESNKILQEQQEMERQRRKEIAQKAAEENERYRKEMEQIVEEEEKFKKQWEEDWGSKEQLLLPKTITAEVHPVPLRKPKSFGWFYRYDGKFPTIRKKGKDKKKAKYGSLQDLRKNKKELEFEQKLYKEKEEMLEKEKQLKINRLAQEVSETEREDLEESEKIQYWVERLCQTRLEQISSADNEISEMTTGPPPPPPSVSPLAPPLRRFAGGLHLHTTDLDDIPLDMFYYPPKTPSALPVMPHPPPSNPPHKVPAPPVLPLSGHVSASSSPWVQRTPPPIPIPPPPSVPTQDLTPTRPLPSALEEALSNHPFRTGDTGNPVEDWEAKNHSGKPTNSPVPEQSFPPTPKTFCPSPQPPRGPGVSTISKPVMVHQEPNFIYRPAVKSEVLPQEMLKRMVVYQTAFRQDFRKYEEGFDPYSMFTPEQIMGKDVRLLRIKKEGSLDLALEGGVDSPIGKVVVSAVYERGAAERHGGIVKGDEIMAINGKIVTDYTLAEAEAALQKAWNQGGDWIDLVVAVCPPKEYDDELASLPSSVAESPQPVRKLLEDRAAVHRHGFLLQLEPTDLLLKSKRGNQIHR'