# Last exons coordinates of the 19302 human genome of Ensembl 92 canonical transcripts 

This notebook contains the code to fetch the coordinates of the last exons of the 19302 human genome of Ensembl 92 canonical transcripts (downloaded using Biomart). The canonical transcripts have been further filtered to be unique, meaning not including those belonging to scaffolds (only standard chromosomes).

Downloads from Biomart:

Exons. Settings:
   - Biomart specifications:
        - Database: Ensembl Genes 92
        - Dataset: Human genes (GRCh38.p12)
        - Filters: GENE -> Input external references ID list -> Transcript stable ID -> upload list of canonical transcripts belonging to standard chroms (`../data/external/biomart/ensembl_canonical_transcripts_uniq_ENSTs.txt`)
        - Attributes (in Sequences): Gene stable ID, Transcript stable ID, Exon stable ID, Exon region start (bp), Strand, Exon region end (bp), Exon rank in transcript, Exon sequences.
            - **Some comments**:
                - 'Exon rank in transcript' refers to the exon order, from the beginning of the gene to the end (taking into account strand direction)
   - Resulting file: `../data/external/biomart/biomart92_proteome_exons_cantranscripts.fasta.gz`. 

## Import libraries

In [1]:
import sys
import pandas as pd
import gzip
import os
from tqdm import tqdm
tqdm.pandas()

## my modules ##
sys.path.append("../scripts/Utils/")    # modules folder
from fasta_utils import readFasta_header_gzip

## Define paths and variables

In [59]:
base = "../../"

data = "data/"

genecoords_path = os.path.join(base, data, "external/biomart/biomart92_ensg_enst_genomecoords.txt.gz")
exons_path = os.path.join(base, data, "external/biomart/biomart92_proteome_exons_cantranscripts.fasta.gz")
last_exons_prepro_path = os.path.join(base, data, "external/biomart/biomart92_proteome_lastexons_cantranscripts_relcoords_valid.fasta.gz")

## 1. Load exons

In [8]:
def sep_index(row):
    """
    Extracts the Df index and separates it in values that are
    added as additional columns values
    """

    values = row["index"].split("|")

    row["ENSG"] = values[0]
    row["ENST"] = values[1]
    row["ENSE"] = values[2]
    row["start"] = values[3]   
    row["end"] = values[4]
    row["strand"] = values[5]
    row["exon_rank"] = values[6]

    return row

In [11]:
# Read the FASTA in dictionary format {header: sequence}
exons = readFasta_header_gzip(exons_path)

# Process the dictionary to generate a df (for later groupby)
exons = pd.DataFrame.from_dict(exons, orient = 'index', columns = ['sequence'])
exons.reset_index(inplace = True)
## the header in the FASTA file contains all the information except the exon's sequence
exons = exons.progress_apply(lambda row: sep_index(row), axis = 1)
exons.drop(['index'], axis = 1, inplace = True)
exons = exons[["ENSG", "ENST", "ENSE", "start", "end", "strand", "exon_rank", "sequence"]]

  0%|          | 20/202869 [00:00<17:34, 192.45it/s]

Number of retrieved sequences: 202869



100%|██████████| 202869/202869 [15:32<00:00, 217.63it/s]


In [42]:
# change dtype of integers
exons["exon_rank"] = pd.to_numeric(exons["exon_rank"])
exons["start"] = pd.to_numeric(exons["start"])
exons["end"] = pd.to_numeric(exons["end"])
exons["strand"] = pd.to_numeric(exons["strand"])

In [43]:
exons

Unnamed: 0,ENSG,ENST,ENSE,start,end,strand,exon_rank,sequence
0,ENSG00000039139,ENST00000265104,ENSE00003548872,13809044,13809186,-1,46,GTACATGGACGCACTGGAACACGCGTACCCAGGAATACCTGTATCC...
1,ENSG00000100097,ENST00000215909,ENSE00003491979,37678483,37678654,1,3,CTTCGTGCTGAACCTGGGCAAAGACAGCAACAACCTGTGCCTGCAC...
2,ENSG00000152359,ENST00000428202,ENSE00001004832,75677774,75677950,-1,11,TATGTGCCAAGAGTTGTAACCTCTGCACAACAGAAAGCAGGAAGAA...
3,ENSG00000168575,ENST00000342228,ENSE00003528807,42459896,42459992,-1,5,GAAGACCCTGTTCCCAATGGCCTCCGGGCACTCCCAGTATTCTATG...
4,ENSG00000204590,ENST00000376621,ENSE00001932362,30541377,30546313,-1,12,GCACCTGGGAGTCCCATCCAGAGACCACGGAGCTGGTGGTTTTGCA...
...,...,...,...,...,...,...,...,...
202864,ENSG00000186432,ENST00000334256,ENSE00001002719,160531462,160531557,-1,6,GAAGCTTTTGTCCAGTGATCGAAATCCACCAATTGATGACTTAATA...
202865,ENSG00000241973,ENST00000255882,ENSE00001634710,20747583,20747702,-1,29,GAATATCTGAACAAACATCAGAACTGGGTATCGGGACTGTCCCAGC...
202866,ENSG00000111785,ENST00000392837,ENSE00003628202,106870823,106870942,1,9,CATTAATCTTATCACTGGTCATTTAGAGGAACCAATGCCAAACCCC...
202867,ENSG00000186432,ENST00000334256,ENSE00001224224,160494995,160502202,-1,17,ATTGATGAAGACCCTAGCCTTGTTCCAGAGGCAATTCAAGGCGGAA...


## 2. Identify the last exon of every protein (genomic coordinates)

1. Group the exons dataframe by ENST (Ensembl Transcript ID)
2. Identify the last exon, which corresponds to the row having the maximum value in `exon_rank` column when the strand is +1 (forward), and minimum value when strand is -1 (reverse)
3. Append all last-exon rows and generate a new dataframe with them

In [48]:
# Group by gene
exons_grp = exons.groupby("ENST")
len(exons_grp)   # there are less ENST than expected (19302 in genecoords Df)

19261

In [49]:
# Identify last exon
last_exons = []

for key, item in tqdm(exons_grp):

    enst_exons = exons_grp.get_group(key)
    
    # Identify strand direction
    strand = enst_exons["strand"].unique()[0]
    if strand == 1:  # positive strand
        last_exon_row = enst_exons.loc[enst_exons["exon_rank"] == enst_exons["exon_rank"].max()]
    elif strand == -1:   # negative strand
        last_exon_row = enst_exons.loc[enst_exons["exon_rank"] == enst_exons["exon_rank"].min()]
        
    last_exons.append(last_exon_row)


100%|██████████| 19261/19261 [00:19<00:00, 991.74it/s] 


In [50]:
last_exons_df = pd.concat(last_exons)
last_exons_df

Unnamed: 0,ENSG,ENST,ENSE,start,end,strand,exon_rank,sequence
146177,ENSG00000004059,ENST00000000233,ENSE00000882271,127591213,127591705,1,6,TGGTATGTCCAGGCCACCTGTGCCACCCAAGGCACAGGTCTGTACG...
201825,ENSG00000003056,ENST00000000412,ENSE00002286327,8949488,8949955,-1,1,CCGGGAGCGGTCAGGCGCGTGACCCCGCGTGACCGGGGTGCGCGAG...
49871,ENSG00000004478,ENST00000001008,ENSE00000802792,2803151,2805423,1,10,GCCAAGGCAGAGGCTTCCTCAGGAGACCATCCCACTGACACAGAGA...
55350,ENSG00000003137,ENST00000001146,ENSE00001956510,72147631,72148038,-1,1,AGGCAATTTTTTTCCTCCCTCTCTCCGCTCCCCTCGCAGCCTCCAC...
18539,ENSG00000003509,ENST00000002125,ENSE00003687373,37248135,37249160,1,10,GTTCTTTTAGATAAATCAAATGAGCCATCAGTGAGGCAGCAGTTAC...
...,...,...,...,...,...,...,...,...
181030,ENSG00000090534,ENST00000647395,ENSE00003827644,184378075,184378207,-1,1,AGAAGTGGCCCAGGCAGGCGTATGACCTGCTGCTGTGGAGGGGCTG...
67034,ENSG00000131951,ENST00000647410,ENSE00003830242,60063323,60063559,1,33,AATTTTTGGGAGCAACTTTCCAAGATCAAATCGAATGTAACTGCCT...
124510,ENSG00000184007,ENST00000647444,ENSE00003824320,31937987,31938379,-1,1,GGGAGCTGGTTCCGGCTGCGCGCGCAGCGGTGGTGGTGGCGGCGCG...
153242,ENSG00000122335,ENST00000647468,ENSE00003827563,158168140,158168280,-1,1,AGAGCGGGCCGAGGGGGCGGGGTCACGAGCCGCCAGCCGCCGGGTG...
