# SUBSETTING MOUSE TRANSCRIPTOME 

We explore 
[APPRIS](http://appris-tools.org/) 
mouse transcriptome and do some sanity checks. Based on our observatioons, we make decisions about which transcripts to exclude.

Note that we are inside the scripts folder and need to setup our paths accordingly.

In [1]:
! ls ..

raw  riboflow_annot_and_ref  scripts


To start, we include our library folder first. Make sure it exists in two levels up

In [2]:
import sys
sys.path.insert(0, '../../../..')
import gzip

from collections import defaultdict

import ref_lib
from ref_lib.Fasta import FastaFile, FastaEntry
from ref_lib.GTF import GTFfile, GTFEntry, get_gtf_contents

Here are the reference files we will be using:

In [3]:
transcripts_fasta = "../raw/sequence/gencode.vM23.transcripts.fa.gz"
gencode_gtf       = "../raw/annotation/gencode.vM23.annotation.gtf.gz"
appris_file       = "../raw/annotation/appris_data.principal.txt.gz"

Let's check the documentation and see what the output of get_gtf_contents is like

In [4]:
help(get_gtf_contents)

Help on function get_gtf_contents in module ref_lib.GTF:

get_gtf_contents(gtf_file)
    Reads a gtf file into a dictionary. 
    
    NOTE 1:
    We tested this on encode gtf file.
    Some unexpected behavior might occur on GTF files from other sources.
    
    NOTE 2:
    For compatibility reasons, in gene_id and transcript_id, we are removing the part after
    ".". So, for example, ENSG00000178199.13 becomes ENSG00000178199
    
    It returns a dictionary where its keys are gene ids and values are again (sub)dictionaries
    where each (sub dictionary is of the form)
        transcript_id -> { "exons": list(), "CDS": list(), "strand": k.strand,
                           "start": k.start, "end": k.end, length: int}
                           
    Later when we want to determine relative UTR, CDS coordinates  etc. 
    We can place them in these (sub)dictionaries



GENCODE gtf contents are stored in a dictionary. It might take a while to read all gtf:

In [5]:
gtf_all = get_gtf_contents( gencode_gtf ) 

Random Two entries from the gtf contents:

In [6]:
list(gtf_all.items())[3000:3002]

[('ENSMUSG00000066677',
  {'ENSMUST00000182880': {'exons': [(173673675, 173673966),
     (173677678, 173680332)],
    'CDS': [],
    'strand': '+',
    'start': 173673675,
    'end': 173680332,
    'gene_type': 'protein_coding',
    'length': 2947},
   'ENSMUST00000085876': {'exons': [(173673680, 173673966),
     (173677678, 173677962),
     (173678930, 173679036),
     (173682665, 173683618),
     (173690667, 173690734),
     (173695572, 173695869),
     (173697405, 173698395)],
    'CDS': [(173677686, 173677962, 1),
     (173678930, 173679036, 2),
     (173682665, 173683618, 3),
     (173690667, 173690734, 4),
     (173695572, 173695869, 5),
     (173697405, 173697464, 6)],
    'strand': '+',
    'start': 173673680,
    'end': 173698395,
    'gene_type': 'protein_coding',
    'length': 2990},
   'ENSMUST00000169857': {'exons': [(173677686, 173677962),
     (173678930, 173679036),
     (173682665, 173683618),
     (173690667, 173690734),
     (173695572, 173695869),
     (173697405, 1

We define the functions to extract CDS and UTR coordinates relative ro the transcript start.
These care stored in the gtf_all dictionary with "bed_" prefix.

In [7]:
def _find_rel_cds_end_coords( exon_list, cds_contents, coord_type, strand ):
    """
    Returns the start or end of the CDS 0-based bed format
    So the start position is included and end position is excluded
    """    
    # c is the corresponding_exon_index to the cds chunk
    
    if coord_type == "start":
        cds_chunk = cds_contents[0]
        c = cds_chunk[2]
        if strand == "+":
            this_sum = cds_chunk[0] - exon_list[c][0]
        else:
            this_sum = exon_list[c][1] - cds_chunk[1]
    else:
        cds_chunk = cds_contents[-1]
        c = cds_chunk[2]
        this_sum = (cds_chunk[1] - cds_chunk[0]) + 1
        
        if len(cds_contents) == 1:
            if strand == "+":
                this_sum += cds_chunk[0]  - exon_list[c][0]  
            else:
                this_sum += exon_list[c][1] - cds_chunk[1]
        """
        Delete this later!!!
        if strand == "+":
            this_sum = (cds_chunk[1] - cds_chunk[0]) + 1
        else:
            this_sum = cds_chunk[0] - exon_list[c][0]
        """
    for e in exon_list[:c]:
        this_sum += (e[1] - e[0]) + 1
        
    return this_sum


def find_rel_utr_cds_positions( genes_dict ):
    """
    For each transcri,pt, the first nucleotide (of the transcript) has position 0
    the coordinates are in BED stye
    so For A,B 
    A: start, is INCLUDED
    B: end, is EXCLUDED
    The coordinates have the _bed prefix and they are stored in
    the corresponding transcript dictionary
    """
    # 
    # 0-based and end coordinate is excluded
    # 
    for g_name, transcripts in genes_dict.items():
    
        for t_name, t_contents in transcripts.items():
            CDS_contents = t_contents.get("CDS", list())
            if len(CDS_contents) == 0:
                continue

            cds_rel_start = \
                _find_rel_cds_end_coords( t_contents["exons"], CDS_contents, "start", t_contents["strand"]   )
            cds_rel_end = \
                _find_rel_cds_end_coords( t_contents["exons"], CDS_contents, "end", t_contents["strand"] )
        
            t_length = t_contents["length"]
                
            t_contents["bed_CDS"] = ( cds_rel_start, cds_rel_end )
            if cds_rel_start > 0:
                t_contents["bed_UTR_5"] = (0, cds_rel_start  )
                
            if cds_rel_end < t_length:
                t_contents["bed_UTR_3"] = (cds_rel_end, t_length )

In [8]:
find_rel_utr_cds_positions(gtf_all)

Now we can see that they are in the dictionary:

In [9]:
list(gtf_all.items())[4000:4002]

[('ENSMUSG00000085545',
  {'ENSMUST00000133025': {'exons': [(26191621, 26191678),
     (26186729, 26188640)],
    'CDS': [],
    'strand': '-',
    'start': 26186729,
    'end': 26191678,
    'gene_type': 'lncRNA',
    'length': 1970}}),
 ('ENSMUSG00000026934',
  {'ENSMUST00000054099': {'exons': [(26208102, 26208289),
     (26203950, 26204121),
     (26202974, 26203176),
     (26202350, 26202501),
     (26202094, 26202262),
     (26200212, 26201511)],
    'CDS': [(26208102, 26208189, 0),
     (26203950, 26204121, 1),
     (26202974, 26203176, 2),
     (26202350, 26202501, 3),
     (26202094, 26202262, 4),
     (26201096, 26201511, 5)],
    'strand': '-',
    'start': 26200212,
    'end': 26208289,
    'gene_type': 'protein_coding',
    'length': 2184,
    'bed_CDS': (100, 1300),
    'bed_UTR_5': (0, 100),
    'bed_UTR_3': (1300, 2184)},
   'ENSMUST00000127035': {'exons': [(26202350, 26202866),
     (26202094, 26202262),
     (26200216, 26201511)],
    'CDS': [],
    'strand': '-',
    

## Selecting Longest Appris Transcripts

When going through Appris transcripts, we pick only the "PRINCIPAL*" ones.

In [10]:
def read_principal_appris_trascript_list(appris_file, gtf_contents):
    """
    Reads the appris isoform file into a dictionary
    
    We assume that the files is of the form:
    Gene NAme \t GENE_ID \t TRANSCRIOPT_ID \t APPRIS_CATEGORY
    # C1orf112	ENSG00000000460	ENST00000286031	CCDS1285	PRINCIPAL:1
    
    The output dictionary is of the form:
    appris_genes[gene_id][transcript_id] = {  "gene_name": gene_name, "category": category }
    """
    
    appris_genes = defaultdict(dict)
    myopen=open
    if appris_file.endswith(".gz"):
        myopen = gzip.open
    
    with myopen(appris_file, "rt") as input_stream:
        for entry in input_stream:
            contents = entry.strip().split()
            if len(contents) < 5:
                continue
            gene_name, gene_id, transcript_id, dummy, category = contents
            if not category.startswith("PRINCIPAL"):
                continue
            appris_genes[gene_id][transcript_id] = gtf_contents[gene_id][transcript_id]
            
    return appris_genes

In [11]:
appris_principal_transcripts = read_principal_appris_trascript_list(appris_file, gtf_all)
len(appris_principal_transcripts)

22431

In [12]:
def pick_longest_appris_transcripts( appris_genes ):
    longest_picks = defaultdict(dict)
    
    for gene, transcripts in appris_genes.items():
        g_transcripts = list( transcripts.items() )
        longest_transcript = g_transcripts[0]
        for t_name, t_contents in g_transcripts[1:]:
            if t_contents["length"] > longest_transcript[1]["length"]:
                longest_transcript = (t_name, t_contents)
        longest_picks[ gene ][longest_transcript[0]] = longest_transcript[1]   
    return longest_picks

In [13]:
longest_appris_transcripts = pick_longest_appris_transcripts(appris_principal_transcripts)
len(longest_appris_transcripts)

22431

How many of these transcripts do NOT have CDS?

In [14]:
no_cds_transcripts = defaultdict(dict)
for g, transcripts in longest_appris_transcripts.items():
    # There is only one (longest) transcript so pick that
    t_name, t_contents = list(transcripts.items())[0]
    CDS = t_contents.get("bed_CDS", None)
    if CDS == None:
        no_cds_transcripts[g][t_name] = t_contents
print(len(no_cds_transcripts))    

0


Great! So all appris transcripts have CDS region annotated!

Just for verification, let's look at GAPDH and BRCA2 and see if we picked the longest transcript:

In [15]:
for t_name, t_contents in appris_principal_transcripts["ENSMUSG00000057666"].items():
    print(t_name, t_contents["length"])
print("--------")
for t_name, t_contents in appris_principal_transcripts["ENSMUSG00000041147"].items():
    print(t_name, t_contents["length"])

ENSMUST00000073605 1272
ENSMUST00000118875 1420
--------
ENSMUST00000044620 10724
ENSMUST00000202313 10517


In [16]:
longest_appris_transcripts["ENSMUSG00000057666"]

{'ENSMUST00000118875': {'exons': [(125165552, 125165773),
   (125165271, 125165311),
   (125163139, 125163436),
   (125162843, 125163040),
   (125162478, 125162708),
   (125162212, 125162393),
   (125161854, 125162101)],
  'CDS': [(125165271, 125165293, 1),
   (125163139, 125163436, 2),
   (125162843, 125163040, 3),
   (125162478, 125162708, 4),
   (125162212, 125162393, 5),
   (125162035, 125162101, 6)],
  'strand': '-',
  'start': 125161854,
  'end': 125165773,
  'gene_type': 'protein_coding',
  'length': 1420,
  'bed_CDS': (240, 1239),
  'bed_UTR_5': (0, 240),
  'bed_UTR_3': (1239, 1420)}}

In [17]:
longest_appris_transcripts["ENSMUSG00000041147"]

{'ENSMUST00000044620': {'exons': [(150522630, 150523090),
   (150523175, 150523279),
   (150529470, 150529697),
   (150531069, 150531177),
   (150531684, 150531733),
   (150532055, 150532095),
   (150532291, 150532402),
   (150534593, 150534642),
   (150535439, 150535550),
   (150536031, 150537137),
   (150538649, 150543457),
   (150544895, 150544990),
   (150547941, 150548010),
   (150548489, 150548835),
   (150550832, 150551013),
   (150552207, 150552394),
   (150552937, 150553107),
   (150554155, 150554509),
   (150556861, 150557016),
   (150557101, 150557245),
   (150558390, 150558511),
   (150560004, 150560196),
   (150560428, 150560591),
   (150560685, 150560823),
   (150566888, 150567129),
   (150567607, 150567768),
   (150568939, 150569746)],
  'CDS': [(150523213, 150523279, 1),
   (150529470, 150529697, 2),
   (150531069, 150531177, 3),
   (150531684, 150531733, 4),
   (150532055, 150532095, 5),
   (150532291, 150532402, 6),
   (150534593, 150534642, 7),
   (150535439, 1505355

This gives us confidence that we picked the longest principal transcripts

## Sanity Checks

At this point, we have the relative coordinates of CDS and UTRs, and we picked the longest principal appris transcripts.
Now, we can do some sanity checks to see if there are any issues with our implmentation.

These sanity checks are:
    1. The length of CDS should be divisible by 3
    2. The start codon should be ATG. The stop codon should be one of TAG, TGA, TAA
    
Also, it is claimed that the alternative isoforms for the principal Appris genes have identical CDS. We walso wan to confirm this.
    3. CDS regions

### 1. CDS Length

In [18]:
# The remainders are stored at the corresponding index position
# i.e., CDS_remainder_by_3[1] will hold the total number of remainders whose CDS length is of the form 3m + 1
CDS_remainder_by_3 = [0,0,0]
non_zero_remainder_transcripts = defaultdict(dict)

for g, transcripts in longest_appris_transcripts.items():
    for t_name, t_contents in transcripts.items():
        bed_CDS = t_contents.get("bed_CDS", None)
        this_remainder = (bed_CDS[1] - bed_CDS[0]) % 3
        CDS_remainder_by_3[this_remainder] += 1
        if this_remainder != 0:
            non_zero_remainder_transcripts[g][t_name] = t_contents

print("CDS length divisible by 3: ", CDS_remainder_by_3[0])
print("CDS length NOT divisible by 3: ", CDS_remainder_by_3[1] + CDS_remainder_by_3[2])


CDS length divisible by 3:  21937
CDS length NOT divisible by 3:  494


Most of the transcripts have CDS multiple of 3. But some of them don't. Now we look at what kind of transcripts they are:

In [19]:
CDS_non_zero_remainder_plus_strand = 0
one_exon_non_zero_remainder_cases = list()
gene_types = defaultdict(int)

for g, transcripts in non_zero_remainder_transcripts.items():
     for t_name, t_contents in transcripts.items():
        gene_types[gtf_all[g][t_name]["gene_type"]] +=1  
        if t_contents["strand"] == "+":
            CDS_non_zero_remainder_plus_strand += 1
        if len(t_contents["CDS"]) == 1:
            one_exon_non_zero_remainder_cases.append( t_name)

all_total = 0
for g_type, g_count in gene_types.items():
    print(g_type, g_count)
    all_total += g_count
print(all_total)

protein_coding 115
polymorphic_pseudogene 35
TR_V_gene 124
TR_J_gene 33
TR_C_gene 7
IG_LV_gene 3
IG_V_gene 135
IG_J_gene 9
IG_C_gene 12
TR_D_gene 2
IG_D_gene 19
494


In [20]:
all_gene_types = defaultdict(int)
total_t = 0

for g, transcripts in appris_principal_transcripts.items():
    for t_name, t_contents in transcripts.items():
        all_gene_types[ t_contents["gene_type"] ] +=1
        total_t += 1
        
        
for  g_type, g_count in all_gene_types.items():
    print(g_type, g_count)
print(total_t)

protein_coding 28689
polymorphic_pseudogene 97
IG_V_gene 218
TR_V_gene 144
TR_J_gene 70
TR_C_gene 9
IG_LV_gene 4
IG_J_gene 14
IG_C_gene 13
TR_D_gene 4
IG_D_gene 19
29281


In [21]:
print(len(one_exon_non_zero_remainder_cases), "\n" ,  one_exon_non_zero_remainder_cases[:20])

137 
 ['ENSMUST00000117860', 'ENSMUST00000117704', 'ENSMUST00000117982', 'ENSMUST00000117787', 'ENSMUST00000121914', 'ENSMUST00000208175', 'ENSMUST00000216831', 'ENSMUST00000133118', 'ENSMUST00000103278', 'ENSMUST00000192366', 'ENSMUST00000103288', 'ENSMUST00000103289', 'ENSMUST00000193061', 'ENSMUST00000103293', 'ENSMUST00000103295', 'ENSMUST00000103296', 'ENSMUST00000103297', 'ENSMUST00000103298', 'ENSMUST00000103340', 'ENSMUST00000103351']


A closer look at the transcript ENST00000426406 directly on the file: gencode.v29.annotation.gtf.gz

In [22]:
! zcat ../raw/annotation/gencode.vM23.annotation.gtf.gz | grep ENSMUST00000117704 

chr2	HAVANA	transcript	86606627	86607569	.	-	.	gene_id "ENSMUSG00000068824.4"; transcript_id "ENSMUST00000117704.1"; gene_type "polymorphic_pseudogene"; gene_name "Olfr1083-ps"; transcript_type "polymorphic_pseudogene"; transcript_name "Olfr1083-ps-201"; level 2; protein_id "ENSMUSP00000149090.1"; transcript_support_level "NA"; mgi_id "MGI:3030917"; ont "PGO:0000018"; tag "basic"; tag "appris_principal_1"; havana_gene "OTTMUSG00000013643.3"; havana_transcript "OTTMUST00000032780.3";
chr2	HAVANA	exon	86606627	86607569	.	-	.	gene_id "ENSMUSG00000068824.4"; transcript_id "ENSMUST00000117704.1"; gene_type "polymorphic_pseudogene"; gene_name "Olfr1083-ps"; transcript_type "polymorphic_pseudogene"; transcript_name "Olfr1083-ps-201"; exon_number 1; exon_id "ENSMUSE00000715632.1"; level 2; protein_id "ENSMUSP00000149090.1"; transcript_support_level "NA"; mgi_id "MGI:3030917"; ont "PGO:0000018"; tag "basic"; tag "appris_principal_1"; havana_gene "OTTMUSG00000013643.3"; havana_transcript "OTTMU

This is a pseodogene looks.

Let's look at another one:

In [23]:
! zcat ../raw/annotation/gencode.vM23.annotation.gtf.gz | grep ENSMUST00000103298

chr6	HAVANA	transcript	41543810	41543856	.	+	.	gene_id "ENSMUSG00000076497.1"; transcript_id "ENSMUST00000103298.1"; gene_type "TR_J_gene"; gene_name "Trbj2-7"; transcript_type "TR_J_gene"; transcript_name "Trbj2-7-201"; level 2; protein_id "ENSMUSP00000141994.1"; transcript_support_level "NA"; mgi_id "MGI:4439731"; tag "mRNA_start_NF"; tag "mRNA_end_NF"; tag "cds_start_NF"; tag "cds_end_NF"; tag "basic"; tag "appris_principal_1"; havana_gene "OTTMUSG00000051368.3"; havana_transcript "OTTMUST00000129900.3";
chr6	HAVANA	exon	41543810	41543856	.	+	.	gene_id "ENSMUSG00000076497.1"; transcript_id "ENSMUST00000103298.1"; gene_type "TR_J_gene"; gene_name "Trbj2-7"; transcript_type "TR_J_gene"; transcript_name "Trbj2-7-201"; exon_number 1; exon_id "ENSMUSE00000663112.1"; level 2; protein_id "ENSMUSP00000141994.1"; transcript_support_level "NA"; mgi_id "MGI:4439731"; tag "mRNA_start_NF"; tag "mRNA_end_NF"; tag "cds_start_NF"; tag "cds_end_NF"; tag "basic"; tag "appris_principal_1"; havana_gen

ENSMUST00000103298 is another trsanscript that is not labeled as "protein coding".

### 2. Start & Stop Codons

In [24]:
t_fasta_stream = FastaFile(transcripts_fasta)

clipped = list()
start_triplets = defaultdict(int)
stop_triplets = defaultdict(int)

for this_fasta_entry in t_fasta_stream:
    contents = this_fasta_entry.header.strip().split("|")
    this_t = contents[0].split(".")[0]
    
    g_contents = contents[1].split(".")
    this_g = g_contents[0]
    if len(g_contents) >= 2 and "PAR_Y" in g_contents[1]:
        continue
    
    transcripts = longest_appris_transcripts.get(this_g, None)
    if bed_CDS == None:
        continue
    if this_t in list(longest_appris_transcripts[this_g].keys()):
        bed_CDS = longest_appris_transcripts[this_g][this_t].get("bed_CDS", None)
        
        if bed_CDS != None:
            start_codon_coord = bed_CDS[0]
            start_triplets[ this_fasta_entry.sequence[ start_codon_coord:start_codon_coord+3 ]  ] += 1
            stop_triplets[ this_fasta_entry.sequence[ bed_CDS[1]:bed_CDS[1] + 3 ]  ] += 1

print(start_triplets)
print("---------------------------")
print(stop_triplets)

defaultdict(<class 'int'>, {'ATG': 21972, 'TGG': 7, 'GTC': 5, 'TTC': 6, 'GGA': 16, 'TGT': 12, 'CTG': 21, 'ATA': 8, 'GTG': 13, 'ATC': 9, 'TTG': 5, 'GTT': 9, 'AGA': 15, 'CCT': 10, 'ATT': 6, 'TTT': 7, 'CTT': 5, 'GAA': 21, 'GTA': 3, 'AAT': 3, 'TAC': 3, 'ACG': 2, 'GGG': 7, 'AGG': 6, 'GGC': 9, 'AAA': 9, 'TAT': 2, 'CTC': 6, 'AAG': 4, 'GCA': 9, 'GAG': 20, 'AGC': 6, 'GCC': 11, 'TCT': 11, 'CAG': 26, 'GCT': 5, 'TCA': 29, 'CAA': 15, 'GAC': 16, 'TAA': 5, 'AGT': 5, 'AAC': 3, 'GAT': 5, 'CTA': 9, 'ACC': 3, 'CCA': 5, 'CAC': 3, 'GCG': 1, 'TTA': 4, 'ACA': 9, 'CGC': 2, 'CCC': 3, 'ACT': 2, 'GGT': 4, 'CAT': 2, 'TGC': 2, 'TGA': 10, 'TAG': 2, 'CCG': 2, 'CGG': 1})
---------------------------
defaultdict(<class 'int'>, {'TAA': 6054, 'TAG': 4957, 'TGA': 10727, '': 680, 'AGG': 1, 'AAA': 1, 'TGC': 1, 'CGC': 1, 'TT': 1, 'AAT': 1, 'TGT': 1, 'GCA': 1, 'TTC': 1, 'GAG': 1, 'CCC': 1, 'GTC': 1, 'TCT': 1})


Most of our transcripts start with ATG.

In [25]:
lenght_dict = defaultdict(int)
short_and_p_coding = defaultdict(dict)
for g, transcripts in longest_appris_transcripts.items():
    for t_name , t_contents in transcripts.items():
        cds_length = t_contents["bed_CDS"][1] - t_contents["bed_CDS"][0]
        lenght_dict[ cds_length ] += 1
        if cds_length < 150 and t_contents["gene_type"] == "protein_coding":
            short_and_p_coding[g][t_name] = t_contents
            
print("There are", len(short_and_p_coding), "many short and protein coding transcripts (CDS length < 150)")

There are 101 many short and protein coding transcripts (CDS length < 150)


We think that these few short transcripts will not impact our results significantly. Also, very few reads are going to map to them.

### Selecting Only Protein-Coding Genes

Our sanity checks above suggest that many (though not all) of the transcripts that are failing CDS length and start / stop codon checks are not non-protein coding genes. So we decided to pick only protein-coding genes.

In [26]:
protein_coding_longest_transcripts = defaultdict(dict)
for g, transcripts in longest_appris_transcripts.items():
    for t_name , t_contents in transcripts.items():
        if t_contents["gene_type"] == "protein_coding":
            protein_coding_longest_transcripts[g][t_name] = t_contents
long_p_coding_count = len(protein_coding_longest_transcripts)
print("There are {} protein coding transcripts picked".format(long_p_coding_count))

There are 21849 protein coding transcripts picked


Now we revisit our sanity checks using the protein coding transcripts only

### 1. CDS Length (Revisited)

In [27]:
CDS_remainder_by_3 = [0,0,0]
non_zero_remainder_transcripts = defaultdict(dict)

for g, transcripts in protein_coding_longest_transcripts.items():
    for t_name, t_contents in transcripts.items():
        bed_CDS = t_contents.get("bed_CDS", None)
        this_remainder = (bed_CDS[1] - bed_CDS[0]) % 3
        CDS_remainder_by_3[this_remainder] += 1
        if this_remainder != 0:
            non_zero_remainder_transcripts[g][t_name] = t_contents

print("CDS length divisible by 3: ", CDS_remainder_by_3[0])
print("CDS length NOT divisible by 3: ", CDS_remainder_by_3[1] + CDS_remainder_by_3[2])

CDS length divisible by 3:  21734
CDS length NOT divisible by 3:  115


### 2. Start & Stop Codons (Revisited)

In [28]:
t_fasta_stream = FastaFile(transcripts_fasta)

clipped = list()
start_triplets = defaultdict(int)
stop_triplets = defaultdict(int)
empty_stop_triplets = defaultdict(dict)
transcripts_with_stop_codon = defaultdict(dict)

for this_fasta_entry in t_fasta_stream:
    contents = this_fasta_entry.header.strip().split("|")
    this_t = contents[0].split(".")[0]
    
    g_contents = contents[1].split(".")
    this_g = g_contents[0]
    if len(g_contents) >= 2 and "PAR_Y" in g_contents[1]:
        continue
    
    transcripts = protein_coding_longest_transcripts.get(this_g, None)

    if this_t in list(protein_coding_longest_transcripts[this_g].keys()):
        bed_CDS = protein_coding_longest_transcripts[this_g][this_t].get("bed_CDS", None)
        
        if bed_CDS != None:
            start_codon_coord = bed_CDS[0]
            start_triplets[ this_fasta_entry.sequence[ start_codon_coord:start_codon_coord+3 ]  ] += 1
            observed_stop_triplet = this_fasta_entry.sequence[ bed_CDS[1]:bed_CDS[1] + 3 ]
            stop_triplets[ observed_stop_triplet ] += 1
            if observed_stop_triplet == '':
                empty_stop_triplets[this_g][this_t] = protein_coding_longest_transcripts[this_g][this_t]
            else:
                transcripts_with_stop_codon[this_g][this_t] = protein_coding_longest_transcripts[this_g][this_t]

print("start_triplets")                
print(start_triplets)
print("---------------------------")
print("stop_triplets")  
print(stop_triplets)
print("---------------------------")
print("Number of transcripts_with_stop_codon = ", len(transcripts_with_stop_codon) )

start_triplets
defaultdict(<class 'int'>, {'ATG': 21591, 'TGG': 4, 'GTC': 4, 'TTC': 4, 'GGA': 6, 'CTG': 20, 'ATA': 4, 'ATC': 5, 'TTG': 4, 'GTT': 5, 'AGA': 11, 'CCT': 4, 'ATT': 4, 'TGT': 8, 'TTT': 4, 'CTT': 4, 'GAA': 8, 'GTA': 2, 'AAT': 1, 'TAC': 3, 'ACG': 2, 'GGG': 2, 'AGG': 1, 'GGC': 5, 'AAA': 8, 'TAT': 1, 'CTC': 4, 'AAG': 4, 'GCA': 6, 'GAG': 12, 'AGC': 5, 'GTG': 10, 'GCC': 8, 'TCT': 4, 'CAG': 2, 'GCT': 3, 'TCA': 28, 'CAA': 7, 'GAC': 6, 'CTA': 3, 'ACC': 3, 'CAC': 3, 'GCG': 1, 'TTA': 3, 'ACA': 5, 'CGC': 2, 'CCC': 3, 'ACT': 1, 'GGT': 3, 'AGT': 2, 'CCG': 2, 'CGG': 1, 'TGC': 1, 'GAT': 1, 'CCA': 1})
---------------------------
stop_triplets
defaultdict(<class 'int'>, {'TAA': 6027, 'TAG': 4936, 'TGA': 10703, '': 180, 'CGC': 1, 'TT': 1, 'GAG': 1})
---------------------------
Number of transcripts_with_stop_codon =  21669


Clearly, picking only protein coding genes largely improved our results!

The empty triplets are coming from transcripts without 3UTR.

We can clearly see this from the GTF annotation or from our dictionary:

In [29]:
empty_stop_triplets

defaultdict(dict,
            {'ENSMUSG00000100679': {'ENSMUST00000190734': {'exons': [(53297111,
                 53297188),
                (53298972, 53299152),
                (53305455, 53305606),
                (53314860, 53315099),
                (53317902, 53318035)],
               'CDS': [(53298979, 53299152, 1),
                (53305455, 53305606, 2),
                (53314860, 53315099, 3),
                (53317902, 53318035, 4)],
               'strand': '+',
               'start': 53297111,
               'end': 53318035,
               'gene_type': 'protein_coding',
               'length': 785,
               'bed_CDS': (85, 785),
               'bed_UTR_5': (0, 85)}},
             'ENSMUSG00000100265': {'ENSMUST00000186341': {'exons': [(117697131,
                 117697228),
                (117697312, 117697512),
                (117698849, 117698975),
                (117726775, 117726903),
                (117727324, 117727441)],
               'CDS': [(117697

When we examined these genes on UCSC browser, we saw they didn't have a 3UTR.

We decided to take them out as well. 

Now we narrowed down our set to 
`transcripts_with_stop_codon`

In [30]:
t_fasta_stream = FastaFile(transcripts_fasta)

start_triplets = defaultdict(int)
stop_triplets = defaultdict(int)

for this_fasta_entry in t_fasta_stream:
    contents = this_fasta_entry.header.strip().split("|")
    this_t = contents[0].split(".")[0]
    
    g_contents = contents[1].split(".")
    this_g = g_contents[0]
    if len(g_contents) >= 2 and "PAR_Y" in g_contents[1]:
        continue
    
    transcripts = transcripts_with_stop_codon.get(this_g, None)
    if transcripts == None:
        continue
    if bed_CDS == None:
        continue
    if this_t in list(transcripts_with_stop_codon[this_g].keys()):
        bed_CDS = transcripts_with_stop_codon[this_g][this_t].get("bed_CDS", None)
        
        if bed_CDS != None:
            start_codon_coord = bed_CDS[0]
            start_triplets[ this_fasta_entry.sequence[ start_codon_coord:start_codon_coord+3 ]  ] += 1
            observed_stop_triplet = this_fasta_entry.sequence[ bed_CDS[1]:bed_CDS[1] + 3 ]
            stop_triplets[ observed_stop_triplet ] += 1
            
print(start_triplets)
print("---------------------------")
print(stop_triplets)

defaultdict(<class 'int'>, {'ATG': 21528, 'TGG': 2, 'GTC': 2, 'TTC': 2, 'GGA': 5, 'CTG': 18, 'AGA': 5, 'TGT': 3, 'GAA': 2, 'AAT': 1, 'ACG': 2, 'GCA': 5, 'GTT': 4, 'AGC': 4, 'GGC': 1, 'GTG': 4, 'GCC': 5, 'TCT': 1, 'GCT': 3, 'AAA': 1, 'CAA': 1, 'GAC': 4, 'CTC': 1, 'CTA': 1, 'ATC': 3, 'CCT': 1, 'GCG': 1, 'CGC': 2, 'ATT': 2, 'CCC': 2, 'ACT': 1, 'GGT': 3, 'CTT': 3, 'ACA': 1, 'GAG': 2, 'AAG': 2, 'AGT': 2, 'TCA': 27, 'TTG': 1, 'ACC': 1, 'GGG': 1, 'CCG': 1, 'GTA': 1, 'TTT': 1, 'TTA': 1, 'CAG': 1, 'TGC': 1, 'GAT': 1, 'CCA': 1, 'ATA': 1})
---------------------------
defaultdict(<class 'int'>, {'TAA': 6027, 'TAG': 4936, 'TGA': 10703, 'CGC': 1, 'TT': 1, 'GAG': 1})


In [31]:
len(transcripts_with_stop_codon)

21669

In [32]:
CDS_remainder_by_3 = [0,0,0]
non_zero_remainder_transcripts = defaultdict(dict)

for g, transcripts in transcripts_with_stop_codon.items():
    for t_name, t_contents in transcripts.items():
        bed_CDS = t_contents.get("bed_CDS", None)
        this_remainder = (bed_CDS[1] - bed_CDS[0]) % 3
        CDS_remainder_by_3[this_remainder] += 1
        if this_remainder != 0:
            non_zero_remainder_transcripts[g][t_name] = t_contents

print("CDS length divisible by 3: ", CDS_remainder_by_3[0])
print("CDS length NOT divisible by 3: ", CDS_remainder_by_3[1] + CDS_remainder_by_3[2])

CDS length divisible by 3:  21601
CDS length NOT divisible by 3:  68


### 3. CDS Regions

We check whether different transcripts of the same gene of the Appris list have different CDS sequence or not.

In [33]:
t_fasta_stream = FastaFile(transcripts_fasta)
selected_CDS_sequences = defaultdict(dict)
different_CDS_transcripts = list()

for this_fasta_entry in t_fasta_stream:
    contents = this_fasta_entry.header.strip().split("|")
    this_t = contents[0].split(".")[0]
    
    g_contents = contents[1].split(".")
    this_g = g_contents[0]
    #if len(g_contents) >= 2 :
    #    continue
    
    pre_transcripts = transcripts_with_stop_codon.get(this_g, None)
    if pre_transcripts == None or pre_transcripts == {}:
        continue
    
    transcripts = appris_principal_transcripts.get(this_g, {})
    if this_t not in transcripts.keys():
        continue
    cds_boundaries = transcripts[this_t]["bed_CDS"]
    
    selected_CDS_sequences[this_g][this_t] = this_fasta_entry.sequence[ cds_boundaries[0]: cds_boundaries[1] ]
                 
for g, transcripts in selected_CDS_sequences.items():
    
    transcript_items = list(transcripts.items ())
    first_t, first_t_cds = transcript_items[0] 
    
    for t_name, t_cds in transcript_items[1:]:
        if t_cds != first_t_cds:
            different_CDS_transcripts.append( (first_t, t_name, first_t_cds, t_cds) )

In [34]:
for e in different_CDS_transcripts:
    print( e[0], e[1], len(e[2]), len(e[3]) )

ENSMUST00000059975 ENSMUST00000186780 570 570
ENSMUST00000066432 ENSMUST00000100069 5841 5841
ENSMUST00000212009 ENSMUST00000077816 1725 1725
ENSMUST00000199891 ENSMUST00000084922 1422 1422
ENSMUST00000147802 ENSMUST00000234625 801 801
ENSMUST00000111817 ENSMUST00000079314 426 426
ENSMUST00000155364 ENSMUST00000046754 327 327
ENSMUST00000184635 ENSMUST00000061331 1773 1773


Interestingly, all different cds sequences have the same length. When we looked at the second item (ENST00000612898 ENST00000396980 378 378), the seuqneces looked almost identical. So we look at them nuc-by-nuc.

In [35]:
def find_disagreeing_indices( seq_1, seq_2 ):
    disagreeing_indices = []
    
    for i in range( len(seq_1) ):
        if seq_1[i] != seq_2[i]:
            disagreeing_indices.append(i) 
    
    return disagreeing_indices

In [36]:
for e in different_CDS_transcripts:
    print( find_disagreeing_indices( e[2], e[3] ) )

[569]
[383]
[74]
[614]
[449]
[200]
[74]
[779]


Hence, appris principal transcripts have identical CDS. There are very few exceptions to this, coming from the above output. But, for mapping purposes this won't result in an important difference.

### 4. CDS Length Distirbution

In [37]:
cds_lengths = []
gene_lengths = []

for g, transcripts in transcripts_with_stop_codon.items():
    for t_name, t_contents in transcripts.items():
        cds_lengths.append( t_contents["bed_CDS"][1] - t_contents["bed_CDS"][0] )
        gene_lengths.append(t_contents["length"])
    
len(cds_lengths)

21669

In [38]:
import numpy as np

In [39]:
cds_lengths_np = np.array(cds_lengths, dtype= np.int)
gene_lengths_np = np.array(gene_lengths, dtype= np.int)

In [40]:
np.mean(cds_lengths)

1617.0516867414278

In [41]:
np.median(cds_lengths)

1173.0

These numbers are in the mouse transcript lenmgth ranghe given by Busotti et al. 
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4864457/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4864457/).
See supplemental figure 4, for details.

## Summary & Conclusion

We explored [Appris](https://academic.oup.com/nar/article/46/D1/D213/4561658) transcripts. Based on our observations, we decided to filter out some transcripts for our purposes. In particular, we followed the following steps:

1. We picked only "PRINCIPAL" entries and left "ALTERNATIVE" entries.
2. We picked only protein coding genes by choosing "gene_type protein_coding" in the GTF file
3. We excluded transcripts that didn't have 3TR and hence a stop codon.

These steps are mostly based on our 'Sanity Checks'.

After filtering, we had a total of 21669 transcripts.

* 21601 of these transcripts have CDS with length a multiple of 3. 68 of them don't.
* 21528 of them start with ATG (the start codon)
* All of the transcripts have proper stop codons
* All appris genes have identical CDS regions. However there are 8 exceptions to this. We found 8 cases where the transcripts have the same CDS length but differ in a few places. You can find the complete list above.
* The median length of the CDS regiosn is 1173. This is consistent with the earlier findings on the median length of mouse transcript.



## Files and Hash Sums

Appris Version GencodeM23/Ensembl98

Appris list was downloaded from : http://appris-tools.org/#/downloads
on November 28, 2019

GENCODE:

All other files have been downloaded from gencode website: [https://www.gencodegenes.org/mouse/](https://www.gencodegenes.org/mouse/)

Version: Release M23 (GRCm38.p6)

In [42]:
! zcat ../raw/sequence/gencode.vM23.transcripts.fa.gz | md5sum

2f31e11ad8dc6269ec51dee7c890d280  -


In [43]:
! zcat ../raw/annotation/gencode.vM23.annotation.gtf.gz | md5sum

32d3d2088c237464ee0416f441e3f6bd  -


In [44]:
! zcat ../raw/annotation/appris_data.principal.txt.gz | md5sum

fa302779d9e094c090f99377ad99860b  -


## Generating RiboFlow Reference and Annotation

If the riboflow reference and annotation has not already been generated, you can run the 'generate_riboflow_ref_and_annot.sh' inside the scripts folder.


## Note

This notebook has been adapted from our ealier notebook where we explored Human transcriptome.