# Exploring rMATS Output

rMATS calls and quantifies alternative splicing sites using RNA-seq data. Although most of the splicing sites are known sites from the GTF file rather than noval, we still want to know if any of the alternative events are noval from the GTF file. So here I used the rMATS output of the RNA-seq data from a sample using GENCODE v36. When looking for the splicing patterns, a match is called when the exon and its upstream and downstream exist. If the exon is found but either the upstream or downstream is missing, a match is not called from the transcript of the gene.

In [1]:
import os
import pickle
from dataclasses import dataclass
from typing import List, Iterable

In [3]:
os.chdir('..')
from moPepGen import gtf
with open('test/files/gencode_36_index/annotation.pickle', 'rb') as fh:
    annotation = pickle.load(fh)

## 1. Skipped Exon (SE)

In [4]:
@dataclass
class SkippedExon:
    gene_id: str
    chrom: str
    skipped_exon_start:int
    skipped_exon_end:int
    upstream_exon_start:int
    upstream_exon_end:int
    downstream_exon_start:int
    downstream_exon_end:int

In [5]:
skipped_exons = []
with open('../../rmats/output/CPT0208690010_SE.MATS.JC.txt', 'rt') as fh:
    line = next(fh, None)
    line = next(fh, None)
    while line:
        fields = line.rstrip().split('\t')
        gene_id = fields[1].strip('"')
        chrom = fields[3]
        skipped_exon_start = int(fields[5])
        skipped_exon_end = int(fields[6])
        upstream_exon_start = int(fields[7])
        upstream_exon_end = int(fields[8])
        downstream_exon_start = int(fields[9])
        downstream_exon_end = int(fields[10])
        record = SkippedExon(gene_id, chrom, skipped_exon_start,
            skipped_exon_end, upstream_exon_start, upstream_exon_end,
            downstream_exon_start, downstream_exon_end)
        skipped_exons.append(record)
        line = next(fh, None)

In [81]:
results = []
i = 0
for record in skipped_exons:
    i += 1
    transcript_ids = annotation.genes[record.gene_id]
    transcripts = [annotation.transcripts[x] for x in transcript_ids.transcripts]
    retained = False
    skipped = False
    for transcript in transcripts:
        if retained and skipped:
            break
        it = iter(transcript.exon)
        exon = next(it, None)
        while exon:
            if int(exon.location.end) == record.upstream_exon_end:
                exon = next(it, None)
                
                if not exon:
                    continue

                if int(exon.location.start) == record.downstream_exon_start:
                    skipped = True
                    break
                if exon.location.start == record.skipped_exon_start and \
                        exon.location.end == record.skipped_exon_end:
                    exon = next(it, None)
                    if not exon:
                        continue
                    if int(exon.location.start) == record.downstream_exon_start:
                        retained = True
                        break
            exon = next(it, None)
    results.append((retained, skipped))

In [41]:
both, neither, skipped_only, retained_only = 0, 0, 0, 0
for skipped, retained in results:
    if skipped and retained:
        both += 1
    elif skipped:
        skipped_only += 1
    elif retained:
        retained_only += 1
    else:
        neither += 1

In [42]:
print(f'both:          {both}')
print(f'skipped only:  {skipped_only}')
print(f'retained only: {retained_only}')
print(f'neither:       {neither}')

both:          53133
skipped only:  11206
retained only: 12620
neither:       3698


+ 53133 genes have transcripts with or without the exon of interest. For those genes, no non-canonical peptides are resulted.
+ 11206 genes only have the transcripts that the exon of interest is skipped. For those genes, retaining the exon will generate non-canonical peptides.
+ 12620 genes only have the transcripts that teh exon of interest is retained. For those genes, the skipping the exon will generate non-canonical peptides.
+ 3698 genes have neither the skipped or retained transcript. This maybe because the upstream or downstream exon is also skipped. But since rMATS is not able to detect the skipping of multiple exons, we are then not able to know what really happened.

## 2. Alternative 5' Splicing Site (A5SS)

In [48]:
@dataclass
class AlternativeSplicingSite():
    gene_id: str
    chrom: str
    long_exon_start:int
    long_exon_end:int
    short_exon_start:int
    short_exon_end:int
    flanking_exon_start:int
    flanking_exon_end:int

In [49]:
a5ss = []
with open('../../rmats/output/CPT0208690010_A5SS.MATS.JC.txt', 'rt') as fh:
    line = next(fh, None)
    line = next(fh, None)
    while line:
        fields = line.rstrip().split('\t')
        line = next(fh, None)
        gene_id = fields[1].strip('"')
        chrom = fields[3]
        long_exon_start = int(fields[5])
        long_exon_end = int(fields[6])
        short_exon_start = int(fields[7])
        short_exon_end = int(fields[8])
        flanking_exon_start = int(fields[9])
        flanking_exon_end = int(fields[10])
        record = AlternativeSplicingSite(gene_id, chrom, long_exon_start, long_exon_end,
            short_exon_start, short_exon_end, flanking_exon_start,
            flanking_exon_end)
        a5ss.append(record)

In [53]:
results = []
for record in a5ss:
    transcript_ids = annotation.genes[record.gene_id]
    transcripts = [annotation.transcripts[x] for x in transcript_ids.transcripts]
    short = False
    long = False
    for transcript in transcripts:
        if short and long:
            break
        if transcript.transcript.strand == 1:
            it = iter(transcript.exon)
        else:
            it = reversed(transcript.exon)
        exon = next(it, None)
        while exon:
            if int(exon.location.start) == record.long_exon_start and \
                    int(exon.location.end) == record.long_exon_end:
                exon = next(it, None)
                if exon and exon.location.start == record.flanking_exon_start:
                    long = True
                break
            if int(exon.location.start) == record.short_exon_start and \
                    int(exon.location.end) == record.short_exon_end:
                exon = next(it, None)
                if exon and exon.location.start == record.flanking_exon_start:
                    short = True
                break
            exon = next(it, None)
    results.append((long, short))

In [54]:
both, neither, long_only, short_only = 0, 0, 0, 0
for long, short in results:
    if long and short:
        both += 1
    elif long:
        long_only += 1
    elif short:
        short_only += 1
    else:
        neither += 1

In [55]:
print(f'both:       {both}')
print(f'long only:  {long_only}')
print(f'short only: {short_only}')
print(f'neither:    {neither}')

both:       7694
long only:  3666
short only: 4309
neither:    1653


+ 7694 genes have transcripts with both the long and short version of the exon. For those genes, no non-canonical peptides are resulted.
+ 3666 genes only have the long exon, so the A5SS will genereate non-caninical peptides.
+ 4309 genes only have the short exon. The A5SS also generates non-canonical peptides.
+ 1653 genes don't have the exon annotated. The alternative splicing is more complicated and can not be infered at this stage.

## 3. Alternative 3' Splicing Site (A3SS)

In [26]:
a3ss = []
with open('../../rmats/output/CPT0208690010_A3SS.MATS.JC.txt', 'rt') as fh:
    line = next(fh, None)
    line = next(fh, None)
    while line:
        fields = line.rstrip().split('\t')
        line = next(fh, None)
        gene_id = fields[1].strip('"')
        chrom = fields[3]
        long_exon_start = int(fields[5])
        long_exon_end = int(fields[6])
        short_exon_start = int(fields[7])
        short_exon_end = int(fields[8])
        flanking_exon_start = int(fields[9])
        flanking_exon_end = int(fields[10])
        record = AlternativeSplicingSite(gene_id, chrom, long_exon_start, long_exon_end,
            short_exon_start, short_exon_end, flanking_exon_start,
            flanking_exon_end)
        a3ss.append(record)

In [27]:
results = []
for record in a3ss:
    transcript_ids = annotation.genes[record.gene_id]
    transcripts = [annotation.transcripts[x] for x in transcript_ids.transcripts]
    short = False
    long = False
    for transcript in transcripts:
        if short and long:
            break
        if transcript.transcript.strand == 1:
            it = iter(transcript.exon)
        else:
            it = reversed(transcript.exon)
        exon = next(it, None)
        while exon:
            if int(exon.location.start) == record.flanking_exon_start:
                exon = next(it, None)
                if not exon:
                    break
                if exon.location.start == record.long_exon_start and \
                    int(exon.location.end) == record.long_exon_end:
                    long = True
                elif int(exon.location.start) == record.short_exon_start and \
                    int(exon.location.end) == record.short_exon_end:
                    short = True
                break
            exon = next(it, None)
    results.append((long, short))

In [28]:
both, neither, long_only, short_only = 0, 0, 0, 0
for long, short in results:
    if long and short:
        both += 1
    elif long:
        long_only += 1
    elif short:
        short_only += 1
    else:
        neither += 1

In [29]:
print(f'both:       {both}')
print(f'long only:  {long_only}')
print(f'short only: {short_only}')
print(f'neither:    {neither}')

both:       12134
long only:  6318
short only: 4272
neither:    1902


+ 12134 genes have transcripts with both the long and short version of the exon. For those genes, no non-canonical peptides are resulted.
+ 6318 genes only have the long exon, so the A5SS will genereate non-caninical peptides.
+ 4272 genes only have the short exon. The A5SS also generates non-canonical peptides.
+ 1902 genes don't have the exon annotated. The alternative splicing is more complicated and can not be infered at this stage.

## 4. Mutually Exclusive Exons (MXE)

In [58]:
@dataclass
class MutuallyExclusiveExons():
    gene_id: str
    chrom: str
    first_exon_start:int
    first_exon_end:int
    second_exon_start:int
    second_exon_end:int
    upstream_exon_start:int
    upstream_exon_end:int
    downstream_exon_start:int
    downstream_exon_end:int

In [94]:
mxes = []
with open('../../rmats/output/CPT0208690010_MXE.MATS.JC.txt', 'rt') as fh:
    line = next(fh, None)
    line = next(fh, None)
    while line:
        fields = line.rstrip().split('\t')
        line = next(fh, None)
        gene_id = fields[1].strip('"')
        chrom = fields[3]
        first_exon_start = int(fields[5])
        first_exon_end = int(fields[6])
        second_exon_start = int(fields[7])
        second_exon_end = int(fields[8])
        upstream_exon_start = int(fields[9])
        upstream_exon_end = int(fields[10])
        downstream_exon_start = int(fields[11])
        downstream_exon_end = int(fields[12])
        record = MutuallyExclusiveExons(gene_id, chrom, first_exon_start,
            first_exon_end, second_exon_start, second_exon_end,
            upstream_exon_start, upstream_exon_end, downstream_exon_start,
            downstream_exon_end)
        mxes.append(record)

In [95]:
results = []
for record in mxes:
    transcript_ids = annotation.genes[record.gene_id]
    transcripts = [annotation.transcripts[x] for x in transcript_ids.transcripts]
    first, second, both = False, False, False
    for transcript in transcripts:
        if first and second:
            break
        it = iter(transcript.exon)
        exon = next(it, None)
        while exon:
            if int(exon.location.end) == record.upstream_exon_end:
                exon = next(it, None)
                if not exon:
                    break
                if int(exon.location.start) == record.first_exon_start and \
                        int(exon.location.end) == record.first_exon_end:
                    exon = next(it, None)
                    if not exon:
                        break
                    if int(exon.location.start) == record.downstream_exon_start:
                        first = True
                    elif int(exon.location.start) == record.second_exon_start and \
                            int(exon.location.end) == record.second_exon_end:
                        exon = next(it, None)
                        if not exon:
                            break
                        if int(exon.location.start) == record.downstream_exon_start:
                            both = True
                elif int(exon.location.start) == record.second_exon_start and \
                        int(exon.location.end) == record.second_exon_end:
                    exon = next(it, None)
                    if not exon:
                        break
                    if int(exon.location.start) == record.downstream_exon_start:
                        second = True
            exon = next(it, None)
    results.append((first, second, both))

In [96]:
both, neither, first_only, second_only = 0, 0, 0, 0
for i_first, i_second, i_both in results:
    if i_both:
        both += 1
    elif i_first:
        first_only += 1
    elif i_second:
        second_only += 1
    else:
        neither += 1

In [97]:
print(f'both:        {both}')
print(f'first only:  {first_only}')
print(f'second only: {second_only}')
print(f'neither:     {neither}')

both:        2691
first only:  3220
second only: 485
neither:     852


+ 2691 genes have both the first and second exons. For those genes, no non-canonical peptides are resulted.
+ 3220 genes only have the first exon, so the non-caninical peptides will be generated when the first exon is replaced with the second.
+ 485 genes only have the second exon. Non-caninical peptides will be generated when the second exon is replaced with the first.
+ 1902 genes don't have either version. It could also because the upstream or downstream exons are skipped. The alternative splicing is more complicated and can not be infered at this stage.

## Conclusion

rMATS is able to call 5 alternative splicing events. They are skipped exons (SE), alternative 5' splicing site, alternative 3' splicing site, mutual exclusive exons and retained intron. Although all alternative slicing sites come from the provided genomic annotation (GTF), not all of the events are included in it. For example for skipped exon, in the example above, there are 14,893 events of which only the exon skipped version is annotated, 4,765 events of which only the exon retained version is annotated. 17,672 events that have both the skipped and retained version, while 43,327 of them have neither skipped or retained version of transcript.