# Isoforms selection

The aim of this notebook is simply to parse the file containing all peptide sequences of *Theobroma cacao*. Each time a unique gene identifier provides two or more sequences, the longest isoform will be the only one kept.

Fasta files come from ftp://ftp.ensemblgenomes.org/pub/plants/release-41/fasta/theobroma_cacao

In [None]:
from Bio import SeqIO
import pandas as pd

In [2]:
isoform_stats = {"gene" : [] ,
                 "transcript" : [] ,
                 "length" : []}

In [3]:
# parsing the input, multiple fasta file
for s in SeqIO.parse("source_data/Theobroma_cacao.Theobroma_cacao_20110822.pep.all.fa", "fasta"):
    isoform_stats['gene'].append(s.description.split(" ")[3])
    isoform_stats['transcript'].append( s.description.split(" ")[4])
    isoform_stats['length'].append(str(len(s.seq)))   

In [4]:
# generating the gene list (without dupicates) for the parsed fasta file 
isoform_stats = pd.DataFrame(isoform_stats)
genes = list(set(isoform_stats['gene']))

In [5]:
# pandas data frame for selected isoforms
selected_isoforms = pd.DataFrame({"gene" : [] ,
                                  "transcript" : [],
                                  "length" : []})
# isoform selection; the longest isoform is selected for each gene in "genes"
for i in range(0, len(genes)) :
    gene = genes[i]
    isoforms = isoform_stats.loc[isoform_stats['gene'] == gene]
    isoforms = isoforms.sort_values(by = ['length'], ascending = False)
    selected_isoforms = selected_isoforms.append(isoforms.head(1))

In [6]:
# run to display the array of the selected isoforms
selected_isoforms = selected_isoforms.sort_index()
list_ind = list(selected_isoforms.index)

In [7]:
# Writing out a new faste file 
with open("selected_isoforms.fasta", "w") as OUT:
    i = 0
    for seq in SeqIO.parse("./source_data/Theobroma_cacao.Theobroma_cacao_20110822.pep.all.fa", "fasta"):
        if i == list_ind[0]:
            SeqIO.write(seq, OUT, "fasta")
            list_ind.pop(0)
        i += 1