# Retrieve Phatr3 protein sequences

### Latest update: 29 April 2020

### Raffaela Abbriano (raffaela.abbriano@uts.edu.au)

This code is designed to read in a list of Phatr3 protein IDs (supplied by you in a text file) and to return the protein sequences for those IDs in fasta format. It is compatible with the protein IDs from the Phaeodactylum tricornutum genome re-annotation available on EnsemblProtists. Protein sequences are derived from Phaeodactylum_tricornutum.ASM15095v2.pep.all.fa, which was downloaded from (https://protists.ensembl.org/Phaeodactylum_tricornutum/Info/Index) on April 24th, 2020. This code is customized for the Phatr3 annotation, but the basic code can be modified to retrieve sequences from any fasta file.

To run, save your protein IDs in a text file - format should be Phatr3_XYZ, one ID per line (see my_ids.txt as an example). Results will be save in a new fasta file 'my_ids_aa.fasta' in the same directory.

In [118]:
#Import necessary libraries, requires Biopython package
from Bio import SeqIO

In [119]:
#Create an empty list
lines = []   

#Import the text file
f = open('my_ids.txt', 'r').readlines()

In [120]:
#For each ID in the text file, remove leading/trailing characters and add to the list 
for line in f:
    newid = line.strip()
    lines.append((newid + '.p1')) #append .p1 to ID to make compatible with Phatr3 headers
    
#Print out how many IDs are being used as keys to pull sequences from the fasta file
print('Searching for',len(lines),'protein sequences')

Searching for 3 protein sequences


In [121]:
#Use set to create unordered set of unique ids and put them into a new list called mylist 
#Eliminates repeat IDs in text file
myset =set(lines) 
mylist =list(myset)

In [122]:
#Create a dictionary from the Phatr3 annotation file 
pt_dict = SeqIO.index("Phaeodactylum_tricornutum.ASM15095v2.pep.all.fa", "fasta") 

In [123]:
outfilename = "my_ids_aa.fasta" #names the output file
outfile = open(outfilename, "ab") #creates output file

In [124]:
#Loop over each of the headers in the results list and pulls out the aa sequence from the dictionary
for PID in mylist:
    outfile.write(pt_dict.get_raw(PID))

In [125]:
#Close output file
outfile.close()