# Endolysin querying from NCBI protein database

Script to create a fasta file with endolysin sequences queried from NCBI protein database, on 27/05/2024, using the search term "endolysin[Protein Name] AND txid28883[Organism]" retrieving 9495 entries.

In [1]:
# Import necessary packages
from Bio import Entrez, SeqIO

In [2]:
# Set your email address (required by NCBI)
Entrez.email = "pg49130@alunos.uminho.pt"

# Define the search term including the taxid for bacteriophages
search_term = "endolysin[Protein Name] AND txid28883[Organism]"

# Perform the search
handle = Entrez.esearch(db="protein", term=search_term, retmax=100000)
record = Entrez.read(handle)
handle.close()

# Get a list of protein IDs
protein_ids = record["IdList"]
# Print the number of protein entries
print(len(protein_ids), "entries")

# Fetch protein information
handle = Entrez.efetch(db="protein", id=protein_ids, rettype="fasta", retmode="text")
protein_records = handle.read()
handle.close()

9495 entries


In [None]:
# Save the protein entries to a FASTA file
output_file = "endolysin-ncbi-raw.fasta"
with open(output_file, "w") as f:
    f.write(protein_records)