# Preparing GTDB files to use Kaiju 

Steps:
1) Download protein sequence data 
2) Convert files into FASTA format 
3) Run Kaiju 

# Step 1

GTDB's main website: https://gtdb.ecogenomic.org/ 
Protein sequences retrieved from: https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/
data in "gtdb_proteins_aa_reps.tar.gz"

# Step 2
Current formatting for GTDB is as follows:
>Accession#1.1 | more_info_about_species
AYKMQFHEGTGNPRQIHGCDSKLATVEGHAVIAWLACYWAGI
>Accession#1.2 | more_info_about_species
GIQRTVDHCQATWLPMIHFCAKWLAQPIGTVWIGM

But, kiaju-makedb requires FASTA format as follows: 
>TaxID 
AYKMQFHEGTGNPRQIHGCDSKLATVEGHAVIAWLACYWAGI
>TaxID
GIQRTVDHCQATWLPMIHFCAKWLAQPIGTVWIGM

The next two chunks re-format the database to FASTA format

In [None]:
#This program loops through bacteria files, takes first acc#, finds tax ID,
# and outputs file with tax ID and protein sequence 

import os
import subprocess
from subprocess import check_output

directory = "./bacteria" 
fout=open("proteinsFIRST.faa", "w")
for file in os.listdir(directory): #loop through files
    with open("./bacteria/" + file) as f:
        contents = f.readlines() #read lines in each file 
        first = contents[0].strip()
        per=first.find(".")
        acc= first[1:per] #variable for accession number 
        #taxid uses subprocess to run cli command from python script 
        taxid = subprocess.check_output(f'efetch -db nuccore -id {acc} -format docsum | xtract -pattern DocumentSummary -element TaxId', shell=True)
        taxid = taxid.decode('utf-8')
        for line in contents:
            #writes new file with taxid and sequence 
            if line.startswith(">"):
                fout.write(">" + str(taxid))
            else:
                fout.write(line)

In [None]:
## this program loops through the fasta file and removes sequences with missing
## tax ids

import os
keepopen = False 

f1 = open('./proteinsFIRST.faa', 'r')
f2 = open('./proteins.faa', 'w')
for line in f1:
    if line.startswith(">"):
        line.split()
        if line[1].isdigit():
            keepopen = True
            f2.write(line)
        else:
            keepopen = False
    elif keepopen == True:
        f2.write(line)

# Step 3
Follow instructions to create .fmi file and run Kaiju from: 
https://github.com/bioinformatics-centre/kaiju


In [None]:
kaiju-mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
kaiju-mkfmi proteins

# Step 4
Running Kaiju. The following instructions are from Kaiju's website 

In [None]:
kaiju -z 25 -t path/to/nodes.dmp -f path/to/proteins.fmi -i path/to/inputfile.fastq -o kaiju.out