# PsiBLAST processing

This notebook is to extract profiles with psiblast from each of the sequences in the training set. I am using psiblast from the official linux distribution at the NCBI website:

`ncbi-blast-2.11.0+-x64-linux.tar.gz`

Here there are the already-compiled executables. There is a script in the package, `update_blastdb.pl` that I use to download the databases with this synthax:

`./update_blastdb.pl --decompress <database>`

To see available databases run:

`./update_blastdb.pl --showall`

For this project I am using the database `nr`, but I also downloaded `swissprot` since it is much smaller and I did not have enough space initially. The size of swissprot is some GB, the size of nr is around 300 GB decompressed. There is no need to use `makebalstdb` on the databases downloaded with `update_blastdb.pl` since they are already in an usable format. The run of psiblast on nr is quite heavy and thus I am running it on the server aeserv19a. Simple file movements are omitted here.

In [1]:
import pandas as pd
import numpy as np
import json

in_path = '../processing/input_sequences/'
input_list_file = '../processing/input_list.txt'
seq_json_file = '../dataset/Reeb2020/sequences.json'

I create a series of fasta files with all the sequences in the Rebb2020 set named in a similar way to how dataset are referenced in the csv of the paper

In [2]:
json_seqs_filein = open(seq_json_file)
json_seqs_dict = json.load(json_seqs_filein)
for paper in json_seqs_dict:
    for dms_set in json_seqs_dict[paper]:
        with open(in_path + paper + '-' + dms_set + '.fasta', 'w') as fastaout:
            fastaout.write('>' + paper + '/' + dms_set + '\n' + json_seqs_dict[paper][dms_set] + '\n')

I put the fasta basenames in a file for easier handling

In [3]:
!ls $in_path | cut -d '.' -f '1' > $input_list_file

Run this only if you want to process the swissprot pssm files

In [2]:
in_ckp_path = '../processing/psiblast_processing/swissprot_db/ckp_files/'
out_profile_path = '../processing/psiblast_processing/swissprot_db/profiles/'
out_pssm_path = '../processing/psiblast_processing/swissprot_db/pssm_array/'

Run this only if you want to process the nr pssm files

In [2]:
in_ckp_path = '../processing/psiblast_processing/nr_db/ckp_files/'
out_profile_path = '../processing/psiblast_processing/nr_db/profiles/'
out_pssm_path = '../processing/psiblast_processing/nr_db/pssm_array/'

I run psiblast from a wrapper script:

`./psiblast_wrapper.sh <list of fasta files> <database>`

This is very heavy and I do it on the server, not on the notebook. The parameters used are:

- E value reporting threshold of 0.01
- 3 iterations
- I want the pssm matrices in output

After putting the pssm files in the appropriate folder, I obtain the profiles and I save them in .npy files. The first columns of the file are the pssm, while the profile is in the second half. I also move the resulting .npy files in the appropriate folder.

In [3]:
def get_profile(checkpoint_file):
    """
    Extract the profile portion from a single-sequence psiblast checkpoint
    file and returns it as a numpy array.
    """
    ckp = []
    header = True
    footer = False
    with open(checkpoint_file) as handle:
        for line in handle:
            line_l = line.rstrip().split()

            if len(line_l) > 0 and line_l[0] == "1":
                header = False

            if not header and len(line_l) == 0:
                footer = True

            if not header and not footer:
                # select only the profile and discard the pssm
                # and the last 2 columns
                ckp.append([float(el) for el in line_l[22:-2]])
    ckp_mat = np.array(ckp)

    profile = ckp_mat / 100
    
    return profile


def get_pssm(checkpoint_file):
    """
    Extract the profile portion from a single-sequence psiblast checkpoint
    file and returns it as a numpy array.
    """
    pssm = []
    header = True
    footer = False
    with open(checkpoint_file) as handle:
        for line in handle:
            line_l = line.rstrip().split()

            if len(line_l) > 0 and line_l[0] == "1":
                header = False

            if not header and len(line_l) == 0:
                footer = True

            if not header and not footer:
                # select only the pssm and discard the profile
                pssm.append([int(el) for el in line_l[2:22]])
    pssm_mat = np.array(pssm)
    
    # no normalization done, could be revisited later
    
    return pssm_mat

with open(input_list_file) as handle:
    for line in handle:
        curr_header = line.rstrip()
        ckp_file = in_ckp_path + curr_header + '.psiblast.ckp'
        profile_outfile = out_profile_path + curr_header + '.profile.npy'
        pssm_outfile = out_pssm_path + curr_header + '.pssm.npy'
        curr_profile, curr_pssm = get_profile(ckp_file), get_pssm(ckp_file)
        np.save(profile_outfile, curr_profile)
        np.save(pssm_outfile, curr_pssm)

Now all the profiles are stored in numpy arrays and ready for further processing.