# Profiles processing

In this notebook I process the profiles in numpy arrays that I obtained from the psiblast PSSMs. My aim is to extract useful features for the project.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

profile_path = '../processing/psiblast/swissprot_db/profiles/'
out_path = '../processing/profiles_processing/swissprot/shannon/'
input_list_file = '../processing/input_list.txt'

I can use the Shannon's entropy as a measure of absolute conservation for a position. For each profile, I create a vector of Shannon entropies (one per position). The entropy function automagically normalizes the vector, so no hassle for adding pseudocounts.

In [2]:
def shannon_entropy(profile, pseudocount=0.001):
    return stats.entropy(profile + pseudocount, axis=1)

with open(input_list_file) as handle:
    for line in handle:
        input_name = line.rstrip()
        profile = np.load(profile_path + input_name + '.profile.npy')
        entropy_array = shannon_entropy(profile)
        np.save(out_path + input_name + '.shannon.npy', entropy_array)

TODO: the entropy function can also calculate the Kullback-Leibler divergence if given 2 input vectors. Check what is it.