# Training dataframe to vectors Rebb2020
This notebook summarizes the complete training dataframe in numpy arrays that can be directly used for training.

In [45]:
import pandas as pd
import numpy as np

out_path = '../processing/training_set/'
df = pd.read_csv('../dataset/Reeb2020/SetAll_only_deleterious.csv')

Some entries have a non-defined values for the fields `ToAA`. Since they are not many I remove them. These entries are marked with the `*` charachter.

In [35]:
print(len(df[df['FromAA'] == '*']))
print(len(df[df['ToAA'] == '*']))

df_clean = df[df['FromAA'] != '*'][df['ToAA'] != '*']

0
30


I create a (40, 1) vector for each mutation coding FromAA and ToAA as one-hot concatenated

In [46]:
def seq_to_one_hot(sequence):
    # the aa order is the same used in psiblast pssm
    aa_tuple = tuple('ARNDCQEGHILKMFPSTWYV')
    growing_arr = []
    for char in sequence:
        curr_row = [1 if char == aa else 0 for _,aa in enumerate(aa_tuple)]
        growing_arr.append(curr_row)
    one_hot_vec = np.array(growing_arr)
    assert one_hot_vec.sum(axis=1).all() == 1
    assert one_hot_vec.sum(axis=1).sum() == len(sequence)
    assert one_hot_vec.all() in (0,1)
    return one_hot_vec

from_aa_vec = seq_to_one_hot(df_clean['FromAA'])
to_aa_vec = seq_to_one_hot(df_clean['ToAA'])
mutations_two_hot = np.concatenate((from_aa_vec, to_aa_vec), axis=1)
np.save(out_path + 'mutations_two_hot.npy', mutations_two_hot)
mutations_two_hot.shape

(45352, 40)

I save in an array also the position of the mutation along the sequence (0-indexed)

In [63]:
positions_vec = np.array(df_clean['Position(1-indexed)']).reshape(-1,1).astype(int)
# the dataset was 1-indexed and now I make it 0-indexed
assert positions_vec.all() >= 1
positions_vec = positions_vec - 1
assert positions_vec.all() >= 0
np.save(out_path + 'position_vec.npy', positions_vec)
positions_vec.shape

(45352, 1)

I save an array of strings containing the source dataset for each mutation. I replace for each instance the `/` charachter with `-` so to be equal to the respective filenames.

In [68]:
df_clean['DatasetId_clean'] = df_clean['DatasetId'].str.replace('/', '-')
dataset_vec = np.array(df_clean['DatasetId_clean']).reshape(-1,1).astype(str)
np.save(out_path + 'dataset_vec.npy', dataset_vec)
dataset_vec.shape

(45352, 1)

Finally I extract the target value, the normalized score for each mutation

In [73]:
score_vec = np.array(df_clean['NormalizedScore']).reshape(-1,1)
np.save(out_path + 'norm_score.npy', score_vec)
score_vec.shape

(45352, 1)