# Deep learning training dataset creation Gray2018

My general approach is to give the raw MSA in input to the model and let it extract its own features. Here I compile the data that I will then use for deep learning.

In [45]:
import pandas as pd; pd.set_option('display.max_columns', None)
import requests
import numpy as np
import joblib
from Bio import SeqIO, AlignIO

## Retrieving input sequences

I exclude the studies that were excluded in the gray2018 paper

In [7]:
df = pd.read_csv('../dataset/gray2018/dmsTraining_2017-02-20.csv')

# some of the studies in the training set were excluded from training in the original paper
excluded_studies = ['Brca1_E3', 'Brca1_Y2H', 'E3_ligase']
for study in excluded_studies:
    df = df[df['dms_id'] != study]

I fetch the WT sequences in the training set from uniprot and I put them in a single file. I also check that the declared length agrees with the uniprot length.

In [46]:
uniprot_uri = 'https://www.uniprot.org/uniprot/'
out_sequences_path = '../processing/gray2018/sequences/'

uniprot_id_set = set(df['uniprot_id'])
for uniprot_id in set(df['uniprot_id']):
    current_fasta_url = uniprot_uri + uniprot_id + '.fasta'
    r = requests.get(current_fasta_url)
    fasta_seqence = r.text
    fasta_file = out_sequences_path + uniprot_id + '.fasta'
    with open(fasta_file, 'w') as handle:
        handle.write(fasta_seqence)
    
    # this is just a consistency check
    record = list(SeqIO.parse(fasta_file, "fasta"))[0]
    declared_protein_len = set(df[df['uniprot_id'] == uniprot_id]['protein_size'])
    assert len(declared_protein_len) == 1
    assert declared_protein_len.pop() == len(record.seq)

## MSA preparation

I decided to use HHblits for building the MSA for each query sequence. As a database I use the pre-fromatted `uniclust30`. I perform 3 iterations  and 0.001 maximum E value. I took these parameters from the rawMSA paper.
I use -all to disable the msa filters, since I will apply them later. In the original rawMSA publication they also provided a filter `-diff inf` but this is not needed since for values of 0 or NaN the filter is inactive, and 0 is the default. `-diff n` retains only the n most diverse sequences in the alignment. This is done so that each alignment block of 50 positions has at least n sequences covering it.

```
hhblits -i <input_seq> -o <result_file> -d <database> -oa3m <out_msa_file> -n 3 -e 0.001 -all
```

I filter for sequences with 50% minimum coverage to the query and 99% maximum pairwise identity.
The coverage and identity filters are applied later on the resulting MSA using `hhfilter`, so that I can try different parameters without running again hhblits.

```
hhfilter -i <in_msa_file> -o <out_msa_file> -id 99 -cov 50
```

I can reformat the a3m alignments to fasta using a script in the hh suite.  In the rawMSA paper the authors also remove the insertions relative to the query by filtering out the lowercase letters in the a3m file. I do the same with the reformat script of the hh-suite by specifying the `-r` parameter.

```
reformat.pl -r a3m fas <infile> <outfile>
```

Now that I have the fasta msa, I need to convert it to an integer representation for the keras embedding that I will then apply to it. I am saving the output as a numpy array. I do not filter by depth here, that will be done later to the numpy array at training.

## MSA visualization

I use aliview for the visualization of the MSAs, but there are too many sequences to be seen confortably.
I create 2 different sets for viewing. One is a representative 30 sequences of the diversity in the MSA.

```
hhfilter -i <in_msa_file> -o <out_msa_file> -diff 30
```

The other is the top1000 sequences, that are supposed to be the most similar to the query and the ones that I will end up actually using in the prediction. I use the seqkit script.

```
seqkit head -n 1000 <in_msa_file> > <out_msa_file>
```

These edited alignments can be visualised with aliview.

## Vectorizing the MSA

I convert the full MSA to a vector where each symbol is assigned to an integer. I do so since I will use an embedding layer in the network.

In [4]:
id_list = open('../processing/gray2018/input_list.txt')
msa_path = '../processing/gray2018/hhblits_msa_filtered_noinsert/'
out_vec_path = '../processing/gray2018/msa_vectors/'
possible_chars = 'ARNDCQEGHILKMFPSTWYV-XBZU'

for line in id_list:
    protein = line.rstrip()
    msa_file = msa_path + protein + '.fasta'
    first_seq = True
    msa = AlignIO.read(msa_file, 'fasta')
    out_vec = []
    print('Processing file:', msa_file)
    for record in msa:
        if first_seq:
            first_seq = False
            first_seq_id = record.id.split('|')[1]
            assert first_seq_id == protein
        sequence = record.seq.upper()
        for char in set(sequence):
            assert char in possible_chars
        # I add 1 since 0 is reserved for padding in the embedding input
        seq_mapped = [possible_chars.index(char) + 1 for char in sequence]
        out_vec.append(seq_mapped)
    out_vec = np.array(out_vec)
    assert out_vec.shape == (len(msa), msa.get_alignment_length())
    joblib.dump(out_vec, out_vec_path + protein + '.npy.joblib.xz', compress=True)

Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P00552.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P02829.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P04147.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P06654.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P0CG48.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P0CG63.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P31016.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P46937.fasta
Processing file: ../processing/gray2018/hhblits_msa_filtered_noinsert/P62593.fasta


I could process now sliding windows for each mutation, but I want first to have a look at the literature