UPDATED SPLITS WITH SPECIES: Preprocess uniprot data (SPs that are only experimentally verified and verified by sequence analysis) and split into train/val/test with tokens for the species.

First, go through and split the sequences into the signal peptide and the remainder of the sequence. 
Discard sequences where the signal peptide does not start at the first position. Then, discard sequences
where the signal peptide is not between 10 and 70 amino acids, inclusive. Also discard sequences where 
the remaining sequence is not strictly longer than the signal peptide. 

In one training dataset, keep the first 100 amino acids of the mature protein. In another training dataset, only keep the first 95, 100, and 105 amino acids of the mature protein in the training dataset to vary the length of the protein sequences. This way, we get "more" training data if for each one.

Remove examples where the SP is the same and the protein sequences are > 0.5 the same.

For each example, also save the organism. All organisms with fewer than 5 examples get lumped together as token 0: 'AAUnknown' There are a total of 754 species tokens. 

There are a total of 32263 examples. 

Finally, shuffle the signal peptide/mature protein pairs and set aside 20% each as test and validation sets. The split is 19359/6452/6452. 

Minimal SP Length: 10 AA
Maximal SP Length: 70 AA
 
Defaults from SignalP http://www.cbs.dtu.dk/services/SignalP/instructions.php#limits
 
Minimal Protein Length: Longer than Signal Peptide
Maximal Protein Length: truncated to 70 -> according the SignalP’s SI (below)
https://images.nature.com/full/nature-assets/nmeth/journal/v8/n10/extref/nmeth.1701-S1.pdf

In [None]:
%matplotlib inline
import pickle
import random
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

import csv

In [2]:
# read in datasets from csv
dataset_75 = []

filename = "dataset_2.csv"
with open(filename, "r") as f:
    reader = csv.reader(f, delimiter="\t")
    index = 0 # for appending to lists of similarities within the list "sim"
    for j, line in enumerate(reader): # reads in each row of the csv
        dataset_75.append(line[0])

In [3]:
# load in prot sequences from dataset
df = pd.read_excel('../dataset.xls')
si_c = df['Signal peptides'].values
pr_c = df['Prot Sequences'].values

In [4]:
# include triplets of all lengths
triplets = []

for si, pr in zip(si_c, pr_c):
    if pr in dataset_75:
        triplets.append((si, pr))

In [5]:
# Remove exact duplicates
triplets = list(set(triplets))
len(triplets)

22115

In [6]:
random.seed(a=1)
random.shuffle(triplets)

In [7]:
L = len(triplets) // 5
test = triplets[-L:]
val = triplets[-2 * L:-L]
train = triplets[:-2*L]
len(train), len(val), len(test)#check tokens
#check similarity

(13269, 4423, 4423)

In [8]:
# Ensure prot seq length of val and test are 100 aa
# dataset where training has prot seqs of length 100

train_len = [(si, pr[:100]) for si, pr in train]
val_len = [(si, pr[:100]) for si, pr in val]
test_len = [(si, pr[:100]) for si, pr in test]

val = val_len
test = test_len

In [9]:
# number of pairs with prot seq of length less than 100
length = 0

for t in train:
    sp, pr = t
    if len(pr) < 100:
        length += 1

length

118

In [10]:
with open('../../6-11_data/train_75.pkl', 'wb') as f:
    pickle.dump(train_len, f)
with open('../../6-11_data/validate_75.pkl', 'wb') as f:
    pickle.dump(val, f)
with open('../../6-11_data/test_75.pkl', 'wb') as f:
    pickle.dump(test, f)

In [11]:
# Python code to remove duplicate elements
def Remove(duplicate):
    final_list = []
    dup = []
    for num in duplicate:
        if num not in final_list:
            final_list.append(num)
        else:
            dup.append(num)
    return dup

In [12]:
# vary prot length in training set
maximum = 105
leng1 = 0
leng2 = 0
train_vlen = []
train_vlen1 = []
train_vlen2 = []

####### keep train with og lengths, len, len-5, len-10 (smaller of 105 or actual protein length for len)
for t in train:
    si, pr = t
    if len(pr) < maximum:
        leng1 = len(pr) - 10
        leng2 = len(pr) - 5
        train_vlen.append((si, pr[:leng1]))
        train_vlen1.append((si, pr[:leng2]))
        train_vlen2.append((si, pr))
    else:
        train_vlen.append((si, pr[:95]))
        train_vlen1.append((si, pr[:100]))
        train_vlen2.append((si, pr[:105]))

print(len(train_vlen))
print(len(train_vlen1))
print(len(train_vlen2))

train = []
train = train_vlen + train_vlen1 + train_vlen2

print(len(train))
# Remove exact duplicates
train = list(set(train))
print(len(train))

13269
13269
13269
39807
39746


In [13]:
# dump data with sp, prot, and species in *_species_augmented.pkl files (training dataset with varied prot seq lengths of 
# 95, 100, and 105)

with open('../../6-11_data/train_augmented_75.pkl', 'wb') as f:
    pickle.dump(train, f)

In [14]:
with open('../../6-11_data/train_75.pkl', 'rb') as f:
    t = pickle.load(f)
t[0]

('MKKLISNDVTPEEIFYQRRKIIKAFGLSAVATALPTFSFA',
 'QESSDLKALEYKKSTESTLILTPENKVTGYNNFYEFGVDKGSPAHYAKNFQVNPWKLDIGGEVENPFTLNYDQLFTQFPLEERIYRFRCVEAWAMVVPWI')

In [15]:
# remove species from datasets to be dumped into train.pkl, test.pkl, and validate.pkl
# has varied lengths of 95, 100, and 105

train_nosp = [(si, pr) for si, pr in train]

with open('../../6-11_data/train_augmented_75.pkl', 'wb') as f:
    pickle.dump(train_nosp, f)

In [16]:
train = [('MRLSTAQLIAIAYYMLSIGATVPQVDG', 'QGETEEALIQKRSYDYYQEPCDDYPQQQQQQEPCDYPQQQQQEEPCDYPQQQPQEPCDYPQQPQEPCDYPQQPQEPCDYPQQPQEPCDNPPQPDV', 121), ('MLTPRVLRALGWTGLFFLLLSPSNVLG', 'ASLSRDLETPPFLSFDPSNISINGAPLTEVPHAPSTESVSTNSESTNEHTITETTGKNAYIHNNASTDKQNANDTHKTPNILCDTEEVFVFLNET', 260)]
leng1 = 0
leng2 = 0
lst = []

for t in train:
    si, pr, sp = t
    leng1 = len(pr) - 10
    leng2 = len(pr) - 5
    lst.append(pr[:leng1])
    lst.append(pr[:leng2])
    lst.append(pr)

print(len(lst))
list(set(lst))
print(len(lst))

6
6
