### Script snippet to manipulate the dataset

The PSSM files generated by PSI-BLAST were parsed to retrieve the relative frequencies. The ouput matrices were saved in numpy arrays for efficent storage. The whole training set of 1832414 protein sequences was then split randomly into equal batches, each holding 10k proteins. Each batch is stored in a dictionary object, where the key is the Uniref50 header of the sequence and the value is the corresponding numpy array, with the relative frequencies. Each batch was then converted to a byte stream (serialized) and saved in the disc.

Regarding the final set of input sequences, it was also stored as a dictionary, where the key contains the batch name and the value is another sub-dictinary of two elements: the first entry has the key "Headers" and its value contains a list of 10k Uniref50 headers, while the second entry has the key "Sequences" and its value contains the list of protein sequences keeping the same order of the list of headers. In other words, for a header at position 50 the corespending sequence is in the list of sequences at the same index, 50.

The same structure is applied to the validation and test sets, though they contain only 879 protein sequences each.

#### 1) Import pickle to deserialize and load the data

In [23]:
import pickle

#### 2) Load input sequeces of the training set

In [24]:
with open('Proteins/train_set/train_set_sequences/training_set_seqs.data', 'rb') as filehandle:
    # read the data as binary data stream
    training_data = pickle.load(filehandle)

with open('Proteins/test_set/test_set_sequences/test_set_seqs.data', 'rb') as filehandle:
    # read the data as binary data stream
    test_data = pickle.load(filehandle)

with open('Proteins/val_set/val_set_sequences/val_set_seqs.data', 'rb') as filehandle:
    # read the data as binary data stream
    val_data = pickle.load(filehandle)

#### 3) Example displaying the header, sequence, and coresponding labels (i.e. relative frequencies matrix stored as a numpy array) of the first batch of the 10k proteins

In [25]:
for counter, (batch_name, batch_data) in enumerate(training_data.items(),1):
    if counter > 1:
        break
    #with open('./Proteins/train_set/train_set_labels/'+batch_name+'_labels.data', 'rb') as filehandle2:
        # read the data as binary data stream
        #training_labels = pickle.load(filehandle2)
    for i, seq in enumerate(training_data[batch_name]['Sequences']):
        if i>10:
            break
        #print("> "+training_data[batch_name]['Headers'][i])
        print(seq)
        #print(training_labels[training_data[batch_name]['Headers'][i]])

MAEIAFISRILMFCSVLVMLLCSVGLVVWGAFMTDSSYDRKLKDVVFNYNSSEPLAGDKSNMRSWYQCCGVRSGGWTCPNNTPCDTAVFNSVDNAMMIAGIIMIPLLLLQFFIIGFSALVLRIEPRVVKERKRGEEVTNTDETWSS
MSASTTKKKLNINSQPFNIEQPEKQPHHSLVHTSKPYQPKNEIQEYFSQKSYTPIDELAPDFNYHKSTQHQQEPAQVVNQNQGPKEKKKLRKTHYEFQVDPNLAEKIDQKSKEELELQKLEEQRKIEEEKLKLEQEELRRQEELIKLEELRKQEEENYQPRPYTKEQLESMFANMNDFQYFESDINWIKIKDRKHFDYKEFKRYEQRKDSKIPTQKKDTKTEIQLDRQPVQQNEPPQKFPKRAQISIQEQLWKDKLKQEAYAWSQILDTTPEQKKAIEKDLKYRLNQLAPDNEGQVSKYIIETIEQNQDFIDFLTEKIIEKAQIEPKYRGLYINLCSKLATLPSLQLKVQDKNGKTKQQSKFKASLLNRVQKMFNSRKTMDYDTSKLKQEEKIQFHMIRKRKIMGNVRFIGGLYLTSLLPIQALSSVLSELMGEYLIGYVPKEDTADESLEGLIELIDQIGQSFNQNERIVDEAFINKIGQILNGQKKSLDQIEGLFRDIKKQTVLDLMQKVFEKLLENYHFSQRIRLLIENLLDKIKKGWQGSYAKKEELAESAQKNNEQEEDDPLELQLREARQNSAKKENSSNELDQKIRAKDIAMFVSSSFQFDPVDQMDKLRGQLKYIKLEAPELMIVFFETFFQNMLASKQVKQNIERAQLVFDLLVMEEADEEVVSKALQYILDEDLVYNISESKTSWKLITNILSYFILEQNWNALKKLYLKPLLDIEEFNEFLIRVGNQLIIQIPDQSGRELIVKYFNMLKQDLSTQLNFD
MGACCSQPSEKYEVQGQSSKPKPATGSGAKAAAKQPATPDFGLQTTHEVIKLLGRGGEGETWLCRDKETNAEVAIKLIKRPI

In [26]:
len(training_data)

184

In [27]:
type(training_data['batch_182']['Sequences'])

list

In [28]:
len(training_data['batch_182']['Sequences'])

10000

In [29]:
for keys,values in training_data.items():
    print(keys)

batch_1
batch_2
batch_3
batch_4
batch_5
batch_6
batch_7
batch_8
batch_9
batch_10
batch_11
batch_12
batch_13
batch_14
batch_15
batch_16
batch_17
batch_18
batch_19
batch_20
batch_21
batch_22
batch_23
batch_24
batch_25
batch_26
batch_27
batch_28
batch_29
batch_30
batch_31
batch_32
batch_33
batch_34
batch_35
batch_36
batch_37
batch_38
batch_39
batch_40
batch_41
batch_42
batch_43
batch_44
batch_45
batch_46
batch_47
batch_48
batch_49
batch_50
batch_51
batch_52
batch_53
batch_54
batch_55
batch_56
batch_57
batch_58
batch_59
batch_60
batch_61
batch_62
batch_63
batch_64
batch_65
batch_66
batch_67
batch_68
batch_69
batch_70
batch_71
batch_72
batch_73
batch_74
batch_75
batch_76
batch_77
batch_78
batch_79
batch_80
batch_81
batch_82
batch_83
batch_84
batch_85
batch_86
batch_87
batch_88
batch_89
batch_90
batch_91
batch_92
batch_93
batch_94
batch_95
batch_96
batch_97
batch_98
batch_99
batch_100
batch_101
batch_102
batch_103
batch_104
batch_105
batch_106
batch_107
batch_108
batch_109
batch_110
batch_11

In [30]:
len(batch_data['Sequences'])

10000

In [31]:
batch_name

'batch_2'

In [32]:
import sys
print(len(training_data[batch_name]['Sequences']))
print(type(training_data[batch_name]['Sequences']))
print("Size of list: " + str(sys.getsizeof(training_data[batch_name]['Sequences'])) + "bytes")

10000
<class 'list'>
Size of list: 90104bytes


In [33]:
combined_train = []
for keys,values in training_data.items():
    combined_train.extend(training_data[keys]['Sequences'])
print(len(combined_train))
print(combined_train[2])

combined_test = []
for keys,values in test_data.items():
    combined_test.extend(test_data[keys]['Sequences'])
print(len(combined_test))
print(combined_test[2])

combined_val = []
for keys,values in val_data.items():
    combined_val.extend(val_data[keys]['Sequences'])
print(len(combined_val))
print(combined_val[2])

1832414
MGACCSQPSEKYEVQGQSSKPKPATGSGAKAAAKQPATPDFGLQTTHEVIKLLGRGGEGETWLCRDKETNAEVAIKLIKRPIPKPAIQVIKREIKIQADLGQGHLNIVSADEVILSKTYLGLIMEYVPGGNMVSYVTKKRETKSERAGLCIDEDEARYFFIQLVSAVEYCHRNNVAHRDLKLDNTLLDNHVPSWLKLCDFGFAKHWQANSNMDTMRIGTPEYMGPELISSRTGYDGKKVDVWAAGVLLYVMLVGMFPFETQDDNFNNTAGLYDIWLQQIKTSWREVPNNTSAASRLTPELKDLLDKMFDVKQESRASIETIKNHPWFKKPLPEPYESSLKELQEEQRSIDEQVSKGAFQSAERDKALEALLDRAVTPSLPSEEVTRLSLSKIKRAYSILKGKGNAGAGMAAVVEED
879
MALRFSRILKAMRRRPRALAAGLALAVAAAGAGGFVARQHAGADGDVRLTSGPLPVVVPDPPAPPKALQDELSALALAYGEEVGIAVTDLDQGWTAGVDPDGVYPQQSVSKLWVAIAVLKAADEGRLELDRTVLLTDADRSVFFQPVAYNIGPQGYATTIEALLRRAIVQSDNAANDRLMREVGGPEAVAEALDGLGLKGVTVGAYERDLQAKVAGLVWRPEYGVGWNFQAAREQLSRAERERALEAYLAEPMDGAQPAAIARALAALKKGELLSPASTDRLLTLMEEVRTGPRRLKGGLPPGWSIGHKTGTGQDYRGASVGINDVGLLTAPDGRIYAVAVMMRRTWKPVPQRLAFMQAVSRAVAAEWARAREPDFTTAD
879
MVQINRVRMKEVKKVAETLFLAFQNEPFFDYLLSPTSWKSAKSLTKFRRDFFDYIAYAFIMSGTVYEIDGFKGVALWAGPGQDPYSTYTVLRSGLWRCVYRVPKEVVCRYTDEYLSKTCDARERLMGKKKHWYLCFLAVRPEHQKRGLSRPLLDEVHRLCDKKKQQIYLECNQSTNRSYYEHLGYS

In [34]:
combined_train = []
for keys,values in training_data.items():
    combined_train.extend(training_data[keys]['Sequences'])
print(len(combined_train))
print(combined_train[2])

combined_test = []
for keys,values in test_data.items():
    combined_test.extend(test_data[keys]['Sequences'])
print(len(combined_test))
print(combined_test[2])

combined_val = []
for keys,values in val_data.items():
    combined_val.extend(val_data[keys]['Sequences'])
print(len(combined_val))
print(combined_val[2])

1832414
MGACCSQPSEKYEVQGQSSKPKPATGSGAKAAAKQPATPDFGLQTTHEVIKLLGRGGEGETWLCRDKETNAEVAIKLIKRPIPKPAIQVIKREIKIQADLGQGHLNIVSADEVILSKTYLGLIMEYVPGGNMVSYVTKKRETKSERAGLCIDEDEARYFFIQLVSAVEYCHRNNVAHRDLKLDNTLLDNHVPSWLKLCDFGFAKHWQANSNMDTMRIGTPEYMGPELISSRTGYDGKKVDVWAAGVLLYVMLVGMFPFETQDDNFNNTAGLYDIWLQQIKTSWREVPNNTSAASRLTPELKDLLDKMFDVKQESRASIETIKNHPWFKKPLPEPYESSLKELQEEQRSIDEQVSKGAFQSAERDKALEALLDRAVTPSLPSEEVTRLSLSKIKRAYSILKGKGNAGAGMAAVVEED
879
MALRFSRILKAMRRRPRALAAGLALAVAAAGAGGFVARQHAGADGDVRLTSGPLPVVVPDPPAPPKALQDELSALALAYGEEVGIAVTDLDQGWTAGVDPDGVYPQQSVSKLWVAIAVLKAADEGRLELDRTVLLTDADRSVFFQPVAYNIGPQGYATTIEALLRRAIVQSDNAANDRLMREVGGPEAVAEALDGLGLKGVTVGAYERDLQAKVAGLVWRPEYGVGWNFQAAREQLSRAERERALEAYLAEPMDGAQPAAIARALAALKKGELLSPASTDRLLTLMEEVRTGPRRLKGGLPPGWSIGHKTGTGQDYRGASVGINDVGLLTAPDGRIYAVAVMMRRTWKPVPQRLAFMQAVSRAVAAEWARAREPDFTTAD
879
MVQINRVRMKEVKKVAETLFLAFQNEPFFDYLLSPTSWKSAKSLTKFRRDFFDYIAYAFIMSGTVYEIDGFKGVALWAGPGQDPYSTYTVLRSGLWRCVYRVPKEVVCRYTDEYLSKTCDARERLMGKKKHWYLCFLAVRPEHQKRGLSRPLLDEVHRLCDKKKQQIYLECNQSTNRSYYEHLGYS

In [35]:
all_chars = set(''.join(combined_train))
print(all_chars)
print(len(''.join(combined_train)))
print(len(all_chars))

{'R', 'D', 'Y', 'V', 'H', 'C', 'K', 'P', 'X', 'L', 'B', 'E', 'S', 'G', 'Q', 'A', 'W', 'U', 'Z', 'M', 'N', 'T', 'I', 'F'}
680506270
24


In [36]:
remaining_chars= all_chars - set( ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'])
print(remaining_chars)
print(len(remaining_chars))

{'B', 'X', 'Z', 'U'}
4


In [37]:
all_chars = set(''.join(combined_val))
print(all_chars)
print(len(''.join(combined_val)))
print(len(all_chars))
remaining_chars= all_chars - set( ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'])
print(remaining_chars)
print(len(remaining_chars))

{'R', 'D', 'Y', 'V', 'H', 'C', 'K', 'P', 'X', 'L', 'E', 'S', 'G', 'Q', 'A', 'W', 'M', 'N', 'T', 'I', 'F'}
319739
21
{'X'}
1


In [38]:
all_chars = set(''.join(combined_test))
print(all_chars)
print(len(''.join(combined_test)))
print(len(all_chars))
remaining_chars= all_chars - set( ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'])
print(remaining_chars)
print(len(remaining_chars))

{'R', 'D', 'Y', 'V', 'H', 'C', 'K', 'P', 'X', 'L', 'E', 'S', 'G', 'Q', 'A', 'W', 'M', 'N', 'T', 'I', 'F'}
330411
21
{'X'}
1


In [39]:
from sklearn.model_selection import train_test_split
x_train ,x_test = train_test_split(combined_train,test_size=0.50)    
len(x_train),len(x_test)

(916207, 916207)

In [40]:
x_test[:10]

['MSSHKQYRLDLDRIEASLREVQADFDRISEKLLMRREPITDQIIANLLEGYAYVDQLLEDGIEVFSPSGLDHILELNHIVLCGVNDSLRLEYGSHLLETRRRFADGIGAIADYYEAKRHKAAPRRAAGVYVLSVSQPQLFLEGNHRTGALVASYILVQEGLPPFVLTPKNAVAYFNPSTLIKWRDKQKFLDRQYYLKKYRRVLQDFFEETLDKRFRRSLRLDKDVTAKD',
 'MSSVIRFPIILLFLLRLAIIVKAESPEYRYHFCSNTTTFTPNSTYQTNLNQVLSSLSNNANNSIGFFNIPSGQQPDDVYGSFLCRGDVSTDVCQDCVTFATQDIVKRCPIEKVAIVWYDQCLLRYENQSFISTMDQTPGVFLSNTQDISDPDRFDNLLATTMENLATDASSAASGEKKFAATDDNFTAFQKLYSLVQCTPDLSNPGCRQCLRGAISNLPSCCGGKRGANVLYPSCNVRYEVYPFFNATAVEPPPPSPSPAVPPPPPTGSGTRPETEGQSGISTVIIVAIVAPVAIAIVLFSLAYCYLRRRPRKKYDAVQEDGNEITTVESLQIDLNTVEAATNKFSADNKLGEGGFGEVYKGILPNGQEIAVKKLSRSSGQGAQEFKNEVVLLAKLQHRNLVRLLGFCLEGAEKILVYEFVSNKSLDYFLFDPEKQRQLDWSTRYKIVGGIARGILYLHEDSQLRIVHRDLKVSNILLDRNMNPKISDFGTARIFGVDQSQGNTKRIVGTYGYMSPEYAMHGQFSVKSDMYSFGVLILEIICGKKNSSFYEIDGAGDLVSYVWKHWRDGTPMEVMDPVIKDSYSRNEVLRCIQIGLLCVQEDPADRLTMATVVLMLNSFSVTLPVPQQPAFLIHSRSQPTMPMKGLELDKSTPKSMQLSVDQEPITQIYPR',
 'MTDSSLIRPVDGEVPDGYGFYNEDTRNLYETLSKAFTDVDDALVDVTITSPPYADVKDYGYDEELQVGLGDDYEDYLEELRDIYKQTY

In [41]:
import json
import os

save_path = os.path.join(os.getcwd(), 'Proteins/tokenizer_data.json')
with open(save_path, "w") as f:   #Pickling
    json.dump(combined_train, f, indent=2) 

In [20]:
print(len(x_train))
print(len(combined_train))

916207
1832414


In [21]:
save_path = os.path.join(os.getcwd(), 'Proteins/train.json')
with open(save_path, "w") as f:   #Pickling
    json.dump(combined_train, f, indent=2) 

In [22]:
save_path = os.path.join(os.getcwd(), 'Proteins/test.json')
with open(save_path, "w") as f:   #Pickling
    json.dump(combined_test, f, indent=2) 

save_path = os.path.join(os.getcwd(), 'Proteins/val.json')
with open(save_path, "w") as f:   #Pickling
    json.dump(combined_val, f, indent=2) 