# Processing the Sequence Data
Before training the model, the sequence data have to process into training  and validation datasets. To ensure consistency in the training, I must create permenent datasets. The DeepVHPPI uses the json format to store the dataset.

I will also need to create a negative training dataset and make sure that the sequences within this dataset is less than 80% similar to the positive training dataset and that no duplicates occur. In preliminary tests, I found that if duplicates occur in the negative training dataset and it is not less than 80% similar, confused and inconsistent training occurs.

## Dateset Origins
### HPIDB dataset
This dataset was obtained from [HPIDB](https://hpidb.igbb.msstate.edu/) in 2022, a host-pathogen database containing experimental and predicted interactions between various hosts and pathogens. 10704 host-bacterial pathogen interactions were downloaded by clicking [The Pie Chart](https://hpidb.igbb.msstate.edu/hpi30_statistics.html), bacteria section.

The HPIDB dataset serves as the training dataset for PPI prediction training using the BERT model.

#### Negative HPIDB dataset
The negative interaction dataset was created by downloading a random sequence from Uniprot and pairing it with a bacterial pathogen sequence. This created a negative human-pathogen interaction dataset that can be used for training. To ensure that no human sequence in the positive dataset occurred, we used CD-HIT-2D to compare the sequence similarity of the sequences in both datasets.

1. A negative set of sequences of len == length of positive set of human sequences.
2. CD-HIT finds examples in the negative dataset that is greater than 80% similarity in the positive dataset.
3. We remove the examples and replace them with new examples (create a new list) and compare this list again to the positive dataset.
4. Finally, we will end with a positive and negative dataset that can be used for training.

### Create interaction dataframe
From the downloaded MITAB formatted file downloaded from HPIDB we create an interaction dataframe. I wrote Python scripts that formats the data and creates a dataframe using pandas. The MITAB file is a tab seperated CSV file

In [2]:
from data.process_hpidb_mtb import create_interaction_df
from data.fasta_to_pandas import convert_fasta_to_pandas
root_data = '../data/williams_MTB/'
input_mitab_file = root_data + 'pathcat_BACTERIA.mitab_plus.txt'
interactionDF = create_interaction_df(input_mitab_file)
interactionDF


Unnamed: 0,Uniprot_A,Uniprot_B,TaxonID_A,TaxonID_B,Species_A,Species_B,Seq1,Seq2
0,Q81SE4,P22897,1392,9606,Bacillus anthracis,Homo sapiens,MFKIDSARTYFSIFLAASFVVALLIPLPPFILDIIIVFLLSMSVLI...,MRLPLLLVFASVIPGAVLLLDTRQFLIYNEDHKRCVDAVSPSAVQT...
1,Q5NID2,P22897,177416,9606,Francisella tularensis subsp. tularensis SCHU S4,Homo sapiens,MSYSYAEKKRIRKEFGVLPHILDVPYLLSIQTESYKKFLTADAAKG...,MRLPLLLVFASVIPGAVLLLDTRQFLIYNEDHKRCVDAVSPSAVQT...
2,Q5NGQ8,Q96Q89,177416,9606,Francisella tularensis subsp. tularensis SCHU S4,Homo sapiens,MIVDNSKDFDLKSFLANLTTHSGVYRMLDKHGEIIYVGKAKNLKNR...,MESNFNQEGVPRPSYVFSADPIARPSEINFDGIKLDLSHEFSLVAP...
3,Q7BU69,P62820,623,9606,Shigella flexneri,Homo sapiens,MQTSNITNHERNDSSWMSTVKSTTEVSWNKLSFCDILLKIITFGIY...,MSSMNPEYDYLFKLLLIGDSGVGKSCLLLRFADDTYTESYISTIGV...
4,Q5ZWZ3,P62820,272624,9606,Legionella pneumophila subsp. pneumophila str....,Homo sapiens,MAKDNKSHQVKTSEGSLESVKTKEKEPVLEKMRVEDSKKEDKLSMP...,MSSMNPEYDYLFKLLLIGDSGVGKSCLLLRFADDTYTESYISTIGV...
...,...,...,...,...,...,...,...,...
10475,A0A0F7RFB0,Q9ZST4,1392,9606,Bacillus anthracis,Homo sapiens,MSRYDDSQNKFSKPCFPSSAGRIPNTPSIPVTKAQLRTFRAIIIDL...,MRAQGRGRLPRRLLLLLALWVQAARPMGYFELQLSALRNVNGELLS...
10476,A0A0J1I2N1,Q9ZST4,1392,9606,Bacillus anthracis,Homo sapiens,MADVVKITFPDGAVKEFPKGVTTEEIAASISPGLKKKAVAGKLNDE...,MRAQGRGRLPRRLLLLLALWVQAARPMGYFELQLSALRNVNGELLS...
10477,Q5NF53,Q9ZST4,1392,9606,Bacillus anthracis,Homo sapiens,MEWLKKTCFSNLEKESQKNHLLLFITICSFFLGIIAIGYYGYIFTA...,MSTVEEDSDTVTVETVNSVTLTQDTEGNLILHCPQNEADEIDSEDS...
10478,Q8ZGW9,Q9ZW31,1392,9606,Bacillus anthracis,Homo sapiens,MDLNQMTTKTQEAIMSAQSLAVSHHHQEVDTVHLLFTLLEEQDGLA...,MSTVEEDSDTVTVETVNSVTLTQDTEGNLILHCPQNEADEIDSEDS...


The negative dataset contained examples with >80% similarity and duplicates. I used CD-HIT-2D to find high similarity entries and used scripts to remove the duplicates.

The end result was the file 'new_negative_examples'

In [3]:
negative_examples_fasta = root_data + 'new_negative_examples.fasta'
negative_examples_df = convert_fasta_to_pandas(negative_examples_fasta, 'Uniprot_B', 'Seq2')
negative_examples_df

Dataframe   Uniprot_B                                               Seq2
0    Q6ZR08  MSDANKAAIAAEKEALNLKLPPIVHLPENIGVDTPTQSKLLKYRRS...
1    Q8TAX9  MFSVFEEITRIVVKEMDAGGDMIAVRSLVDADRFRCFHLVGEKRTF...
2    Q9P2G4  MAASLSERLFSLELLVDWVRLEARLLPSPAAAVEQEEEEEEKEQGE...
3    Q9XRX5  MFGACYKQPLKPSGSEPPAEECRMTPRHAGCDVTEMQRILSQPTFT...
4    P49366  MEGSLEREAPAGALAAVLKHSSTLPPESTQVRGYDFNRGVNYRALL...
Dataframe columns Index(['Uniprot_B', 'Seq2'], dtype='object')


Unnamed: 0,Uniprot_B,Seq2
0,Q6ZR08,MSDANKAAIAAEKEALNLKLPPIVHLPENIGVDTPTQSKLLKYRRS...
1,Q8TAX9,MFSVFEEITRIVVKEMDAGGDMIAVRSLVDADRFRCFHLVGEKRTF...
2,Q9P2G4,MAASLSERLFSLELLVDWVRLEARLLPSPAAAVEQEEEEEEKEQGE...
3,Q9XRX5,MFGACYKQPLKPSGSEPPAEECRMTPRHAGCDVTEMQRILSQPTFT...
4,P49366,MEGSLEREAPAGALAAVLKHSSTLPPESTQVRGYDFNRGVNYRALL...
...,...,...
10475,Q8IV45,MCPQESSFQPSQFLLLVGVPVASVLLLAQCLRWHCPRRLLGACWTL...
10476,Q99469,MIPPSSPREDGVDGLPKEAVGAEQPPSPASTSSQESKLQKLKRSLS...
10477,P16402,MSETAPLAPTIPAPAEKTPVKKKAKKAGATAGKRKASGPPVSELIT...
10478,Q8TAA5,MAVRSLWAGRLRVQRLLAWSAAWESKGWPLPFSTATQRTAGEDCRS...


In [4]:
negative_interaction_df = interactionDF[['Uniprot_A','Seq1']]
negative_interaction_df[['Uniprot_B','Seq2']] = negative_examples_df[['Uniprot_B','Seq2']]
negative_interaction_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,Uniprot_A,Seq1,Uniprot_B,Seq2
0,Q81SE4,MFKIDSARTYFSIFLAASFVVALLIPLPPFILDIIIVFLLSMSVLI...,Q6ZR08,MSDANKAAIAAEKEALNLKLPPIVHLPENIGVDTPTQSKLLKYRRS...
1,Q5NID2,MSYSYAEKKRIRKEFGVLPHILDVPYLLSIQTESYKKFLTADAAKG...,Q8TAX9,MFSVFEEITRIVVKEMDAGGDMIAVRSLVDADRFRCFHLVGEKRTF...
2,Q5NGQ8,MIVDNSKDFDLKSFLANLTTHSGVYRMLDKHGEIIYVGKAKNLKNR...,Q9P2G4,MAASLSERLFSLELLVDWVRLEARLLPSPAAAVEQEEEEEEKEQGE...
3,Q7BU69,MQTSNITNHERNDSSWMSTVKSTTEVSWNKLSFCDILLKIITFGIY...,Q9XRX5,MFGACYKQPLKPSGSEPPAEECRMTPRHAGCDVTEMQRILSQPTFT...
4,Q5ZWZ3,MAKDNKSHQVKTSEGSLESVKTKEKEPVLEKMRVEDSKKEDKLSMP...,P49366,MEGSLEREAPAGALAAVLKHSSTLPPESTQVRGYDFNRGVNYRALL...
...,...,...,...,...
10475,A0A0F7RFB0,MSRYDDSQNKFSKPCFPSSAGRIPNTPSIPVTKAQLRTFRAIIIDL...,Q8IV45,MCPQESSFQPSQFLLLVGVPVASVLLLAQCLRWHCPRRLLGACWTL...
10476,A0A0J1I2N1,MADVVKITFPDGAVKEFPKGVTTEEIAASISPGLKKKAVAGKLNDE...,Q99469,MIPPSSPREDGVDGLPKEAVGAEQPPSPASTSSQESKLQKLKRSLS...
10477,Q5NF53,MEWLKKTCFSNLEKESQKNHLLLFITICSFFLGIIAIGYYGYIFTA...,P16402,MSETAPLAPTIPAPAEKTPVKKKAKKAGATAGKRKASGPPVSELIT...
10478,Q8ZGW9,MDLNQMTTKTQEAIMSAQSLAVSHHHQEVDTVHLLFTLLEEQDGLA...,Q8TAA5,MAVRSLWAGRLRVQRLLAWSAAWESKGWPLPFSTATQRTAGEDCRS...


Training and tests lists were created from these two dataframes that is exported as json format as the training and test files.

In [5]:
from data.process_hpidb_mtb import create_train_test_list

train_list, test_list = create_train_test_list(interactionDF, negative_interaction_df)
train_list

[{'protein_1': {'id': 'Q00496',
   'primary': 'MQKFTFFVSCAKGIELLLKDELERLGISSQEKLAGVEFEGSIKDAYKVCIYSYLASQVMLKVATDKVINQQDLYEFISSINWMDYFAVDKTFKIIISGKHYDFNNTMFVSQKTKDAIVDQFRNVTNQRPNIDTENPDNVIKLHLHKQFVNVFLCLNIDSLHKRSYRQFQGQAPLKESLAAAILIKAGWLEELKKHQPILIDPMCGSGTILIEAALMAKNIAPVLLNKEFKIFNSKFHNQELWDNLLEIAKNSQKVTNAIICGFDIDNNVLDKAQRNIYQAGVEDVITVKRQDIRDLENEFESEGLIVTNPPYGERLYGDQLDELLDIFNGFGNRLSQDFYGWKVAVLTSFADSIKEMQLRTTERNKFYNGAIETILYQFEINEHAKFKHETQLEKNIRIAEASAQKSDEHIDFANKLKKNLKSLKPWLKQTGLECYRLYDADIPTFAVAVDVYSEHIFLQEYRADATIDQNIAKQRFYQAIYQIHKTLDIKYENIHTRVRQRQKGKEQYQKENDKNKFHIINEFDAKFYVNFDDYLDTGIFLDHRKIRQLVAKAAKNKTLLNLFSYTCTASVHAALKGAKTTSVDMSNTYLEWGKNNFTLNNIDAKKHSFIQADCISWLKTNKDKFDVIFLDPPTFSNSKRMDDILDIQRDHELLINLAMDSLKKDGILYFSNNYRRFKMSPQILEKFNCENIDKICLSRDFLSNKNIHNCWEIKYK'},
  'protein_2': {'id': 'O00299',
   'primary': 'MQIMFSSVVRISGLCLFPNGGMTYNLFCLYLSIHQGAVFSASRPSYCQAGYDSEDVI'},
  'is_interaction': 1},
 {'protein_1': {'id': 'A0A384LB40',
   'primary': 'MTSSDTQNNKTLAAMKNFAEQYAKRTDTYFCSDLSVTAVVIEGLARHKE

We now dump the lists as json objects to disk. Note this is the original data I used. The files generated are not in the data folder required for training.

In [8]:
import json
with open('williams_MTB/hidb_temp_train.json', 'w') as f:
    json.dump(train_list, f)
with open('williams_MTB/hidb_temp_test.json', 'w') as f:
    json.dump(test_list, f)

### Using CD-HIT-2D and removing duplicates