# Pastebin for the Kunitz domain project

# Structure selection
I will collect all the available PDB structures of the Kunitz-BPTI domain. I tried different approaches (see pastebin) but in the end I get way to many structures and there is no need to get crazy.
I believe that the Uniprot annotation is the best unbiased source and gives me only a few proteins that I can inspect manually.

## Retrieving structures from proteins annotated with the Kunitz Pfam domain in Uniprot
I search for Uniprot (rewied and not) entries annotated with the Pfam Kunitz domain that have a 3D structure available.

`database:(type:pfam pf00014) database:(type:pdb)`

I retrieve only 33 proteins. I save the IDs in `Uniprot_Kunitz_3D.list`.

In [55]:
!head Uniprot_Kunitz_3D.list

P08592
Q06481
P05067
Q7TQN3
P12111
P02760
P12023
P00974
P00981
P10646


I map the 33 entries to all their PDB structures using the ID mapping tool at Uniprot.
I retrieve 346 structures. I save the list of PDB IDs in `Uniprot_Kunitz_3D_PDBAC.list`.

In [56]:
!head Uniprot_Kunitz_3D_PDBAC.list

1AAL
1AAP
1ADZ
1AMB
1AMC
1AML
1B0C
1BA4
1BA6
1BF0


This list contains only unique identifiers

In [57]:
!cat Uniprot_Kunitz_3D_PDBAC.list|wc -l
!cat Uniprot_Kunitz_3D_PDBAC.list|sort -u|wc -l

346
346


## Retrieving structures in annotated with the Kunitz CATH domain in RCSB PDB

I search in the RCSB PDB for entries containing the Kunitz CATH domain.

`CathTree Search for Factor Xa Inhibitor (4.10.410.10)`

I retrieve 136 matches. The list of PDB IDs is in csv format is in `PDB_CATH_Kunitz.csv`.
Now I parse the file in a pandas series.

In [None]:
with open("./PDB_CATH_Kunitz.csv") as PDB_CATH_filein:
    PDB_CATH_list = PDB_CATH_filein.read().split(", ")
    PDB_CATH_series = pd.Series(PDB_CATH_list)
PDB_CATH_series.head()

## Retrieving Kunitz structures with PDBfold
I select a prototype structure and scan the PDB for similar structures with PDBfold

* Single chain 1 A structure: `5PTI`
* Structure in complex with trypsin at 1.8 A: `3TGI`

I perform a search on PDBfold with default parameters using chain I of `3TGI`, retrieving 609 matches. I save the result summary in `pdbfold_3tgi.dat`.

I perform a search on PDBfold with default parameters using chain A of `5PTI`, retrieving 584 matches.I save the result summary in `pdbfold_5pti.dat`.

I will write a short script for parsing the .dat files obtained by PDBfold in a pandas dataframe. It seems that I cannot download the results in a more useful format. First I need to strip the PDB string from all the colums, otherwise it is difficult to parse.

In [None]:
!sed -i 's/PDB / /g' ./pdbfold_*.dat

In [None]:
def get_pdbfold_df(filepath):
    with open(filepath) as dat_filein:
        pdbfold_df = pd.read_csv(dat_filein, skiprows=(0, 1, 2, 3), sep="\s+").set_index('##')
    pdbfold_df["Query"]= pdbfold_df["Query"].str.split(":")
    pdbfold_df["Target"]= pdbfold_df["Target"].str.split(":")
    return pdbfold_df

pdbfold_3tgi_df = get_pdbfold_df("./pdbfold_3tgi.dat")
pdbfold_5pti_df = get_pdbfold_df("./pdbfold_5pti.dat")

pdbfold_3tgi_df.head()

## Merging all the PDB IDs in an unique list
Now I want to get a single list of unique target PDB IDs from the files and put them in the list PDBfold_IDs. I get 402 IDs.

In [None]:
def get_PDB_list(df):
    PDB_list = []
    for element in df['Target']:
        ID = element[0].upper()
        PDB_list.append(ID)
    return PDB_list

PDBfold_IDs = []
PDBfold_IDs += get_PDB_list(pdbfold_3tgi_df)
PDBfold_IDs += get_PDB_list(pdbfold_5pti_df)
PDBfold_IDs = list(dict.fromkeys(PDBfold_IDs)) # remove duplicate IDs
len(PDBfold_IDs)

Now I merge all the PDB IDs retrieve in different ways. This gives me  unique IDs.

In [None]:
PDB_IDs = []
print(len(PDB_IDs))
PDB_IDs += UniprotPDB_IDs
print(len(PDB_IDs))
PDB_IDs += PDBfold_IDs
print(len(PDB_IDs))
PDB_IDs += PDB_CATH_list
print(len(PDB_IDs))
PDB_IDs = list(dict.fromkeys(PDB_IDs)) # remove duplicate IDs
print(len(PDB_IDs))

At this point my set of structures will be as complete as possible.
First I write all the PDB IDs to `PDB_IDs.list`

In [None]:
try:
    with open("./PDB_IDs.list","x") as PDB_IDs_fileout: # x avoids overwriting files
        file_content = "\n".join(PDB_IDs)
        PDB_IDs_fileout.write(file_content)
except FileExistsError:
    pass

Now I  map the list to Uniprot ACs using the Uniprot ID mapping tool, so to get a list of single proteins.
I get 425 Uniprot IDs, that I save in the `Uniprot_IDs_mappedfromPDB.list`.
24 PDB IDs were not mapped. I save them in `PDB_IDs_notmapped.list`.
For now I decided to ignore them.
The ID list is unique.

In [None]:
!sort ./Uniprot_IDs_mappedfromPDB.list|uniq|wc -l

 I will find a way to remove duplicate structures.
I will find a way to select the right chain in the PDB file.
I will find a way to select the right domain in the chain.
I will exclude mutated proteins.
I will adopt a resulution threshold.

Now I scrable my databases

In [155]:
!scramble_fasta.py positives_all_cleaned.fasta|tee positives_all_cleaned_scrambled.fasta|grep ">"|wc -l
!scramble_fasta.py negatives_all.fasta|tee negatives_all_scrambled.fasta|grep ">"|wc -l

346
561552


In [157]:
print(346/2)
561552/2

173.0


280776.0

In [173]:
!cat positives_all_cleaned_scrambled.fasta|grep -m 174 ">" -n|tail -1
!cat negatives_all_scrambled.fasta|grep -m 280777 ">" -n|tail -1

347:>sp|A8Y7N5|VKTC2_DABSI Kunitz-type serine protease inhibitor C2 OS=Daboia siamensis OX=343250 PE=2 SV=1
561553:>sp|A9MCV8|TRUA_BRUC2 tRNA pseudouridine synthase A OS=Brucella canis (strain ATCC 23365 / NCTC 10854) OX=483179 GN=truA PE=3 SV=1


In [160]:
!cat positives_all_cleaned_scrambled.fasta|head -346|tee positives_set1.fasta|grep ">"|wc -l
!cat positives_all_cleaned_scrambled.fasta|tail +347|tee positives_set2.fasta|grep ">"|wc -l

173
173


In [168]:
!cat negatives_all_scrambled.fasta|head -561552|tee negatives_set1.fasta|grep ">"|wc -l
!cat negatives_all_scrambled.fasta|tail +561553|tee negatives_set2.fasta|grep ">"|wc -l

280776
280776


In [172]:
!cat positives_set1.fasta|tail -1
!cat positives_set2.fasta|head -1
!cat negatives_set1.fasta|tail -1
!cat negatives_set2.fasta|head -1

MGTARFLSAVLLLSVLLMVTFPALLSAEYHDGRVDICSLPSDSGDRLRFFEMWYFDGTTCTKFVYGGYGGNDNRFPTEKACMKRCAKA
>sp|A8Y7N5|VKTC2_DABSI Kunitz-type serine protease inhibitor C2 OS=Daboia siamensis OX=343250 PE=2 SV=1
MDLTPSPRKHRSVSHSQSSDSGPPSSTKSNSGVPAGSNRKGFNINIAVSPISVSSPPVSHKHTLTRSHSHSSKHRRGSSASTNNPLPQLLEDADGPQLPEWPQPANESQGLRYNLELPSDEHLASLDIDDQLKFLALKEMGIVELKDKISQLNSILHKGEKDLHRLRELVQRSLYKEMSAGYTGSSKHVRQSSNPRDEAIASTKNRTRRRTLSSSSSPSKYLPVPEQSEPDSKSRLWSNLSKPLGFIQQFDSMLQNEFERSLIPQVSNSANPQPRTSEESYQSPLRSRSKNNDVDLPTEWTSSRSSSPQRASRNPEEMFQAVSSSIWSFVNDVRENMLPPREEEEKDKELYNLDNGSTVSVENMNNSDYDETTTETLPRRRSRQNSNAIDK
>sp|A9MCV8|TRUA_BRUC2 tRNA pseudouridine synthase A OS=Brucella canis (strain ATCC 23365 / NCTC 10854) OX=483179 GN=truA PE=3 SV=1


In [174]:
!grep ">" positives_set1.fasta|wc -l
!grep ">" positives_set2.fasta|wc -l
!grep ">" negatives_set1.fasta|wc -l
!grep ">" negatives_set2.fasta|wc -l

173
173
280776
280776


## HMM search
I use HMM search for obtaining an E-value on all datasets.
I use the -max flag to disable all heuristics.

In [176]:
!hmmsearch --max --tblout positives_set1.hmmsearch.tbl kunitz_bpti.hmm positives_set1.fasta|tee positives_set1.hmmsearch|wc -l
!hmmsearch --max --tblout positives_set2.hmmsearch.tbl kunitz_bpti.hmm positives_set2.fasta|tee positives_set2.hmmsearch|wc -l
!hmmsearch --max --tblout negatives_set1.hmmsearch.tbl kunitz_bpti.hmm negatives_set1.fasta|tee negatives_set1.hmmsearch|wc -l
!hmmsearch --max --tblout negatives_set2.hmmsearch.tbl kunitz_bpti.hmm negatives_set2.fasta|tee negatives_set2.hmmsearch|wc -l

2543
2543
258
114


## Parsing ang creation of the draft datasets
I parse the output of hmmsearch and output something that I can use with the model_performance.py script.
I create my own script for doing this.
I am sorting unique entries already in my script.
Note that these are only the positive scores. I need to add the missing IDs.

In [177]:
!hmmalign_to_dataset.sh positives_set1.hmmsearch.tbl 1|tee positives_set1.dat|wc -l
!hmmalign_to_dataset.sh positives_set2.hmmsearch.tbl 1|tee positives_set2.dat|wc -l
!hmmalign_to_dataset.sh negatives_set1.hmmsearch.tbl 0|tee negatives_set1.dat|wc -l
!hmmalign_to_dataset.sh negatives_set2.hmmsearch.tbl 0|tee negatives_set2.dat|wc -l

173
173
16
6


## Completion of the dataset
Now I add the missing IDs by using my scripts.

In [178]:
!ID_from_uniprot_fasta.sh positives_set1.fasta|tee positives_set1.list|wc -l #.list have the IDs!!
!ID_from_uniprot_fasta.sh positives_set2.fasta|tee positives_set2.list|wc -l
!ID_from_uniprot_fasta.sh negatives_set1.fasta|tee negatives_set1.list|wc -l
!ID_from_uniprot_fasta.sh negatives_set2.fasta|tee negatives_set2.list|wc -l

173
173
280776
280776


In [188]:
!cat positives_set1.dat|sort -grk 2|head -1 # the worst score
!cat positives_set2.dat|sort -grk 2|head -1
!cat negatives_set1.dat|sort -grk 2|head -1
!cat negatives_set2.dat|sort -grk 2|head -1

P0CAR0 9.3e-12 1
D3GGZ8 4.6e-05 1
Q9FF80 29 0
Q07494 25 0


In [181]:
!add_missing_IDs.sh positives_set1.dat positives_set1.list 30 0|tee positives_set1_complete.dat|wc -l # .dat are the dataset obtained from hmmsearch!
!add_missing_IDs.sh positives_set2.dat positives_set2.list 30 0|tee positives_set2_complete.dat|wc -l # .dat are the dataset obtained from hmmsearch!
!add_missing_IDs.sh negatives_set1.dat negatives_set1.list 30 0|tee negatives_set1_complete.dat|wc -l
!add_missing_IDs.sh negatives_set2.dat negatives_set2.list 30 0|tee negatives_set2_complete.dat|wc -l

173
173
280776
280776


## Merging of positives and negatives

In [182]:
!cat positives_set1_complete.dat negatives_set1_complete.dat > set1.dat
!cat positives_set2_complete.dat negatives_set2_complete.dat > set2.dat

In [183]:
!cat set1.dat|wc -l
!cat set2.dat|wc -l

280949
280949


# Calculation of the confusion matrix
I first want to see the best scores of the negatives and the worst scores for the positives in the training set.
G3LH89 is in the negative group but seems indeed to have a prosite-annotated Kunitz-BPTI domain.

In [205]:
!cat set1.dat|awk '{if($3==1){print $0}}'|sort -gk 2|tail # the worst score of the positives
!echo
!cat set1.dat|awk '{if($3==0){print $0}}'|sort -gk 2|head # the best score of the negatives

Q63870 1.4e-17 1
B5L5Q3 1.7e-17 1
P0DMW9 1.1e-15 1
C0LNR2 3.5e-15 1
P0DMW8 9.8e-15 1
Q7Z1K3 1.9e-14 1
Q2ES49 3e-14 1
H2A0N1 1e-13 1
P26226 7.5e-12 1
P0CAR0 9.3e-12 1

G3LH89 1.4e-18 0
P56409 0.00098 0
P84555 0.003 0
P83605 0.044 0
P0DJ63 0.17 0
P84556 0.73 0
P36235 0.78 0
P71089 1.2 0
P85040 1.2 0
P83604 3.3 0


In [208]:
!model_stats.py set1.dat 1e-3
!model_stats.py set2.dat 1e-3

Confusion matrix:
(173, 2)
(280774, 0)

ACC: 0.9999928812702661 MCC: 0.9942657526348435
Confusion matrix:
(173, 0)
(280776, 0)

ACC: 1.0 MCC: 1.0
