# Pastebin for the Kunitz domain project

# Structure selection
I will collect all the available PDB structures of the Kunitz-BPTI domain. I tried different approaches (see pastebin) but in the end I get way to many structures and there is no need to get crazy.
I believe that the Uniprot annotation is the best unbiased source and gives me only a few proteins that I can inspect manually.

## Retrieving structures from proteins annotated with the Kunitz Pfam domain in Uniprot
I search for Uniprot (rewied and not) entries annotated with the Pfam Kunitz domain that have a 3D structure available.

`database:(type:pfam pf00014) database:(type:pdb)`

I retrieve only 33 proteins. I save the IDs in `Uniprot_Kunitz_3D.list`.

In [55]:
!head Uniprot_Kunitz_3D.list

P08592
Q06481
P05067
Q7TQN3
P12111
P02760
P12023
P00974
P00981
P10646


I map the 33 entries to all their PDB structures using the ID mapping tool at Uniprot.
I retrieve 346 structures. I save the list of PDB IDs in `Uniprot_Kunitz_3D_PDBAC.list`.

In [56]:
!head Uniprot_Kunitz_3D_PDBAC.list

1AAL
1AAP
1ADZ
1AMB
1AMC
1AML
1B0C
1BA4
1BA6
1BF0


This list contains only unique identifiers

In [57]:
!cat Uniprot_Kunitz_3D_PDBAC.list|wc -l
!cat Uniprot_Kunitz_3D_PDBAC.list|sort -u|wc -l

346
346


## Retrieving structures in annotated with the Kunitz CATH domain in RCSB PDB

I search in the RCSB PDB for entries containing the Kunitz CATH domain.

`CathTree Search for Factor Xa Inhibitor (4.10.410.10)`

I retrieve 136 matches. The list of PDB IDs is in csv format is in `PDB_CATH_Kunitz.csv`.
Now I parse the file in a pandas series.

In [None]:
with open("./PDB_CATH_Kunitz.csv") as PDB_CATH_filein:
    PDB_CATH_list = PDB_CATH_filein.read().split(", ")
    PDB_CATH_series = pd.Series(PDB_CATH_list)
PDB_CATH_series.head()

## Retrieving Kunitz structures with PDBfold
I select a prototype structure and scan the PDB for similar structures with PDBfold

* Single chain 1 A structure: `5PTI`
* Structure in complex with trypsin at 1.8 A: `3TGI`

I perform a search on PDBfold with default parameters using chain I of `3TGI`, retrieving 609 matches. I save the result summary in `pdbfold_3tgi.dat`.

I perform a search on PDBfold with default parameters using chain A of `5PTI`, retrieving 584 matches.I save the result summary in `pdbfold_5pti.dat`.

I will write a short script for parsing the .dat files obtained by PDBfold in a pandas dataframe. It seems that I cannot download the results in a more useful format. First I need to strip the PDB string from all the colums, otherwise it is difficult to parse.

In [None]:
!sed -i 's/PDB / /g' ./pdbfold_*.dat

In [None]:
def get_pdbfold_df(filepath):
    with open(filepath) as dat_filein:
        pdbfold_df = pd.read_csv(dat_filein, skiprows=(0, 1, 2, 3), sep="\s+").set_index('##')
    pdbfold_df["Query"]= pdbfold_df["Query"].str.split(":")
    pdbfold_df["Target"]= pdbfold_df["Target"].str.split(":")
    return pdbfold_df

pdbfold_3tgi_df = get_pdbfold_df("./pdbfold_3tgi.dat")
pdbfold_5pti_df = get_pdbfold_df("./pdbfold_5pti.dat")

pdbfold_3tgi_df.head()

## Merging all the PDB IDs in an unique list
Now I want to get a single list of unique target PDB IDs from the files and put them in the list PDBfold_IDs. I get 402 IDs.

In [None]:
def get_PDB_list(df):
    PDB_list = []
    for element in df['Target']:
        ID = element[0].upper()
        PDB_list.append(ID)
    return PDB_list

PDBfold_IDs = []
PDBfold_IDs += get_PDB_list(pdbfold_3tgi_df)
PDBfold_IDs += get_PDB_list(pdbfold_5pti_df)
PDBfold_IDs = list(dict.fromkeys(PDBfold_IDs)) # remove duplicate IDs
len(PDBfold_IDs)

Now I merge all the PDB IDs retrieve in different ways. This gives me  unique IDs.

In [None]:
PDB_IDs = []
print(len(PDB_IDs))
PDB_IDs += UniprotPDB_IDs
print(len(PDB_IDs))
PDB_IDs += PDBfold_IDs
print(len(PDB_IDs))
PDB_IDs += PDB_CATH_list
print(len(PDB_IDs))
PDB_IDs = list(dict.fromkeys(PDB_IDs)) # remove duplicate IDs
print(len(PDB_IDs))

At this point my set of structures will be as complete as possible.
First I write all the PDB IDs to `PDB_IDs.list`

In [None]:
try:
    with open("./PDB_IDs.list","x") as PDB_IDs_fileout: # x avoids overwriting files
        file_content = "\n".join(PDB_IDs)
        PDB_IDs_fileout.write(file_content)
except FileExistsError:
    pass

Now I  map the list to Uniprot ACs using the Uniprot ID mapping tool, so to get a list of single proteins.
I get 425 Uniprot IDs, that I save in the `Uniprot_IDs_mappedfromPDB.list`.
24 PDB IDs were not mapped. I save them in `PDB_IDs_notmapped.list`.
For now I decided to ignore them.
The ID list is unique.

In [None]:
!sort ./Uniprot_IDs_mappedfromPDB.list|uniq|wc -l

 I will find a way to remove duplicate structures.
I will find a way to select the right chain in the PDB file.
I will find a way to select the right domain in the chain.
I will exclude mutated proteins.
I will adopt a resulution threshold.