# Random snippets
This notebook contains a series of snippets for reuse in different projects. These include parsing functions and other commonly used routines. I can migrate content from here to a more specialised location when I feel it is needed. Mostly I will use a combination of python and shell scripting.

## Parsing the output summary of a PDBeFold search
The sample file for this snippet is in `./sample_files/pdbfold_output.dat`.
The output of PDBeFold is somewhat difficult to parse and very information rich, so I convert it to a pandas dataframe. The default separator is space `\s+`.
In front of the PDB ID of the query and of the target it is present the string 'PDB' separated from the actual ID by a single space. When parsing with the normal pandas.read_csv() this cause 'PDB' and the actual ID to go into different columns of the dataframe and not match with the column headers. If I skip a single space as separator I cannot parse correctly the column headers since some of them are separated by a single space. The easiest solution that I found is to just remove the 'PDB' string with sed before reading the file.

In [41]:
!sed -i 's/PDB / /g' ./sample_files/pdbfold_output.dat

In [42]:
import pandas as pd


def get_pdbfold_df(filepath):
    with open(filepath) as dat_filein:
        pdbfold_df = pd.read_csv(dat_filein, skiprows=(0, 1, 2, 3), sep="\s+").set_index('##')
    pdbfold_df["Query"]= pdbfold_df["Query"].str.split(":")
    pdbfold_df["Target"]= pdbfold_df["Target"].str.split(":")
    return pdbfold_df

The resulting dataframe is also split in a list in the 'Target' and 'Query' columns so to more easily access PDB IDs and chain IDs.

In [43]:
pdbfold_df = get_pdbfold_df('./sample_files/pdbfold_output.dat')
pdbfold_df

Unnamed: 0_level_0,Q-score,P-score,Z-score,RMSD,Nalgn,Nsse,Ngaps,Seq-%,Nmd,Nres-Q,Nsse-Q,Nres-T,Nsse-T,Query,Target
##,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,1.00000,16.18000,11.9300,0.000,56,4,0,1.00000,0,56,4,56,4,"[3tgi, I]","[3tgi, I]"
2,0.99860,13.58000,10.9000,0.112,56,4,0,1.00000,0,56,4,56,4,"[3tgi, I]","[1f7z, I]"
3,0.99840,13.58000,10.9000,0.121,56,4,0,1.00000,0,56,4,56,4,"[3tgi, I]","[1fy8, I]"
4,0.99840,13.58000,10.9000,0.122,56,4,0,1.00000,0,56,4,56,4,"[3tgi, I]","[1ykt, B]"
5,0.99750,13.13000,10.7100,0.151,56,4,0,1.00000,0,56,4,56,4,"[3tgi, I]","[3tgk, I]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,0.03488,0.03441,0.9695,4.611,27,3,2,0.07407,0,56,4,111,4,"[3tgi, I]","[6umg, c]"
606,0.03392,0.27470,1.7650,4.455,26,3,2,0.00000,0,56,4,111,4,"[3tgi, I]","[5mq4, D]"
607,0.02931,0.07331,1.1740,4.677,25,3,2,0.08000,0,56,4,111,4,"[3tgi, I]","[5mq4, B]"
608,0.02929,0.07073,1.1640,4.680,25,3,2,0.08000,0,56,4,111,4,"[3tgi, I]","[5mq4, E]"
