# Random snippets
This notebook contains a series of snippets for reuse in different projects. These include parsing functions and other commonly used routines. I can migrate content from here to a more specialised location when I feel it is needed. Mostly I will use a combination of python and shell scripting.

## Parsing the output summary of a PDBeFold search
The sample file for this snippet is in `./sample_files/pdbfold_output.dat`.
The output of PDBeFold is somewhat difficult to parse and very information rich, so I convert it to a pandas dataframe. The default separator is space `\s+`.
In front of the PDB ID of the query and of the target it is present the string 'PDB' separated from the actual ID by a single space. When parsing with the normal pandas.read_csv() this cause 'PDB' and the actual ID to go into different columns of the dataframe and not match with the column headers. If I skip a single space as separator I cannot parse correctly the column headers since some of them are separated by a single space. The easiest solution that I found is to just remove the 'PDB' string with sed before reading the file.

In [1]:
!sed -i 's/PDB / /g' ./pdbfold_output.dat

sed: can't read ./pdbfold_output.dat: No such file or directory


In [3]:
import pandas as pd


def get_pdbfold_df(filepath):
    with open(filepath) as dat_filein:
        pdbfold_df = pd.read_csv(dat_filein, skiprows=(0, 1, 2, 3), sep="\s+")
    return pdbfold_df

Sometimes I want to easily access the chain name and the PDB ID of a target. This function splits the 'Target' column in a list containing PDB ID and chain. It raises a warning but it works.

In [3]:
def separate_chain_ID(pdbfold_df):
    for i in range(len(pdbfold_df["Target"])):
        pdbfold_df["Target"][i] = pdbfold_df["Target"][i].split(":")
    return pdbfold_df