# Random snippets
This notebook contains a series of snippets for reuse in different projects. These include parsing functions and other commonly used routines. I can migrate content from here to a more specialised location when I feel it is needed. Mostly I will use a combination of python and shell scripting.

## Calculate the Matthews Correlation Coefficient from the confusion matrix
When a pipe is called from python or when I use a pipe inside Jupyter by doing `!some command` in the python cell, sometimes a broken pipe error is raised.
This is because of how python itself hanldes the message SIGPIPE, that communicates when a part of the pipe is not receiving anymore (i.e. when head stops taking input because it reached the desired line).
This throws an error in python.
The following script, if executed before the offending line, makes the system itself handle the issue (SIG_DFL, system default) instead of using the default python mechanism.
The signal function in this case assigns SIG_DFL as an handler for the SIGPIPE call.

For more: https://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python

In [None]:
def get_MCC(confusion_mat):
    t_pos = confusion_mat[0,0]
    t_neg = confusion_mat[1,0]
    f_pos = confusion_mat[0,1]
    f_neg = confusion_mat[1,1]
    above_frac = t_pos*t_neg-f_pos*f_neg
    below_frac = np.sqrt((t_pos+f_pos)*(t_pos+f_neg)*(t_neg+f_pos)*(t_neg+f_neg))
    MCC = above_frac/max(below_frac,1) # otherwise I can divide by 0
    return MCC

## Avoid broken pipes in python and jupyter
When a pipe is called from python or when I use a pipe inside Jupyter by doing `!some command` in the python cell, sometimes a broken pipe error is raised.
This is because of how python itself hanldes the message SIGPIPE, that communicates when a part of the pipe is not receiving anymore (i.e. when head stops taking input because it reached the desired line).
This throws an error in python.
The following script, if executed before the offending line, makes the system itself handle the issue (SIG_DFL, system default) instead of using the default python mechanism.
The signal function in this case assigns SIG_DFL as an handler for the SIGPIPE call.

For more: https://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python

In [2]:
# this avoids broken pipes by making the default system handler handle the SIGPIPE call
# see https://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE, SIG_DFL) 

<Handlers.SIG_DFL: 0>

## Parsing the output summary of a PDBeFold search
The sample file for this snippet is in `./sample_files/pdbfold_output.dat`.
The output of PDBeFold is somewhat difficult to parse and very information rich, so I convert it to a pandas dataframe. The default separator is space `\s+`.
In front of the PDB ID of the query and of the target it is present the string 'PDB' separated from the actual ID by a single space. When parsing with the normal pandas.read_csv() this cause 'PDB' and the actual ID to go into different columns of the dataframe and not match with the column headers. If I skip a single space as separator I cannot parse correctly the column headers since some of them are separated by a single space. The easiest solution that I found is to just remove the 'PDB' string with sed before reading the file.

In [41]:
!sed -i 's/PDB / /g' ./sample_files/pdbfold_output.dat

In [42]:
import pandas as pd


def get_pdbfold_df(filepath):
    with open(filepath) as dat_filein:
        pdbfold_df = pd.read_csv(dat_filein, skiprows=(0, 1, 2, 3), sep="\s+").set_index('##')
    pdbfold_df["Query"]= pdbfold_df["Query"].str.split(":")
    pdbfold_df["Target"]= pdbfold_df["Target"].str.split(":")
    return pdbfold_df

The resulting dataframe is also split in a list in the 'Target' and 'Query' columns so to more easily access PDB IDs and chain IDs.

In [44]:
pdbfold_df = get_pdbfold_df('./sample_files/pdbfold_output.dat')
pdbfold_df.head()

Unnamed: 0_level_0,Q-score,P-score,Z-score,RMSD,Nalgn,Nsse,Ngaps,Seq-%,Nmd,Nres-Q,Nsse-Q,Nres-T,Nsse-T,Query,Target
##,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,1.0,16.18,11.93,0.0,56,4,0,1.0,0,56,4,56,4,"[3tgi, I]","[3tgi, I]"
2,0.9986,13.58,10.9,0.112,56,4,0,1.0,0,56,4,56,4,"[3tgi, I]","[1f7z, I]"
3,0.9984,13.58,10.9,0.121,56,4,0,1.0,0,56,4,56,4,"[3tgi, I]","[1fy8, I]"
4,0.9984,13.58,10.9,0.122,56,4,0,1.0,0,56,4,56,4,"[3tgi, I]","[1ykt, B]"
5,0.9975,13.13,10.71,0.151,56,4,0,1.0,0,56,4,56,4,"[3tgi, I]","[3tgk, I]"
