In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import mybiotools as mbt
import os, sys

# 2018-07-19 New peaks
I reran Xavi's pipeline for the generation of the genomic coordinates of the "all_treated", "4HCP" etc. peaks. However, I obtain different results from the results that he obtained.

Investigating on the origin of this discrepancy, I realize that there has been a reprocessing of the original data. At some point the pipeline for the analysis of the ChIP-seq results was reran, by including an additional filter that avoided including false positive results. As a result, there are less peaks resulting from the new analysis. Frustratingly enough, there is also another issue, which is that the genomic coordinates of the peaks are not exactly the same. So let's find out whether there is really an overlap or not, and which peaks disappeared from the analysis.

First of all, as usual, let's load the data files.

In [None]:
old_datadir = '/mnt/xavi/projects/gvicent/2017-01-23_characterisation_prbs_r5020_titration/tables'
new_datadir = '/home/rcortini/work/CRG/projects/pr_peaks/data/peak_analysis'
peak_ids = ['all_treated','4HCP','3HCP','1HCP']
old = {}
new = {}
for peak_id in peak_ids :
    old_fname = '%s/genomic_coordinates_by_peak_population_%s.bed'%(old_datadir,peak_id)
    new_fname = '%s/genomic_coordinates_by_peak_population_%s.bed'%(new_datadir,peak_id)
    old[peak_id] = mbt.parse_simple_bed(old_fname)
    new[peak_id] = mbt.parse_simple_bed(new_fname)

So now we need to find a way of assessing the overlap between the peaks. I'll write a small function to do this.

In [None]:
def have_overlap(peak1, peak2) :
    if peak1['chr'] != peak2['chr'] :
        return False
    a1, b1 = peak1['start'], peak1['end']
    a2, b2 = peak2['start'], peak2['end']
    L1 = np.abs(a1 - b1)
    L2 = np.abs(a2 - b2)
    L = np.abs(min(a1, a2) - max(b1, b2))
    if L1+L2 > L :
        return True
    else :
        return False

Test space for this.

In [None]:
peak1 = old['all_treated'][4]
peak2 = new['all_treated'][5]
print peak1
print peak2
print have_overlap(peak1, peak2)

Okay now we are ready to do the test in large scale.

In [None]:
for peak_id in peak_ids :
    mbt.log_message('find overlap', peak_id)
    for peak1 in old[peak_id] :
        overlap_found = False
        for peak2 in new[peak_id] :
            if have_overlap(peak1, peak2) :
                overlap_found = True
                break
        # if we get here, then it means that no overlap was found
        if not overlap_found :
            print peak1

So there are quite a few different results here. Some of the peaks in the new data set are absent, and probably others have no overlap.