## task:
we need to map a peak to one chunk in the bed file, and then see if the other chunk in contact with it is the promoter of any gene.

* SNP to peak definition:  
+/- 250bp of peak region

* promoter definition:  
5kb upstream of gene TSS 

* in contact definition:  
each row in the Hi-C file defines in contact.

* peak-HiC and promoter-HiC contact
20% overlap of peak size between peak and hic, 20% overlap of promoter size between promoter and hic.

## format:
* Hi-C files  

    * [PsychEncode data](https://science.sciencemag.org/content/362/6420/eaat4311), obtained from Siwei, in hg19, a Hi-C bed file with the name XXX.50000_blocks.bed, each number A in the 2nd/3rd column represents a fragment that spans from A to A+50000.
    * [Nature 2016](https://www.nature.com/articles/nature19847)
    * [NG 2019](http://dx.doi.org/10.1038/s41588-019-0472-1)


* ASoC SNP file  
in hg38.

* peak file  
in hg38. CN means glutamatergic neurons, and I guess we should use Hi-C files starting with 'Neu';
NSC means neural progenitor cells, and I guess we should use Hi-C files starting with 'NPC'.

* gene anno file  
in hg19.

## file source
`genecode.v29lift37.genes` from zhongshan.  
the rest from yifan.

In [1]:
cd /project2/xinhe/simingz/neuron_atac_seq/HiC

/project2/xinhe/simingz/neuron_atac_seq/HiC


In [1]:
# promoter annotation file
!mv genecode.v29lift37.genes genecode.v29lift37.genes.ori
!awk -F'\t' '$2 >0 && $1 ~ /chr/ {print $0}' genecode.v29lift37.genes.ori > genecode.v29lift37.genes
!awk 'BEGIN{FS="\t";OFS="\t"}{if($4=="+"){print $1,$2-3000,$2,$5}else{print $1,$3,$3+3000,$5}}'  genecode.v29lift37.genes > genecode.v29lift37.genes.promoter

In [3]:
# HiC file to canonical bed file and duplicate each entry by switching order of the two ends
!awk -F'\t' '{print "chr"$1"\t"$2"\t"$2+50000"\t"$3"\t"$3+50000"\nchr"$1"\t"$3"\t"$3+50000"\t"$2"\t"$2+50000}' NPC.50000_blocks.bed > NPC.50000_blocks.converted.bed
!awk -F'\t' '{print "chr"$1"\t"$2"\t"$2+50000"\t"$3"\t"$3+50000"\nchr"$1"\t"$3"\t"$3+50000"\t"$2"\t"$2+50000}' Neu.50000_blocks.bed > Neu.50000_blocks.converted.bed
!awk -F'\t' '{print "chr"$1"\t"$2"\t"$2+100000"\t"$3"\t"$3+100000"\nchr"$1"\t"$3"\t"$3+100000"\t"$2"\t"$2+100000}' NPC.100000_blocks.bed > NPC.100000_blocks.converted.bed
!awk -F'\t' '{print "chr"$1"\t"$2"\t"$2+100000"\t"$3"\t"$3+100000"\nchr"$1"\t"$3"\t"$3+100000"\t"$2"\t"$2+100000}' Neu.100000_blocks.bed > Neu.100000_blocks.converted.bed

In [4]:
import pybedtools
from pyliftover import LiftOver
def snp2gene(snpf, peakf, hicf, annof, outtag):
    '''snp to peak to gene promoter
    '''
    # snp to peak
    peakflsize = 250
    snps = pybedtools.BedTool(snpf)
    peaks = pybedtools.BedTool(peakf)
    peaksfl = peaks.slop(genome = 'hg38', b = peakflsize)
    snp2peak = snps.intersect(peaksfl, wb=True)
    snp2peak.saveas(outtag + '.snp2peak.bed')  
    
    # snp2peak liftover
    lo = LiftOver('hg38', 'hg19')
    with open(outtag + '.snp2peak.lo.bed', 'w') as snp2peaklo:
        for snp in snp2peak:
            s = lo.convert_coordinate(snp.fields[8], int(snp.fields[9]))
            e = lo.convert_coordinate(snp.fields[8], int(snp.fields[10]))
            if len(s) !=0 and len(e) != 0:
                if s[0][0] == e[0][0] and s[0][1] < e[0][1]:
                    peaklo = [s[0][0], str(s[0][1]), str(e[0][1])]
                    snp2peaklo.write('\t'.join(peaklo + list(snp)) + '\n')
    
    # annotate with HiC contact
    hics = pybedtools.BedTool(hicf)
    snppeak = pybedtools.BedTool(outtag + '.snp2peak.lo.bed')
    snppeakhic = snppeak.intersect(hics, wao=True, f=0.2)
    
    # annotate the other end with promoter info
    snppeakhic1 = pybedtools.BedTool([[i[14]] + i[17:19] + i[0:17] + [i[19]] for i in snppeakhic if i[19]!='0'])
    annos= pybedtools.BedTool(annof)
    snppeakhic2 = snppeakhic1.intersect(annos, wao= True, F=0.2)
    out = pybedtools.BedTool([i for i in snppeakhic2 if i[25] != '0'])
    
    # write output
    with open(outtag + '.bed', 'w') as outfile:
        # outfile.write('\t'.join(['SNP.chr','SNP.start','SNP.end','SNP.label','SNP','SNP','SNP','SNP', 'peak.chr', 'peak.start', 'peak.end', 'geneID']) +'\n')
        for outr in out:
            outfile.write('\t'.join(outr[6:17] + [outr[24]]) + '\n')
    return(out)

In [5]:
snpf = 'NSC_20_ASoC_FDR0.05.bed'
peakf = 'NSC_merged_all_20_peaks.hotspot.bed'
hicf = 'NPC.50000_blocks.converted.bed'
annof = 'genecode.v29lift37.genes.promoter'
outtag = 'NSC.hic'
out = snp2gene(snpf, peakf, hicf, annof, outtag)

In [7]:
snpf = 'CN_20_ASoC_FDR0.05.bed'
peakf = 'CN_merged_all_20_peaks.hotspot.bed'
hicf = 'Neu.50000_blocks.converted.bed'
annof = 'genecode.v29lift37.genes.promoter'
outtag = 'CN.hic'
out = snp2gene(snpf, peakf, hicf, annof, outtag)