## task:
we need to map a peak to one chunk in the bed file, and then see if the other chunk in contact with it is the promoter of any gene. The difference from Hi-C.ipynb is the gene annotation file is provided by Jubao. Also, when we test whether a fragment intersects with a promoter, if the fragment size is less than 5 kb, we expand it equally to both size so that the total length is 5kb.

* SNP to peak definition:  
+/- 250bp of peak region

* promoter definition:  
2kb upstream and 1kb downstream of gene TSS 

* in contact definition:  
each row in the Hi-C file defines in contact.

* peak-HiC and promoter-HiC contact
20% overlap of peak size between peak and hic, 20% overlap of promoter size between promoter and hic. if the fragment size is less than 5 kb, we expand it equally to both size so that the total length is 5kb.

## format:
* Hi-C files  

    * [PsychEncode data](https://science.sciencemag.org/content/362/6420/eaat4311), obtained from Siwei, in hg19, a Hi-C bed file with the name XXX.50000_blocks.bed, each number A in the 2nd/3rd column represents a fragment that spans from A to A+50000.
    * [Nature 2016](https://www.nature.com/articles/nature19847)
    * [NG 2019](http://dx.doi.org/10.1038/s41588-019-0472-1),
    in hg19,
    `wget https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-019-0472-1/MediaObjects/41588_2019_472_MOESM4_ESM.xlsx`


* ASoC SNP file  
in hg38. from Yifan.

* peak file  
in hg38. CN means glutamatergic neurons, and I guess we should use Hi-C files starting with 'Neu';
NSC means neural progenitor cells, and I guess we should use Hi-C files starting with 'NPC'. from Yifan.

* gene anno file  
in hg19. `GENCODE_V31lift37_Duan.dms` from Jubao.

In [1]:
cd /project2/xinhe/simingz/neuron_atac_seq/HiC

/project2/xinhe/simingz/neuron_atac_seq/HiC


In [107]:
import pybedtools
from pyliftover import LiftOver
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import csv
import subprocess
import os
from functools import reduce
from brokenaxes import brokenaxes
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('svg')

In [67]:
def snp2gene(snpf, peakf, hicf, annof, outtag):
    '''snp to peak to gene promoter
    '''
    # snp to peak
    peakflsize = 250
    snps = pybedtools.BedTool(snpf)
    peaks = pybedtools.BedTool(peakf)
    peaksfl = peaks.slop(genome = 'hg38', b = peakflsize)
    snp2peak = snps.intersect(peaksfl, wb=True)
    snp2peak.saveas(outtag + '.snp2peak.bed')  
    
    # snp2peak liftover
    lo = LiftOver('hg38', 'hg19')
    with open(outtag + '.snp2peak.lo.bed', 'w') as snp2peaklo:
        for snp in snp2peak:
            s = lo.convert_coordinate(snp.fields[8], int(snp.fields[9]))
            e = lo.convert_coordinate(snp.fields[8], int(snp.fields[10]))
            if len(s) !=0 and len(e) != 0:
                if s[0][0] == e[0][0] and s[0][1] < e[0][1]:
                    peaklo = [s[0][0], str(s[0][1]), str(e[0][1])]
                    snp2peaklo.write('\t'.join(peaklo + list(snp)) + '\n')
    
    # annotate with HiC contact
    hics = pybedtools.BedTool(hicf)
    snppeak = pybedtools.BedTool(outtag + '.snp2peak.lo.bed')
    snppeakhic = snppeak.intersect(hics, wao=True)
    snppeakhic.saveas(outtag + '.snp2peak.lo.hic1.bed')
    
    # annotate the other end with promoter info
    snppeakhic1 = pybedtools.BedTool([[i[14]] + i[17:19] + i[0:17] + [i[19]] for i in snppeakhic if i[19]!='0'])
    annos= pybedtools.BedTool(annof)
    snppeakhic2 = snppeakhic1.intersect(annos, wao= True)
    snppeakhic2.saveas(outtag + '.snp2peak.lo.hic2.bed')
    out = pybedtools.BedTool([i for i in snppeakhic2 if i[25] != '0'])
    out2 = list(set(['\t'.join(outr[6:17] + outr[24:27]) for outr in out]))
    
    # write output
    with open(outtag + '.bed', 'w') as outfile:
        # outfile.write('\t'.join(['SNP.chr','SNP.start','SNP.end','SNP.label','SNP','SNP','SNP','SNP', 'peak.chr', 'peak.start', 'peak.end', 'geneID1','geneID2','geneID3']) +'\n')
        for outr in out2:
            outfile.write(outr + '\n')
    return(out)

In [None]:
# promoter annotation file, need manually remove unwanted chroms
awk -F'\t' '{print $0}'

## NG 2019

## HiC further process

In [13]:
!head NG2019-HiC-hippo.converted.bed

chr1	876092	889423	1206874	1212438
chr1	889424	903640	927395	936954
chr1	903641	927394	936955	942417
chr1	903641	927394	978382	989620
chr1	903641	927394	1206874	1212438
chr1	927395	936954	942418	943676
chr1	927395	936954	943677	957199
chr1	927395	936954	1069046	1083958
chr1	927395	936954	1083959	1091234
chr1	927395	936954	1109733	1122642


In [70]:
# extend to 5kb
def To5kb(hicf, outf):
    with open(hicf) as f1, open(outf,'w') as f2:
        csvreader = csv.reader(f1, delimiter = '\t')
        csvwriter = csv.writer(f2, delimiter = '\t', lineterminator=os.linesep)
        for row in csvreader:
            if int(row[2]) -int(row[1]) < 5000:
                mid = round(0.5 * (int(row[1]) + int(row[2])))
                row[1] = str(mid - 2500)
                row[2] = str(mid + 2500)
            if int(row[4]) -int(row[3]) < 5000:
                mid = round(0.5 * (int(row[3]) + int(row[4])))
                row[3] = str(mid - 2500)
                row[4] = str(mid + 2500)
            csvwriter.writerow(row)

To5kb('NG2019-HiC-hippo.converted.bed', 'NG2019-HiC-hippo.converted5kb.bed')
To5kb('NG2019-HiC-ex.converted.bed', 'NG2019-HiC-ex.converted5kb.bed')

In [72]:
!head NG2019-HiC-hippo.converted5kb.bed

chr1	876091	889423	1206873	1212438
chr1	889423	903640	927394	936954
chr1	903640	927394	936954	942417
chr1	903640	927394	978381	989620
chr1	903640	927394	1206873	1212438
chr1	927394	936954	940546	945546
chr1	927394	936954	943676	957199
chr1	927394	936954	1069045	1083958
chr1	927394	936954	1083958	1091234
chr1	927394	936954	1109732	1122642


## Merge HiC and eQTL results

Merge HiC and eQTL results into a single table, each row is one ASoC. 

In [55]:
def eID2gene(eids):
    geneList = [i.split('.')[0] for i in eids.split(',')]
    mg = mygene.MyGeneInfo()
    geneSyms = mg.querymany(geneList , scopes='ensembl.gene', fields='symbol', species='human')
    try:
        gene = [i['symbol'] for i in geneSyms]
    except KeyError:
        gene=[]
    geneout = ','.join(gene)
    return(geneout)

### CN

In [106]:
snpf = 'CN_20_ASoC_FDR0.05.bed'
peakf = 'CN_merged_all_20_peaks.hotspot.bed'
hicf = 'NG2019-HiC-hippo.converted5kb.bed'
annof = 'GENCODE_V31lift37_Duan.dms.promoter.bed'
outtag = 'CN.hic-ng2019-hippo2'
out = snp2gene(snpf, peakf, hicf, annof, outtag)
hicf = 'NG2019-HiC-ex.converted5kb.bed'
outtag = 'CN.hic-ng2019-ex2'
out = snp2gene(snpf, peakf, hicf, annof, outtag)

In [110]:
snpflo = 'CN_20_ASoC_FDR0.05.bed.hg19'
snps = pybedtools.BedTool(snpflo)
annos= pybedtools.BedTool(annof)
snppro = snps.intersect(annos, wb=True)
snppro.sort().merge(c="4,12,13,14", o="distinct").saveas('CN_20_ASoC_FDR0.05_promotor2.bed.hg19')

<BedTool(CN_20_ASoC_FDR0.05_promotor2.bed.hg19)>

In [50]:
# join
f1 = pd.read_csv(snpf, delimiter= '\t', header= None)
f1snp = f1.iloc[:,0:4]
f1snp.columns = ['chrom', 'snp.start', 'snp.end', 'rsID']

In [52]:
f2 = pd.read_csv('CN_20_ASoC_FDR0.05_promotor2.bed.hg19', delimiter= '\t', header= None)
f2 = f2.iloc[:,3:7]
f2.columns = ['rsID','promoter_transcriptID', 'promoter_geneID', 'promoter_gene_symbol']

In [None]:
eqtlf = '../eQTL/CN_ASoC_500kb_0.05_CMC_eQTLanno.bed'
f3 = pd.read_csv(eqtlf + '.hg38', delimiter= '\t', header= None)
f3.columns = ['chrom', 'snp.end', 'eQTL_geneID']
f3['eQTL_gene_symbol'] = f3['eQTL_geneID'].apply(eID2gene)

In [91]:
f4 = pd.read_csv('CN.hic-ng2019-ex2.bed', delimiter= '\t', header= None)
f4ex1 = f4.groupby([3])[11].apply(','.join).reset_index()
f4ex1.columns = ['rsID','excitatory_HiC_transcriptID']
f4ex2 = f4.groupby([3])[12].apply(','.join).reset_index()
f4ex2.columns = ['rsID','excitatory_HiC_geneID'] 
f4ex3 = f4.groupby([3])[13].apply(','.join).reset_index()
f4ex3.columns = ['rsID','excitatory_HiC_gene_symbol'] 
f4ex12 = pd.merge(f4ex1,f4ex2,left_on=['rsID'], right_on = ['rsID'])
f4ex = pd.merge(f4ex12,f4ex3,left_on=['rsID'], right_on = ['rsID'])

In [95]:
f5 = pd.read_csv('CN.hic-ng2019-hippo2.bed', delimiter= '\t', header= None)
f5hippo1 = f5.groupby([3])[11].apply(','.join).reset_index()
f5hippo1.columns = ['rsID','hippocampus_HiC_transcriptID']
f5hippo2 = f5.groupby([3])[12].apply(','.join).reset_index()
f5hippo2.columns = ['rsID','hippocampus_HiC_geneID'] 
f5hippo3 = f5.groupby([3])[13].apply(','.join).reset_index()
f5hippo3.columns = ['rsID','hippocampus_HiC_gene_symbol'] 
f5hippo12 = pd.merge(f5hippo1,f5hippo2,left_on=['rsID'], right_on = ['rsID'])
f5hippo = pd.merge(f5hippo12,f5hippo3,left_on=['rsID'], right_on = ['rsID'])

In [96]:
f1f2 = pd.merge(f1snp, f2,  how='left', left_on=['rsID'], right_on = ['rsID'])

In [97]:
f1f2f3 = pd.merge(f1f2, f3, how='left', left_on=['chrom', 'snp.end'], right_on = ['chrom','snp.end'])

In [98]:
f_merged = reduce(lambda left,right: pd.merge(left,right,on=['rsID'],
                                            how='left'), [f1f2f3,f4ex,f5hippo])

In [100]:
f_merged.fillna('-').to_csv('CN_20_ASoC_FDR0.05_eQTL_HiC2.txt',sep = '\t',index=False)

### NSC

In [None]:
snpf = 'NSC_20_ASoC_FDR0.05.bed'
peakf = 'NSC_merged_all_20_peaks.hotspot.bed'
hicf = 'NG2019-HiC-hippo.converted5kb.bed'
annof = 'GENCODE_V31lift37_Duan.dms.promoter.bed'
outtag = 'NSC.hic-ng2019-hippo2'
out = snp2gene(snpf, peakf, hicf, annof, outtag)
hicf = 'NG2019-HiC-ex.converted5kb.bed'
outtag = 'NSC.hic-ng2019-ex2'
out = snp2gene(snpf, peakf, hicf, annof, outtag)

snpflo = 'NSC_20_ASoC_FDR0.05.bed.hg19'
snps = pybedtools.BedTool(snpflo)
annos= pybedtools.BedTool(annof)
snppro = snps.intersect(annos, wb=True)
snppro.sort().merge(c="4,12,13,14", o="distinct").saveas('NSC_20_ASoC_FDR0.05_promotor2.bed.hg19')

# join
f1 = pd.read_csv(snpf, delimiter= '\t', header= None)
f1snp = f1.iloc[:,0:4]
f1snp.columns = ['chrom', 'snp.start', 'snp.end', 'rsID']

f2 = pd.read_csv('NSC_20_ASoC_FDR0.05_promotor2.bed.hg19', delimiter= '\t', header= None)
f2 = f2.iloc[:,3:7]
f2.columns = ['rsID','promoter_transcriptID', 'promoter_geneID', 'promoter_gene_symbol']

eqtlf = '../eQTL/NSC_ASoC_500kb_0.05_CMC_eQTLanno.bed'
f3 = pd.read_csv(eqtlf + '.hg38', delimiter= '\t', header= None)
f3.columns = ['chrom', 'snp.end', 'eQTL_geneID']
f3['eQTL_gene_symbol'] = f3['eQTL_geneID'].apply(eID2gene)

f4 = pd.read_csv('NSC.hic-ng2019-ex2.bed', delimiter= '\t', header= None)
f4ex1 = f4.groupby([3])[11].apply(','.join).reset_index()
f4ex1.columns = ['rsID','excitatory_HiC_transcriptID']
f4ex2 = f4.groupby([3])[12].apply(','.join).reset_index()
f4ex2.columns = ['rsID','excitatory_HiC_geneID'] 
f4ex3 = f4.groupby([3])[13].apply(','.join).reset_index()
f4ex3.columns = ['rsID','excitatory_HiC_gene_symbol'] 
f4ex12 = pd.merge(f4ex1,f4ex2,left_on=['rsID'], right_on = ['rsID'])
f4ex = pd.merge(f4ex12,f4ex3,left_on=['rsID'], right_on = ['rsID'])

f5 = pd.read_csv('NSC.hic-ng2019-hippo2.bed', delimiter= '\t', header= None)
f5hippo1 = f5.groupby([3])[11].apply(','.join).reset_index()
f5hippo1.columns = ['rsID','hippocampus_HiC_transcriptID']
f5hippo2 = f5.groupby([3])[12].apply(','.join).reset_index()
f5hippo2.columns = ['rsID','hippocampus_HiC_geneID'] 
f5hippo3 = f5.groupby([3])[13].apply(','.join).reset_index()
f5hippo3.columns = ['rsID','hippocampus_HiC_gene_symbol'] 
f5hippo12 = pd.merge(f5hippo1,f5hippo2,left_on=['rsID'], right_on = ['rsID'])
f5hippo = pd.merge(f5hippo12,f5hippo3,left_on=['rsID'], right_on = ['rsID'])

f1f2 = pd.merge(f1snp, f2,  how='left', left_on=['rsID'], right_on = ['rsID'])

f1f2f3 = pd.merge(f1f2, f3, how='left', left_on=['chrom', 'snp.end'], right_on = ['chrom','snp.end'])

f_merged = reduce(lambda left,right: pd.merge(left,right,on=['rsID'],
                                            how='left'), [f1f2f3,f4ex,f5hippo])

f_merged.fillna('-').to_csv('NSC_20_ASoC_FDR0.05_eQTL_HiC2.txt',sep = '\t',index=False)

In [111]:
snpf = 'NSC_20_ASoC_FDR0.05.bed'
peakf = 'NSC_merged_all_20_peaks.hotspot.bed'
hicf = 'NG2019-HiC-hippo.converted5kb.bed'
annof = 'GENCODE_V31lift37_Duan.dms.promoter.bed'
outtag = 'NSC.hic-ng2019-hippo2'
out = snp2gene(snpf, peakf, hicf, annof, outtag)
hicf = 'NG2019-HiC-ex.converted5kb.bed'
outtag = 'NSC.hic-ng2019-ex2'
out = snp2gene(snpf, peakf, hicf, annof, outtag)

snpflo = 'NSC_20_ASoC_FDR0.05.bed.hg19'
snps = pybedtools.BedTool(snpflo)
annos= pybedtools.BedTool(annof)
snppro = snps.intersect(annos, wb=True)
snppro.sort().merge(c="4,12,13,14", o="distinct").saveas('NSC_20_ASoC_FDR0.05_promotor2.bed.hg19')



<BedTool(NSC_20_ASoC_FDR0.05_promotor2.bed.hg19)>