### task description
brain eQTL enrichment analysis

*  First dataset is this: http://eqtl.brainseq.org/phase2/
`SupplementaryTable15_eQTL.tar.gz` in hg38 with Gencode v25
Tables with the significant eQTL associations (FDR < 1%) for DLPFC, HIPPO and the brain region interaction (DLPFC vs HIPPO) at the gene, exon, exon-exon junction and transcript expression levels.  

    * snp: SNP ID.
    * feature_id: expression feature ID. You might also want the EnsemblGeneID column (Ensembl gene ID) or the Symbol one (gene symbol, when available).
    * statistic: eQTL t-statistic computed by MatrixEQTL.
    * pvalue: p-value.
    * FDR: FDR adjusted p-value.
    * beta: eQTL beta coefficient.
    
    * SNP annotation file: `wget https://libd-brainseq2.s3.us-east-2.amazonaws.com/BrainSeqPhaseII_snp_annotation.txt.gz`

* Second dataset from [Common Minds Consortium]( https://www.nimhgenetics.org/resources/commonmind): https://www.sugarsync.com/pf/D7756315_09685381_064295
`CMC_MSSM-Penn-Pitt_DLPFC_mRNA_eQTL-adjustedSVA.txt` in hg19


    
* Random SNP sets generated by Kevin Luo: `CN_20_ASoC_FDR0.05_matchedSNP` and `NSC_20_ASoC_FDR0.05_matchedSNP`, in hg19

    * The file named “SNPmatch_100_matchedSNPs.txt.gz” has a table of input SNPs and matched SNPs.
    The first column is the input SNPs, the second is the number of SNPs in the SNPsnap database satisfied our thresholds, the following 100 columns are 100 matching SNPs sampled from the SNPsnap database that satisfied our thresholds.
    * The file named “snpsnap_input_exist.txt” saved the information for the SNPs that exist in the database.
    * The file named “unincluded_snps.txt” show the SNPs that are not exist in the database

### prepare input file to snpsnap (generate random set of SNPs)

In [1]:
cd /home/simingz/neuron_atac_seq/HiC/

/project2/xinhe/simingz/neuron_atac_seq/HiC


In [10]:
!awk '{print $1"\t"$3}' CN_20_ASoC_FDR0.05.bed > ../eQTL/CN_20_ASoC_FDR0.05.bed.chrpos.hg38
!awk '{print $1"\t"$3}' NSC_20_ASoC_FDR0.05.bed > ../eQTL/NSC_20_ASoC_FDR0.05.bed.chrpos.hg38

In [2]:
cd ../eQTL/

/project2/xinhe/simingz/neuron_atac_seq/eQTL


In [8]:
from pyliftover import LiftOver
lo = LiftOver('hg38', 'hg19')
def lofile(f, outf, lo):
    with open(f) as f1, open(outf, 'w') as outf1:
        for line in f1.readlines():
            chrom = line.split('\t')[0]
            pos = int(line.strip().split('\t')[1])
            out = lo.convert_coordinate(chrom, pos)
            if len(out) != 0:
                outf1.write(out[0][0].split('r')[1] + ':' + str(out[0][1]) + '\n')

def lofile2(f, outf, lo):
    with open(f) as f1, open(outf, 'w') as outf1:
        for line in f1.readlines():
            chrom = line.split('\t')[0]
            end = int(line.strip().split('\t')[2]) # different from lofile above
            newline = line.split('\t')[3: len(line.split('\t'))]
            out = lo.convert_coordinate(chrom, end)
            if len(out) != 0:
                outf1.write('\t'.join([out[0][0],str(out[0][1] -1), str(out[0][1])]+ newline))


lofile('CN_20_ASoC_FDR0.05.bed.chrpos.hg38', 'CN_20_ASoC_FDR0.05.bed.chrpos.hg19', lo)
lofile('NSC_20_ASoC_FDR0.05.bed.chrpos.hg38', 'NSC_20_ASoC_FDR0.05.bed.chrpos.hg19', lo)
lofile2('/home/simingz/neuron_atac_seq/HiC/CN_20_ASoC_FDR0.05.bed', '/home/simingz/neuron_atac_seq/HiC/CN_20_ASoC_FDR0.05.bed.hg19', lo)
lofile2('/home/simingz/neuron_atac_seq/HiC/NSC_20_ASoC_FDR0.05.bed', '/home/simingz/neuron_atac_seq/HiC/NSC_20_ASoC_FDR0.05.bed.hg19', lo)

### prepare Brain eQTL 1
ChrX is ignored

In [76]:
import csv
eQTLf1 = 'BrainSeqPhaseII_eQTL_FDR1perc_DLPFC_gene.txt'
annof = 'BrainSeqPhaseII_snp_annotation.txt'
annodict = {}
with open(annof, 'r') as anno:
    csv_reader = csv.reader(anno, delimiter = '\t')
    next(csv_reader)
    for row in csv_reader:
            annodict.update({row[0]:row[4] + ':' + row[5]})

In [86]:
eqtl1 = []
fail = []
with open(eQTLf1, 'r') as f1:
    csv_reader = csv.reader(f1, delimiter = '\t')
    next(csv_reader)
    for row in csv_reader:
        if len(row[0].split(":")) > 1 and row[0].split(":")[0] != 'chrX':
            try:
                int(row[0].split(':')[0])
                eqtl1.append('chr' + row[0].split(':')[0] + ':' + row[0].split(':')[1])
            except ValueError:
                try:
                    eqtl1.append(annodict[row[0]])
                except KeyError:
                    fail.append(row[0])
                
eqtl1 = set(eqtl1)
fail= set(fail)

In [80]:
print('Number of eQTL: ' + str(len(eqtl1)) + '\nNumber failed to include: ' + str(len(fail)))

Number of eQTL: 824893
Number failed to include: 0


### prepare brain eQTL2

In [57]:
eQTLf2 = 'CMC_MSSM-Penn-Pitt_DLPFC_mRNA_eQTL-adjustedSVA.txt'

In [91]:
eqtl2 = []
with open(eQTLf2, 'r') as f2:
    csv_reader = csv.reader(f2, delimiter = ' ')
    next(csv_reader)
    for row in csv_reader:
        try:
            eqtl2.append(row[7])
        except IndexError:
            print(row)

eqtl2 = set(eqtl2)                 

In [92]:
print('Number of eQTL: ' + str(len(eqtl2)))

Number of eQTL: 2070780


### target SNP1

In [96]:
snpf1 = 'CN_20_ASoC_FDR0.05.bed.chrpos.hg19'
snp1 = []
with open(snpf1, 'r') as f1:
    csv_reader = csv.reader(f1)
    next(csv_reader)
    for row in csv_reader:
        snp1.append('chr' + row[0])
snp1 = set(snp1)      

In [99]:
print('Number of SNPs: ' + str(len(snp1)))

Number of SNPs: 5610


### target SNP2

In [101]:
snpf2 = 'NSC_20_ASoC_FDR0.05.bed.chrpos.hg19'
snp2 = []
with open(snpf2, 'r') as f2:
    csv_reader = csv.reader(f2)
    next(csv_reader)
    for row in csv_reader:
        snp2.append('chr' + row[0])
snp2 = set(snp2)

In [102]:
print('Number of SNPs: ' + str(len(snp2)))

Number of SNPs: 3545


### SNP1 matched random SNPs

In [112]:
snprf1 = 'CN_20_ASoC_FDR0.05_matchedSNP/SNPmatch_100_matchedSNPs.txt'
snpr1 = []
with open(snprf1, 'r') as f1:
    csv_reader = csv.reader(f1, delimiter="\t")
    next(csv_reader)
    for row in csv_reader:
        snpr1.extend(['chr' + i for i in row[2:102]])
snpr1 = set(snpr1)

In [113]:
print('Number of SNPs: ' + str(len(snpr1)))

Number of SNPs: 422013


### SNP2 matched random SNPs

In [116]:
snprf2 = 'NSC_20_ASoC_FDR0.05_matchedSNP/SNPmatch_100_matchedSNPs.txt'
snpr2 = []
with open(snprf2, 'r') as f2:
    csv_reader = csv.reader(f2, delimiter="\t")
    next(csv_reader)
    for row in csv_reader:
        snpr2.extend(['chr' + i for i in row[2:102]])
snpr2 = set(snpr2)

In [117]:
print('Number of SNPs: ' + str(len(snpr2)))

Number of SNPs: 280230


### Enrichment analysis

In [120]:
import scipy.stats as stats

In [130]:
mystat = [[len(snp1 & eqtl1), len(snp1 - eqtl1)], [len(snpr1 & eqtl1), len(snpr1 - eqtl1)]]
odds, pvalue = stats.fisher_exact(mystat)
print('For SNP set1, enrichment in eQTL dataset1 results:')
print('Percent of SNP is eQTL in SNP set1: ' + str(len(snp1 & eqtl1)/len(snp1)))
print('Percent of SNP is eQTL in matched random set: ' + str(len(snpr1 & eqtl1)/len(snpr1)))
print('odds ratio: ' + str(odds) + "\np value: "+ str(pvalue))

For SNP set1, enrichment in eQTL dataset1 results:
Percent of SNP is eQTL in SNP set1: 0.14367201426024956
Percent of SNP is eQTL in matched random set: 0.1066933957010803
odds ratio: 1.4047370927845446
p value: 1.6877869556860068e-17


In [142]:
mystat = [[len(snp2 & eqtl1), len(snp2 - eqtl1)], [len(snpr2 & eqtl1), len(snpr2 - eqtl1)]]
odds, pvalue = stats.fisher_exact(mystat)
print('For SNP set2, enrichment in eQTL dataset1 results:')
print('Percent of SNP is eQTL in SNP set2: ' + str(len(snp2 & eqtl1)/len(snp2)))
print('Percent of SNP is eQTL in matched random set: ' + str(len(snpr2 & eqtl1)/len(snpr2)))
print('odds ratio: ' + str(odds) + "\np value: "+ str(pvalue))

For SNP set2, enrichment in eQTL dataset1 results:
Percent of SNP is eQTL in SNP set2: 0.12609308885754583
Percent of SNP is eQTL in matched random set: 0.10493523177390002
odds ratio: 1.2307199661101298
p value: 7.190754513783032e-05


In [144]:
mystat = [[len(snp1 & eqtl2), len(snp1 - eqtl2)], [len(snpr1 & eqtl2), len(snpr1 - eqtl2)]]
odds, pvalue = stats.fisher_exact(mystat)
print('For SNP set1, enrichment in eQTL dataset2 results:')
print('Percent of SNP is eQTL in SNP set1: ' + str(len(snp1 & eqtl2)/len(snp1)))
print('Percent of SNP is eQTL in matched random set: ' + str(len(snpr1 & eqtl2)/len(snpr1)))
print('odds ratio: ' + str(odds) + "\np value: "+ str(pvalue))

For SNP set1, enrichment in eQTL dataset2 results:
Percent of SNP is eQTL in SNP set1: 0.3762923351158645
Percent of SNP is eQTL in matched random set: 0.28609308244058834
odds ratio: 1.5054922495116099
p value: 2.33319980128235e-47


In [140]:
mystat = [[len(snp2 & eqtl2), len(snp2 - eqtl2)], [len(snpr2 & eqtl2), len(snpr2 - eqtl2)]]
odds, pvalue = stats.fisher_exact(mystat)
print('For SNP set2, enrichment in eQTL dataset2 results:')
print('Percent of SNP is eQTL in SNP set2: ' + str(len(snp2 & eqtl2)/len(snp2)))
print('Percent of SNP is eQTL in matched random set: ' + str(len(snpr2 & eqtl2)/len(snpr2)))
print('odds ratio: ' + str(odds) + "\np value: "+ str(pvalue))

For SNP set2, enrichment in eQTL dataset2 results:
Percent of SNP is eQTL in SNP set2: 0.34809590973201693
Percent of SNP is eQTL in matched random set: 0.280391107304714
odds ratio: 1.370400473687971
p value: 2.7802319096376094e-18
