In [2]:
import pandas as pd

INSTRUCTIONS:
    run in this Rstudio: http://gphost03.bcgsc.ca:8787/

http://simon-coetzee.github.io/motifBreakR/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4653394/  
Transcription factor binding sites (TFBS) are typically short DNA sequence motifs that facilitate binding of a specific transcription factors via protein–DNA interactions.  
MotifbreakR will produce a GRanges table with match statistics describing the percent of maximum score for a matrix for both alleles of the SNP, the matrix values for each allele (useful for determining the severity of the disruption), the strand, whether the disruption is strong or weak.

https://bioconductor.org/packages/devel/bioc/manuals/motifbreakR/man/motifbreakR.pdf  
**seqMatch** the sequence on the 5’ -> 3’ direction of the "+" strand that corresponds to DNA at the position that the TF binding motif was found.  

**pctRef** The score as determined by the scoring method, when the sequence contains the reference SNP allele, normalized to a scale from 0 - 1. If filterp = FALSE, this is the value that is thresholded.  

**pctAlt** The score as determined by the scoring method, when the sequence contains the alternate SNP allele, normalized to a scale from 0 - 1. If filterp = FALSE, this is the value that is thresholded.  

**scoreRef** The score as determined by the scoring method, when the sequence contains the reference SNP allele

**scoreAlt** The score as determined by the scoring method, when the sequence contains the alternate SNP allele

**Refpvalue** p-value for the match for the pctRef score, initially set to NA. see calculatePvalue for more information. this is the significance of the match for PWM, position weight matrix. 
**Altpvalue** p-value for the match for the pctAlt score, initially set to NA. see calculatePvalue for more information  
**alleleRef** The proportional frequency of the reference allele at position motifPos in the motif    

**alleleAlt** The proportional frequency of the alternate allele at position motifPos in the motif    

effect one of weak, strong, or neutral indicating the strength of the effect.

use Rstudio on: http://gphost03.bcgsc.ca:8787/

these are the format of the bed file as input. if you have rs_ids, you can just use a list of rs_ids as input as well.

[szong@szong01 motifbreakR]$ head /projects/da_workspace/software/motifbreakR/motifBreakR/inst/extdata/snps.bed

chr2	12581137	12581138	rs10170896	0	+

chr2	12594017	12594018	chr2:12594018:G:A	0	+

chr3	192388677	192388678	rs13068005	0	+

chr4	122361479	122361480	rs12644995	0	+


reference:
    
    http://simon-coetzee.github.io/motifBreakR/

### input file

In [22]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [37]:
f = '/projects/trans_scratch/validations/workspace/szong/Cervical/hotspots/rainstorm/hotspots_20190426.txt'
df = pd.read_csv(f, sep='\t', usecols=['chr', 'start_x', 'end_x','Reference_Allele', 'Tumor_Seq_Allele2'])
df.head(2)

Unnamed: 0,chr,start_x,end_x,Reference_Allele,Tumor_Seq_Allele2
0,5,1295228,1295229,G,A
1,5,1295250,1295251,G,A


In [86]:
s = df.Tumor_Seq_Allele2.str.split(';', expand=True).stack()
idx = s.index.get_level_values(0)
dfn = df.loc[idx]
dfn['alt'] = list(s)
dfn.drop('Tumor_Seq_Allele2', axis=1, inplace=True)

dfn.head(2)

Unnamed: 0,chr,start_x,end_x,Reference_Allele,alt
0,5,1295228,1295229,G,A
1,5,1295250,1295251,G,A


### input file need 'chr' in front of chromosome number, for some reason, it looks it require the second position to be the mutation. may be 0 based numbering?


In [87]:
dfn['rsid'] = dfn.apply(lambda x: ''.join(['chr', ':'.join(np.array([str(i) for i in x])[[0,1,3, 4]])]), axis=1)
dfn = dfn.iloc[:, [0,1,5]]
dfn[4] = 0
dfn[5] = "+"
dfn['start'] = dfn.start_x -1
dfn = dfn.iloc[:,[0,5,1,2,3,4]].drop_duplicates()
dfn['chr'] = dfn['chr'].apply(lambda x: ''.join(['chr', str(x)]))
dfn
of = '/projects/trans_scratch/validations/workspace/szong/Cervical/hotspots/rainstorm/mt_input.txt'
dfn.to_csv(of, sep='\t', index=False, header=False)

Unnamed: 0,chr,start,start_x,rsid,4,5
0,chr5,1295227,1295228,chr5:1295228:G:A,0,+
1,chr5,1295249,1295250,chr5:1295250:G:A,0,+
2,chr6,92890259,92890260,chr6:92890260:A:T,0,+
3,chr6,142706205,142706206,chr6:142706206:G:A,0,+
4,chr6,142706208,142706209,chr6:142706209:C:T,0,+
4,chr6,142706208,142706209,chr6:142706209:C:G,0,+
5,chr8,83993736,83993737,chr8:83993737:T:A,0,+
6,chr11,39877968,39877969,chr11:39877969:A:G,0,+


### this is the R script to run motifBreakR
file.path(R.home("bin"), "R")  
.libPaths( c('/projects/da_workspace/software/motifbreakR_libs', "/gsc/software/linux-x86_64-centos7/R-3.5.1/lib64/R/library", .libPaths()) )  
.libPaths()  
library(motifbreakR)  
library(BSgenome)  
library(SNPlocs.Hsapiens.dbSNP142.GRCh37) # dbSNP137 in hg19  
library(BSgenome.Hsapiens.UCSC.hg19)     # hg19 genome  
library(MotifDb)  

snps.bed.file <- '/projects/trans_scratch/validations/workspace/szong/Cervical/hotspots/rainstorm/mt_input.txt'
read.table(snps.bed.file, header = FALSE)

#import the BED file
snps.mb.frombed <- snps.from.file(file = snps.bed.file,
                                  #dbSNP = SNPlocs.Hsapiens.dbSNP142.GRCh37,
                                  search.genome = BSgenome.Hsapiens.UCSC.hg19,
                                  format = "bed")
snps.mb.frombed

data(motifbreakR_motif)
motifbreakR_motif


results <- motifbreakR(snpList = snps.mb.frombed, filterp = TRUE,  
                       pwmList = motifbreakR_motif,  
                       threshold = 1e-4,  
                       method = "ic",  
                       bkg = c(A=0.25, C=0.25, G=0.25, T=0.25),  
                       BPPARAM = BiocParallel::bpparam())

results

f <- '/projects/da_workspace/software/motifbreakR/results.csv'  
write.csv(results, file = f)

### calculate p values
sub_res <- results[names(results) %in% 'chr5:1295228:G:A']  
pvalue <- calculatePvalue(sub_res)  
pvalue

plotMB(results = results, rsid = "6:142706209:142706210:C:G", effect = "strong")


### filter output file

In [17]:
f = '/projects/da_workspace/software/motifbreakR/results_pvalues.csv'
df = pd.read_csv(f)
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()


Unnamed: 0,seqnames,start,end,width,strand,REF,ALT,snpPos,motifPos,geneSymbol,...,seqMatch,pctRef,pctAlt,scoreRef,scoreAlt,Refpvalue,Altpvalue,alleleRef,alleleAlt,effect
0,chr5,1295222,1295231,10,+,G,A,1295228,7,BCLAF1,...,cccggaGggg,0.771581,0.910434,8.09901,9.55651,0.001601,6e-05,0.0,0.971154,strong
1,chr5,1295220,1295235,16,+,G,A,1295228,9,CTCF,...,ggcccggaGggggctg,0.942173,0.855357,5.52348,5.014519,8.4e-05,0.001699,0.612805,0.0,weak
2,chr5,1295219,1295237,19,+,G,A,1295228,10,CTCF,...,gggcccggaGggggctggg,0.898012,0.752305,10.995074,9.211073,1e-05,0.001171,0.979328,0.002839,strong
3,chr5,1295219,1295237,19,+,G,A,1295228,10,CTCF,...,gggcccggaGggggctggg,0.857465,0.724539,10.114097,8.546181,3.9e-05,0.001859,0.991218,0.005488,strong
4,chr5,1295222,1295235,14,+,G,A,1295228,7,CTCF,...,cccggaGggggctg,0.985911,0.702726,5.706734,4.067583,3e-06,0.032342,1.0,0.0,strong


In [18]:
df['pvalue_diff'] = (df.Refpvalue - df.Altpvalue).abs()
df['score_diff'] = (df.scoreRef - df.scoreAlt).abs()

In [12]:
df.shape

(232, 26)

In [26]:
df1 = df.sort_values(['pvalue_diff', 'score_diff'], ascending=False)

In [27]:
of = '/projects/da_workspace/software/motifbreakR/motifbreakr_results_pvalues_20190528.csv'
df1.to_csv(of, sep='\t', index=False)

In [25]:
df[df.snpPos == 1295228].sort_values(['pvalue_diff', 'score_diff'], ascending=False).head(2)

Unnamed: 0,seqnames,start,end,width,strand,REF,ALT,snpPos,motifPos,geneSymbol,...,pctAlt,scoreRef,scoreAlt,Refpvalue,Altpvalue,alleleRef,alleleAlt,effect,pvalue_diff,score_diff
46,chr5,1295225,1295237,13,+,G,A,1295228,4,POU2F2,...,0.643708,4.886466,3.247314,7.9e-05,0.092089,1.0,0.0,strong,0.092009,1.639152
7,chr5,1295223,1295235,13,+,G,A,1295228,6,EBF1,...,0.682239,5.717843,4.078691,8.3e-05,0.051075,1.0,0.0,strong,0.050993,1.639152


### transcription databased used by motifBreakR

HOMMER, INCLUDED JASPER

In [96]:
*Warning?

In [99]:
%cpaste?

Object `%cpaste` not found.


In [97]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python