# Introduction

The purpose of this .ipynb is to generate combinatorial perturbations of motifs based on `ZDTBCG` model from curated motifs. We will generate perturbations across nucleosome (200bp), and entire window (1000bp) range. Combinatorial perturbations can be defined as perturbations of motifs in distinct combinations across each distinct enhancer/peak given. We will perturb up to 2 motifs at a time, in any combination. From this, we will be able to extract pairwise motif-motif synergy and single motif effects on different modes of binding. 

# Computational setup

In [1]:
import warnings
warnings.filterwarnings("ignore")
from tensorflow.python.util import deprecation
deprecation._PRINT_DEPRECATION_WARNINGS = False

#Packages
import os
import sys
import pandas as pd
import numpy as np
from pybedtools import BedTool
from bpnet.cli.contrib import bpnet_contrib
from bpnet.cli.modisco import cwm_scan

# Settings
os.chdir('/l/Zeitlinger/ZeitlingerLab/Manuscripts/Zelda_and_Nucleosomes/Analysis/analysis/')
pd.set_option('display.max_columns', 100)

#Pre-existing variables
fasta_file = f'../data/indexes/bowtie2/dm6.fa'
model_dir = 'bpnet/models/optimized_model/fold1'
modisco_dir = f'bpnet/modisco/fold1/'
curated_motifs = f'bed/mapped_motifs/all_instances_curated_0based.bed'
curated_regions = f'bed/mapped_motifs/all_grouped_regions_0based.bed'
tasks = ['Zld', 'Dl', 'Twi', 'Bcd', 'Cad', 'GAF']

# Dependent variables
perturb_output_dir = f'tsv/perturbs/binding/genomic'

Using TensorFlow backend.


In [2]:
!mkdir -p {perturb_output_dir}

# Collect mapped motifs together

Here, we need to import the motifs that were curated during `3_` and add the correct columns such that the `bpnet_generate_perturbations` script can be satisfied. To do this, we need a 0-based coordinate .tsv file with the following columns: `pattern_name`, `example_idx`, `example_chrom`, `pattern_start`, `pattern_end`, `pattern_len`

In [3]:
#Import set of motifs
motifs_df = BedTool(curated_motifs).to_dataframe()
motifs_df.columns = ['example_chrom','start','end','name','score','strand']

#Separate motif name
motifs_df['pattern_len'] = motifs_df['end'] - motifs_df['start']
motifs_df['pattern_name'] = [n.split('_')[0] for n in motifs_df.name]
motifs_df['motif_id'] = [n.split('_')[1] for n in motifs_df.name]
motifs_df['example_idx'] = [n.split('_')[2] for n in motifs_df.name]
motifs_df['region_id'] = [n.split('_')[2] for n in motifs_df.name]
motifs_df.shape

2022-08-04 10:21:22,889 [INFO] Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-08-04 10:21:22,892 [INFO] NumExpr defaulting to 8 threads.


(69318, 11)

Match grouped regions to motifs and collect motif position within the windows to obtain `pattern_start` and `pattern_end`. We do this now because it is far easier to work with 0-based coordinated in python than in R.

In [4]:
regions_df = BedTool(curated_regions).to_dataframe()
regions_df.columns = ['region_chrom','region_start','region_end','region_id','region_score','region_strand']
regions_df['region_id']=regions_df['region_id'].astype(str)

In [5]:
motifs_df = motifs_df.merge(regions_df[['region_start','region_end','region_id']], on = 'region_id', how = 'left')
motifs_df['pattern_start'] = motifs_df['start']-motifs_df['region_start']
motifs_df['pattern_end'] = motifs_df['end']-motifs_df['region_start']

#Save motifs
motifs_df.to_csv(f'{perturb_output_dir}/all_instances_curated_formatted_0based.tsv.gz', sep = '\t', index = False)
print(motifs_df.pattern_name.value_counts())
print(motifs_df.region_id.value_counts().value_counts())

Twi    23825
Bcd    17344
Cad    13491
GAF     7429
Zld     5264
Dl      1965
Name: pattern_name, dtype: int64
1     28097
2      9921
3      3436
4      1361
5       543
6       224
7       100
8        51
9        23
10       11
12        6
11        4
14        1
13        1
Name: region_id, dtype: int64


# Generate contributions to match the curated coordinates

Because the `example_idx` files were changing, we wanted to generate contributions to match these modified coordinates. Otherwise, `ContribFile(original_contrib.h5).get_seq()` will return the incorrect sequence.

In [6]:
contrib_cmd = f'bpnet contrib --method deeplift --memfrac-gpu 1 --regions {curated_regions} \
{model_dir} bpnet/preds/fold1/all_grouped_regions_0based_contrib.h5'
contrib_file = f'bpnet/preds/fold1/all_grouped_regions_0based_contrib.h5'
print(contrib_cmd)

bpnet contrib --method deeplift --memfrac-gpu 1 --regions bed/mapped_motifs/all_grouped_regions_0based.bed bpnet/models/optimized_model/fold1 bpnet/preds/fold1/all_grouped_regions_0based_contrib.h5


# Generate perturbations

`/n/projects/mw2098/shared_code/bpnet/bpnet_generate_perturbations.py` contains the scripts required to generate the sum and maximum values across each (1) task, (2) annotated motif, (3) mutant combination. Pseudocounts of the entire window for each (1) task and (2) mutant combination are also included for further analysis.

For reduced time in generating predictions, you can enable a GPU

## Generate windowed perturbations

These perturbations will consider a window of 50bp around the center of each motif. This lets us look at the maximum profile height effects across certain windows. This analysis is intended for usage on TFs that bind normally and are localized across motifs. It is code that is meant to recreate the BPNet paper's (Avsec 2022, Nature Genetics), Figure 5. 



### Generate across nucleosome range

In [7]:
nuc_perturb_cmd = f'~/anaconda3/envs/bpnet-gpu/bin/python scripts/py/bpnet_generate_perturbations.py \
-d {perturb_output_dir}/all_instances_curated_formatted_0based.tsv.gz \
-m {model_dir} -c {contrib_file} -o {perturb_output_dir}/perturbs_200bp --comb_max 2 -t 16 -n 16 -w 200 -g 0 -x .9'
nuc_perturb_cmd

'~/anaconda3/envs/bpnet-gpu/bin/python scripts/py/bpnet_generate_perturbations.py -d tsv/perturbs/binding/genomic/all_instances_curated_formatted_0based.tsv.gz -m bpnet/models/optimized_model/fold1 -c bpnet/preds/fold1/all_grouped_regions_0based_contrib.h5 -o tsv/perturbs/binding/genomic/perturbs_200bp --comb_max 2 -t 16 -n 16 -w 200 -g 0 -x .9'

### Generate across enhancer whole

In [8]:
all_perturb_cmd=f'~/anaconda3/envs/bpnet-gpu/bin/python scripts/py/bpnet_generate_perturbations.py \
-d {perturb_output_dir}/all_instances_curated_formatted_0based.tsv.gz \
-m {model_dir} -c {contrib_file} -o {perturb_output_dir}/perturbs_1000bp --comb_max 2 -t 16 -n 16 --use_whole_window -g 0 -x .9'
all_perturb_cmd

'~/anaconda3/envs/bpnet-gpu/bin/python scripts/py/bpnet_generate_perturbations.py -d tsv/perturbs/binding/genomic/all_instances_curated_formatted_0based.tsv.gz -m bpnet/models/optimized_model/fold1 -c bpnet/preds/fold1/all_grouped_regions_0based_contrib.h5 -o tsv/perturbs/binding/genomic/perturbs_1000bp --comb_max 2 -t 16 -n 16 --use_whole_window -g 0 -x .9'

## Wrap commands into an `.sge` script.

In [9]:
basedir = os.getcwd()
sge_header = ['#$ -cwd', '#$ -S /bin/bash', '#$ -N genomic_perturbs', '#$ -pe smp 50','#$ -l h_rt=24:00:00', '#$ -V']
setup_cmds = ['conda activate /home/mw2098/anaconda3/envs/bpnet-gpu', f'cd {basedir}']
cmds = sge_header + setup_cmds + [contrib_cmd, nuc_perturb_cmd, all_perturb_cmd]
output_cmd_path = 'tmp/ZDTBCG_genomic_perturbs.sge'  
#Write script
output_hit_mapping_script = open(output_cmd_path, "w")
for i in cmds:
    output_hit_mapping_script.write(i + "\n")
output_hit_mapping_script.close()

In [10]:
%%script false --no-raise-error
!qsub tmp/ZDTBCG_genomic_perturbs.sge

Analysis of these genomic perturbations will be done in subsequent analysis (i.e. `5a_*...`).