# Introduction

The purpose of this .ipynb is to generate combinatorial perturbations of motifs based on `CM_vs_MN_merged_bias` model from curated motifs. We will generate perturbations across nucleosome (200bp), and enhancer (500bp) range. 

## Note about perturbation limitations

For this initial pass-through, we will only consider combinations of 3 mutations or less. This will allow us to keep perturbations within a manageable range. Greater depth of mutations will be considered in the future at a more focused level.

# Computational setup

In [1]:
import warnings
warnings.filterwarnings("ignore")
from tensorflow.python.util import deprecation
deprecation._PRINT_DEPRECATION_WARNINGS = False

#Packages
import os
import sys
import itertools
import pandas as pd
import numpy as np
from pybedtools import BedTool
from bpnet.cli.contrib import bpnet_contrib
from bpnet.cli.modisco import cwm_scan

#Setup
os.chdir('/n/projects/mw2098/publications/2022_maven_ISL1/')
pd.set_option('display.max_columns', 100)
%matplotlib inline

# function to return key for any value 
def get_key(val, my_dict): 
    for key, value in my_dict.items(): 
        if val == value: 
            return key 
    return "key doesn't exist"

Using TensorFlow backend.


In [2]:
#Pre-existing variables
fasta_file = f'/n/projects/mw2098/genomes/hg19/hg19.fa'
model_prefix = 'seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64'

tasks = ['I_WT_D6CM','I_WT_S3MN']

# Determined variables
model_dir = f'models/{model_prefix}'
modisco_dir = f'modisco/{model_prefix}'
curated_motifs = f'analysis/bed/mapped_motifs/all_instances_curated_0based.bed'
curated_regions = f'analysis/bed/mapped_motifs/all_grouped_regions_0based.bed'

# Dependent variables
perturb_output_dir = f'analysis/tsv/perturbs/'

In [3]:
!mkdir -p {perturb_output_dir}
!mkdir -p figures/5_collect_genomic_perturbations

# Collect mapped motifs together

Here, we need to import the motifs that were curated during `3_` and add the correct columns such that the `bpnet_generate_perturbations` script can be satisfied. To do this, we need a 0-based coordinate .tsv file with the following columns: `pattern_name`, `example_idx`, `example_chrom`, `pattern_start`, `pattern_end`, `pattern_len`

In [4]:
#Import set of motifs
motifs_df = BedTool(curated_motifs).to_dataframe()
motifs_df.columns = ['example_chrom','start','end','name','score','strand']

#Separate motif name
motifs_df['pattern_len'] = motifs_df['end'] - motifs_df['start']
motifs_df['pattern_name'] = [n.split('_')[0] for n in motifs_df.name]
motifs_df['motif_id'] = [n.split('_')[1] for n in motifs_df.name]
motifs_df['example_idx'] = [n.split('_')[2] for n in motifs_df.name]
motifs_df['region_id'] = [n.split('_')[2] for n in motifs_df.name]
motifs_df.shape

2022-01-07 11:16:54,719 [INFO] Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2022-01-07 11:16:54,721 [INFO] Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-01-07 11:16:54,722 [INFO] NumExpr defaulting to 8 threads.


(53353, 11)

Match grouped regions to motifs and collect motif position within the windows to obtain `pattern_start` and `pattern_end`. We do this now because it is far easier to work with 0-based coordinated in python than in R.

In [5]:
regions_df = BedTool(curated_regions).to_dataframe()
regions_df.columns = ['region_chrom','region_start','region_end','region_id','region_score','region_strand']
regions_df['region_id']=regions_df['region_id'].astype(str)

In [6]:
motifs_df = motifs_df.merge(regions_df[['region_start','region_end','region_id']], on = 'region_id', how = 'left')
motifs_df['pattern_start'] = motifs_df['start']-motifs_df['region_start']
motifs_df['pattern_end'] = motifs_df['end']-motifs_df['region_start']
motifs_df = motifs_df[motifs_df['example_chrom']!='chrY']

In [7]:
#Save motifs
motifs_df.to_csv(f'{perturb_output_dir}/all_instances_curated_formatted_0based.tsv.gz', sep = '\t', index = False)
print(motifs_df.pattern_name.value_counts())
print(motifs_df.region_id.value_counts().value_counts())

LHX-ISL1-28    16748
Onecut2         6699
GATA            5173
LHX             4826
NKX2.5          4510
NeuroD          4010
ISL1            2911
EBF1            2338
LHX-ISL1-9      2193
LHX-ISL1-10     2073
NKX2.5-alt      1827
Name: pattern_name, dtype: int64
3     3104
2     2906
4     2631
1     2445
5     1972
6     1089
7      609
8      271
9      135
10      69
11      24
12       8
13       4
14       3
15       1
16       1
Name: region_id, dtype: int64


In [8]:
motifs_df.head(n=5)

Unnamed: 0,example_chrom,start,end,name,score,strand,pattern_len,pattern_name,motif_id,example_idx,region_id,region_start,region_end,pattern_start,pattern_end
0,chr1,100629813,100629851,LHX-ISL1-28_1_558,0,-,38,LHX-ISL1-28,1,558,558,100629332,100630332,481,519
1,chr1,100643263,100643280,LHX-ISL1-10_2_559,0,-,17,LHX-ISL1-10,2,559,559,100642844,100643844,419,436
2,chr1,100643287,100643325,LHX-ISL1-28_3_559,0,-,38,LHX-ISL1-28,3,559,559,100642844,100643844,443,481
3,chr1,100643336,100643346,EBF1_4_559,0,+,10,EBF1,4,559,559,100642844,100643844,492,502
4,chr1,100643398,100643436,LHX-ISL1-28_5_559,0,-,38,LHX-ISL1-28,5,559,559,100642844,100643844,554,592


# Generate contributions to match the curated coordinates

Because the `example_idx` files were changing, we wanted to generate contributions to match these modified coordinates. Otherwise, `ContribFile(original_contrib.h5).get_seq()` will return the incorrect sequence.

In [9]:
print(f'bpnet contrib --method deeplift --memfrac-gpu .4 --regions {curated_regions} \
{model_dir} preds/{model_prefix}/all_grouped_regions_0based_contrib.h5')

bpnet contrib --method deeplift --memfrac-gpu .4 --regions analysis/bed/mapped_motifs/all_grouped_regions_0based.bed models/seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64 preds/seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64/all_grouped_regions_0based_contrib.h5


In [10]:
contrib_file = f'preds/{model_prefix}/all_grouped_regions_0based_contrib.h5'

# Generate perturbations

`scripts/bpnet_generate_seq_perturbations.py` contains the scripts required to generate the sum and maximum values across each (1) task, (2) annotated motif, (3) mutant combination. Pseudocounts of the entire window for each (1) task and (2) mutant combination are also included for further analysis.

For reduced time in generating predictions, you can enable a GPU

## Generate windowed perturbations

This lets us look at the maximum profile height effects across certain windows. This analysis is intended for usage on TFs that bind normally and are localized across motifs. It is code that is meant to recreate the BPNet paper's Figure 5. 

### Generate across enhancer range

In [11]:
#%%script false --no-raise-error
! echo python scripts/bpnet_generate_seq_perturbations.py \
-d {perturb_output_dir}/all_instances_curated_formatted_0based.tsv.gz \
-m {model_dir} -c {contrib_file} -o {perturb_output_dir}/perturbs_500bp --comb_max 3 -t 64 -n 32 -w 500 -g 0 -x .4

python scripts/bpnet_generate_seq_perturbations.py -d analysis/tsv/perturbs//all_instances_curated_formatted_0based.tsv.gz -m models/seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64 -c preds/seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64/all_grouped_regions_0based_contrib.h5 -o analysis/tsv/perturbs//perturbs_500bp --comb_max 3 -t 64 -n 32 -w 500 -g 0 -x .4


## Generate across whole window range

In [12]:
! echo python scripts/bpnet_generate_seq_perturbations.py \
-d {perturb_output_dir}/all_instances_curated_formatted_0based.tsv.gz \
-m {model_dir} -c {contrib_file} -o {perturb_output_dir}/perturbs_all --comb_max 3 -t 64 -n 32 --use_whole_window -g 0 -x .4

python scripts/bpnet_generate_seq_perturbations.py -d analysis/tsv/perturbs//all_instances_curated_formatted_0based.tsv.gz -m models/seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64 -c preds/seq_width1000-lr0.001-lambda100-n_dil_layers9-conv_kernel_size7-tconv_kernel_size7-filters64/all_grouped_regions_0based_contrib.h5 -o analysis/tsv/perturbs//perturbs_all --comb_max 3 -t 64 -n 32 --use_whole_window -g 0 -x .4


From these co-occurence plots we can see that the motifs do not co-occur significantly at their own peak sets when treating the peak set of each cell type as a specific group.