# Impute cistromes

The ``impute_cistrome`` command uses ChromBERT's learned co-association patterns to impute cistrome signals (e.g., ChIP-seq) for factor–cell pairs where experimental data is unavailable.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

In [1]:
import pandas as pd
import numpy as np
import os
workdir="/mnt/Storage2/home/chenqianqian/projects/chrombert_tools/2.test/pull/ChromBERT-tools/examples/cli" # your workdir
os.chdir(workdir)

In [2]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

In [3]:
!chrombert-tools impute_cistrome -h

Usage: chrombert-tools impute_cistrome [OPTIONS]

  Impute cistromes on specified regions

Options:
  --region FILE                   Region BED file.  [required]
  --cistrome TEXT                 factor:cell e.g.
                                  BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2. Use ';'
                                  to separate multiple cistromes.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb]              Resolution. Only supports 1kb resolution in
                                  imputing cistromes task.  [default: 1kb]
  --batch-size INTEGER            Batch size. if you have enough GPU memory,
                                  you can set it to a larger value.  [default:
                                  4]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache directory (containing
                                  config/ and

In [4]:
!chrombert-tools impute_cistrome \
    --cistrome "BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2;MYC:H1;MYC:h9;SPI1:GSM2702714" \
    --region "../data/CTCF_ENCFF664UGR_sample100.bed" \
    --odir "./output_impute" \
    --genome "hg38" \
    --resolution "1kb"


Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 0
celltype: h1 has no corresponding wild type dnase data in ChromBERT.
Note: All cistromes names were converted to lowercase for matching.
Cistromes count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found cistromes: ['myc:h1']
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
use organisim hg38; max sequence length is 6391
use organisim hg38; max sequenc

In [5]:
# results_pro_df: Imputed peak probabilities.
results_pro_df = pd.read_csv("./output_impute/results_prob_df.csv")
results_pro_df


Unnamed: 0,input_chrom,input_start,input_end,chrombert_build_region_index,chrombert_start,chrombert_end,bcl11a:gm12878,brd4:mcf7,ctcf:hepg2,myc:h9,spi1:gsm2702714
0,chr1,37989946,37990368,32658,37990000,37991000,0.781250,0.660156,0.984375,0.972656,0.632812
1,chr11,2400199,2400617,289179,2400000,2401000,0.664062,0.570312,0.972656,0.882812,0.917969
2,chr12,6778809,6779319,391108,6779000,6780000,0.527344,0.412109,0.980469,0.871094,0.503906
3,chr12,52980788,52981316,424926,52981000,52982000,0.174805,0.601562,0.976562,0.812500,0.345703
4,chr12,53676021,53676448,425578,53676000,53677000,0.494141,0.699219,0.968750,0.945312,0.570312
...,...,...,...,...,...,...,...,...,...,...,...
95,chr6,53171843,53172315,1660979,53172000,53173000,0.408203,0.474609,0.988281,0.566406,0.617188
96,chr6,131628105,131628616,1713078,131628000,131629000,0.632812,0.667969,0.988281,0.894531,0.773438
97,chr6,158704189,158704642,1735665,158704000,158705000,0.558594,0.251953,0.972656,0.613281,0.554688
98,chr9,128117589,128118035,2049996,128117000,128118000,0.597656,0.468750,0.972656,0.812500,0.468750
