# e4-2 Borzoi Inference Exercise

The exercise uses a python package `grelu` to make predictions for a given DNA sequence. It uses a pre-trained Borzoi model to make the prediction.

About the grelu package:  
    - github repo: https://github.com/Genentech/gReLU  
    - citation Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. bioRxiv 2024.10.09.617507 (2024) doi:10.1101/2024.10.09.617507.

About Borzoi: Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).

## Before the class

On discovery set up a conda environment and install grelu

In [3]:
source /optnfs/common/miniconda3/etc/profile.d/conda.sh
conda create --name py10 python=3.10
conda activate py10
pip install grelu

SyntaxError: invalid syntax (3306148305.py, line 1)

In [None]:
# run python 
python

You should be an in interactive python session now. We'll execute the rest of the code in this python shell in class.

* An alternative way to install grelu: If you have >50gb disk space, you can try the following way to set up grelu

Navigate to your lab share and create container directory

In [2]:
# cd to your lab share. You home directory only has 45G, so it is not suitable to install by this way.
mkdir containers
cd containers

SyntaxError: invalid syntax (3531433326.py, line 2)

Pull image to container directory, this step will take ~ 10 min. It would require at least 10GB storage space.

In [None]:
singularity pull --dir ~/containers docker://nvcr.io/nvidia/pytorch:24.08-py3 

Build a sandbox from the image

In [None]:
singularity build --sandbox inference_pytorch_24.08-py3/ pytorch_24.08-py3.sif

Initialize Container

In [None]:
CONTAINER_NAME="inference_pytorch_24.08-py3/"
PROJECT_BASE="/dartfs-hpc/rc/home/j/$netID" # Change as needed
CONTAINER_LOCATION="/dartfs-hpc/rc/home/j/$netID/containers/$CONTAINER_NAME" # I use /dartfs/rc/lab/S/Szhao/grahams

In [None]:
PROJECT_BASE="/dartfs/rc/lab/S/Szhao/grahams"
CONTAINER_LOCATION="/dartfs/rc/lab/S/Szhao/grahams/containers/$CONTAINER_NAME"

In [None]:
export PATH=$PATH:/root/.local/bin

In [None]:
singularity shell --fakeroot --writable --nv --contain \
--home "${PROJECT_BASE}:/root" \
"${CONTAINER_LOCATION}" bash -c

In [None]:
pip install grelu
export WANDB_MODE=disabled
python

You should be an in interactive python session now. We'll execute the rest of the code in this python shell in class.

# predict features based on input sequence

Load the pre-trained Borzoi model from the GreLU model zoo

In [None]:
import grelu.resources
model = grelu.resources.load_model(
    project="borzoi",
    model_name="human_rep0",
)

Check model metadata and trained cell contexts

In [None]:
model.data_params.keys()

In [None]:
tasks = pd.DataFrame(model.data_params['tasks'])
tasks.head(3)

View Borzoi hyperparameters and training intervals

In [None]:
model.data_params['train'].keys()

In [None]:
for key in model.data_params['train'].keys():
    if key !="intervals":
        print(key, model.data_params['train'][key])

In [None]:
pd.DataFrame(model.data_params['train']['intervals']).head()

Make the inference intervals

In [None]:
input_len = model.data_params["train"]["seq_len"]
chrom = "chr1"
input_start = 69993520
input_end = input_start + input_len

In [None]:
input_intervals = pd.DataFrame({
    'chrom':[chrom], 'start':[input_start], 'end':[input_end], "strand":["+"],
})

input_intervals

Extract sequence using GENCODE assembly (takes a few minutes)

In [None]:
import grelu.sequence.format

input_seqs = grelu.sequence.format.convert_input_type(
    input_intervals,
    output_type="strings",
    genome="hg38"
)
input_seq = input_seqs[0]

len(input_seq)

In [None]:
input_seq[:10]

Run inference on sequence

In [None]:
device = torch.device('cpu')

model.to(device)
preds = model.predict_on_seqs(input_seqs, device=device)
preds.shape

Note the shape of preds: it’s in the format Batch, Tasks, Length. So we have 1 sequence, 7611 tasks, and 6144 bins along the length axis.

Get output intervals:

In [None]:
output_intervals = model.input_intervals_to_output_intervals(input_intervals)
output_intervals

In [None]:
output_start = output_intervals.start[0]
output_end = output_intervals.end[0]
output_len = output_end - output_start
print(output_len)

Save output predictions as image

In [None]:
cage_brain_tasks = tasks[(tasks.assay=="CAGE") & (tasks["sample"].str.contains("brain"))].head(2)
rna_brain_tasks = tasks[(tasks.assay=="RNA") & (tasks["sample"].str.contains("brain"))].head(2)

tasks_to_plot = cage_brain_tasks.index.tolist() + rna_brain_tasks.index.tolist()
task_names = tasks.description[tasks_to_plot].tolist() # Description of these tracks from the `tasks` dataframe

print(tasks_to_plot)
print(task_names)

In [None]:
fig = grelu.visualize.plot_tracks(
    preds[0, tasks_to_plot, :],  # Outputs to plot
    start_pos=output_start,      # Start coordinate for the x-axis label
    end_pos=output_end,         # End coordinate for the x-axis label
    titles=task_names,          # Titles for each track
    figsize=(10, 3.5),           # Width, height
)

# Save the figure as a JPG
fig.savefig("predictions.jpg", format='jpg')