# Determining padlock secondary structure with NUPACK

For a given pool of padlocks, perform multiple secondary structure analyses.


1. Padlocks may form hairpins/homodimers that affect the binding ability of the left arm, the right arm or both arms of the padlock. These would prevent this padlock from effectively binding to the target and being ligated. To measure the effect of secondary structure on both of these arms, we can simulate a test tube containing just this padlock and its target. The readout of the simulation is the fraction of total target which was bound by padlock. If we just simulate this as the whole target sequence and the padlock, a binding event which involves only one of the arms binding to the target would still count as a bound sample, but in reality the other arm may not have bound. To ensure that neither arm has problematic secondary structure concerns we can split the target sequence into 2 pieces, corresponding to the binding sites of the left and right arms. Then we determine the fraction of left target and right target bound by the padlock in a simulated tube containing all 3 sequences. As in the real world scenario, we add the padlock at its actual concentration in the experiment, then we set the two target sequences to be at a very low concentration, mimicking cDNA found in tissue. This means that the fraction of targets found in a triple bound configuration will be very low, but the individual fractions of left arm target and right arm target bound to padlock should be near 1 for both arms for a good probe. We can use the average of these two fractions, or a similar metric to determine good/bad padlocks
   This is done by the function: analyze_padlock_bridging

2. Padlocks may form heterodimers between each other which prevent binding to targets. To ensure we capture all possible pairwise interactions, we can perform test tube simulations of every unique pair of padlocks in the pool and determine the fraction of these which bind together and the free energy of this binding. This is done by the function: submit_exhaustive_heterodimer_jobs


In [None]:
# Dataframe-driven tube analysis
from pathlib import Path

import numpy as np
import pandas as pd

from lib import nupack_heterodimer

# Utilities moved to lib.nupack_tube_analysis
from lib.nupack_tube_analysis import (
    analyze_padlock_bridging,
)

In [None]:
# Perform per padlock-target analysis
df = pd.read_csv(
    "/camp/lab/znamenskiyp/home/users/becalia/code/multi_padlock_design/notebooks/monahan_panel_barcoded.csv")
per_df = analyze_padlock_bridging(
    df,
    padlock_conc=1e-9,
    target_conc=1e-19,
    progress=True,
    parallel=1)
per_df.to_csv("monahan_panel_barcoded_NUPACK.csv", index=False)

In [None]:
slurm_folder = Path.home() / "slurm_logs" / "olfr_probe_design"
slurm_folder.mkdir(parents=True, exist_ok=True)
output_dir = Path("/camp/lab/znamenskiyp/scratch/old_olfr_dg_hetero_NUPACK")
output_dir.mkdir(parents=True, exist_ok=True)

# Submit jobs (adjust n_batches/time/mem as needed)
nupack_heterodimer.submit_exhaustive_heterodimer_jobs(
    probe_df_path="/camp/lab/znamenskiyp/home/users/becalia/code/multi_padlock_design/padlock_checking/all_olf_probes_barcoded.csv",
    output_dir=output_dir,
    slurm_folder=str(slurm_folder),
    n_batches=350,
    time="7-00:00:00",
    mem="8G",
    partition="ncpu",
    cpus_per_task=1,
    sequence_col="padlock",
    dependency_aggregate_time="02:00:00",
    dependency_aggregate_mem="32G",
    padlock_conc=1e-9,
    dry_run=False,
)


In [None]:
# Load the output of heterodimer jobs and convert to DataFrame
# Convert heterodimer dict -> square pandas DataFrame (index/columns = padlock names)
slurm_folder = Path.home() / "slurm_logs" / "olfr_probe_design"
slurm_folder.mkdir(parents=True, exist_ok=True)
output_dir = Path("/camp/lab/znamenskiyp/scratch/old_olfr_dg_hetero_NUPACK")
output_dir.mkdir(parents=True, exist_ok=True)
df = pd.read_pickle(output_dir / "heterodimer_nupack_percent_matrix.pkl")

if isinstance(df, dict) and {'matrix', 'names'} <= set(df.keys()):
    names = df['names']
    matrix = df['matrix']
    # Ensure numpy array
    matrix = np.asarray(matrix)
    if matrix.shape != (len(names), len(names)):
        raise ValueError(f"Matrix shape {matrix.shape} != ({len(names)}, {len(names)}) from names list length")
    # Optional memory optimization: downcast to float32 if float64
    if matrix.dtype == np.float64:
        matrix = matrix.astype('float32')
    dg_matrix_df = pd.DataFrame(matrix, index=names, columns=names)
    # Replace df with the DataFrame for convenience
    df = dg_matrix_df
    print(f"Converted heterodimer matrix to DataFrame: shape={df.shape}, dtype={df.dtypes.iloc[0]}")
    # Show a small corner preview
else:
    raise TypeError("Expected df to be a dict with keys 'matrix' and 'names'.")

df.to_csv(output_dir / "heterodimer_nupack_percent_matrix.csv")