# Building a JointDb: Integrating Hi-C Embeddings with Epigenomic Signal Tracks

This notebook demonstrates how to create a `JointDb` database that integrates Hi-C embeddings from `jointly-hic` with additional epigenomic datasets, such as ChIP-seq signal tracks.

## What is JointDb?

`JointDb` is an optional output format of the `jointly-hic` toolkit. It is a compact HDF5-based database that combines:

- **Low-dimensional Hi-C embeddings** (e.g., PCA or NMF coordinates)
- **Sample metadata** (e.g., biosample, cell type, condition)
- **BigWig signal tracks** from ChIP-seq, ATAC-seq, DNase-seq, or RNA-seq

This unified format enables:

- Efficient storage and querying of per-bin signal and embedding data
- Integration of 3D chromatin structure with regulatory activity
- Visualization and analysis across genomic regions and samples

## Overview of This Notebook

In this example, we use public Hi-C data from *Choppavarapu et al.* and enrich it with H3K27me3 ChIP-seq signal tracks.

We will:

1. Load and annotate post-processed Hi-C embeddings (`.pq`) with sample metadata
2. Generate experiment and track metadata files in YAML format
3. Create a JointDb HDF5 file combining all datasets

## Input Files

- **`ENCODE_breast_metadata.csv`**: Metadata for matched experiments curated from the ENCODE database. This header-free file includes the accession numbers of ChIP-seq, ATAC-seq, DNase-seq, and RNA-seq experiments performed on healthy human breast tissues and MCF-7 cells, corresponding to the breast tissue Hi-C experiments from *Choppavarapu et al.*.
- **`breast-tissue-demo-output_PCA-32_80000bp_hg38_post_processed.pq`**: Post-processed Hi-C embeddings
- **BigWig files**: Referenced in the `ENCODE_breast_metadata.csv` and automatically retrieved

## Output Files

- **`breast-tissue-demo-annotated.pq`**: Hi-C embeddings with merged sample metadata
- **`experiment-metadata.yaml`**: YAML file describing Hi-C experiments
- **`track-metadata.yaml`**: YAML file describing ChIP-seq signal tracks
- **`breast-demo-jointdb.h5`**: The final JointDb HDF5 database

With this setup, you can analyze or visualize how chromatin structure (via Hi-C embeddings) correlates with epigenetic features across breast cancer and normal breast tissue at various genomic loci.


In [1]:
import pandas as pd

In [2]:
# Path to the "post processed" jointly embeddings
EMBEDDINGS = "./data/breast-tissue-demo-output_PCA-32_50000bp_hg38_post_processed.pq"

# Load the embeddings
df = pd.read_parquet(EMBEDDINGS)
df.head()

Unnamed: 0,chrom,start,end,weight,good_bin,filename,PCA1,PCA2,PCA3,PCA4,...,leiden_0_3_n500,leiden_0_5_n500,leiden_0_8_n500,leiden_1_0_n500,umap1_n30,umap2_n30,umap1_n100,umap2_n100,umap1_n500,umap2_n500
0,chr1,0,50000,,False,Normal_breast1.hg38.mapq_30.1000.mcool,,,,,...,,,,,,,,,,
1,chr1,50000,100000,,False,Normal_breast1.hg38.mapq_30.1000.mcool,,,,,...,,,,,,,,,,
2,chr1,100000,150000,,False,Normal_breast1.hg38.mapq_30.1000.mcool,,,,,...,,,,,,,,,,
3,chr1,150000,200000,,False,Normal_breast1.hg38.mapq_30.1000.mcool,,,,,...,,,,,,,,,,
4,chr1,200000,250000,,False,Normal_breast1.hg38.mapq_30.1000.mcool,,,,,...,,,,,,,,,,


In [3]:
# Copy the df and make the column of filename more concise
new_df = df.copy()
new_df['filename'] = new_df['filename'].str.split('.').str[0]
new_df

Unnamed: 0,chrom,start,end,weight,good_bin,filename,PCA1,PCA2,PCA3,PCA4,...,leiden_0_3_n500,leiden_0_5_n500,leiden_0_8_n500,leiden_1_0_n500,umap1_n30,umap2_n30,umap1_n100,umap2_n100,umap1_n500,umap2_n500
0,chr1,0,50000,,False,Normal_breast1,,,,,...,,,,,,,,,,
1,chr1,50000,100000,,False,Normal_breast1,,,,,...,,,,,,,,,,
2,chr1,100000,150000,,False,Normal_breast1,,,,,...,,,,,,,,,,
3,chr1,150000,200000,,False,Normal_breast1,,,,,...,,,,,,,,,,
4,chr1,200000,250000,,False,Normal_breast1,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
727555,chrX,155800000,155850000,0.025782,True,Recurrent_breast_tumor5,0.002060,0.000881,0.000495,-0.000279,...,1.0,2.0,3.0,6.0,2.878007,12.033370,2.633736,1.973420,3.249315,1.966544
727556,chrX,155850000,155900000,0.021622,True,Recurrent_breast_tumor5,0.002514,0.000704,0.000012,-0.000670,...,1.0,2.0,3.0,3.0,3.030249,11.754386,2.470406,1.810485,3.098918,1.576779
727557,chrX,155900000,155950000,0.028212,True,Recurrent_breast_tumor5,0.003229,0.001286,-0.000395,-0.000509,...,2.0,2.0,3.0,6.0,3.956839,12.381848,3.116170,1.077261,3.723387,1.207550
727558,chrX,155950000,156000000,0.021044,True,Recurrent_breast_tumor5,0.002438,0.001010,0.000107,-0.000218,...,1.0,2.0,3.0,6.0,2.872449,12.048409,2.637435,1.818989,3.204717,1.799860


In [4]:
# Add a column to describe the type of biosample (whether it is cancer or normal)
new_df['biosample'] = new_df['filename'].str.startswith('Normal').map({True: 'normal', False: 'cancer'})
new_df

Unnamed: 0,chrom,start,end,weight,good_bin,filename,PCA1,PCA2,PCA3,PCA4,...,leiden_0_5_n500,leiden_0_8_n500,leiden_1_0_n500,umap1_n30,umap2_n30,umap1_n100,umap2_n100,umap1_n500,umap2_n500,biosample
0,chr1,0,50000,,False,Normal_breast1,,,,,...,,,,,,,,,,normal
1,chr1,50000,100000,,False,Normal_breast1,,,,,...,,,,,,,,,,normal
2,chr1,100000,150000,,False,Normal_breast1,,,,,...,,,,,,,,,,normal
3,chr1,150000,200000,,False,Normal_breast1,,,,,...,,,,,,,,,,normal
4,chr1,200000,250000,,False,Normal_breast1,,,,,...,,,,,,,,,,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
727555,chrX,155800000,155850000,0.025782,True,Recurrent_breast_tumor5,0.002060,0.000881,0.000495,-0.000279,...,2.0,3.0,6.0,2.878007,12.033370,2.633736,1.973420,3.249315,1.966544,cancer
727556,chrX,155850000,155900000,0.021622,True,Recurrent_breast_tumor5,0.002514,0.000704,0.000012,-0.000670,...,2.0,3.0,3.0,3.030249,11.754386,2.470406,1.810485,3.098918,1.576779,cancer
727557,chrX,155900000,155950000,0.028212,True,Recurrent_breast_tumor5,0.003229,0.001286,-0.000395,-0.000509,...,2.0,3.0,6.0,3.956839,12.381848,3.116170,1.077261,3.723387,1.207550,cancer
727558,chrX,155950000,156000000,0.021044,True,Recurrent_breast_tumor5,0.002438,0.001010,0.000107,-0.000218,...,2.0,3.0,6.0,2.872449,12.048409,2.637435,1.818989,3.204717,1.799860,cancer


In [6]:
# Save the embeddings with metadata
new_df.to_parquet("./data/breast-tissue-demo-annotated.pq")

In [7]:
# Now, we can extract the proper-labeled new_df as YAML to make the JointDb database
! jointly embedding2yaml --parquet-file ./data/breast-tissue-demo-annotated.pq \
    --accession-column filename \
    --metadata-columns biosample biosample \
    --yaml-file experiment-metadata.yaml

2025-04-17 17:31:34,310::INFO::main:   Starting jointly-hic
2025-04-17 17:31:35,054::INFO::main:   Metadata successfully written to experiment-metadata.yaml
2025-04-17 17:31:35,084::INFO::main:   Finished jointly-hic


In [8]:
# Let's also prepare our track metadata from ENCODE_breast_metadata.csv
# ENCODE_breast_metadata.csv is a header-free list of RNA-seq, DNase-seq, Histone ChIP-seq, Transcription factor ChIP-seq and ATAC-seq experiments 
# on both cancer cells(MCF-7) and normal cells(normal tissues from donors) available on ENCODE
# This file must contain only the following columns: biosample, assay, experiment, accession

! jointly tracks2yaml ENCODE_breast_metadata.csv track-metadata.yaml

2025-04-17 17:31:42,909::INFO::main:   Starting jointly-hic
2025-04-17 17:31:42,941::INFO::main:   Finished jointly-hic


In [None]:
# Now we can create a JointDb database
! jointly hdf5db --experiments experiment-metadata.yaml \
    --tracks track-metadata.yaml \
    --embeddings ./data/breast-tissue-demo-annotated.pq \
    --accession filename \
    --output data/breast-demo-jointdb.h5