# Building a JointDb: Integrating Hi-C Embeddings with Epigenomic Signal Tracks

This notebook demonstrates how to create a `JointDb` database that integrates Hi-C embeddings from `jointly-hic` with additional epigenomic datasets, such as ChIP-seq signal tracks.

## What is JointDb?

`JointDb` is an optional output format of the `jointly-hic` toolkit. It is a compact HDF5-based database that combines:
- Low-dimensional Hi-C embeddings (e.g., PCA or NMF coordinates)
- Sample metadata (e.g., biosample, cell type, condition)
- BigWig signal tracks from ChIP-seq, ATAC-seq, or RNA-seq

This unified format enables:
- Efficient storage and query of per-bin signal and embedding data
- Integration of 3D chromatin structure with regulatory activity
- Visualization and analysis across genomic regions and samples

## Overview of This Notebook

In this example, we use public CD4+ and CD8+ T cell Hi-C data from ENCODE and enrich it with H3K27ac ChIP-seq signal tracks.

We will:
1. Load and annotate post-processed Hi-C embeddings (`.pq`) with sample metadata
2. Generate experiment and track metadata files in YAML format
3. Create a `JointDb` HDF5 file combining all datasets

## Input Files

- `HiC-metadata.txt`: Metadata table for CD4+/CD8+ Hi-C samples (ENCODE accessions, biosample info)
- `ChIP-metadata.csv`: Metadata for matched ChIP-seq experiments (e.g., H3K27ac)
- `t-cell-demo-output_PCA-32_80000bp_hg38_post_processed.pq`: Post-processed Hi-C embeddings
- BigWig files: Referenced in the ChIP-seq metadata and automatically retrieved

## Output

- `t-cell-demo-annotated.pq`: Embeddings with merged sample metadata
- `experiment-metadata.yaml`: YAML describing Hi-C experiments
- `track-metadata.yaml`: YAML describing ChIP-seq signal tracks
- `t-cell-jointdb.h5`: The final JointDb HDF5 database

With this setup, you can analyze or visualize how chromatin structure (via Hi-C embeddings) correlates with epigenetic features across T cell types and genomic loci.

In [1]:
import pandas as pd

In [2]:
# Path to the "post processed" jointly embeddings
EMBEDDINGS = "./data/t-cell-demo-output_PCA-32_80000bp_hg38_post_processed.pq"

# Load the embeddings
df = pd.read_parquet(EMBEDDINGS)
df.head()

Unnamed: 0,chrom,start,end,weight,good_bin,filename,PCA1,PCA2,PCA3,PCA4,...,leiden_0_3_n500,leiden_0_5_n500,leiden_0_8_n500,leiden_1_0_n500,umap1_n30,umap2_n30,umap1_n100,umap2_n100,umap1_n500,umap2_n500
0,chr1,0,80000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,,,,,,
1,chr1,80000,160000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,,,,,,
2,chr1,160000,240000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,,,,,,
3,chr1,240000,320000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,,,,,,
4,chr1,320000,400000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,,,,,,


In [3]:
# Load metadata describing ENCODE CD4+/CD8+ Hi-C experiments
meta = pd.read_csv("HiC-metadata.txt", sep="\t")
meta

Unnamed: 0,accession,assay_term_name,assay_title,biosample,biosample_summary,hic_accession
0,ENCSR351NAI,HiC,intact Hi-C,"activated CD8-positive, alpha-beta T cell","Homo sapiens activated CD8-positive, alpha-bet...",ENCFF493SFI
1,ENCSR923PPH,HiC,intact Hi-C,"activated CD4-positive, alpha-beta T cell","Homo sapiens activated CD4-positive, alpha-bet...",ENCFF355VJW
2,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m...",ENCFF009ONH
3,ENCSR215PTV,HiC,intact Hi-C,"naive thymus-derived CD4-positive, alpha-beta ...",Homo sapiens naive thymus-derived CD4-positive...,ENCFF520GFL
4,ENCSR335JYP,HiC,intact Hi-C,"CD4-positive, alpha-beta T cell","Homo sapiens CD4-positive, alpha-beta T cell m...",ENCFF958DWQ
5,ENCSR379TIE,HiC,intact Hi-C,"CD4-positive, alpha-beta memory T cell","Homo sapiens CD4-positive, alpha-beta memory T...",ENCFF442IGJ
6,ENCSR421CGL,HiC,intact Hi-C,"activated CD4-positive, alpha-beta T cell","Homo sapiens activated CD4-positive, alpha-bet...",ENCFF962EDB
7,ENCSR195ETV,HiC,intact Hi-C,"activated naive CD4-positive, alpha-beta T cell","Homo sapiens activated naive CD4-positive, alp...",ENCFF571GTF
8,ENCSR458VJJ,HiC,intact Hi-C,"naive thymus-derived CD4-positive, alpha-beta ...",Homo sapiens naive thymus-derived CD4-positive...,ENCFF044TCQ
9,ENCSR291EBI,HiC,intact Hi-C,"activated CD4-positive, alpha-beta memory T cell","Homo sapiens activated CD4-positive, alpha-bet...",ENCFF980NXK


In [4]:
# We need to extract the "hic_accession" from the "filename" and then merge the metadata
# Split at "/" and "." to get the accession value
df["hic_accession"] = df["filename"].map(lambda x: x.split("/")[1].split(".")[0])

In [5]:
# Merge the metadata into the dataframe
df = pd.merge(df, meta)
df.head()

Unnamed: 0,chrom,start,end,weight,good_bin,filename,PCA1,PCA2,PCA3,PCA4,...,umap1_n100,umap2_n100,umap1_n500,umap2_n500,hic_accession,accession,assay_term_name,assay_title,biosample,biosample_summary
0,chr1,0,80000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,ENCFF009ONH,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m..."
1,chr1,80000,160000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,ENCFF009ONH,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m..."
2,chr1,160000,240000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,ENCFF009ONH,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m..."
3,chr1,240000,320000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,ENCFF009ONH,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m..."
4,chr1,320000,400000,,False,data/ENCFF009ONH.mcool,,,,,...,,,,,ENCFF009ONH,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m..."


In [6]:
# Save the embeddings with metadata
df.to_parquet("./data/t-cell-demo-annotated.pq")

In [7]:
# Now, we can extract the metadata as YAML to make the JointDb database
! jointly embedding2yaml --parquet-file ./data/t-cell-demo-annotated.pq \
    --accession-column hic_accession \
    --metadata-columns biosample biosample_summary \
    --yaml-file experiment-metadata.yaml

2025-04-09 10:37:31,329::INFO::main:   Starting joint PCA
2025-04-09 10:37:31,803::INFO::main:   Metadata successfully written to experiment-metadata.yaml
2025-04-09 10:37:31,806::INFO::main:   Finished joint PCA


In [8]:
# Let's also prepare our track metadata from ChIP-metadata.csv
# This file must contain only the following columns: biosample, assay, experiment, accession
! jointly tracks2yaml ChIP-metadata.csv track-metadata.yaml

2025-04-09 10:37:32,589::INFO::main:   Starting joint PCA
2025-04-09 10:37:32,600::INFO::main:   Finished joint PCA


In [9]:
# Now we can create a JointDb database
! jointly hdf5db --experiments experiment-metadata.yaml \
    --tracks track-metadata.yaml \
    --embeddings ./data/t-cell-demo-annotated.pq \
    --accession hic_accession \
    --output data/t-cell-jointdb.h5

2025-04-09 10:37:33,307::INFO::main:   Starting joint PCA
2025-04-09 10:37:33,307::INFO::main:   Creating HDF5 file at data/t-cell-jointdb.h5
2025-04-09 10:37:33,347::INFO::create_hdf5:   HDF5 file created at data/t-cell-jointdb.h5
2025-04-09 10:37:33,347::INFO::main:   Loading and storing bins...
2025-04-09 10:37:33,720::INFO::store_bins:   Bin information stored successfully.
2025-04-09 10:37:33,721::INFO::main:   Loading and storing experiment metadata...
2025-04-09 10:37:33,725::INFO::store_experiment_metadata:   Experiment metadata stored successfully.
2025-04-09 10:37:33,725::INFO::main:   Loading and storing embeddings...
2025-04-09 10:37:35,721::INFO::store_embeddings:   [1/52] Storing embeddings for PCA1.
2025-04-09 10:37:36,244::INFO::store_embeddings:   [2/52] Storing embeddings for PCA2.
2025-04-09 10:37:36,762::INFO::store_embeddings:   [3/52] Storing embeddings for PCA3.
2025-04-09 10:37:37,278::INFO::store_embeddings:   [4/52] Storing embeddings for PCA4.
2025-04-09 10:3