# Analyzing T-Cell Hi-C Data with `jointly-hic`

This notebook demonstrates how to use the `jointly-hic` toolkit to jointly embed Hi-C contact matrices from public CD4+ and CD8+ T cell data available through the ENCODE portal.

## Background

Hi-C is a genome-wide chromosome conformation capture method that quantifies 3D chromatin contacts. It provides a matrix of contact frequencies between genomic bins. By embedding these matrices into a low-dimensional space, we can compare 3D chromatin structure across different conditions and cell types.

In this demo, we:
- Convert ENCODE portal `.hic` files to `.cool` format using [`hictk`](https://github.com/paulsengroup/hictk)
- Generate multi-resolution `.mcool` files using `cooler zoomify`
- Prepare for downstream embedding with `jointly-hic`

## Setup

To run this notebook, you will need the following Python packages and command-line tools:

### Python Packages
- `jointly-hic`: For joint embedding and analysis of Hi-C data
- `cooler`: For handling `.cool` and `.mcool` Hi-C files
- `cooltools`: Additional utilities for Hi-C file operations and balancing
- `pandas`, `numpy`, etc. (installed as dependencies)

Install them using pip:

```bash
pip install jointly-hic cooler
```

### Command-Line Tools -- hictk
This tool is used to convert .hic files (a binary format from Juicer) to .cool format.

Install via conda:

```bash
conda install bioconda::hictk
```

In [1]:
import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import pandas as pd

from jointly_hic.notebook_utils.encode_utils import EncodeFile

In [2]:
# Load metadata describing ENCODE CD4+/CD8+ Hi-C experiments
meta = pd.read_csv("HiC-metadata.txt", sep="\t")
meta

Unnamed: 0,accession,assay_term_name,assay_title,biosample,biosample_summary,hic_accession
0,ENCSR351NAI,HiC,intact Hi-C,"activated CD8-positive, alpha-beta T cell","Homo sapiens activated CD8-positive, alpha-bet...",ENCFF493SFI
1,ENCSR923PPH,HiC,intact Hi-C,"activated CD4-positive, alpha-beta T cell","Homo sapiens activated CD4-positive, alpha-bet...",ENCFF355VJW
2,ENCSR321BHC,HiC,intact Hi-C,"CD8-positive, alpha-beta T cell","Homo sapiens CD8-positive, alpha-beta T cell m...",ENCFF009ONH
3,ENCSR215PTV,HiC,intact Hi-C,"naive thymus-derived CD4-positive, alpha-beta ...",Homo sapiens naive thymus-derived CD4-positive...,ENCFF520GFL
4,ENCSR335JYP,HiC,intact Hi-C,"CD4-positive, alpha-beta T cell","Homo sapiens CD4-positive, alpha-beta T cell m...",ENCFF958DWQ
5,ENCSR379TIE,HiC,intact Hi-C,"CD4-positive, alpha-beta memory T cell","Homo sapiens CD4-positive, alpha-beta memory T...",ENCFF442IGJ
6,ENCSR421CGL,HiC,intact Hi-C,"activated CD4-positive, alpha-beta T cell","Homo sapiens activated CD4-positive, alpha-bet...",ENCFF962EDB
7,ENCSR195ETV,HiC,intact Hi-C,"activated naive CD4-positive, alpha-beta T cell","Homo sapiens activated naive CD4-positive, alp...",ENCFF571GTF
8,ENCSR458VJJ,HiC,intact Hi-C,"naive thymus-derived CD4-positive, alpha-beta ...",Homo sapiens naive thymus-derived CD4-positive...,ENCFF044TCQ
9,ENCSR291EBI,HiC,intact Hi-C,"activated CD4-positive, alpha-beta memory T cell","Homo sapiens activated CD4-positive, alpha-bet...",ENCFF980NXK


## Preprocessing: Converting .hic to .mcool

ENCODE provides Hi-C data in `.hic` format. For compatibility with `jointly-hic`, we first convert to the `.mcool` format using `hictk`, then apply `cooler zoomify` to generate multi-resolution matrices and balance them.

In [None]:
# Set up output directory
output_dir = Path("./data")
output_dir.mkdir(parents=True, exist_ok=True)


def convert_to_mcool(row):
    """Function to download ENCODE Hi-C data and conver it to mcool for jointly-hic"""
    accession = row.hic_accession
    biosample = row.biosample
    print(f"[{accession}] Starting conversion ({biosample})...")

    with EncodeFile(accession, "hic") as local_hic_path:
        hic_path = Path(local_hic_path)
        mcool_path = output_dir / f"{accession}.mcool"
        temp_cool = mcool_path.with_suffix(".cool")

        try:
            # Step 1: Convert .hic to .cool
            cmd_convert = [
                "hictk",
                "convert",
                "--output-fmt",
                "cool",
                "--resolutions",
                "10000",
                "--normalization-methods",
                "NONE",
                str(hic_path),
                str(temp_cool),
            ]
            print(f"[{accession}] Running: {' '.join(cmd_convert)}")
            subprocess.run(cmd_convert, check=True)

            # Step 2: Zoomify and balance
            cmd_zoomify = [
                "cooler",
                "zoomify",
                "--balance",
                "--nproc",
                "8",
                str(temp_cool),
                "--out",
                str(mcool_path),
            ]
            print(f"[{accession}] Running: {' '.join(cmd_zoomify)}")
            subprocess.run(cmd_zoomify, check=True)

            print(f"[{accession}] Done → {mcool_path.name}")
            return accession, "success"

        except subprocess.CalledProcessError as e:
            print(f"[{accession}] FAILED: {e}")
            return accession, "error"


# Run all conversions in parallel
with ThreadPoolExecutor(max_workers=12) as executor:
    futures = [executor.submit(convert_to_mcool, row) for row in meta.itertuples()]
    for future in as_completed(futures):
        accession, status = future.result()

## Joint Embedding of T Cell Hi-C Matrices

Now that we have preprocessed `.mcool` files for each ENCODE sample, we can jointly embed these contact matrices into a low-dimensional vector space using `jointly-hic`.

The `jointly embed` command performs out-of-core matrix decomposition using PCA (default), NMF, or SVD. It stacks contact matrices from all samples at a specified resolution and learns a shared representation that preserves biological variation across samples.

We'll run it at a fairly course-resolution of 320 kb. Often results are desired at higher resolution such as 50 kb or even 25 kb if the data can support it. However, we'll get results faster at 320 kb. 

This embedding can then be used for:
- Dimensionality reduction and visualization
- Clustering and trajectory inference
- Integration with RNA-seq, ATAC-seq, or ChIP-seq via JointDb

In this example, we run `jointly embed` with:
- Input: all `.mcool` files in `./data/`
- Resolution: 320 kb
- Assembly: `hg38`
- Method: PCA
- Components: 32

In [None]:
# Gather list of all mcool files
mcool_files = sorted(Path("./data").glob("*.mcool"))

cmd = [
    "jointly",
    "embed",
    "--mcools",
    *map(str, mcool_files),
    "--resolution",
    "320000",
    "--assembly",
    "hg38",
    "--method",
    "PCA",
    "--components",
    "32",
    "--output",
    "data/t-cell-demo-output",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)