# Analyzing Breast Tissue (Normal vs. Cancer) Hi-C Data with `jointly-hic`

This notebook demonstrates how to use the `jointly-hic` toolkit to jointly embed Hi-C contact matrices derived from breast tissue samples, as published in *Choppavarapu et al.*

## Background

Hi-C is a genome-wide chromosome conformation capture technique that quantifies 3D chromatin contacts by generating matrices of contact frequencies between genomic bins. By embedding these matrices into a low-dimensional space, we can compare chromatin architecture across different conditions and cell types.

In this demonstration, we:
- Map Hi-C data from 12 breast tissue samples from *Choppavarapu et al.* using [`distiller-nf`](https://github.com/open2c/distiller-nf).
- Prepare the mapped data for embedding and analysis using `jointly-hic`.

## Hi-C Data Source and Mapping

The raw and processed Hi-C data for the breast tissue samples were originally published by *Choppavarapu et al.* and are available through the NCBI Gene Expression Omnibus (GEO) under accession number [**GSE261230**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE261230).

To map the raw Hi-C data, we used the **SRA Run (SRR) identifiers** provided in the GEO entry to download the original FASTQ files. These were then processed using the `distiller-nf` pipeline.

Once the pipeline is set up, each sample can be mapped by referencing its corresponding SRR ID within the `project.yml` configuration file.

## Setup

To run this notebook, you will need the following Python packages and command-line tools:

### Python Packages
- `jointly-hic`: For joint embedding and analysis of Hi-C data.
- `cooler`: For handling `.cool` and `.mcool` Hi-C files.
- `cooltools`: Additional utilities for Hi-C file operations and balancing.
- `pandas`, `numpy`, and other common dependencies.

Install them using `pip`:

```bash
pip install jointly-hic cooler cooltools pandas numpy
```

### Command-Line Tools: `distiller-nf`

The `distiller-nf` tool is a modular Hi-C mapping pipeline. To set up a new project, execute the following command in your project folder:

```bash
nextflow clone open2c/distiller-nf ./
```

After installation, follow the instructions here: [distiller-nf GitHub](https://github.com/open2c/distiller-nf).  
The folder for this demo also includes the `project.yml` file, which can be used with the following command:

```bash
nextflow run distiller.nf -params-file project.yml
```


In [4]:
import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import pandas as pd

from jointly_hic.notebook_utils.encode_utils import EncodeFile

## Joint Embedding of Breast Tissue Hi-C Matrices

We have obtained `.mcool` files for 12 breast tissue samples. The next step is to jointly embed these contact matrices into a low-dimensional vector space using the `jointly-hic` tool.

The `jointly embed` command performs out-of-core matrix decomposition using PCA (default), NMF, or SVD. It stacks contact matrices from all samples at a specified resolution and learns a shared representation that preserves biological variation across samples.

Since the Hi-C data are generated from human samples, we will set the resolution to **50kb**.

### Applications of this Embedding:
- **Dimensionality reduction** and visualization
- **Clustering** and trajectory inference
- **Integration** with RNA-seq, ATAC-seq, or ChIP-seq via **JointDb**

### In this example, we will run `jointly embed` with the following parameters:
- **Input**: All `.mcool` files located in the `./data/` directory
- **Resolution**: 50kb
- **Assembly**: `hg38`
- **Method**: PCA
- **Components**: 32

Since these are clinical samples from female donors, we will exclude **chrY** and **chrM** during the embedding process to reduce noise and improve the quality of the output.


In [3]:
import subprocess
from pathlib import Path
import pandas as pd



In [None]:
# Set up output directory
output_dir = Path("./data")
output_dir.mkdir(parents=True, exist_ok=True)

# Gather list of all mcool files
mcool_files = sorted(Path("./").glob("*.mcool"))

#run jointly embed excluding chrY and chrM by setting chrom_limit as 23

cmd = [
    "jointly",
    "embed",
    "--mcools",
    *map(str, mcool_files),
    "--resolution",
    "50000",
    "--assembly",
    "hg38",
    "--method",
    "PCA",
    "--components",
    "32",
    "--chrom-limit", 
    "23",
    "--output",
    "data/breast-tissue-demo-output",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)

#Users can also run the above command in the terminal