<a href="https://colab.research.google.com/github/Noble-Lab/HiCFoundation/blob/main/HiCFoundation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HiCFoundation: a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species

<a href="https://github.com/marktext/marktext/releases/latest">
   <img src="https://img.shields.io/badge/HiCFoundation-v1.0.0-green">
   <img src="https://img.shields.io/badge/platform-Linux%20%7C%20Mac%20-green">
   <img src="https://img.shields.io/badge/Language-python3-green">
   <img src="https://img.shields.io/badge/dependencies-tested-green">
</a>  

HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species.

Copyright (C) 2024 Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang

License: Apache License 2.0

Contact:  Sergei Doulatov (doulatov@uw.edu) & William Stafford Noble (wnoble@uw.edu) & Sheng Wang (swang@cs.washington.edu)

For technical problems or questions, please reach to Xiao Wang (wang3702@uw.edu) and Yuanyuan Zhang (zhang038@purdue.edu).

**We strongly suggest to use Google Chrome for HiCFoundation Colab. Other browsers such as Safari may raise errors when uploading or downloading files.**

If you are using other browsers, disabling tracking protection may help resolve the errors when uploading or downloading files.

For more details, see **<a href="#Instructions">Instructions</a>** of the notebook and checkout the **[HiFoundation GitHub](https://github.com/Noble-Lab/HiCFoundation)**. If you use HiCFoundation, please cite it: **<a href="#Citation">Citation</a>**.

#Overall Protocol
1) Pre-training stage: the model is trained in a self-supervised fashion on massive quantities of unlabeled Hi-C data. The model takes masked Hi-C submatrices as input, optimizing for the reconstruction of the full submatrix.

2) Fine-tuning stage: the model is fine-tuned and tested for diverse downstream applications, including genome architecture analysis, multi-species analysis, neutrophil differentiation analysis, multi-omics analysis and single-cell analysis.


<p align="center">
  <img src="https://github.com/Noble-Lab/HiCFoundation/raw/main/imgs/framework_github.png" alt="HiCFoundation framework" width="70%">
</p>

# Instructions <a name="Instructions"></a>
## Steps
1. Connect to a **gpu machine** by clicking the right top button **"connect"** in the notebook, then we can run HiCFoundation under GPU support. We strongly recommend to connect with **A100** with **high memory**.(This is included as paid option).<br>
You can also connect to local runtime or GCE VM on Google cloud.
2. Click the left running button in <a href="#Dependency">Install Dependencies</a> to install dependencies.
3. Upload .hic/.cool/.pairs file in <a href="#file">Input file</a> for inference.
4. Specify the Parameters in <a href="#Param">Parameters</a>. Please make sure to click the left button to configure the parameters after your modification. If you used the default one, please also run it.
5. Running HiCFoundation by by clicking the left running button in <a href="#Running">Run HiCFoundation</a>.
6. Download HiCFoundation output. Please click the left running button in <a href="#Download">Download</a> to download the zip files. <br>

PS: For **reproducibility task**, please run 3-6 for interested Hi-C maps to get their corresponding embedding files in .pkl format. Keep them in your local computer, then you can run <a href="https://github.com/Noble-Lab/HiCFoundation/blob/main/Reproducibility.ipynb">Repro Colab</a> to calculate the similarity by uploading the saved embedding files here.

In [None]:
#@title <a name="Dependency">Install dependencies</a>
#@markdown Please make sure the notebook is already connected to **GPU**, HiCFoundation needs GPU support to run.
#@markdown <br> It will restart the kernel and throughout some error, but it is fine, please just ignore it and let it run.

%%shell
cd /content
git clone https://github.com/Noble-Lab/HiCFoundation --quiet
cd HiCFoundation

pip install -q condacolab
python3 -c "import condacolab;condacolab.install()"
eval "$(conda shell.bash hook)"
conda env create -f /content/HiCFoundation/environment_notorch.yml
conda activate HiCFoundation
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install timm==0.3.2
cd /content/HiCFoundation
cd hicfoundation_model
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_pretrain.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_reproducibility.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_loop.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_loop_lc.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_resolution.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_epigenomic.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/resolve/main/hicfoundation_schic.pth.tar
cd ..


In [None]:
#@title  <a name="file">Input file</a>
from google.colab import files
import os
import os.path
import re
import hashlib
import random
import string
from google.colab import drive

from datetime import datetime
# Get the current date and time
current_datetime = datetime.now()
# Convert to string in desired format
current_datetime_str = current_datetime.strftime("%Y-%m-%d-%H-%M-%S")


rand_letters = string.ascii_lowercase
rand_letters = ''.join(random.choice(rand_letters) for i in range(20))
#@markdown ## Option 1: download from database <br>
#@markdown Instead of uploading, you can also specify the link here to automatically download maps from ENCODE and other servers.
#@markdown Example: https://www.encodeproject.org/files/ENCFF689CUX/@@download/ENCFF689CUX.hic
download_link = '' #@param {type:"string"}
output_dir="/content/"
if download_link!='':
  root_dir = os.getcwd()
  upload_dir = os.path.join(root_dir,rand_letters)
  if not os.path.exists(upload_dir):
    os.mkdir(upload_dir)
  os.chdir(upload_dir)
  os.system("wget %s"%download_link)
  parse_link=download_link.split("/")[-1]
  hic_input_path = os.path.join(upload_dir,parse_link)
  os.chdir(root_dir)
  output_dir="/content/hicfoundation_output/"

else:
  #@markdown ## Option 2: Use files in Google Drive
  #@markdown If you wanted to use google drive, please input the file path in your google drive
  #@markdown <br> This is recommended when your file is larger than 1GB.
  #@markdown <br> Please authenticate google colab to access your google drive, whic will pop up a window for your authentication.
  #@markdown <br> **The colab is also private to you and Google, developers can not see your data, so privacy is guaranteed.**
  #@markdown <br> The output is directly saved to your google drive, you can directly download and analyze the data on your Google Drive.
  google_drive_file = "" #@param {type:"string"}

  if google_drive_file=="":
    #@markdown ## Option 3: Upload from your local file system
    #@markdown This is not recommended for big files. Please only consider this when input file is small (<=100 MB).
    print("Please uploading your input files")
    os.chdir("/content/")
    root_dir = os.getcwd()
    upload_dir = os.path.join(root_dir,rand_letters)
    if not os.path.exists(upload_dir):
      os.mkdir(upload_dir)
    os.chdir(upload_dir)
    map_input = files.upload()
    for fn in map_input.keys():
      print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(map_input[fn])))
      hic_input_path = os.path.abspath(fn)
      print("The input save to %s"%hic_input_path)
    os.chdir(root_dir)
    output_dir="/content/hicfoundation_output/"

  else:

    #loading the data from google drive

    drive.mount('/content/drive')
    hic_input_path = "/content/drive/My Drive/"+google_drive_file
    if not os.path.exists(hic_input_path):
      print(f"The specified path {hic_input_path} did not contain input file.")
      print(f"You may consider simply upload your file to the rood directory of your Google drive. Then you only need to input the file name in the field.")
      print("If you still encounter problems, please contact us!")
      exit()
    print(f"Google drive path is available at {hic_input_path}")
    output_dir=f"/content/drive/My Drive/hicfoundation_output_{current_datetime_str}"
os.makedirs(output_dir,exist_ok=True)
print(f"HiCFoundation output will be redirected to {output_dir}")


In [None]:

#@title  <a name="Param">Configure parameters</a>
#@markdown We supported six different tasks, please make selection based on your need.
#@markdown - Reproducibility analysis: HiCFoundation will generate embeddings of the input Hi-C, and the submatrix embeddings can be used to compare across biological replicates and non-replicates.
#@markdown - Chromatin loop detection: HiCFoudation will generate the loop detection of the input Hi-C in .bedpe format.
#@markdown - Resolution enhancement: HiCFoundation will generate enhanced Hi-C map given the input Hi-C.
#@markdown - Epigenomic assay profiling: HiCFoundation will generate corressponding epigenomic assays in .bigWig format given the input Hi-C.
#@markdown - Single-cell Hi-C enhancement: HiCFoundation will generate the enhanced scHi-C given the input siHi-C.
#@markdown - Hi-C Embedding: Generates different levels of embeddings for Hi-C:
#@markdown  - patch level embdding: an embedding vector corresponds to a 16*16 patch space at specified resolution.
#@markdown  - submatrix level embedding: an embedding vector corresponds to the specified submatrix at specified resolution.
#@markdown  - chromosome level embedding: embedding vectors correspond to different chromosomes at specified resolution.
#@markdown  - genome wide embedding: an embedding vector corresponds to the input Hi-C at specified resolution.

task = "Hi-C Embedding" # @param ["Reproducibility analysis", "Loop detection", "Resolution enhancement","Epigenomic profiling","Single-cell Hi-C","Hi-C Embedding"]
task_name=task
task_list=["Reproducibility analysis", "Loop detection", "Resolution enhancement","Epigenomic profiling","Single-cell Hi-C","Hi-C Embedding"]
task = task_list.index(task)+1
root_dir="/content/HiCFoundation/hicfoundation_model/"
if task==1:
  model_path=root_dir+"hicfoundation_reproducibility.pth.tar"
elif task==2:
  model_path=root_dir+"hicfoundation_loop.pth.tar"
elif task==3:
  model_path=root_dir+"hicfoundation_resolution.pth.tar"
elif task==4:
  model_path=root_dir+"hicfoundation_epigenomic.pth.tar"
elif task==5:
  model_path=root_dir+"hicfoundation_schic.pth.tar"
else:
  model_path=root_dir+"hicfoundation_pretrain.pth.tar"
#configure input size
#@markdown Configure the input submatirx size of HiCFoundation. <br>
#@markdown For "Reproducibility analysis", "Loop detection", "Resolution enhancement",and "Single-cell Hi-C" task, 224$\times$224 is suggested input size.
#@markdown <br> For "Epigenomic Profiling", the suggested input size is 128$\times$4000.
#@markdown <br> For "Hi-C Embedding", please chooce the submatrix size that you are interested. Please make sure the numbers are multiplies of 16.

submat_height = 128 #@param {type:"number"}
submat_width = 4000 #@param {type:"number"}
#@markdown Configure scanning stride for the input Hi-C matrix. The default should applies to most settings.
stride = 32 #@param {type:"number"}
#@markdown Configure scanning boundary for the input Hi-C matrix. Default 0 indicates scanning across the diagonal, should work for most tasks. You can adjust based on your need.
scan_boundary = 0 #@param {type:"number"}
#@markdown Configure the batch size for HiCFoundation inference.
batch_size = 1 #@param {type:"number"}
#@markdown Configure resolution for inference:
#@markdown - Chromatin loop detection/Resolution enhancement: 10000 (10kb).
#@markdown - Reproducibility analysis: 25000 (25kb)
#@markdown - Epigenomic assay profilling: 1000 (1kb)
#@markdown - Single-cell Hi-C analysis: 1000000 (1Mb)
resolution = 10000 #@param {type:"number"}

#@markdown (Only applicable to "Resolution enhancement") Configure genome id or the path to chromsize files.
genome_id = "hg38" #@param {type:"string"}
#@markdown (Only applicable to "Hi-C embedding") Configure the embedding depth: default: 0 (encoder output embeddings). You can also specify k from 1 to 8 to indicate the output of k-th layer of the pre-trained decoder.
embed_depth = 0 #@param {type:"number"}
#@markdown (Only applicable to "Loop detection") Configure if this is a low-coverage Hi-C (like <=100Mb reads for human genome).
low_coverage_input = False # @param {type:"boolean"}
print("Configured task for inference: %s"%task_name)
if task==2 and low_coverage_input:
  model_path=root_dir+"hicfoundation_loop_lc.pth.tar"
  print("Low coverage input for loop detection, used model %s"%model_path)
else:
  print("Configured HiCFoundation model path %s"%model_path)
command_line=f"python3 inference.py --input '{hic_input_path}' --batch_size {batch_size} --resolution {resolution} \
  --task {task} --input_row_size {submat_height} --input_col_size {submat_width} \
  --stride {stride} --bound {scan_boundary} --model_path '{model_path}' \
  --output '{output_dir}'"
print("Configure command:",command_line)
#write it to a running file
with open("/content/HiCFoundation/run.sh","w") as file:
  file.write("%s"%command_line)

In [None]:
#@title <a name="Running">Run HiCFoundation</a>
#@markdown Please allow >30 minutes to finish the output. The epigenomic assay may take longer because of its finer resolution.
#@markdown <br>Our running time is correlated to the size of input Hi-C.
#@markdown <br>If your Hi-C contact map is too big, please run locally with our [GitHub](https://github.com/Noble-Lab/HiCFoundation).
#@markdown <br>If you don't have GPU resources, please make contact with us and we are happy to run it for you.

%%shell
cd /content/HiCFoundation
git pull origin main
eval "$(conda shell.bash hook)"
conda activate HiCFoundation
bash /content/HiCFoundation/run.sh


In [None]:
#@title <a name="Download">Download Output</a>
#@markdown If you want to download all output files in .zip file, please select this.
#@markdown Otherwise, it will be downloaded in .tar.gz file.
from google.colab import files
import os, tarfile
import shutil
import zipfile

zip_format = True #@param {type:"boolean"}
download_path = "/content/download/"
os.makedirs(download_path,exist_ok=True)
print("Copying files from %s"%output_dir)
listfiles=[x for x in os.listdir(output_dir) if "HiCFoundation" in x]
for item in listfiles:
  src_file = os.path.join(output_dir,item)
  dest_file = os.path.join(download_path,item)
  shutil.copy(src_file,dest_file)
if zip_format:
  tar_path = "/content/HiCFoundation_output.zip"
else:
  tar_path = "/content/HiCFoundation_output.tar.gz"
def zip_file(tar_path,src_dir):
    zip_name = tar_path
    z = zipfile.ZipFile(zip_name,'w',zipfile.ZIP_DEFLATED)
    for dirpath, dirnames, filenames in os.walk(src_dir):
        fpath = dirpath.replace(src_dir,'')
        fpath = fpath and fpath + os.sep or ''
        for filename in filenames:
            z.write(os.path.join(dirpath, filename),fpath+filename)
            print ('==Compress Success!==',filename)
    z.close()

def make_targz(output_filename, source_dir):
    """
    :param output_filename:
    :param source_dir:
    :return: bool
    """
    try:
        with tarfile.open(output_filename, "w:gz") as tar:
            tar.add(source_dir, arcname=os.path.basename(source_dir))

        return True
    except Exception as e:
        print(e)
        return False
if zip_format:
  zip_file(tar_path,download_path)
else:
  make_targz(tar_path,download_path)
files.download(tar_path)



# Citation: <a name="Citation"></a>

Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang. "A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species." bioRxiv (2024).
<a href="https://www.biorxiv.org/content/10.1101/2024.12.16.628821v1">Paper</a>
```
@article{wang2024generalizable,
  title={A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species},
  author={Wang, Xiao and Zhang, Yuanyuan and Ray, Suhita and Jha, Anupama and Fang, Tangqi and Hang, Shengqi and Doulatov, Sergei and Noble, William Stafford and Wang, Sheng},
  journal={bioRxiv},
  year={2024}
}
```