<a href="https://colab.research.google.com/github/Noble-Lab/HiCFoundation/blob/main/Reproducibility.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HiCFoundation: a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species
**This repo is only for calculating reproducbility score by HiCFoundation**

HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species.

Copyright (C) 2024 Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang

License: Apache License 2.0

Contact:  Sergei Doulatov (doulatov@uw.edu) & William Stafford Noble (wnoble@uw.edu) & Sheng Wang (swang@cs.washington.edu)

For technical problems or questions, please reach to Xiao Wang (wang3702@uw.edu) and Yuanyuan Zhang (zhang038@purdue.edu).


If you are using other browsers, disabling tracking protection may help resolve the errors when uploading or downloading files.

For more details, see **<a href="#Instructions">Instructions</a>** of the notebook and checkout the **[HiFoundation GitHub](https://github.com/Noble-Lab/HiCFoundation)**. If you use HiCFoundation, please cite it: **<a href="#Citation">Citation</a>**.

# Instructions <a name="Instructions"></a>
## Steps
1. Run <a href="https://github.com/Noble-Lab/HiCFoundation/blob/main/HiCFoundation.ipynb">HiCFoundation Colab</a> on your interested two Hi-C maps and download the embedding pickle files for further processing.
2. Connect to a **cpu machine** by clicking the right top button **"connect"** in the notebook. <br>
3. Upload the embedding of 1st Hi-C map (.pkl file) in <a href="#file">Input file1</a>.
4. Upload the embedding of 1st Hi-C map (.pkl file) in <a href="#file">Input file2</a>.
5. Running the score calculation by by clicking the left running button in <a href="#Running">Run</a>.
6. You can check the output to get the similarity score in the same tab.

In [None]:
#@title  <a name="file">Input embedding file1</a>
from google.colab import files
import os
import os.path
import re
import hashlib
import random
import string
from google.colab import drive

from datetime import datetime
# Get the current date and time
current_datetime = datetime.now()
# Convert to string in desired format
current_datetime_str = current_datetime.strftime("%Y-%m-%d-%H-%M-%S")
rand_letters = string.ascii_lowercase
rand_letters = ''.join(random.choice(rand_letters) for i in range(20))
output_dir="/content/"

#@markdown ## Upload the calculated embedding file(.pkl) of 1st Hi-C from your local file system
print("Please uploading your input files")
os.chdir("/content/")
root_dir = os.getcwd()
upload_dir = os.path.join(root_dir,rand_letters)
if not os.path.exists(upload_dir):
  os.mkdir(upload_dir)
os.chdir(upload_dir)
map_input = files.upload()
for fn in map_input.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
    name=fn, length=len(map_input[fn])))
  hic_input_path1 = os.path.abspath(fn)
  print("The input save to %s"%hic_input_path1)
os.chdir(root_dir)



In [None]:
#@title  <a name="file">Input embedding file2</a>
#@markdown ## Upload the calculated embedding file(.pkl) of 2nd Hi-C from your local file system
os.chdir(upload_dir)
map_input = files.upload()
for fn in map_input.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
    name=fn, length=len(map_input[fn])))
  hic_input_path2 = os.path.abspath(fn)
  print("The input save to %s"%hic_input_path2)
os.chdir(root_dir)

In [None]:
# @title
# This script is to calculate the similarity between two Hi-C using a pre-trained reproducibility model.

import os
import sys
import numpy as np
import pickle
from collections import defaultdict

input_pickle1 = hic_input_path1
input_pickle2 = hic_input_path2

def load_pickle(file_path):
    with open(file_path, 'rb') as f:
        data = pickle.load(f)
    return data

input1 = load_pickle(input_pickle1)
input2 = load_pickle(input_pickle2)

def find_key(chr,loc,key_list):
    """
    Find the key in the list of keys that contains the given chromosome and location.
    """
    key1 = chr+":"+loc
    if key1 in key_list:
        return key1
    key1 = "chr"+chr+":"+loc
    if key1 in key_list:
        return key1
    key1 = chr+"_"+chr+":"+loc
    if key1 in key_list:
        return key1
    key1 = "chr"+chr+"_chr"+chr+":"+loc
    if key1 in key_list:
        return key1
    return None

def calculate_similarity(input1, input2):
    """
    Calculate the similarity between two Hi-C matrices using a pre-trained reproducibility model.
    """
    similarity_dict = defaultdict(list)
    for key in input1.keys():
        #1_1:1960,1960 format of key
        split_chromosome = key.split(":")[0]
        split_loc = key.split(":")[1]
        combine_key = split_chromosome + ":" + split_loc
        chr = split_chromosome.split("_")[0]
        chr = chr.replace("chr","")
        if combine_key not in input2.keys():
            combine_key = find_key(chr,split_loc,input2.keys())
            if combine_key is None:
                continue

        embedding1 = input1[key]
        embedding2 = input2[combine_key]
        # Calculate the similarity between the two embeddings
        similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
        if np.isnan(similarity):
            continue
        similarity_dict[chr].append(similarity)
    #ignore chrY, chrM, Un, Alt cases
    similarity_list=[]
    for chrom in similarity_dict:
        if "Y" in chrom or "M" in chrom or "Un" in chrom or "Alt" in chrom:
            continue
        mean_val = np.mean(similarity_dict[chrom])
        similarity_list.append(mean_val)
    similarity = np.mean(similarity_list)
    return similarity

similarity = calculate_similarity(input1, input2)
print("The reproducibility score between the two Hi-C is: ", similarity)
