## How to normalize Hi-C map?



To normalize a Hi-C map, several methods have been implemented.

* [Matrix balancing algorithm by Knight and Ruiz](http://imajna.oxfordjournals.org/content/33/3/1029)
* Normalized using distance specific average contact



***

### Matrix balancing algorithm by Knight and Ruiz

**import gcMapExplorer.ccmap and gcMapExplorer.normalizer modules**

In [1]:
import gcMapExplorer.ccmap as cmp
import gcMapExplorer.normalizer as cnorm
import numpy as np
import os

`ccmap` module is used here to read already imported ccmap file. `normalizer` module contains method to normalize the Hi-C map data.



**Load HI-C map file (.ccmap)**

At first, we read `output/chr22_100kb_RawObserved` file and normalize it using Knight-Riuiz algorithm.

In [2]:
raw_ccmap = cmp.load_ccmap('output/CooMatrix/chr22_100kb_RawObserved.ccmap')

**Normalize and save Hi-C map**

In [3]:
norm_ccmap = cnorm.normalizeByKnightRuiz(raw_ccmap, memory='RAM', workDir=os.getcwd())       # Matrix balancing
cmp.save_ccmap(norm_ccmap, 'output/CooMatrix/normalized/chr22_100kb_normKR.ccmap', compress=True)   # Save ccmap

del raw_ccmap     # Remove CCMAP object from memory and any related temporary files
del norm_ccmap    # Remove CCMAP object from memory and any related temporary files

**Whether using RAM and HDD yield same result**
Here, we test whether using RAM and HDD yield results in same matrix. We also calculate total time taken for normalization.



In [4]:
raw_ccmap = cmp.load_ccmap('output/CooMatrix/chr1_100kb_RawObserved.ccmap')

print('Time using RAM: ')
%timeit norm_ram = cnorm.normalizeByKnightRuiz(raw_ccmap, memory='RAM', workDir=os.getcwd()) # Balancing using RAM

print('Time using HDD: ')
%timeit norm_hdd = cnorm.normalizeByKnightRuiz(raw_ccmap, memory='HDD', workDir=os.getcwd()) # Balancing using disk

# Again renormalize
norm_hdd = cnorm.normalizeByKnightRuiz(raw_ccmap, memory='HDD', workDir=os.getcwd())
norm_ram = cnorm.normalizeByKnightRuiz(raw_ccmap, memory='RAM', workDir=os.getcwd())
norm_ram.make_readable()
norm_hdd.make_readable()

print('If matrix from RAM and HDD are similar: ', np.allclose(norm_ram.matrix, norm_hdd.matrix) )
del raw_ccmap
del norm_ram
del norm_hdd

Time using RAM: 
1 loop, best of 3: 13.4 s per loop
Time using HDD: 
1 loop, best of 3: 30.8 s per loop
If matrix from RAM and HDD are similar:  True


***


**Normalize and save all Hi-C maps**

In [5]:
chroms = [1, 5, 15, 20, 21]      # List of chromosomes

# Loop for each chromosome
for chrom in chroms:
    input_file = 'output/CooMatrix/chr{0}_100kb_RawObserved.ccmap' .format(chrom)
    output_file = 'output/CooMatrix/normalized/chr{0}_100kb_normKR.ccmap' .format(chrom)
    
    raw_ccmap = cmp.load_ccmap(input_file)
    norm_ccmap = cnorm.normalizeByKnightRuiz(raw_ccmap, memory='RAM', workDir=os.getcwd())
    cmp.save_ccmap(norm_ccmap, output_file, compress=True)

    del raw_ccmap     # Remove CCMAP object from memory and any related temporary files
    del norm_ccmap    # Remove CCMAP object from memory and any related temporary files

All normalized Hi-C map files are saved in `output` directory.

***

### Normalize by Iterative correction method

This method normalize the raw contact map by removing biases from experimental procedure.
For more details, see [this publication](http://www.nature.com/nmeth/journal/v9/n10/full/nmeth.2148.html).



In [6]:
chroms = [1, 5, 15, 20, 21]      # List of chromosomes

# Loop for each chromosome
for chrom in chroms:
    input_file = 'output/CooMatrix/chr{0}_100kb_RawObserved.ccmap' .format(chrom)
    output_file = 'output/CooMatrix/normalized/chr{0}_100kb_IC.ccmap' .format(chrom)
    
    raw_ccmap = cmp.load_ccmap(input_file)
    norm_ccmap = cnorm.normalizeByIC(raw_ccmap)
    cmp.save_ccmap(norm_ccmap, output_file, compress=True)

    del raw_ccmap     # Remove CCMAP object from memory and any related temporary files
    del norm_ccmap    # Remove CCMAP object from memory and any related temporary files

***

### Normalize by Average Contact Frequency

This method can be used to normalize Hi-C map using average contact values for particular distance between two locations/coordinates. At first, average distance contact frequency for each distance is calculated and subsequently, the observed contact frequency is divided by respective average distance contact frequency.

$$d_{|m-n|}  = \frac { \sum_{i=0}^{N-1} \sum_{j=0}^{i+1} \begin{cases} C_{ij}, & \text{if }|m-n| = |i-j| \\\\ 0, & otherwise \end{cases}} {L_{m-n}}$$

$$v_{ij} = v_{ji}= \frac { C_{ij} } { d_{|i-j|} }$$

where,  $d_{|m-n|}$ is distance between two locations $m$ and $n$. $C_{ij}$ is observed contact between $i$ and $j$ location. $L_{m-n}$ is total number of instances when contact between locations seprated by distance $m-n$ was larger than zero. $v_{ij}$ and $v_{ji}$ is normalized contact value for $i$ and $j$ location.


In [7]:
chroms = [1, 5, 15, 20, 21]      # List of chromosomes

# Loop for each chromosome
for chrom in chroms:
    input_file = 'output/CooMatrix/chr{0}_100kb_RawObserved.ccmap' .format(chrom)
    output_file = 'output/CooMatrix/normalized/chr{0}_100kb_AvgContact.ccmap' .format(chrom)
    
    raw_ccmap = cmp.load_ccmap(input_file)
    norm_ccmap = cnorm.normalizeByAvgContact(raw_ccmap)
    cmp.save_ccmap(norm_ccmap, output_file, compress=True)

    del raw_ccmap     # Remove CCMAP object from memory and any related temporary files
    del norm_ccmap    # Remove CCMAP object from memory and any related temporary files