Skip to content

tutorial

stefanofranzini edited this page Sep 27, 2020 · 3 revisions

Welcome to essHIC tutorial. Here you will learn how to use this package to analyze your Hi-C data. You can find a working example of this tutorial at this link


Using essHIC in a python script

To use essHIC in one of your python scripts simply import it as any other python module:

import essHIC

If you installed the package into a local directory, you need to specify the path in your script in order for python to find the package:

import sys
sys.path.append('path/to/essHIC')
import essHIC

Tutorial materials

To be able to complete this tutorial you will need to install matplotlib on your computer (if you installed essHIC from the python package manager, it will already be installed).

You will also need to download the raw_data directory. This directory contains 12 Hi-C experiments divided into 2 cell-types (GM12878 and IMR90), which we will analyze in this tutorial, as well as their metadata file. Copy raw_data into your working directory and let's get started.


Building a library of your experiments

essHIC relies on a particular directory structure in order to function properly. The make_hic module provides an easy way to create a library of your experiments with the required characteristics, as well as to convert raw counts matrices into Observed over Expected matrices, which we are going to use in all analyses from here on out.

To start initialize make_hic with

mymaker = essHIC.make_hic('raw_data','hic_data')

This will set the input path raw_data and the output directory hic_data. You do not need to create the hic_data directory in advance.

Next we will provide mymaker with the metadata it requires to work:

mymaker.get_metadata('raw_data/metadata.txt')

make_hic will create new names for the experiments, ordering them according to their cell-type. Of course it will also create a reference table in the output directory, naming it metadata.txt, which will contain the cell-types of the experiments along with their original name.

Finally we are going to save our original matrices under the new names and process them in order to obtain the OoE matrices we are going to use later on.

mymaker.save_data('nrm',res='100kb',dirtree='05_sub-matrices/nrm')
mymaker.save_decay_norm('nrm',res='100kb',dirtree='05_sub-matrices/nrm',makenew=True)

With this we have a new directory containing all our data, in binary pickle format.


Analyzing single Hi-C matrices

essHIC provide a wrapper class for Hi-C matrices, containing not only the matrix of the experiment, but also all the relevant metadata attached.

Let's start by uploading two experiments:

hicA = essHIC.hic('hic_data/hic001/dkn/dkn_chr17_100kb.npy')
hicB = essHIC.hic('hic_data/hic012/dkn/dkn_chr17_100kb.npy')

All the metadata informations about the experiments will be automatically uploaded along with their matrices.

print 'hicA contains experiment %s, chromosome %d, its cell-type is %s' % ( hicA.oldrefname, hicA.chromo, hicA.cell )
print 'hicB contains experiment %s, chromosome %d, its cell-type is %s' % ( hicB.oldrefname, hicB.chromo, hicB.cell )

With this we can immediatly recover the cell-type, chromosome, and original reference name of the experiment. We could also obtain the resolution and the length (in bins) of the matrices, if we wanted to.

The hic class allows us to easily compute eigenvalues and eigenvectors of our matrices and store them:

hicA.get_spectrum()	
hicB.get_spectrum()	

We can plot the spectra with matplotlib:

plt.plot(np.abs(hicA.eig),'C0o',label=hicA.oldrefname)
plt.plot(np.abs(hicB.eig),'C1o',label=hicB.oldrefname)
plt.legend(fontsize=20)
plt.xlabel('eigen-number',fontsize=30)
plt.ylabel('eigenvalue',fontsize=30)
plt.tight_layout()
plt.show()

Next we will plot the two matrices, so we can look at the differences between them.

hicA.plot()	
hicB.plot()

plt.show()

You can immediatly spot the differences between the two matrices, however they appear to be quite fuzzy. We can improve the sharpness of the long scale patterns by applying a spectral filter and obtaining their essential matrices (using 10 eigenspaces):

hicA.reduce(10,norm='none')
hicB.reduce(10,norm='none')

Now we plot the matrices again, as above. The patterns sharpness has increased in both matrices, making it easier to identify the differences.


Computing essential distances between all experiments in the dataset

To compute distances between experiments we use the clas ess. We start by initializing an instance of the class which will compute distances between chromosome 17 OoE matrices at 100kb resolution:

mydistance = essHIC.ess('hic_data','dkn',17,'100kb')

For all the collected matrices, we compute the first 100 eigenspaces with:

mydistance.get_spectra(100)

Before proceeding any further, we must create an output directory for the distances, which we can simply call dist. Once this is done, we can compute the essential distances: ess allows us to vary the number of employed eigenspaces on the fly (up to the maximum number we computed above, 100 in this case).

for k in range(1,100):
	mydistance.get_essential_distance('dist/dist_chr17_100kb_%d.dat' % (k),nvec=k,spectrum=1.)

The distances will be saved in dist for later use.


Analyzing a distance matrix

The dist class provides useful tools to analyze the properties of a distance matrix and plot the results. Let's see what we can do.

We start by uploading the essential distance at 10 eigenspaces which we just computed above:

mydist = essHIC.dist('dist/dist_chr17_100kb_10.dat', metafile='hic_data/metadata.txt')

Remember to provide the metadata file contained in hic_data, so that dist can interpret the distance matrix we provided. We can then plot the matrix with:

mydist.plot()
plt.show()

This shows the distance map: notice that there are two clear blocks of similar experiments. These are biological replicates from the same cell-type, which only differ for their aspecific parts.

We can quantify how good the separation between replicates and non-replicates is by looking at ROC curves

mydist.get_roc()
AUC = mydist.get_roc_area()

print 'The area under the curve is %f' % AUC
mydist.plot_roc()
plt.show()

This will compute and plot the ROC curve for this distance matrix, as well as provide the value of the area under the curve (AUC), which summarizes the quality of the curve: the higher the better our distance is at separating replicates and non-replicates.

Next we can use two different clustering algorithms (Ward's hierarchical clustering and spectral clustering) to separate our dataset into two clusters.

mydist.hier_clustering(nclust=2)
mydist.show_clusters()

mydist.get_gauss_sim()

mydist.spec_clustering(nclust=2)
mydist.show_clusters()

Both methods return the same clustering, which tells us this is a robust subdivision. Notice that to employ spectral clustering we need a similarity matrix, rather than a distance matrix, so we computed one with the method get_gauss_sim().

We can also visualize the dendrogram of the dataset: the tree plot shows the hierarchical structure of the dataset and allows us to spot the presence of obvious divisions.

mydist.plot_dendrogram(leafname='old')
plt.show()

Notice the large jump between the red and the blue branches: this means that the two are well separated.

We can also use multidimensional scaling (MDS) to represent all experiments as points in 3D.

mydist.MDS()
mydist.plot_MDS()
plt.show()

Again we notice the separation between red and blue experiments.


Comparing essential distances with different numbers of eigenspaces

We can compare the quality of the distances as the number of employed eigenspaces increases by computing their AUC:

rocAUC = []	# empty list of roc AUC values

for k in range(1,100):
	mydist = essHIC.dist('dist/dist_chr17_100kb_%d.dat' % (k), metafile='hic_data/metadata.txt')
	
	mydist.get_roc()
	rocAUC += [ mydist.get_roc_area() ]	
	
plt.plot(range(1,100),rocAUC,'C3-',lw=2.)
plt.xlabel('eigen-numer',fontsize=30)
plt.ylabel('AUC', fontsize=30)
plt.tight_layout()
plt.show()

The curve grows up to the 6th eigenspace, it flattens up to the 30th, then starts decreasing. It shows that, while the quality is still quite high, increasing the number of eigenspaces does not improve the discrimination of replicates and non replicates.

Finally we compare the dendrograms for different numbers of eigenspaces employed (2,10,99):

mydist2	  = essHIC.dist('dist/dist_chr17_100kb_2.dat',  metafile='hic_data/metadata.txt') 
mydist10  = essHIC.dist('dist/dist_chr17_100kb_10.dat', metafile='hic_data/metadata.txt') 
mydist99  = essHIC.dist('dist/dist_chr17_100kb_99.dat', metafile='hic_data/metadata.txt') 
 
mydist2.plot_dendrogram(leafname='old') 
plt.show()

mydist10.plot_dendrogram(leafname='old')
plt.show()

mydist99.plot_dendrogram(leafname='old')
plt.show()

Notice that as the number of eigenspaces increase, the separation between replicates also increase: the aspecific part of the matrices makes it difficult to recognize their common features.


Thanks for following this tutorial, now you are ready to implement your own code using essHIC.