Skip to content

essHIC.make_hic

stefanofranzini edited this page Sep 27, 2020 · 1 revision
essHIC.make_hic(indir,outdir,loader='from_index', chromosomes='human')

The function of make_hic is to sort the raw data, normalize them, and create metadata files.

In practice the other classes of essHIC (such as essHIC.hic or essHIC.red) rely on the path tree and naming scheme defined by make_hic to properly function.


Parameters:

indir: string
Directory containing the raw data.
outdir: string
output directory where the sorted and normalized data, as well as the metadata, will be put.
loader: {from_index, from_matrix}, default:from_index_
The loader function to use, depending on the format of your data.
chromosomes: {human} or list of integers, default:human
it tells the class how to read your input matrices and break them down into chromosomes. If the HiC maps are from a human cell line, the chromosome lengths are already saved into the package. Otherwise you can provide a list of integers containing the length of each chromosome in your organism.

Attributes:

input_directory: string
the directory where the raw data are stored.
output_directory: string
the directory where new data and metadata will be saved.
loader: function
the loader function to read HiC data.
chromosomes: list of integers
list of the chromosomes length.
list_directories: list of strings
list of the directories (one for each experiment) found in input_directory.
newref_to_oldref: dictionary
a dictionary of the old names of the experiments in terms of the names assigned by essHIC.make_hic.
oldref_to_newref: dictionary
a dictionary of the names of the experiments assigned by essHIC.make_hic in terms of the old names.
oldref_to_cell: dictionary
a dictionary of the celltypes of the experiments in terms of the old names.
newref_to_cell: dictionary
a dictionary of the celltypes of the experiments in terms of the new names.


Methods

method function
save_metadata saves a metadata.txt file which contains metadata information about the HiC matrices.
save_chromosomes saves a chromosome.txt file which contains the lengths of the chromosomes at different resolutions.
save_data saves raw data as binary files.
save_decay_norm saves normalized data as binary files.
get_metadata reads a metadata file and creates dictionaries to convert between the new and the old nomenclature of the data.
get_chromosomes creates chromosomes boundaries at the wanted resolution.
load_from_index loads an HiC matrix saved in the index format.
load_from_matrix loads an HiC matrix saved in the matricial format.
compute_decay_norm normalizes a matrices according to the decay norm.

__init__(indir,outdir,loader='from_index', chromosomes='human')

initialize self.


save_metadata

save_metadata(self)

saves a metadata.txt file in the output directory (self.output_directory). It contains the new names of the HiC experiments, the old ones, and the cell type. All the new names follow the format ''hicxxx'' where xxx is an integer with trailing zeros (for example hic001). The experiments are sorted according to their cell type. The fourth column of the file can be used to mark outliers for removal by writing ''remove''. These experiments will not be considered when analysing the distance matrix.

Returns:

none

save_chromosomes

save_chromosomes(self,res_list)

Saves a chromosomes.txt file in the output directory (self.output_directory). It contains a line for each resolution found in the raw data. Each line starts with the resolution identifier and contains the length of each chromosome at that resolution.

Parameters:

res_list: list of strings
the list of the resolutions found in the raw data.

Returns:

none

save_data

save_data(self,from_norm='all',res='all',dirtree='',ext='abc',nameformat='NCR',full=False)

Saves selected raw data in the output directory as binary files which contain a numpy array of shape (3,K) where K is the number of nonzero bins in the upper triangular matrix. The first two rows of the array contain the indices of the bins, while the third contains its value. One can resctrict the method to only copying files with a given normalization, resolution and extension. The method will name data according to the normalization, chromosome, and resolution of the matrix, as well as to additional information, if provided.

Parameters:

from_norm: string, default='all'
identifier of the normalization of the matrices to save. If it is not 'all' it will only save data with the given normalization.
res: string, default='all'
identifier of the resolution of the matrices to save. If it is not 'all' it will only save data with the given resolution
dirtree: string, default=''
path from the experiment directory to the directory containing the matrices.
ext: string, default='abc'
extension of the data files to save. It will ignore files with different extensions.
nameformat: string, default='NCR'
format of the name of the files. Each file name will be stripped of its extension and segmented using underscore ( "_" ) separators. Information about the HiC matrix is extracted in the order indicated by the string. 'N' stands for normalization, 'C' stands for chromosome, 'R' stands for resolution. You can skip a field by putting a jolly character in the string (for example 'x'). You can have the method save additional information by putting 'A' characters in correspondence of the fields to read.
full: bool, default=False
if 'True', it will save full matrices instead of breaking them into single chromosomes.

Returns:

none

save_decay_norm

save_decay_norm(self,from_norm='nrm',res='all',makenew=False,dirtree='',ext='abc',nameformat='NCR',full=False,compute_chbounds=True)

It computes the decay norm for selected matrices and saves them in the output directory as binary files which contain a numpy array of shape (3,K) where K is the number of nonzero bins in the upper triangular matrix. The first two rows of the array contain the indices of the bins, while the third contains its normalized value. One can resctrict the method to only copying files with a given normalization, resolution and extension. The method will name data according to the normalization, chromosome, and resolution of the matrix, as well as to additional information, if provided.

Parameters:

from_norm: string, default='all'
identifier of the normalization of the matrices to save. If it is not 'all' it will only save data with the given normalization.
res: string, default='all'
identifier of the resolution of the matrices to save. If it is not 'all' it will only save data with the given resolution
dirtree: string, default=''
path from the experiment directory to the directory containing the matrices.
ext: string, default='abc'
extension of the data files to save. It will ignore files with different extensions.
nameformat: string, default='NCR'
format of the name of the files. Each file name will be stripped of its extension and segmented using underscore ( "_" ) separators. Information about the HiC matrix is extracted in the order indicated by the string. 'N' stands for normalization, 'C' stands for chromosome, 'R' stands for resolution. You can skip a field by putting a jolly character in the string (for example 'x'). You can have the method save additional information by putting 'A' characters in correspondence of the fields to read.
full: bool, default=False
if 'True', it will save full matrices instead of breaking them into single chromosomes.
compute_chbounds: bool, default=True
if 'False' does not compute the chbounds and takes the largest possible distance between two indexes instead.

Returns:

none

get_metadata

get_metadata(self, metadata)

Obtains metadata about the original dataset from a metadata file which contains the name of the experiment on the first column, and the cell type (or some other attribute of the experiments) on the second.

Parameters:

metadata: string
path to the metadata file.

Returns:

none

get_chromosomes

get_chromosomes(self,res)

Computes the chromosomes boundaries for each chromosome at the given resolution.

Parameters:

res: string
the resolution specification, written as a string of which the last two characters are the unit of measure (kb = 1000 b, Mb = 1000000 b, Gb = 1000000000 b ).

Returns:

chromo_bound: nested list of integers
It contains the chromosome boundaries. The first index of the list refers to the chromosome, the second to the lower (0) or upper (1) boundary of the chromosome.

load_from_index

load_from_index(self,fname)

Loads a matrix from a file in the index format, where the first column contains the bin lower index number, the second the bin upper index number, and the third specifies the bin value. Lines which start with # are disregarded.

Parameters:

fname: string
the name of the index file.

Returns:

loaded: array-like
the matrix.

load_from_matrix

load_from_matrix(self,fname)

loads a matrix from a file in the matricial format. The file contains the values of each bin arranged in N rows and N columns, where N is the size of the matrix. Lines which start with # are disregarded.

Parameters:

fname: string
the name of the index file.

Returns:

loaded: array-like
the matrix.

compute_decay_norm

compute_decay_norm(self,loaded,delta_max=-1)

Computes the decay normalization (or observed versus expected normalization) of a matrix passed by its indices. The decay norm is given by dividing each bin by the average interaction at its genomic distance.

Parameters:

loaded: array-like
(3,N) array-like which contains the indices of the bins in its first two rows, and their value on the third row.
delta_max: integer, default=-1
maximum genomic distance. If -1 it will compute the maximum distance directly from the indices of the matrix.

Returns:

dkn: array-like
the normalized matrix.