essHIC.make_hic

essHIC.make_hic(indir,outdir,loader='from_index', chromosomes='human')

The function of make_hic is to sort the raw data, normalize them, and create metadata files.

In practice the other classes of essHIC (such as essHIC.hic or essHIC.red) rely on the path tree and naming scheme defined by make_hic to properly function.

Parameters:

indir: string: Directory containing the raw data.
outdir: string: output directory where the sorted and normalized data, as well as the metadata, will be put.
loader: {from_index, from_matrix}, default:from_index_: The loader function to use, depending on the format of your data.
chromosomes: {human} or list of integers, default:human: it tells the class how to read your input matrices and break them down into chromosomes. If the HiC maps are from a human cell line, the chromosome lengths are already saved into the package. Otherwise you can provide a list of integers containing the length of each chromosome in your organism.

Attributes:

input_directory: string: the directory where the raw data are stored.
output_directory: string: the directory where new data and metadata will be saved.
loader: function: the loader function to read HiC data.
chromosomes: list of integers: list of the chromosomes length.
list_directories: list of strings: list of the directories (one for each experiment) found in input_directory.
newref_to_oldref: dictionary: a dictionary of the old names of the experiments in terms of the names assigned by essHIC.make_hic.
oldref_to_newref: dictionary: a dictionary of the names of the experiments assigned by essHIC.make_hic in terms of the old names.
oldref_to_cell: dictionary: a dictionary of the celltypes of the experiments in terms of the old names.
newref_to_cell: dictionary: a dictionary of the celltypes of the experiments in terms of the new names.

Methods

method	function
save_metadata	saves a metadata.txt file which contains metadata information about the HiC matrices.
save_chromosomes	saves a chromosome.txt file which contains the lengths of the chromosomes at different resolutions.
save_data	saves raw data as binary files.
save_decay_norm	saves normalized data as binary files.
get_metadata	reads a metadata file and creates dictionaries to convert between the new and the old nomenclature of the data.
get_chromosomes	creates chromosomes boundaries at the wanted resolution.
load_from_index	loads an HiC matrix saved in the index format.
load_from_matrix	loads an HiC matrix saved in the matricial format.
compute_decay_norm	normalizes a matrices according to the decay norm.

__init__(indir,outdir,loader='from_index', chromosomes='human')

initialize self.

save_metadata

save_metadata(self)

saves a metadata.txt file in the output directory (self.output_directory). It contains the new names of the HiC experiments, the old ones, and the cell type. All the new names follow the format ''hicxxx'' where xxx is an integer with trailing zeros (for example hic001). The experiments are sorted according to their cell type. The fourth column of the file can be used to mark outliers for removal by writing ''remove''. These experiments will not be considered when analysing the distance matrix.

Returns:

none

save_chromosomes

save_chromosomes(self,res_list)

Saves a chromosomes.txt file in the output directory (self.output_directory). It contains a line for each resolution found in the raw data. Each line starts with the resolution identifier and contains the length of each chromosome at that resolution.

Parameters:

res_list: list of strings: the list of the resolutions found in the raw data.

Returns:

none

save_data

save_data(self,from_norm='all',res='all',dirtree='',ext='abc',nameformat='NCR',full=False)

Saves selected raw data in the output directory as binary files which contain a numpy array of shape (3,K) where K is the number of nonzero bins in the upper triangular matrix. The first two rows of the array contain the indices of the bins, while the third contains its value. One can resctrict the method to only copying files with a given normalization, resolution and extension. The method will name data according to the normalization, chromosome, and resolution of the matrix, as well as to additional information, if provided.

Parameters:

from_norm: string, default='all': identifier of the normalization of the matrices to save. If it is not 'all' it will only save data with the given normalization.
res: string, default='all': identifier of the resolution of the matrices to save. If it is not 'all' it will only save data with the given resolution
dirtree: string, default='': path from the experiment directory to the directory containing the matrices.
ext: string, default='abc': extension of the data files to save. It will ignore files with different extensions.
nameformat: string, default='NCR': format of the name of the files. Each file name will be stripped of its extension and segmented using underscore ( "_" ) separators. Information about the HiC matrix is extracted in the order indicated by the string. 'N' stands for normalization, 'C' stands for chromosome, 'R' stands for resolution. You can skip a field by putting a jolly character in the string (for example 'x'). You can have the method save additional information by putting 'A' characters in correspondence of the fields to read.
full: bool, default=False: if 'True', it will save full matrices instead of breaking them into single chromosomes.

Returns:

none

save_decay_norm

save_decay_norm(self,from_norm='nrm',res='all',makenew=False,dirtree='',ext='abc',nameformat='NCR',full=False,compute_chbounds=True)

It computes the decay norm for selected matrices and saves them in the output directory as binary files which contain a numpy array of shape (3,K) where K is the number of nonzero bins in the upper triangular matrix. The first two rows of the array contain the indices of the bins, while the third contains its normalized value. One can resctrict the method to only copying files with a given normalization, resolution and extension. The method will name data according to the normalization, chromosome, and resolution of the matrix, as well as to additional information, if provided.

Parameters:

from_norm: string, default='all': identifier of the normalization of the matrices to save. If it is not 'all' it will only save data with the given normalization.
res: string, default='all': identifier of the resolution of the matrices to save. If it is not 'all' it will only save data with the given resolution
dirtree: string, default='': path from the experiment directory to the directory containing the matrices.
ext: string, default='abc': extension of the data files to save. It will ignore files with different extensions.
nameformat: string, default='NCR': format of the name of the files. Each file name will be stripped of its extension and segmented using underscore ( "_" ) separators. Information about the HiC matrix is extracted in the order indicated by the string. 'N' stands for normalization, 'C' stands for chromosome, 'R' stands for resolution. You can skip a field by putting a jolly character in the string (for example 'x'). You can have the method save additional information by putting 'A' characters in correspondence of the fields to read.
full: bool, default=False: if 'True', it will save full matrices instead of breaking them into single chromosomes.
compute_chbounds: bool, default=True: if 'False' does not compute the chbounds and takes the largest possible distance between two indexes instead.

Returns:

none

get_metadata

get_metadata(self, metadata)

Obtains metadata about the original dataset from a metadata file which contains the name of the experiment on the first column, and the cell type (or some other attribute of the experiments) on the second.

Parameters:

metadata: string: path to the metadata file.

Returns:

none

get_chromosomes

get_chromosomes(self,res)

Computes the chromosomes boundaries for each chromosome at the given resolution.

Parameters:

res: string: the resolution specification, written as a string of which the last two characters are the unit of measure (kb = 1000 b, Mb = 1000000 b, Gb = 1000000000 b ).

Returns:

chromo_bound: nested list of integers: It contains the chromosome boundaries. The first index of the list refers to the chromosome, the second to the lower (0) or upper (1) boundary of the chromosome.

load_from_index

load_from_index(self,fname)

Loads a matrix from a file in the index format, where the first column contains the bin lower index number, the second the bin upper index number, and the third specifies the bin value. Lines which start with # are disregarded.

Parameters:

fname: string: the name of the index file.

Returns:

loaded: array-like: the matrix.

load_from_matrix

load_from_matrix(self,fname)

loads a matrix from a file in the matricial format. The file contains the values of each bin arranged in N rows and N columns, where N is the size of the matrix. Lines which start with # are disregarded.

Parameters:

fname: string: the name of the index file.

Returns:

loaded: array-like: the matrix.

compute_decay_norm

compute_decay_norm(self,loaded,delta_max=-1)

Computes the decay normalization (or observed versus expected normalization) of a matrix passed by its indices. The decay norm is given by dividing each bin by the average interaction at its genomic distance.

$M_{i,j}^{(norm)} = \frac{ M_{i,j} }{ \sum_{m,n} M_{m,n}\big|_{|m-n|=|i-j|}}$

Parameters:

loaded: array-like: (3,N) array-like which contains the indices of the bins in its first two rows, and their value on the third row.
delta_max: integer, default=-1
: maximum genomic distance. If -1 it will compute the maximum distance directly from the indices of the matrix.

Returns:

dkn: array-like: the normalized matrix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

essHIC.make_hic

Parameters:

Attributes:

Methods

save_metadata

Returns:

save_chromosomes

Parameters:

Returns:

save_data

Parameters:

Returns:

save_decay_norm

Parameters:

Returns:

get_metadata

Parameters:

Returns:

get_chromosomes

Parameters:

Returns:

load_from_index

Parameters:

Returns:

load_from_matrix

Parameters:

Returns:

compute_decay_norm

Parameters:

Returns:

Clone this wiki locally