-
Notifications
You must be signed in to change notification settings - Fork 2
essHIC.make_hic
essHIC.make_hic(indir,outdir,loader='from_index', chromosomes='human')
The function of make_hic is to sort the raw data, normalize them, and create metadata files.
In practice the other classes of essHIC (such as essHIC.hic or essHIC.red) rely on the path tree and naming scheme defined by make_hic to properly function.
- indir: string
- Directory containing the raw data.
- outdir: string
- output directory where the sorted and normalized data, as well as the metadata, will be put.
- loader: {from_index, from_matrix}, default:from_index_
- The loader function to use, depending on the format of your data.
- chromosomes: {human} or list of integers, default:human
- it tells the class how to read your input matrices and break them down into chromosomes. If the HiC maps are from a human cell line, the chromosome lengths are already saved into the package. Otherwise you can provide a list of integers containing the length of each chromosome in your organism.
- input_directory: string
- the directory where the raw data are stored.
- output_directory: string
- the directory where new data and metadata will be saved.
- loader: function
- the loader function to read HiC data.
- chromosomes: list of integers
- list of the chromosomes length.
- list_directories: list of strings
- list of the directories (one for each experiment) found in input_directory.
- newref_to_oldref: dictionary
- a dictionary of the old names of the experiments in terms of the names assigned by essHIC.make_hic.
- oldref_to_newref: dictionary
- a dictionary of the names of the experiments assigned by essHIC.make_hic in terms of the old names.
- oldref_to_cell: dictionary
- a dictionary of the celltypes of the experiments in terms of the old names.
- newref_to_cell: dictionary
- a dictionary of the celltypes of the experiments in terms of the new names.
method | function |
---|---|
save_metadata | saves a metadata.txt file which contains metadata information about the HiC matrices. |
save_chromosomes | saves a chromosome.txt file which contains the lengths of the chromosomes at different resolutions. |
save_data | saves raw data as binary files. |
save_decay_norm | saves normalized data as binary files. |
get_metadata | reads a metadata file and creates dictionaries to convert between the new and the old nomenclature of the data. |
get_chromosomes | creates chromosomes boundaries at the wanted resolution. |
load_from_index | loads an HiC matrix saved in the index format. |
load_from_matrix | loads an HiC matrix saved in the matricial format. |
compute_decay_norm | normalizes a matrices according to the decay norm. |
__init__(indir,outdir,loader='from_index', chromosomes='human')
initialize self.
save_metadata(self)
saves a metadata.txt file in the output directory (self.output_directory). It contains the new names of the HiC experiments, the old ones, and the cell type. All the new names follow the format ''hicxxx'' where xxx is an integer with trailing zeros (for example hic001). The experiments are sorted according to their cell type. The fourth column of the file can be used to mark outliers for removal by writing ''remove''. These experiments will not be considered when analysing the distance matrix.
- none
save_chromosomes(self,res_list)
Saves a chromosomes.txt file in the output directory (self.output_directory). It contains a line for each resolution found in the raw data. Each line starts with the resolution identifier and contains the length of each chromosome at that resolution.
- res_list: list of strings
- the list of the resolutions found in the raw data.
- none
save_data(self,from_norm='all',res='all',dirtree='',ext='abc',nameformat='NCR',full=False)
Saves selected raw data in the output directory as binary files which contain a numpy array of shape (3,K) where K is the number of nonzero bins in the upper triangular matrix. The first two rows of the array contain the indices of the bins, while the third contains its value. One can resctrict the method to only copying files with a given normalization, resolution and extension. The method will name data according to the normalization, chromosome, and resolution of the matrix, as well as to additional information, if provided.
- from_norm: string, default='all'
- identifier of the normalization of the matrices to save. If it is not 'all' it will only save data with the given normalization.
- res: string, default='all'
- identifier of the resolution of the matrices to save. If it is not 'all' it will only save data with the given resolution
- dirtree: string, default=''
- path from the experiment directory to the directory containing the matrices.
- ext: string, default='abc'
- extension of the data files to save. It will ignore files with different extensions.
- nameformat: string, default='NCR'
- format of the name of the files. Each file name will be stripped of its extension and segmented using underscore ( "_" ) separators. Information about the HiC matrix is extracted in the order indicated by the string. 'N' stands for normalization, 'C' stands for chromosome, 'R' stands for resolution. You can skip a field by putting a jolly character in the string (for example 'x'). You can have the method save additional information by putting 'A' characters in correspondence of the fields to read.
- full: bool, default=False
- if 'True', it will save full matrices instead of breaking them into single chromosomes.
- none
save_decay_norm(self,from_norm='nrm',res='all',makenew=False,dirtree='',ext='abc',nameformat='NCR',full=False,compute_chbounds=True)
It computes the decay norm for selected matrices and saves them in the output directory as binary files which contain a numpy array of shape (3,K) where K is the number of nonzero bins in the upper triangular matrix. The first two rows of the array contain the indices of the bins, while the third contains its normalized value. One can resctrict the method to only copying files with a given normalization, resolution and extension. The method will name data according to the normalization, chromosome, and resolution of the matrix, as well as to additional information, if provided.
- from_norm: string, default='all'
- identifier of the normalization of the matrices to save. If it is not 'all' it will only save data with the given normalization.
- res: string, default='all'
- identifier of the resolution of the matrices to save. If it is not 'all' it will only save data with the given resolution
- dirtree: string, default=''
- path from the experiment directory to the directory containing the matrices.
- ext: string, default='abc'
- extension of the data files to save. It will ignore files with different extensions.
- nameformat: string, default='NCR'
- format of the name of the files. Each file name will be stripped of its extension and segmented using underscore ( "_" ) separators. Information about the HiC matrix is extracted in the order indicated by the string. 'N' stands for normalization, 'C' stands for chromosome, 'R' stands for resolution. You can skip a field by putting a jolly character in the string (for example 'x'). You can have the method save additional information by putting 'A' characters in correspondence of the fields to read.
- full: bool, default=False
- if 'True', it will save full matrices instead of breaking them into single chromosomes.
- compute_chbounds: bool, default=True
- if 'False' does not compute the chbounds and takes the largest possible distance between two indexes instead.
- none
get_metadata(self, metadata)
Obtains metadata about the original dataset from a metadata file which contains the name of the experiment on the first column, and the cell type (or some other attribute of the experiments) on the second.
- metadata: string
- path to the metadata file.
- none
get_chromosomes(self,res)
Computes the chromosomes boundaries for each chromosome at the given resolution.
- res: string
- the resolution specification, written as a string of which the last two characters are the unit of measure (kb = 1000 b, Mb = 1000000 b, Gb = 1000000000 b ).
- chromo_bound: nested list of integers
- It contains the chromosome boundaries. The first index of the list refers to the chromosome, the second to the lower (0) or upper (1) boundary of the chromosome.
load_from_index(self,fname)
Loads a matrix from a file in the index format, where the first column contains the bin lower index number, the second the bin upper index number, and the third specifies the bin value. Lines which start with # are disregarded.
- fname: string
- the name of the index file.
- loaded: array-like
- the matrix.
load_from_matrix(self,fname)
loads a matrix from a file in the matricial format. The file contains the values of each bin arranged in N rows and N columns, where N is the size of the matrix. Lines which start with # are disregarded.
- fname: string
- the name of the index file.
- loaded: array-like
- the matrix.
compute_decay_norm(self,loaded,delta_max=-1)
Computes the decay normalization (or observed versus expected normalization) of a matrix passed by its indices. The decay norm is given by dividing each bin by the average interaction at its genomic distance.
- loaded: array-like
- (3,N) array-like which contains the indices of the bins in its first two rows, and their value on the third row.
- delta_max: integer, default=-1
- maximum genomic distance. If -1 it will compute the maximum distance directly from the indices of the matrix.
- dkn: array-like
- the normalized matrix.