GitHub - tanakanishi/findclosest

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
GEOdata		GEOdata
additional		additional
ColoredClustering.py		ColoredClustering.py
README.txt		README.txt
calculate.sh		calculate.sh
colname.txt		colname.txt
distance.py		distance.py
merge.pvalue2.64000.GEO.txt		merge.pvalue2.64000.GEO.txt
readfile1.sh		readfile1.sh
readfile2.sh		readfile2.sh
rowname.txt		rowname.txt
test1.narrowPeak		test1.narrowPeak
ward_closest.py		ward_closest.py

Repository files navigation

###Overview

    Calculate the Hamming distances between input sample and 77 ATAC-seq datasets from 13 human primary blood cell types.
    Evaluate which type of hematopoietic cell is closest to an input sample	


###Requirement

bedtools (https://bedtools.readthedocs.io/en/latest/content/installation.html)
python (python 2.7.12 or higher or python 3)
python modules : numpy, pandas, statistics, scipy.cluster, scipy.spatial.distance


###Preprocessing
mapping : bwa (version 0.7.16a or higher)

sam to bam exchange & selecting aligned reads with high mapping quality : samtools (0.1.18 or higher)

remove duplicates : PICARD software (v1.119 or higher)

peak calling : MACS2 (v2.1.2 or higher)  --nomodel --nolambda --keep-dup all -p 0.01



###Input file format

Please input {samplename}.narrowPeak file, the output of MACS2


###Output

{samplename} directory is made and under the current directory.
(i) Rank the peak results in the order of ascending p-values and acquire top 64000 peaks  => {samplename}.64000.bed
(ii) Calculate the Hamming distance between input sample and 77 ATAC-seq datasets => {samplename}.distanceresult.withtitle.txt
(iii) Clustering result => {samplename}.distanceresult.withtitle.pdf
(iv) Which cell type is closeset to the input sample => Please see {samplename}.close_cell_type.txt. Each column indicates "ranking", "cell type", "diatance".

above (i), (ii), (iii),(iv)  files are stored under ./samplename directory


###Usage
sh calculate.sh {INPUTfile}


###Explanation of each file
(1) merge.pvalue2.64000.GEO.txt => 
	distance matrix between 77 ATAC-seq datasets from 13 human primary blood cell types (For the calculation, top 64000 peaks of each sample is used)

(2) Files in the ./GEOdata directory => 
	For example, "Bcell_1.pvalue2.64000.bed" is produced by ranking the peak result in the order of ascending p-values and acquiring top 64000 peaks of Bcell_1 peak calling result.

(3) Files in the ./addiotional  directory
	(i) samplelist.txt => 
		Sample name and its corresponding GSM number
	(ii) penaltyscore.sh => 
		Calculate penalty score. This script is only for 77 ATAC-seq datasets from 13 human primary blood cell types. 
		When using this script, the order of the input file is important. Please refer to the "pvalue2_64000_77samples.txt" in this directory.
	(iii) pvalue2_64000_77samples.txt => 
		Hamming distance matrix of 77 ATAC-seq datasets from 13 human primary blood cell types.
		In order to calculate the penalty score, please run
		sh penaltyscore.sh pvalue2_64000_77samples.txt 
	

(4) test1.narrowPeak  => A testdata. This data is from CLL sample. 
	If you run  
		sh calculate.sh test1.narrowPeak
	then, "test1" directory is made and under this directory, output files are stored.


###Contact

Azusa Tanaka :  a-tanaka at m.u-tokyo.ac.jp

About

No description, website, or topics provided.

Readme

Activity

0 stars

1 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEOdata

GEOdata

additional

additional

ColoredClustering.py

ColoredClustering.py

README.txt

README.txt

calculate.sh

calculate.sh

colname.txt

colname.txt

distance.py

distance.py

merge.pvalue2.64000.GEO.txt

merge.pvalue2.64000.GEO.txt

readfile1.sh

readfile1.sh

readfile2.sh

readfile2.sh

rowname.txt

rowname.txt

test1.narrowPeak

test1.narrowPeak

ward_closest.py

ward_closest.py

Repository files navigation

About

Releases

Packages

Languages

tanakanishi/findclosest

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages