Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

plbaldoni · 2020-02-05T17:13:51Z

Hi,

the file Gencode_TSS_pc_lincRNA_antisense.bed has an extra trailing tab at the end of every line which makes the R pipeline to fail when computing the annotation.

Also, it would be great if the authors could make the code to process the raw data available for the community (from SRA files to count matrices). It is unclear how to obtain the count matrices out of the SRA files posted at https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA481734.

Best,
Pedro Baldoni

caokai001 · 2020-06-01T10:35:15Z

Hello friend, have you finished running this code?

Can we get the test file?

 # input$count_matrix_1 = "test_set/HBCx_95_hg38_paper/MatCov_Sample1_merged123_hg38_rmdup_1000reads_hg38_50kb_chunks.txt"
 # input$count_matrix_2 = "test_set/HBCx_95_hg38_paper/MatCov_Sample2_merged123_hg38_rmdup_1000reads_hg38_50kb_chunks.txt"
 # input$bam1 =  "test_set/HBCx_95_hg38_paper/HBCx_95_flagged_rmDup.bam"
 # input$bam2 =  "test_set/HBCx_95_hg38_paper/HBCx_95_flagged_rmDup.bam"

Same as you, I'm not sure how to get count matrices for SRA.

Pacomito · 2020-06-02T08:10:17Z

Hello @plbaldoni & @caokai001,

To get raw matrices from raw count, you'll have to download the pipeline we developed : https://github.com/vallotlab/scChIPseq_DataEngineering/tree/devel (make sure to use the devel version).
This is a slightly modified version of the pipeline used in Grosselin et al., but will produce more accurate results (better duplicate removal).

You will have to manually change the paths towards all the tools in the first lines of the 'CONFIG_TEMPLATE' file, at the root of the directory. (If you don't have bowtie1 or BWA, you can just omit it as they are not used by default)

You will also have to modify in the 'species_design_configs.csv', at the root of the directory :

The path towards the STAR-2.6.0c mapping indexes as STAR is used for genome mapping by default, for human & mouse:
'/data/annotations/pipelines/Human/hg38/indexes/STAR' -> path/to/hg38/STAR/
'/data/annotations/Mouse/mm10/complete/STAR_indexes' -> path/to/mm10/STAR/
The path towards Bowtie2 barcode indexes - Hifibio design, used for demultiplexing : '/data/users/pprompsy/Annotation/Barcodes_HiFiBio/index_barcode/bowtie_2_index_short/ref_index_' -> '/path/to/source/Barcodes/Barcodes_HiFiBio/index_barcode/bowtie_2_index_short/ref_index_'
The path towards black regions in human & mouse: '' '/data/users/pprompsy/Annotation/bed/hg38-blacklist.v2.bed' -> /path/to/source/BED/hg38-blacklist.v2.bed
'/data/users/pprompsy/Annotation/bed/mm10.blacklist.merged.bed' -> /path/to/source/BED/mm10.blacklist.merged.bed

/path/to/source being the root of the GitHub directory.

For all the samples described in the paper, the barcode design is 'Hifibio'
So for instance for HBCx95_hg38 H3K27me3 sample you would run :

ASSEMBLY=hg38
OUTPUT_CONFIG=~/CONFIG_Hifibio
MARK=h3k27me3

./schip_processing.sh GetConf --template CONFIG_TEMPLATE --configFile species_design_configs.csv --designType Hifibio --genomeAssembly ${ASSEMBLY} --outputConfig ${OUTPUT_CONFIG} --mark ${MARK}

You might want to omit the coverage step which produces bigwigs but needs a bioconda environment which path is hardcoded in the pipeline.

You would run, after downloading the fastq files from SRA:

OUTPUT_DIR=/path/to/output/
NAME="HBCx95_hg38_k27"
READ1=/data/users/pprompsy/tests/[fastq_downloaded_from_SRA].R1.fastq.gz
READ2=/data/users/pprompsy/tests/[fastq_downloaded_from_SRA].R1.fastq.gz

./schip_processing.sh Barcoding+Trimming+Mapping+Filtering+Counting -f ${READ1} -r ${READ2} -c ${OUTPUT_CONFIG}  -o ${OUTPUT_DIR} --name ${NAME}

You need to have at least 40-60Gb RAM and 8 cores availables. This will produce BAM files as well as count matrices in the output directory, that you can input in this R downstream analysis script.

If you have issues with the pipelines, don't hesitate to post on the respective page.

Thank you for noticing that the bed file is corrupted, I will correct this.
Pacome

vivekbhr mentioned this issue Apr 11, 2022

Low no. of reads when re-analyzing Grosselin et. al. 2019 data vallotlab/scChIPseq_DataEngineering#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

plbaldoni commented Feb 5, 2020

caokai001 commented Jun 1, 2020

Pacomito commented Jun 2, 2020 •

edited

Loading

Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

Comments

plbaldoni commented Feb 5, 2020

caokai001 commented Jun 1, 2020

Pacomito commented Jun 2, 2020 • edited Loading

Pacomito commented Jun 2, 2020 •

edited

Loading