Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

Open
plbaldoni opened this issue Feb 5, 2020 · 2 comments
Open

Issue with Gencode_TSS_pc_lincRNA_antisense.bed file #2

plbaldoni opened this issue Feb 5, 2020 · 2 comments

Comments

@plbaldoni
Copy link

Hi,

the file Gencode_TSS_pc_lincRNA_antisense.bed has an extra trailing tab at the end of every line which makes the R pipeline to fail when computing the annotation.

Also, it would be great if the authors could make the code to process the raw data available for the community (from SRA files to count matrices). It is unclear how to obtain the count matrices out of the SRA files posted at https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA481734.

Best,
Pedro Baldoni

@caokai001
Copy link

Hello friend, have you finished running this code?

  • Can we get the test file?
 # input$count_matrix_1 = "test_set/HBCx_95_hg38_paper/MatCov_Sample1_merged123_hg38_rmdup_1000reads_hg38_50kb_chunks.txt"
 # input$count_matrix_2 = "test_set/HBCx_95_hg38_paper/MatCov_Sample2_merged123_hg38_rmdup_1000reads_hg38_50kb_chunks.txt"
 # input$bam1 =  "test_set/HBCx_95_hg38_paper/HBCx_95_flagged_rmDup.bam"
 # input$bam2 =  "test_set/HBCx_95_hg38_paper/HBCx_95_flagged_rmDup.bam"
  • Same as you, I'm not sure how to get count matrices for SRA.

@Pacomito
Copy link
Member

Pacomito commented Jun 2, 2020

Hello @plbaldoni & @caokai001,

To get raw matrices from raw count, you'll have to download the pipeline we developed : https://github.com/vallotlab/scChIPseq_DataEngineering/tree/devel (make sure to use the devel version).
This is a slightly modified version of the pipeline used in Grosselin et al., but will produce more accurate results (better duplicate removal).

You will have to manually change the paths towards all the tools in the first lines of the 'CONFIG_TEMPLATE' file, at the root of the directory. (If you don't have bowtie1 or BWA, you can just omit it as they are not used by default)

You will also have to modify in the 'species_design_configs.csv', at the root of the directory :

  1. The path towards the STAR-2.6.0c mapping indexes as STAR is used for genome mapping by default, for human & mouse:
    '/data/annotations/pipelines/Human/hg38/indexes/STAR' -> path/to/hg38/STAR/
    '/data/annotations/Mouse/mm10/complete/STAR_indexes' -> path/to/mm10/STAR/

  2. The path towards Bowtie2 barcode indexes - Hifibio design, used for demultiplexing : '/data/users/pprompsy/Annotation/Barcodes_HiFiBio/index_barcode/bowtie_2_index_short/ref_index_' -> '/path/to/source/Barcodes/Barcodes_HiFiBio/index_barcode/bowtie_2_index_short/ref_index_'

  3. The path towards black regions in human & mouse: '' '/data/users/pprompsy/Annotation/bed/hg38-blacklist.v2.bed' -> /path/to/source/BED/hg38-blacklist.v2.bed
    '/data/users/pprompsy/Annotation/bed/mm10.blacklist.merged.bed' -> /path/to/source/BED/mm10.blacklist.merged.bed

/path/to/source being the root of the GitHub directory.

For all the samples described in the paper, the barcode design is 'Hifibio'
So for instance for HBCx95_hg38 H3K27me3 sample you would run :

ASSEMBLY=hg38
OUTPUT_CONFIG=~/CONFIG_Hifibio
MARK=h3k27me3

./schip_processing.sh GetConf --template CONFIG_TEMPLATE --configFile species_design_configs.csv --designType Hifibio --genomeAssembly ${ASSEMBLY} --outputConfig ${OUTPUT_CONFIG} --mark ${MARK}

You might want to omit the coverage step which produces bigwigs but needs a bioconda environment which path is hardcoded in the pipeline.

You would run, after downloading the fastq files from SRA:

OUTPUT_DIR=/path/to/output/
NAME="HBCx95_hg38_k27"
READ1=/data/users/pprompsy/tests/[fastq_downloaded_from_SRA].R1.fastq.gz
READ2=/data/users/pprompsy/tests/[fastq_downloaded_from_SRA].R1.fastq.gz

./schip_processing.sh Barcoding+Trimming+Mapping+Filtering+Counting -f ${READ1} -r ${READ2} -c ${OUTPUT_CONFIG}  -o ${OUTPUT_DIR} --name ${NAME}

You need to have at least 40-60Gb RAM and 8 cores availables. This will produce BAM files as well as count matrices in the output directory, that you can input in this R downstream analysis script.

If you have issues with the pipelines, don't hesitate to post on the respective page.

Thank you for noticing that the bed file is corrupted, I will correct this.
Pacome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants