Skip to content

Latest commit

 

History

History
105 lines (79 loc) · 3.93 KB

multi_rna.md

File metadata and controls

105 lines (79 loc) · 3.93 KB

Usage

  1. Make a rna genomeDir

Homo sapiens

mkdir hs_ensembl_99
cd hs_ensembl_99

wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.99.gtf.gz

conda activate celescope
celescope utils mkgtf Homo_sapiens.GRCh38.99.gtf Homo_sapiens.GRCh38.99.filtered.gtf
celescope rna mkref \
 --genome_name Homo_sapiens_ensembl_99_filtered \
 --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa \
 --gtf Homo_sapiens.GRCh38.99.filtered.gtf \
 --mt_gene_list mt_gene_list.txt

Mus musculus

mkdir mmu_ensembl_99
cd mmu_ensembl_99

wget ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-99/gtf/mus_musculus/Mus_musculus.GRCm38.99.gtf.gz

gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz 
gunzip Mus_musculus.GRCm38.99.gtf.gz

conda activate celescope
celescope utils mkgtf Mus_musculus.GRCm38.99.gtf Mus_musculus.GRCm38.99.filtered.gtf

celescope rna mkref \
 --genome_name Mus_musculus_ensembl_99_filtered \
 --fasta Mus_musculus.GRCm38.dna.primary_assembly.fa \
 --gtf Mus_musculus.GRCm38.99.filtered.gtf \
 --mt_gene_list mt_gene_list.txt
  1. Generate scripts for each sample

Under your working directory, write a shell script run.sh as

multi_rna\
	--mapfile ./rna.mapfile\
	--genomeDir {path to hs_ensembl_99 or mmu_ensembl_99}\
	--thread 8\
	--mod shell

--mapfile Required. Mapfile is a tab-delimited text file with as least three columns. Each line of mapfile represents paired-end fastq files.

1st column: Fastq file prefix.
2nd column: Fastq file directory path.
3rd column: Sample name, which is the prefix of all output files.
4th column: The 4th column has different meaning for each assay. For rna, it means forced cell number and it's an optional column. For other assays, see here.

Example

Sample1 has 2 paired-end fastq files located in 2 different directories(fastq_dir1 and fastq_dir2). Sample2 has 1 paired-end fastq file located in fastq_dir1.

$cat ./my.mapfile
fastq_prefix1	fastq_dir1	sample1
fastq_prefix2	fastq_dir2	sample1
fastq_prefix3	fastq_dir1	sample2

$ls fastq_dir1
fastq_prefix1_1.fq.gz	fastq_prefix1_2.fq.gz
fastq_prefix3_1.fq.gz	fastq_prefix3_2.fq.gz

$ls fastq_dir2
fastq_prefix2_1.fq.gz	fastq_prefix2_2.fq.gz

--genomeDir Required. The path of the genome directory after running celescope rna mkref.

--thread Threads to use. The recommended setting is 8, and the maximum should not exceed 20.

--mod Create sjm(simple job manager https://github.com/StanfordBioinformatics/SJM) or shell scripts.

After you sh run.sh, a shell directory containing {sample}.sh files will be generated.

  1. Start the analysis by running:
sh ./shell/{sample}.sh

Note that the ./shell/{sample}.sh must be run under the working directory(You shouldn't run them under the shell directory)

Main output

  • outs/{sample}_Aligned.sortedByCoord.out.bam This bam file contains coordinate-sorted reads aligned to the genome.
  • outs/raw Gene expression matrix file contains all barcodes(background + cell) from the barcode whitelist.
  • outs/filtered Gene expression matrix file contains only cell barcodes.

Seurat CreateSeuratObject

When using seurat CreateSeuratObject, the default names.delim is underscore . Since cell barcode is separated by underscore(for example, ATCGATCGA_ATCGATCGA_ATCGATCGA), using names.delim = "_" will incorrectly set orig.ident to the third segment of barcode. This problem can be avoided by setting names.delim to other characters, such as names.delim="-"

seurat.object = CreateSeuratObject(matrix, names.delim="-", project="sample_name")