- Make a rna genomeDir
mkdir hs_ensembl_99
cd hs_ensembl_99
wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.99.gtf.gz
conda activate celescope
celescope utils mkgtf Homo_sapiens.GRCh38.99.gtf Homo_sapiens.GRCh38.99.filtered.gtf
celescope rna mkref \
--genome_name Homo_sapiens_ensembl_99_filtered \
--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--gtf Homo_sapiens.GRCh38.99.filtered.gtf \
--mt_gene_list mt_gene_list.txt
mkdir mmu_ensembl_99
cd mmu_ensembl_99
wget ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-99/gtf/mus_musculus/Mus_musculus.GRCm38.99.gtf.gz
gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm38.99.gtf.gz
conda activate celescope
celescope utils mkgtf Mus_musculus.GRCm38.99.gtf Mus_musculus.GRCm38.99.filtered.gtf
celescope rna mkref \
--genome_name Mus_musculus_ensembl_99_filtered \
--fasta Mus_musculus.GRCm38.dna.primary_assembly.fa \
--gtf Mus_musculus.GRCm38.99.filtered.gtf \
--mt_gene_list mt_gene_list.txt
- Generate scripts for each sample
Under your working directory, write a shell script run.sh
as
multi_rna\
--mapfile ./rna.mapfile\
--genomeDir {path to hs_ensembl_99 or mmu_ensembl_99}\
--thread 8\
--mod shell
--mapfile
Required. Mapfile is a tab-delimited text file with as least three columns. Each line of mapfile represents paired-end fastq files.
1st column: Fastq file prefix.
2nd column: Fastq file directory path.
3rd column: Sample name, which is the prefix of all output files.
4th column: The 4th column has different meaning for each assay. For rna
, it means forced cell number
and it's an optional column. For other assays, see here.
Example
Sample1 has 2 paired-end fastq files located in 2 different directories(fastq_dir1 and fastq_dir2). Sample2 has 1 paired-end fastq file located in fastq_dir1.
$cat ./my.mapfile
fastq_prefix1 fastq_dir1 sample1
fastq_prefix2 fastq_dir2 sample1
fastq_prefix3 fastq_dir1 sample2
$ls fastq_dir1
fastq_prefix1_1.fq.gz fastq_prefix1_2.fq.gz
fastq_prefix3_1.fq.gz fastq_prefix3_2.fq.gz
$ls fastq_dir2
fastq_prefix2_1.fq.gz fastq_prefix2_2.fq.gz
--genomeDir
Required. The path of the genome directory after running celescope rna mkref
.
--thread
Threads to use. The recommended setting is 8, and the maximum should not exceed 20.
--mod
Create sjm
(simple job manager https://github.com/StanfordBioinformatics/SJM) or shell
scripts.
After you sh run.sh
, a shell
directory containing {sample}.sh
files will be generated.
- Start the analysis by running:
sh ./shell/{sample}.sh
Note that the ./shell/{sample}.sh
must be run under the working directory(You shouldn't run them under the shell
directory)
outs/{sample}_Aligned.sortedByCoord.out.bam
This bam file contains coordinate-sorted reads aligned to the genome.outs/raw
Gene expression matrix file contains all barcodes(background + cell) from the barcode whitelist.outs/filtered
Gene expression matrix file contains only cell barcodes.
When using seurat CreateSeuratObject, the default names.delim
is underscore . Since cell barcode is separated by underscore(for example, ATCGATCGA_ATCGATCGA_ATCGATCGA), using names.delim = "_"
will incorrectly set orig.ident
to the third segment of barcode. This problem can be avoided by setting names.delim to other characters, such as names.delim="-"
seurat.object = CreateSeuratObject(matrix, names.delim="-", project="sample_name")