Building a Custom Reference for Cell Ranger (mkref): Adding a Marker Gene to the FASTA and GTF

1. **Prepare the Marker Gene Sequence:**

Create a new FASTA file (e.g., marker_gene.fa) containing the sequence of the marker gene. Ensure the sequence header follows the convention used in the main reference FASTA file.

Example marker_gene.fa:

In [None]:
>GFP
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT
GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT
ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC
TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG
AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC
GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA
AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC
AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG
CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC
CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA
GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG
TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT
AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT
ACTGGAGTGTAT

2. **Count the Total Number of Bases in the Sequence:**

Use the following command to count the total number of bases in the marker gene sequence:

In [None]:
cat marker_gene.fa | grep -v "^>" | tr -d "\n" | wc -c

There are 922 base. This is important to know for the next step.

3. **Create a Custom GTF for the Marker Gene:**

Write a custom GTF entry for the marker gene:

In [None]:
echo -e 'GFP\tunknown\texon\t1\t922\t.\t+\t.\tgene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";' > marker_gene.gtf

The marker_gene.gtf file should look like this:

In [None]:
GFP     unknown exon    1       922     .       +       .       gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";

4. **Integrate GFP into Existing Reference Files:**

Copy the existing genome FASTA file to a new file:

In [None]:
cp path/to/refdata/fasta/genome.fa genome_marker_gene.fa

Append the marker gene sequence to the new genome FASTA file:

In [None]:
cat marker_gene.fa >> genome_marker_gene.fa

Verify the addition by checking for the marker gene entry in the file:

In [None]:
grep ">" genome_marker_gene.fa

The output looks similar to the following:

In [None]:
>1 dna:chromosome chromosome:GRCz11:1:1:59578282:1 REF
>10 dna:chromosome chromosome:GRCz11:10:1:45420867:1 REF
>11 dna:chromosome chromosome:GRCz11:11:1:45484837:1 REF
>12 dna:chromosome chromosome:GRCz11:12:1:49182954:1 REF
>13 dna:chromosome chromosome:GRCz11:13:1:52186027:1 REF
>14 dna:chromosome chromosome:GRCz11:14:1:52660232:1 REF
>15 dna:chromosome chromosome:GRCz11:15:1:48040578:1 REF
>16 dna:chromosome chromosome:GRCz11:16:1:55266484:1 REF
>17 dna:chromosome chromosome:GRCz11:17:1:53461100:1 REF
>18 dna:chromosome chromosome:GRCz11:18:1:51023478:1 REF
>19 dna:chromosome chromosome:GRCz11:19:1:48449771:1 REF
>2 dna:chromosome chromosome:GRCz11:2:1:59640629:1 REF
>20 dna:chromosome chromosome:GRCz11:20:1:55201332:1 REF
>21 dna:chromosome chromosome:GRCz11:21:1:45934066:1 REF
>22 dna:chromosome chromosome:GRCz11:22:1:39133080:1 REF
>23 dna:chromosome chromosome:GRCz11:23:1:46223584:1 REF
>24 dna:chromosome chromosome:GRCz11:24:1:42172926:1 REF
>25 dna:chromosome chromosome:GRCz11:25:1:37502051:1 REF
>3 dna:chromosome chromosome:GRCz11:3:1:62628489:1 REF
>4 dna:chromosome chromosome:GRCz11:4:1:78093715:1 REF
>5 dna:chromosome chromosome:GRCz11:5:1:72500376:1 REF
>6 dna:chromosome chromosome:GRCz11:6:1:60270059:1 REF
>7 dna:chromosome chromosome:GRCz11:7:1:74282399:1 REF
>8 dna:chromosome chromosome:GRCz11:8:1:54304671:1 REF
>9 dna:chromosome chromosome:GRCz11:9:1:56459846:1 REF
>MT dna:chromosome chromosome:GRCz11:MT:1:16596:1 REF
>KN149696.2 dna:scaffold scaffold:GRCz11:KN149696.2:1:368252:1 REF
>KN147651.2 dna:scaffold scaffold:GRCz11:KN147651.2:1:351968:1 REF
>KN149690.1 dna:scaffold scaffold:GRCz11:KN149690.1:1:343018:1 REF
>KN149686.1 dna:scaffold scaffold:GRCz11:KN149686.1:1:260365:1 REF
>KN147652.2 dna:scaffold scaffold:GRCz11:KN147652.2:1:252640:1 REF
>KN149688.2 dna:scaffold scaffold:GRCz11:KN149688.2:1:252035:1 REF
>KN149691.1 dna:scaffold scaffold:GRCz11:KN149691.1:1:233193:1 REF
...

>GFP

5. **Modify the GTF File:**

Copy the existing filtered GTF file to a new file:

In [None]:
cp path/to/refdata/genes/genes.gtf genes_marker_gene.gtf

Append the marker gene GTF entry to the new filtered GTF file:

In [None]:
cat marker_gene.gtf >> genes_marker_gene.gtf

Verify the addition by checking the end of the file:

In [None]:
tail genes_marker_gene.gtf

The output looks similar to the following with the GTF entry as the last line of the file:

In [None]:
MT  RefSeq  start_codon 15308   15310   .   +   0   gene_id "ENSDARG00000063924"; gene_version "3"; transcript_id "ENSDART00000093625"; transcript_version "3"; exon_number "1"; gene_name "mt-cyb"; gene_source "RefSeq"; gene_biotype "protein_coding"; transcript_name "mt-cyb-201"; transcript_source "RefSeq"; transcript_biotype "protein_coding";
GFP unknown exon    1   922 .   +   .   gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";


6. **Build a Custom Reference with Cell Ranger:**

Use the cellranger mkref command with the new genome FASTA and GTF files to create a custom reference directory:

In [None]:
cellranger mkref --genome=markergene_custom_genome \
                 --fasta=genome_marker_gene.fa \
                 --genes=genes_marker_gene.gtf

This command creates a custom reference directory named markergene_custom_genome/.

Tips:

1.Ensure the correct reference directory is updated in cellranger.gex.sh.

2.Ensure the correct "script" directory is updated in run.gex.sh after updating cellranger.gex.sh.

3.More details: https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/cr-tutorial-mr