Skip to content

tdayris/fair_genome_indexer

Repository files navigation

Snakemake GitHub actions status

Snakemake workflow used to deploy and perform basic indexes of genome sequence.

This is done for teaching purpose as an example of FAIR principles applied with Snakemake.

Usage

The usage of this workflow is described in the Snakemake workflow catalog, it is also available locally on a single page.

Results

The expected results of this pipeline are described here.

Material and methods

The tools used in this pipeline are described here textually.

Step by step

Get DNA sequences

Step Commands
Download DNA Fasta from Ensembl ensembl-sequence
Remove non-canonical chromosomes pyfaidx
Index DNA sequence samtools
Creatse sequence Dictionary picard
┌────────────────────────────────────────┐                                     
│Download Ensembl Sequence (wget + gzip) │                                     
└──────────────────┬─────────────────────┘                                     
                   │                                                           
                   │                                                           
┌──────────────────▼────────────────────────┐                                  
│Remove non-canonical chromosomes (pyfaidx) │                                  
└──────────────────┬──────────────────────┬─┘                                  
                   │                      │                                    
                   │                      │                                    
┌──────────────────▼──────────┐         ┌─▼───────────────────────────────────┐
│Index DNA Sequence (samtools)│         │Create sequence dictionary (Picard)  │
└─────────────────────────────┘         └─────────────────────────────────────┘

Get genome annotation (GTF)

Step Commands
Download GTF annotation ensembl-annotation
Fix format errors Agat
Remove non-canonical chromosomes, based on above DNA Fasta Agat
Remove <NA> Transcript support levels Agat
Convert GTF to GenePred format gtf2genepred
┌─────────────────────────────────────────┐                                                   
│Download Ensembl Annotation (wget + gzip)│                                                   
└─────────────┬───────────────────────────┘                                                   
              │                                                                               
              │                                                                               
┌─────────────▼─────────┐                                                                     
│Fix format Error (Agat)│                                                                     
└─────────────┬─────────┘                                                                     
              │                                                                               
              │                                                                               
┌─────────────▼─────────────────────────┐           ┌────────────────────────────────────────┐
│Remove non-canonical chromosomes (Agat)◄───────────┤Fasta sequence index (see Get DNA Fasta)│
└─────────────┬─────────────────────────┘           └────────────────────────────────────────┘
              │                                                                               
              │                                                                               
┌─────────────▼───────────────────────┐                                                       
│Remove <NA> transcript levels (Agat) │                                                       
└─────────────┬───────────────────────┘                                                       
              │                                                                               
              │                                                                               
┌─────────────▼────────────────┐                                                              
│Convert GTF to GenePred (UCSC)│                                                              
└──────────────────────────────┘                                                              

Get transcripts sequence

Step Commands
Extract transcript sequences from above DNA Fasta and GTF gffread
Index DNA sequence samtools
Creatse sequence Dictionary picard
┌───────────────────────────────┐       ┌─────────────────────────────┐       
│GTF (see get genome annotation)│       │DNA Fasta (See get dna fasta)│       
└────────────────────┬──────────┘       └────────┬────────────────────┘       
                     │                           │                            
                     │                           │                            
              ┌──────▼───────────────────────────▼─────┐                      
              │Extract transcripts sequences (gffread) │                      
              └──────┬───────────────────────────┬─────┘                      
                     │                           │                            
                     │                           │                            
┌────────────────────▼────┐             ┌────────▼───────────────────────────┐
│Index sequence (samtools)│             │Create sequence dictionary (Picard) │
└─────────────────────────┘             └────────────────────────────────────┘

Get cDNA sequences

Step Commands
Extract coding transcripts from above GTF Agat
Extract coding sequences from above DNA Fasta and GTF gffread
Index DNA sequence samtools
Creatse sequence Dictionary picard
┌───────────────────────────────┐       ┌─────────────────────────────┐       
│GTF (see get genome annotation)│       │DNA Fasta (See get dna fasta)│       
└────────────────────┬──────────┘       └────────┬────────────────────┘       
                     │                           │                            
                     │                           │                            
              ┌──────▼───────────────────────────▼─────┐                      
              │Extract cDNA        sequences (gffread) │                      
              └──────┬───────────────────────────┬─────┘                      
                     │                           │                            
                     │                           │                            
┌────────────────────▼────┐             ┌────────▼───────────────────────────┐
│Index sequence (samtools)│             │Create sequence dictionary (Picard) │
└─────────────────────────┘             └────────────────────────────────────┘

Get dbSNP variants

Step Commands
Download dbSNP variants ensembl-variation
Filter non-canonical chromosomes pyfaidx + BCFTools
Index variants tabix
┌──────────────────────────────────────────┐            
│Download dbSNP variants (wget + bcftools) │            
└──────────┬───────────────────────────────┘            
           │                                            
           │                                            
┌──────────▼───────────────────────────────────────────┐
│Remove non-canonical chromosomes (bcftools + bedtools)│
└──────────┬───────────────────────────────────────────┘
           │                                            
           │                                            
┌──────────▼─────────────┐                              
│Index variants (tabix)  │                              
└────────────────────────┘                              

Get transcript_id, gene_id, and gene_name correspondancy

Step Commands
Extract gene_id <-> gene_name correspondancy pyroe
Extract transcript_id <-> gene_id <-> gene_name Agat + XSV
┌────────────────────────────────┐                                                                 
│Genome annotation (see get GTF) ├──────────────────┐                                              
└──────┬─────────────────────────┘                  │                                              
       │                                            │                                              
       │                                            │                                              
┌──────▼──────────────────────────────┐    ┌────────▼─────────────────────────────────────────────┐
│Extract gene_id <-> gene_name (pyroe)│    │Extract gene_id <-> gene_name <-> transcript_id (Agat)│
└──────┬──────────────────────────────┘    └────────┬─────────────────────────────────────────────┘
       │                                            │                                              
       │                                            │                                              
┌──────▼─────┐                             ┌────────▼────┐                                         
│Format (XSV)│                             │Format (XSV) │                                         
└────────────┘                             └─────────────┘                                         

Get blacklisted regions

Step Commands
Download blacklisted regions Github source
Merge overlapping intervals bedtools
┌────────────────────────────────┐       
│Download known blacklists (wget)│       
└────────────┬───────────────────┘       
             │                           
             │                           
┌────────────▼──────────────────────────┐
│Merge overlapping intervals (bedtools) │
└───────────────────────────────────────┘

GenePred format

Step Commands
GTF to GenePred UCSC-tools
┌────────────────────────────────┐       
│Genome annotation (see get GTF) │       
└────────────┬───────────────────┘       
             │                           
             │                           
┌────────────▼──────────────┐
│GTFtoGenePred (UCSC-tools) │
└───────────────────────────┘