-
Notifications
You must be signed in to change notification settings - Fork 16
Pipeline Processes
generate_reference_tables
Generates reference tables for downstream pipeline processes using the script
prepare_reference_tables.py
- gsheynkmanlab/generate-reference-tables
- gencode_gtf
- gencode_transcript_fasta
1
outdir/name/reference_tables
All input are mapped within the Nextflow process, generate_reference_tables to files from channels
.
- ch_gencode_gtf
- ch_gencode_transcript_fasta_uncompressed
- ch_ensg_gene
- ch_enst_isoname
- ch_gene_ensp
- ch_gene_isoname
- ch_isoname_lens
- ch_gene_lens
- ch_protein_coding_genes
prepare_reference_tables.py \
--gtf $gencode_gtf \
--fa $gencode_transcript_fasta \
--ensg_gene ensg_gene.tsv \
--enst_isoname enst_isoname.tsv \
--gene_ensp gene_ensp.tsv \
--gene_isoname gene_isoname.tsv \
--isoname_lens isoname_lens.tsv \
--gene_lens gene_lens.tsv \
--protein_coding_genes protein_coding_genes.txt
The ch_protein_coding_genes Nextflow channel
, gets split into multiple output channels for simultaneous consumption by two downstream processes
- ch_protein_coding_genes.into{
- ch_protein_coding_genes_gencode_fasta
- ch_protein_coding_genes_filter_sqanti
- }
The ch_ensg_gene Nextflow channel
, gets split into multiple output channels for simultaneous consumption by multiple downstream processes
- ch_ensg_gene.into{
- ch_ensg_gene_filter
- ch_ensg_gene_six_frame
- ch_ensg_gene_pclass
- }
The ch_gene_lens Nextflow channel
, gets split into multiple output channels for simultaneous consumption by two downstream processes
- ch_gene_lens.into{
- ch_gene_lens_transcriptome
- ch_gene_lens_aggregate
- }
The ch_gene_isoname Nextflow channel
, gets split into multiple output channels for simultaneous consumption by two downstream processes
- ch_gene_isoname.into{
- ch_gene_isoname_pep_viz
- ch_gene_isoname_pep_analysis
- }
make_gencode_database
Clusters same-protein GENCODE entries
- gencode_translation_fasta
1
outdir/name/gencode_db
- ch_gencode_translation_fasta_uncompressed
- ch_protein_coding_genes_gencode_fasta
- ch_gencode_protein_fasta
- gencode_isoname_clusters.csv
make_gencode_database.py \
--gencode_fasta $gencode_translation_fasta \
--protein_coding_genes $protein_coding_genes \
--output_fasta gencode_protein.fasta \
--output_cluster gencode_isoname_clusters.tsv
The ch_gencode_protein_fasta Nextflow channel
, gets split into multiple output channels for simultaneous consumption by multiple downstream processes
- ch_gencode_protein_fasta.into{
- ch_gencode_protein_fasta_metamorpheus
- ch_gencode_protein_fasta_mapping
- ch_gencode_protein_fasta_hybrid
- ch_gencode_protein_fasta_novel
- }
isoseq3
Runs IsoSeq3 on CCS reads, aligning to genome and collapsing redundant reads
STEPS
- ensure only qv10 reads from ccs are kept as input
- find and remove adapters/barcodes
- filter for non-concatamer, polya containing reads
- clustering of reads
- align reads to the genome
- collapse redundant reads
- sample_ccs
- genome_fasta
- primers_fasta
User provides as input --max_cpus which is then read by the Nextflow workflow as params.max_cpus
outdir/name/isoseq3
- ch_sample_ccs
- ch_genome_fasta_isoseq
- ch_primers_fasta
- file("${params.name}.collapsed.gff") into ch_isoseq_gtf
- file("${params.name}.collapsed.abundance.txt") into ch_fl_count
- file("${params.name}.collapsed.fasta")
- file("${params.name}.collapsed.report.json")
- file("${params.name}.demult.lima.summary")
- file("${params.name}.flnc.bam")
- file("${params.name}.flnc.bam.pbi")
- file("${params.name}.flnc.filter_summary.json")
Ensure that only qv10 reads from ccs are input
bamtools filter -tag 'rq':'>=0.90' -in $sample_ccs -out filtered.$sample_ccs
create an index for the ccs bam
pbindex filtered.$sample_ccs
Find and remove adapters/barcodes
lima --isoseq --dump-clips --peek-guess -j ${task.cpus} filtered.$sample_ccs $primers_fasta ${params.name}.demult.bam
Filter for non-concatamer, polya containing reads
isoseq3 refine --require-polya ${params.name}.demult.NEB_5p--NEB_3p.bam $primers_fasta ${params.name}.flnc.bam
Cluster reads, can only make faster by putting more cores on machine (cannot parallelize)
isoseq3 cluster ${params.name}.flnc.bam ${params.name}.clustered.bam --verbose --use-qvs
Align reads to the genome, takes few minutes (40 core machine)
pbmm2 align $genome_fasta ${params.name}.clustered.hq.bam ${params.name}.aligned.bam --preset ISOSEQ --sort -j ${task.cpus} --log-level INFO
Collapse redundant reads
isoseq3 collapse ${params.name}.aligned.bam ${params.name}.collapsed.gff
star_generate_genome
generate the genome index used in STAR. not run if star index directory is provided
outdir/name/star_index
star_alignment
Runs STAR alignment process
outdir/name/star
sqanti3
Corrects any errors in alignment from IsoSeq3 and classifies each accession in relation to the reference genome
Uses SQANTI v1.3
outdir/name/sqanti3
filter_sqanti
Filter SQANTI
- Filter SQANTI results based on several criteria
- protein coding only
PB transcript aligns to a GENCODE-annotated protein coding gene.
- percent A downstream
perc_A_downstreamTTS : percent of genomic "A"s in the downstream 20 bp window. If this number if high (> 80%), the 3' end have arisen from intra-priming during the RT step
- RTS stage
RTS_stage: TRUE if one of the junctions could be an RT template switching artifact.
- Structural Category
keep only transcripts that have a isoform structural category of:
- novel_not_in_catalog
- novel_in_catalog
- incomplete-splice_match
- full-splice_match
- protein coding only
- gsheynkmanlab/proteogenomics-base
Uses inputs from both sqanti3 and generate_reference_tables
Inputs from sqanti3 are these three files:
- name_from_sqanti3__classification.txt
- name_from_sqanti3__corrected.fasta
- name_from_sqanti3__corrected.gtf
Inputs from generate_reference_tables process (prepare_reference_tables.py)
- protein_coding_genes.txt
- ensg_gene.tsv
outdir/name/sqanti3-filtered
pacbio_6frm_gene_grouped
Generates a fasta file of all possible protein sequences derivable from each PacBio transcript, by translating the fasta file in all six frames (3+, 3-). This is used to examine what peptides could theoretically match the peptides found via a mass spectrometry search against GENCODE.
outdir/name/pacbio_6frm_gene_grouped
transcriptome_summary
Compares the abundance (CPM) based on long-read sequencing to the abundances (TPM) inferred from short-read sequencing, as computed by Kallisto (analyzed outside of this pipeline). Additionally produces a pacbio-gene reference table
outdir/name/transcriptome_summary
cpat
CPAT is a bioinformatics tool to predict an RNA’s coding probability based on the RNA sequence characteristics. To achieve this goal, CPAT calculates scores of sequence-based features from a set of known protein-coding genes and background set of non-coding genes.
Features
- ORF size
- ORF coverage
- Fickett score
- Hexamer usage bias
CPAT will then builds a logistic regression model using these 4 features as predictor variables and the “protein-coding status” as the response variable. After evaluating the performance and determining the probability cutoff, the model can be used to predict new RNA sequences.
- gsheynkmanlab/cpat:addr
Download from the Zenodo repository the required hexamer and Logit model required for the species. Unzip the compressed logit model.
wget https://zenodo.org/record/5076056/files/Human_Hexamer.tsv
wget https://zenodo.org/record/5076056/files/Human_logitModel.RData.gz
gunzip Human_logitModel.RData.gz
Also required is the output for the sample from the filter_sqanti.py step.
- Human_Hexamer.tsv
- Human_LogitModel.RData
- param_name_corrected.5degfilter.fasta*
outdir/name/cpat
orf_calling
Selects the most plausible ORF from each pacbio transcript, using the following information
- comparison of ATG start to reference (GENCODE)
selects ORF with ATG start matching the one in the reference, if it exists
- coding probability score from CPAT
- number of upstream ATGs for the candidate ORF
decrease score as number of upstream ATGs increases using sigmoid function
Additionally provides calling confidence of each ORF called
- Clear Best ORF : best score and fewest upstream ATGs of all called ORFs
- Plausible ORF : not clear best, but decent CPAT coding_score (>0.364)
- Low Quality ORF : low CPAT coding_score (<0.364)
outdir/name/orf_calling
refined_database
- Filteres ORF database to only include accessions with a CPAT coding score above a threshold (default 0.0)
- Filters ORFs to only include ORFs that have a stop codon
- Collapses transcripts that produce the same protein into one entry, keeping a base accession (first alphanumeric).
Abundances of transcripts (CPM) are collapsed during this process.
outdir/name/refined_database
pacbio_cds
derive a GTF file that includes the ORF regions (as CDS features)
outdir/name/pacbio_cds
rename_cds
Preprocessing step to SQANTI Protein CDS is renamed to exon and transcript stop and start locations are updated to reflect CDS start and stop
outdir/name/rename_cds
sqanti_protein
Classify protein splice sites and calculates additional statistics for start and stop of ORF
outdir/name/sqanti_protein
five_prime_utr
Intermediate step for protein classification
Determines the 5' UTR status of the protein in order to classify protein category in latter step
None
protein_classification
Classifies protein based on splicing and start site
main classifications are
- pFSM: full-protein-match
protein fully matches a gencode protein
- pISM: incomplete-protein-match
protein only partially matches gencode protein
considered an N- or C-terminus truncation artifact
- pNIC: novel-in-catelog
protein composed of known N-term, splicing, and/or C-term in new combinations
- pNNC: novel-not-in-catelog
protein composed of novel N-term, splicing, and/or C-terminus
outdir/name/protein_classification
protein_gene_rename
Mapings of PacBio transcripts/proteins to GENCODE genes. Some PacBio transcripts and the associated PacBio predicted protein can map two different genes. Some transcripts can also map to multiple genes.
outdir/name/protein_gene_rename
protein_filter
Filters out proteins that are:
- not pFSM, pNIC, pNNC
- are pISMs (either N-terminus or C-terminus truncations)
- pNNC with junctions after the stop codon (default 2)
outdir/name/protein_filter
hybrid_protein_database
Makes a hybrid database that is composed of high-confidence PacBio proteins and GENCODE proteins for genes that are not in the high-confidence space.
High-confidence is defined as genes in which the PacBio sampling is adequate (average transcript length 1-4kb)and a total of 3 CPM (counts per million) per gene
outdir/name/hybrid_protein_database
metamorpheus
MetaMorpheus is a bottom-up proteomics database search software with integrated post-translational modification (PTM) discovery capability. This program combines features of Morpheus and G-PTM-D in a single tool.
metamorpheus_with_gencode_database
Runs Metamorpheus MS search using the GENCODE database
outdir/name/metamorpheus/gencode
metamorpheus_with_uniprot_database
Runs Metamorpheus MS search using the UniProt database
outdir/name/metamorpheus/uniprot
pacbio
metamorpheus_with_sample_specific_database_refined
Runs Metamorpheus MS search using the PacBio refined database
outdir/name/metamorpheus/pacbio/refined
metamorpheus_with_sample_specific_database_filtered
Runs Metamorpheus MS search using the PacBio filtered database
outdir/name/metamorpheus/pacbio/filtered
metamorpheus_with_sample_specific_database_hybrid
Runs Metamorpheus MS search using the PacBio hybrid database
outdir/name/metamorpheus/pacbio/hybrid
metamorpheus_with_sample_specific_database_rescue_resolve
Runs Metamorpheus MS search using the hybrid database and the version of Metamorpheus containing the "Rescue and Resolve" algorithm. This algorithm allows for "Rescue" of protein isoforms that are normally discarded during protein parsimony, given that the protein isoform has strong evidence of transcriptional expression.
outdir/name/metamorpheus/pacbio/rescue_resolve
peptide_analysis
Generate a table comparing MS peptide results between the PacBio and GENCODE databases.
outdir/name/peptide_analysis
track_visualization
reference
Creates tracks to use in UCSC Genome Browser for GENCODE database.
outdir/name/track_visualization/reference
protein_track_visualization
Creates tracks to use in UCSC Genome Browser for refined, filtered, and hybrid PacBio databases.
Shades tracks based on abundance (CPMs) and protein classification.
outdir/name/track_visualization
make_multiregion
Makes multiregion BED file for UCSC genome browser for refined, filtered, and hybrid databases. Regions in the BED correpond to coding regions, thereby allowing for ntronic regions to be minimized for easier isoform viewing.
outdir/name/track_visualization
peptide_track_visualization
Makes peptide tracks for UCSC Genome Browser for refined, filtered and hybrid databases
outdir/name/track_visualization
accession_mapping
Maps protein entries within GENCODE, UniProt and PacBio databases to one another based on sequence similarity
outdir/name/accession_mapping
protein_group_compare
Determine the relationship between protein groups identified in Metamorpheus using GENCODE, UniProt, and/or PacBio databases
outdir/name/protein_group_compare
Sheynkman-Lab