Pipeline Processes

generate_reference_tables

Generate Reference Tables

Generates reference tables for downstream pipeline processes using the script prepare_reference_tables.py

container

gsheynkmanlab/generate-reference-tables

cpus

1

publishDir

outdir/name/reference_tables

Input

All input are mapped within the Nextflow process, generate_reference_tables to files from channels.

ch_gencode_gtf
ch_gencode_transcript_fasta_uncompressed

Output

ch_ensg_gene
ch_enst_isoname
ch_gene_ensp
ch_gene_isoname
ch_isoname_lens
ch_gene_lens
ch_protein_coding_genes

Script

prepare_reference_tables.py \
  --gtf $gencode_gtf \
  --fa $gencode_transcript_fasta \
  --ensg_gene ensg_gene.tsv \
  --enst_isoname enst_isoname.tsv \
  --gene_ensp gene_ensp.tsv \
  --gene_isoname gene_isoname.tsv \
  --isoname_lens isoname_lens.tsv \
  --gene_lens gene_lens.tsv \
  --protein_coding_genes protein_coding_genes.txt

ch_protein_coding_genes

The ch_protein_coding_genes Nextflow channel, gets split into multiple output channels for simultaneous consumption by two downstream processes

ch_protein_coding_genes.into{
ch_protein_coding_genes_gencode_fasta
ch_protein_coding_genes_filter_sqanti
}

ch_ensg_gene

The ch_ensg_gene Nextflow channel, gets split into multiple output channels for simultaneous consumption by multiple downstream processes

ch_ensg_gene.into{
ch_ensg_gene_filter
ch_ensg_gene_six_frame
ch_ensg_gene_pclass
}

ch_gene_lens

The ch_gene_lens Nextflow channel, gets split into multiple output channels for simultaneous consumption by two downstream processes

ch_gene_lens.into{
ch_gene_lens_transcriptome
ch_gene_lens_aggregate
}

ch_gene_isoname

The ch_gene_isoname Nextflow channel, gets split into multiple output channels for simultaneous consumption by two downstream processes

ch_gene_isoname.into{
ch_gene_isoname_pep_viz
ch_gene_isoname_pep_analysis
}

make_gencode_database

Make GENCODE database

Clusters same-protein GENCODE entries

cpus

1

publishDir

outdir/name/gencode_db

Input

ch_gencode_translation_fasta_uncompressed
ch_protein_coding_genes_gencode_fasta

Output

ch_gencode_protein_fasta
gencode_isoname_clusters.csv

Script

make_gencode_database.py \
  --gencode_fasta $gencode_translation_fasta \
  --protein_coding_genes $protein_coding_genes \
  --output_fasta gencode_protein.fasta \
  --output_cluster gencode_isoname_clusters.tsv

ch_gencode_protein_fasta

The ch_gencode_protein_fasta Nextflow channel, gets split into multiple output channels for simultaneous consumption by multiple downstream processes

ch_gencode_protein_fasta.into{
ch_gencode_protein_fasta_metamorpheus
ch_gencode_protein_fasta_mapping
ch_gencode_protein_fasta_hybrid
ch_gencode_protein_fasta_novel
}

isoseq3

ISOSEQ

Runs IsoSeq3 on CCS reads, aligning to genome and collapsing redundant reads

STEPS

ensure only qv10 reads from ccs are kept as input

find and remove adapters/barcodes

filter for non-concatamer, polya containing reads

clustering of reads

align reads to the genome

collapse redundant reads

cpus

User provides as input --max_cpus which is then read by the Nextflow workflow as params.max_cpus

publishDir

outdir/name/isoseq3

Input

ch_sample_ccs
ch_genome_fasta_isoseq
ch_primers_fasta

Output

file("${params.name}.collapsed.gff") into ch_isoseq_gtf
file("${params.name}.collapsed.abundance.txt") into ch_fl_count
file("${params.name}.collapsed.fasta")
file("${params.name}.collapsed.report.json")
file("${params.name}.demult.lima.summary")
file("${params.name}.flnc.bam")
file("${params.name}.flnc.bam.pbi")
file("${params.name}.flnc.filter_summary.json")

Script

Ensure that only qv10 reads from ccs are input

bamtools filter -tag 'rq':'>=0.90' -in $sample_ccs -out filtered.$sample_ccs

create an index for the ccs bam

pbindex filtered.$sample_ccs

Find and remove adapters/barcodes

lima --isoseq --dump-clips --peek-guess -j ${task.cpus} filtered.$sample_ccs $primers_fasta ${params.name}.demult.bam

Filter for non-concatamer, polya containing reads

isoseq3 refine --require-polya ${params.name}.demult.NEB_5p--NEB_3p.bam $primers_fasta ${params.name}.flnc.bam

Cluster reads, can only make faster by putting more cores on machine (cannot parallelize)

isoseq3 cluster ${params.name}.flnc.bam ${params.name}.clustered.bam --verbose --use-qvs

Align reads to the genome, takes few minutes (40 core machine)

pbmm2 align $genome_fasta ${params.name}.clustered.hq.bam ${params.name}.aligned.bam --preset ISOSEQ --sort -j ${task.cpus} --log-level INFO

Collapse redundant reads

isoseq3 collapse ${params.name}.aligned.bam ${params.name}.collapsed.gff

star_generate_genome

STAR Generate Genome

generate the genome index used in STAR. not run if star index directory is provided

STAR Alignment

Output directory

outdir/name/star_index

star_alignment

STAR Alignment

Runs STAR alignment process

STAR Alignment

Output directory

outdir/name/star

sqanti3

SQANTI3

Corrects any errors in alignment from IsoSeq3 and classifies each accession in relation to the reference genome

Uses SQANTI v1.3

SQANTI GitHub

Output directory

outdir/name/sqanti3

filter_sqanti

Filter SQANTI

Filter SQANTI results based on several criteria
- protein coding only
  
  PB transcript aligns to a GENCODE-annotated protein coding gene.
- percent A downstream
  
  perc_A_downstreamTTS : percent of genomic "A"s in the downstream 20 bp window. If this number if high (> 80%), the 3' end have arisen from intra-priming during the RT step
- RTS stage
  
  RTS_stage: TRUE if one of the junctions could be an RT template switching artifact.
- Structural Category
  keep only transcripts that have a isoform structural category of:
  - novel_not_in_catalog
  - novel_in_catalog
  - incomplete-splice_match
  - full-splice_match

Container

gsheynkmanlab/proteogenomics-base

Inputs

Uses inputs from both sqanti3 and generate_reference_tables

Inputs from sqanti3 are these three files:

name_from_sqanti3__classification.txt
name_from_sqanti3__corrected.fasta
name_from_sqanti3__corrected.gtf

Inputs from generate_reference_tables process (prepare_reference_tables.py)

protein_coding_genes.txt
ensg_gene.tsv

Output directory

outdir/name/sqanti3-filtered

pacbio_6frm_gene_grouped

Six-Frame Translation

Generates a fasta file of all possible protein sequences derivable from each PacBio transcript, by translating the fasta file in all six frames (3+, 3-). This is used to examine what peptides could theoretically match the peptides found via a mass spectrometry search against GENCODE.

Output directory

outdir/name/pacbio_6frm_gene_grouped

transcriptome_summary

Transcriptome Summary

Compares the abundance (CPM) based on long-read sequencing to the abundances (TPM) inferred from short-read sequencing, as computed by Kallisto (analyzed outside of this pipeline). Additionally produces a pacbio-gene reference table

Output directory

outdir/name/transcriptome_summary

cpat

CPAT

CPAT is a bioinformatics tool to predict an RNA’s coding probability based on the RNA sequence characteristics. To achieve this goal, CPAT calculates scores of sequence-based features from a set of known protein-coding genes and background set of non-coding genes.

Features

ORF size

ORF coverage

Fickett score

Hexamer usage bias

CPAT will then builds a logistic regression model using these 4 features as predictor variables and the “protein-coding status” as the response variable. After evaluating the performance and determining the probability cutoff, the model can be used to predict new RNA sequences.

CPAT

Container

gsheynkmanlab/cpat:addr

Inputs

Download from the Zenodo repository the required hexamer and Logit model required for the species. Unzip the compressed logit model.

wget https://zenodo.org/record/5076056/files/Human_Hexamer.tsv
wget https://zenodo.org/record/5076056/files/Human_logitModel.RData.gz
gunzip Human_logitModel.RData.gz

Also required is the output for the sample from the filter_sqanti.py step.

Human_Hexamer.tsv
Human_LogitModel.RData
param_name_corrected.5degfilter.fasta*

Output directory

outdir/name/cpat

orf_calling

ORF Calling

Selects the most plausible ORF from each pacbio transcript, using the following information

comparison of ATG start to reference (GENCODE)

selects ORF with ATG start matching the one in the reference, if it exists

coding probability score from CPAT

number of upstream ATGs for the candidate ORF

decrease score as number of upstream ATGs increases using sigmoid function

Additionally provides calling confidence of each ORF called

Clear Best ORF : best score and fewest upstream ATGs of all called ORFs

Plausible ORF : not clear best, but decent CPAT coding_score (>0.364)

Low Quality ORF : low CPAT coding_score (<0.364)

Output directory

outdir/name/orf_calling

refined_database

Refined DB Generation

Filteres ORF database to only include accessions with a CPAT coding score above a threshold (default 0.0)

Filters ORFs to only include ORFs that have a stop codon

Collapses transcripts that produce the same protein into one entry, keeping a base accession (first alphanumeric).

Abundances of transcripts (CPM) are collapsed during this process.

Output directory

outdir/name/refined_database

pacbio_cds

PacBio CDS GTF

derive a GTF file that includes the ORF regions (as CDS features)

Output directory

outdir/name/pacbio_cds

rename_cds

Rename CDS to Exon

Preprocessing step to SQANTI Protein CDS is renamed to exon and transcript stop and start locations are updated to reflect CDS start and stop

Output directory

outdir/name/rename_cds

sqanti_protein

SQANTI Protein

Classify protein splice sites and calculates additional statistics for start and stop of ORF

Output directory

outdir/name/sqanti_protein

five_prime_utr

5' UTR Status

Intermediate step for protein classification

Determines the 5' UTR status of the protein in order to classify protein category in latter step

Output directory

None

protein_classification

Protein Classification

Classifies protein based on splicing and start site

main classifications are

pFSM: full-protein-match

protein fully matches a gencode protein

pISM: incomplete-protein-match

protein only partially matches gencode protein

considered an N- or C-terminus truncation artifact

pNIC: novel-in-catelog

protein composed of known N-term, splicing, and/or C-term in new combinations

pNNC: novel-not-in-catelog

protein composed of novel N-term, splicing, and/or C-terminus

Output directory

outdir/name/protein_classification

protein_gene_rename

Protein Gene Rename

Mapings of PacBio transcripts/proteins to GENCODE genes. Some PacBio transcripts and the associated PacBio predicted protein can map two different genes. Some transcripts can also map to multiple genes.

Output directory

outdir/name/protein_gene_rename

protein_filter

Protein Filtering

Filters out proteins that are:

not pFSM, pNIC, pNNC

are pISMs (either N-terminus or C-terminus truncations)

pNNC with junctions after the stop codon (default 2)

Output directory

outdir/name/protein_filter

hybrid_protein_database

Protein Hybrid Database

Makes a hybrid database that is composed of high-confidence PacBio proteins and GENCODE proteins for genes that are not in the high-confidence space.

High-confidence is defined as genes in which the PacBio sampling is adequate (average transcript length 1-4kb)and a total of 3 CPM (counts per million) per gene

Output directory

outdir/name/hybrid_protein_database

metamorpheus

MetaMorpheus

MetaMorpheus is a bottom-up proteomics database search software with integrated post-translational modification (PTM) discovery capability. This program combines features of Morpheus and G-PTM-D in a single tool.

MetaMorpheus GitHub

metamorpheus_with_gencode_database

Metamorpheus GENCODE

Runs Metamorpheus MS search using the GENCODE database

Output directory

outdir/name/metamorpheus/gencode

metamorpheus_with_uniprot_database

Metamorpheus UniProt

Runs Metamorpheus MS search using the UniProt database

Output directory

outdir/name/metamorpheus/uniprot

pacbio

PacBio

metamorpheus_with_sample_specific_database_refined

Metamorpheus on PacBio Refined Database

Runs Metamorpheus MS search using the PacBio refined database

Output directory

outdir/name/metamorpheus/pacbio/refined

metamorpheus_with_sample_specific_database_filtered

Metamorpheus on PacBio Filtered Database

Runs Metamorpheus MS search using the PacBio filtered database

Output directory

outdir/name/metamorpheus/pacbio/filtered

metamorpheus_with_sample_specific_database_hybrid

Metamorpheus on PacBio Hybrid Database

Runs Metamorpheus MS search using the PacBio hybrid database

Output directory

outdir/name/metamorpheus/pacbio/hybrid

metamorpheus_with_sample_specific_database_rescue_resolve

MetaMorpheus using "Rescue and Resolve" Algorithm

Runs Metamorpheus MS search using the hybrid database and the version of Metamorpheus containing the "Rescue and Resolve" algorithm. This algorithm allows for "Rescue" of protein isoforms that are normally discarded during protein parsimony, given that the protein isoform has strong evidence of transcriptional expression.

Output directory

outdir/name/metamorpheus/pacbio/rescue_resolve

peptide_analysis

Peptide Analysis

Generate a table comparing MS peptide results between the PacBio and GENCODE databases.

Output directory

outdir/name/peptide_analysis

track_visualization

Track Visualization

reference

Reference Track Visualization

Creates tracks to use in UCSC Genome Browser for GENCODE database.

Output director

outdir/name/track_visualization/reference

protein_track_visualization

Protein Track Visualization

Creates tracks to use in UCSC Genome Browser for refined, filtered, and hybrid PacBio databases.

Shades tracks based on abundance (CPMs) and protein classification.

Output directory

outdir/name/track_visualization

make_multiregion

Multiregion BED generation

Makes multiregion BED file for UCSC genome browser for refined, filtered, and hybrid databases. Regions in the BED correpond to coding regions, thereby allowing for ntronic regions to be minimized for easier isoform viewing.

Output directory

outdir/name/track_visualization

peptide_track_visualization

Peptide Track Visualization

Makes peptide tracks for UCSC Genome Browser for refined, filtered and hybrid databases

Output directory

outdir/name/track_visualization

accession_mapping

Accession Mapping

Maps protein entries within GENCODE, UniProt and PacBio databases to one another based on sequence similarity

Output directory

outdir/name/accession_mapping

protein_group_compare

Protein Group Comparison

Determine the relationship between protein groups identified in Metamorpheus using GENCODE, UniProt, and/or PacBio databases

Output directory

outdir/name/protein_group_compare

peptide_novelty_analysis

Novel Peptides

Finds novel peptides between sample database and GENCODE. Novel peptide is defined as a peptide found in PacBio thatcould not be found in GENCODE.

Novel peptides found for refined, filtered, and hybrid databases

Output directory

outdir/name/novel_peptides

Sheynkman-Lab

Pipeline Processes

Generate Reference Tables

container

tags

cpus

publishDir

Input

Output

Script

ch_protein_coding_genes

ch_ensg_gene

ch_gene_lens

ch_gene_isoname

Make GENCODE database

tags

cpus

publishDir

Input

Output

Script

ch_gencode_protein_fasta

ISOSEQ

tags

cpus

publishDir

Input

Output

Script

STAR Generate Genome

Output directory

STAR Alignment

Output directory

SQANTI3

Output directory

Filter SQANTI

Container

Inputs

Output directory

Six-Frame Translation

Output directory

Transcriptome Summary

Output directory

CPAT

Container

Inputs

Output directory

ORF Calling

Output directory

Refined DB Generation

Output directory

PacBio CDS GTF

Output directory

Rename CDS to Exon

Output directory

SQANTI Protein

Output directory

5' UTR Status

Output directory

Protein Classification

Output directory

Protein Gene Rename

Output directory

Protein Filtering

Output directory

Protein Hybrid Database

Output directory

MetaMorpheus

Metamorpheus GENCODE

Output directory

Metamorpheus UniProt

Output directory

PacBio

Metamorpheus on PacBio Refined Database

Output directory

Metamorpheus on PacBio Filtered Database

Output directory

Metamorpheus on PacBio Hybrid Database

Output directory

MetaMorpheus using "Rescue and Resolve" Algorithm

Output directory