Skip to content

Pipeline Processes

Anne Deslattes Mays edited this page Jan 30, 2022 · 15 revisions
generate_reference_tables

Generates reference tables for downstream pipeline processes using the script prepare_reference_tables.py

container

  • gsheynkmanlab/generate-reference-tables

tags

  • gencode_gtf
  • gencode_transcript_fasta

cpus

1

publishDir

outdir/name/reference_tables

Input

All input are mapped within the Nextflow process, generate_reference_tables to files from channels.

  • ch_gencode_gtf
  • ch_gencode_transcript_fasta_uncompressed

Output

  • ch_ensg_gene
  • ch_enst_isoname
  • ch_gene_ensp
  • ch_gene_isoname
  • ch_isoname_lens
  • ch_gene_lens
  • ch_protein_coding_genes

Script

prepare_reference_tables.py \
  --gtf $gencode_gtf \
  --fa $gencode_transcript_fasta \
  --ensg_gene ensg_gene.tsv \
  --enst_isoname enst_isoname.tsv \
  --gene_ensp gene_ensp.tsv \
  --gene_isoname gene_isoname.tsv \
  --isoname_lens isoname_lens.tsv \
  --gene_lens gene_lens.tsv \
  --protein_coding_genes protein_coding_genes.txt

ch_protein_coding_genes

The ch_protein_coding_genes Nextflow channel, gets split into multiple output channels for simultaneous consumption by two downstream processes

  • ch_protein_coding_genes.into{
  • ch_protein_coding_genes_gencode_fasta
  • ch_protein_coding_genes_filter_sqanti
  • }

ch_ensg_gene

The ch_ensg_gene Nextflow channel, gets split into multiple output channels for simultaneous consumption by multiple downstream processes

  • ch_ensg_gene.into{
  • ch_ensg_gene_filter
  • ch_ensg_gene_six_frame
  • ch_ensg_gene_pclass
  • }

ch_gene_lens

The ch_gene_lens Nextflow channel, gets split into multiple output channels for simultaneous consumption by two downstream processes

  • ch_gene_lens.into{
  • ch_gene_lens_transcriptome
  • ch_gene_lens_aggregate
  • }

ch_gene_isoname

The ch_gene_isoname Nextflow channel, gets split into multiple output channels for simultaneous consumption by two downstream processes

  • ch_gene_isoname.into{
  • ch_gene_isoname_pep_viz
  • ch_gene_isoname_pep_analysis
  • }

make_gencode_database

Make GENCODE database

Clusters same-protein GENCODE entries

tags

  • gencode_translation_fasta

cpus

1

publishDir

outdir/name/gencode_db

Input

  • ch_gencode_translation_fasta_uncompressed
  • ch_protein_coding_genes_gencode_fasta

Output

  • ch_gencode_protein_fasta
  • gencode_isoname_clusters.csv

Script

make_gencode_database.py \
  --gencode_fasta $gencode_translation_fasta \
  --protein_coding_genes $protein_coding_genes \
  --output_fasta gencode_protein.fasta \
  --output_cluster gencode_isoname_clusters.tsv

ch_gencode_protein_fasta

The ch_gencode_protein_fasta Nextflow channel, gets split into multiple output channels for simultaneous consumption by multiple downstream processes

  • ch_gencode_protein_fasta.into{
  • ch_gencode_protein_fasta_metamorpheus
  • ch_gencode_protein_fasta_mapping
  • ch_gencode_protein_fasta_hybrid
  • ch_gencode_protein_fasta_novel
  • }
isoseq3

ISOSEQ

Runs IsoSeq3 on CCS reads, aligning to genome and collapsing redundant reads

STEPS

  • ensure only qv10 reads from ccs are kept as input
  • find and remove adapters/barcodes
  • filter for non-concatamer, polya containing reads
  • clustering of reads
  • align reads to the genome
  • collapse redundant reads

tags

  • sample_ccs
  • genome_fasta
  • primers_fasta

cpus

User provides as input --max_cpus which is then read by the Nextflow workflow as params.max_cpus

publishDir

outdir/name/isoseq3

Input

  • ch_sample_ccs
  • ch_genome_fasta_isoseq
  • ch_primers_fasta

Output

  • file("${params.name}.collapsed.gff") into ch_isoseq_gtf
  • file("${params.name}.collapsed.abundance.txt") into ch_fl_count
  • file("${params.name}.collapsed.fasta")
  • file("${params.name}.collapsed.report.json")
  • file("${params.name}.demult.lima.summary")
  • file("${params.name}.flnc.bam")
  • file("${params.name}.flnc.bam.pbi")
  • file("${params.name}.flnc.filter_summary.json")

Script

Ensure that only qv10 reads from ccs are input

bamtools filter -tag 'rq':'>=0.90' -in $sample_ccs -out filtered.$sample_ccs
create an index for the ccs bam
pbindex filtered.$sample_ccs

Find and remove adapters/barcodes

lima --isoseq --dump-clips --peek-guess -j ${task.cpus} filtered.$sample_ccs $primers_fasta ${params.name}.demult.bam

Filter for non-concatamer, polya containing reads

isoseq3 refine --require-polya ${params.name}.demult.NEB_5p--NEB_3p.bam $primers_fasta ${params.name}.flnc.bam

Cluster reads, can only make faster by putting more cores on machine (cannot parallelize)

isoseq3 cluster ${params.name}.flnc.bam ${params.name}.clustered.bam --verbose --use-qvs

Align reads to the genome, takes few minutes (40 core machine)

pbmm2 align $genome_fasta ${params.name}.clustered.hq.bam ${params.name}.aligned.bam --preset ISOSEQ --sort -j ${task.cpus} --log-level INFO

Collapse redundant reads

isoseq3 collapse ${params.name}.aligned.bam ${params.name}.collapsed.gff

star_generate_genome

STAR Generate Genome

generate the genome index used in STAR. not run if star index directory is provided

STAR Alignment

Output directory

outdir/name/star_index


star_alignment

STAR Alignment

Runs STAR alignment process

STAR Alignment

Output directory

outdir/name/star


sqanti3

SQANTI3

Corrects any errors in alignment from IsoSeq3 and classifies each accession in relation to the reference genome

Uses SQANTI v1.3

SQANTI GitHub

Output directory

outdir/name/sqanti3


filter_sqanti

Filter SQANTI

Filter SQANTI

  • Filter SQANTI results based on several criteria
    • protein coding only

      PB transcript aligns to a GENCODE-annotated protein coding gene.

    • percent A downstream

      perc_A_downstreamTTS : percent of genomic "A"s in the downstream 20 bp window. If this number if high (> 80%), the 3' end have arisen from intra-priming during the RT step

    • RTS stage

      RTS_stage: TRUE if one of the junctions could be an RT template switching artifact.

    • Structural Category

      keep only transcripts that have a isoform structural category of:

      • novel_not_in_catalog
      • novel_in_catalog
      • incomplete-splice_match
      • full-splice_match

Container

  • gsheynkmanlab/proteogenomics-base

Inputs

Uses inputs from both sqanti3 and generate_reference_tables

Inputs from sqanti3 are these three files:

  • name_from_sqanti3__classification.txt
  • name_from_sqanti3__corrected.fasta
  • name_from_sqanti3__corrected.gtf

Inputs from generate_reference_tables process (prepare_reference_tables.py)

  • protein_coding_genes.txt
  • ensg_gene.tsv

Output directory

outdir/name/sqanti3-filtered


pacbio_6frm_gene_grouped

Six-Frame Translation

Generates a fasta file of all possible protein sequences derivable from each PacBio transcript, by translating the fasta file in all six frames (3+, 3-). This is used to examine what peptides could theoretically match the peptides found via a mass spectrometry search against GENCODE.

Output directory

outdir/name/pacbio_6frm_gene_grouped


transcriptome_summary

Transcriptome Summary

Compares the abundance (CPM) based on long-read sequencing to the abundances (TPM) inferred from short-read sequencing, as computed by Kallisto (analyzed outside of this pipeline). Additionally produces a pacbio-gene reference table

Output directory

outdir/name/transcriptome_summary


cpat

CPAT

CPAT is a bioinformatics tool to predict an RNA’s coding probability based on the RNA sequence characteristics. To achieve this goal, CPAT calculates scores of sequence-based features from a set of known protein-coding genes and background set of non-coding genes.

Features

  • ORF size
  • ORF coverage
  • Fickett score
  • Hexamer usage bias

CPAT will then builds a logistic regression model using these 4 features as predictor variables and the “protein-coding status” as the response variable. After evaluating the performance and determining the probability cutoff, the model can be used to predict new RNA sequences.

CPAT

Container

  • gsheynkmanlab/cpat:addr

Inputs

Download from the Zenodo repository the required hexamer and Logit model required for the species. Unzip the compressed logit model.

wget https://zenodo.org/record/5076056/files/Human_Hexamer.tsv
wget https://zenodo.org/record/5076056/files/Human_logitModel.RData.gz
gunzip Human_logitModel.RData.gz

Also required is the output for the sample from the filter_sqanti.py step.

  • Human_Hexamer.tsv
  • Human_LogitModel.RData
  • param_name_corrected.5degfilter.fasta*

Output directory

outdir/name/cpat


orf_calling

ORF Calling

Selects the most plausible ORF from each pacbio transcript, using the following information

  • comparison of ATG start to reference (GENCODE)

    selects ORF with ATG start matching the one in the reference, if it exists

  • coding probability score from CPAT
  • number of upstream ATGs for the candidate ORF

    decrease score as number of upstream ATGs increases using sigmoid function

Additionally provides calling confidence of each ORF called

  • Clear Best ORF : best score and fewest upstream ATGs of all called ORFs
  • Plausible ORF : not clear best, but decent CPAT coding_score (>0.364)
  • Low Quality ORF : low CPAT coding_score (<0.364)

Output directory

outdir/name/orf_calling


refined_database

Refined DB Generation

  • Filteres ORF database to only include accessions with a CPAT coding score above a threshold (default 0.0)
  • Filters ORFs to only include ORFs that have a stop codon
  • Collapses transcripts that produce the same protein into one entry, keeping a base accession (first alphanumeric).

Abundances of transcripts (CPM) are collapsed during this process.

Output directory

outdir/name/refined_database


pacbio_cds

PacBio CDS GTF

derive a GTF file that includes the ORF regions (as CDS features)

Output directory

outdir/name/pacbio_cds

rename_cds

Rename CDS to Exon

Preprocessing step to SQANTI Protein CDS is renamed to exon and transcript stop and start locations are updated to reflect CDS start and stop

Output directory

outdir/name/rename_cds


sqanti_protein

SQANTI Protein

Classify protein splice sites and calculates additional statistics for start and stop of ORF

Output directory

outdir/name/sqanti_protein


five_prime_utr

5' UTR Status

Intermediate step for protein classification

Determines the 5' UTR status of the protein in order to classify protein category in latter step

Output directory

None


protein_classification

Protein Classification

Classifies protein based on splicing and start site

main classifications are

  • pFSM: full-protein-match

    protein fully matches a gencode protein

  • pISM: incomplete-protein-match

    protein only partially matches gencode protein

    considered an N- or C-terminus truncation artifact

  • pNIC: novel-in-catelog

    protein composed of known N-term, splicing, and/or C-term in new combinations

  • pNNC: novel-not-in-catelog

    protein composed of novel N-term, splicing, and/or C-terminus

Output directory

outdir/name/protein_classification


protein_gene_rename

Protein Gene Rename

Mapings of PacBio transcripts/proteins to GENCODE genes. Some PacBio transcripts and the associated PacBio predicted protein can map two different genes. Some transcripts can also map to multiple genes.

Output directory

outdir/name/protein_gene_rename


protein_filter

Protein Filtering

Filters out proteins that are:

  • not pFSM, pNIC, pNNC
  • are pISMs (either N-terminus or C-terminus truncations)
  • pNNC with junctions after the stop codon (default 2)

Output directory

outdir/name/protein_filter


hybrid_protein_database

Protein Hybrid Database

Makes a hybrid database that is composed of high-confidence PacBio proteins and GENCODE proteins for genes that are not in the high-confidence space.

High-confidence is defined as genes in which the PacBio sampling is adequate (average transcript length 1-4kb)and a total of 3 CPM (counts per million) per gene

Output directory

outdir/name/hybrid_protein_database


metamorpheus

MetaMorpheus

MetaMorpheus is a bottom-up proteomics database search software with integrated post-translational modification (PTM) discovery capability. This program combines features of Morpheus and G-PTM-D in a single tool.

MetaMorpheus GitHub

metamorpheus_with_gencode_database

Metamorpheus GENCODE

Runs Metamorpheus MS search using the GENCODE database

Output directory

outdir/name/metamorpheus/gencode


metamorpheus_with_uniprot_database

Metamorpheus UniProt

Runs Metamorpheus MS search using the UniProt database

Output directory

outdir/name/metamorpheus/uniprot


pacbio

PacBio

metamorpheus_with_sample_specific_database_refined

Metamorpheus on PacBio Refined Database

Runs Metamorpheus MS search using the PacBio refined database

Output directory

outdir/name/metamorpheus/pacbio/refined


metamorpheus_with_sample_specific_database_filtered

Metamorpheus on PacBio Filtered Database

Runs Metamorpheus MS search using the PacBio filtered database

Output directory

outdir/name/metamorpheus/pacbio/filtered


metamorpheus_with_sample_specific_database_hybrid

Metamorpheus on PacBio Hybrid Database

Runs Metamorpheus MS search using the PacBio hybrid database

Output directory

outdir/name/metamorpheus/pacbio/hybrid


metamorpheus_with_sample_specific_database_rescue_resolve

MetaMorpheus using "Rescue and Resolve" Algorithm

Runs Metamorpheus MS search using the hybrid database and the version of Metamorpheus containing the "Rescue and Resolve" algorithm. This algorithm allows for "Rescue" of protein isoforms that are normally discarded during protein parsimony, given that the protein isoform has strong evidence of transcriptional expression.

Output directory

outdir/name/metamorpheus/pacbio/rescue_resolve


peptide_analysis

Peptide Analysis

Generate a table comparing MS peptide results between the PacBio and GENCODE databases.

Output directory

outdir/name/peptide_analysis


track_visualization

Track Visualization

reference

Reference Track Visualization

Creates tracks to use in UCSC Genome Browser for GENCODE database.

Output director

outdir/name/track_visualization/reference

protein_track_visualization

Protein Track Visualization

Creates tracks to use in UCSC Genome Browser for refined, filtered, and hybrid PacBio databases.

Shades tracks based on abundance (CPMs) and protein classification.

Output directory

outdir/name/track_visualization


make_multiregion

Multiregion BED generation

Makes multiregion BED file for UCSC genome browser for refined, filtered, and hybrid databases. Regions in the BED correpond to coding regions, thereby allowing for ntronic regions to be minimized for easier isoform viewing.

Output directory

outdir/name/track_visualization


peptide_track_visualization

Peptide Track Visualization

Makes peptide tracks for UCSC Genome Browser for refined, filtered and hybrid databases

Output directory

outdir/name/track_visualization


accession_mapping

Accession Mapping

Maps protein entries within GENCODE, UniProt and PacBio databases to one another based on sequence similarity

Output directory

outdir/name/accession_mapping

protein_group_compare

Protein Group Comparison

Determine the relationship between protein groups identified in Metamorpheus using GENCODE, UniProt, and/or PacBio databases

Output directory

outdir/name/protein_group_compare


peptide_novelty_analysis

Novel Peptides

Finds novel peptides between sample database and GENCODE. Novel peptide is defined as a peptide found in PacBio thatcould not be found in GENCODE.

Novel peptides found for refined, filtered, and hybrid databases

Output directory

outdir/name/novel_peptides

Clone this wiki locally