Output Files

bj8th edited this page Jul 27, 2021 · 16 revisions

Output files are customized based on "name" parameter provided as input. here that parameter will be called "name"

reference_tables Output directory: outdir/name/reference_tables
filename description
ensg_gene.tsv GENCODE gene_id to gene_name mapping
enst_isoname.tsv GENCODE transcript_id to transcript_name mapping
gene_ensp.tsv gene_name to GENCODE protein_id mapping
gene_isoname.tsv gene_name and transcript_name mapping
gene_lens.tsv Gene nucleotide length statistics
isoname_lens.tsv Gene isoform length information
protein_coding_genes.txt list of protein coding genes determined by GENCODE
gencode_db Output directory: outdir/name/gencode_db
filename description
gencode_isoname_cluster.tsv Listing of GENCODE transcript_names (i.e., isonames) that create the same proteins. Reference transcript_name (arbitrarily selected) and clustered transcript_names provided
gencode_protein.fasta protein sequence of GENCODE, reference isonames only
isoseq3 Output directory: outdir/name/isoseq3
filename description
name.collapsed.abundance.txt Collapsed isoform abundances
name.collapsed.fasta Representative transcript sequence of collapsed
name.collapsed.gff Collapsed transcript alignment Collapsed isoform statistics
name.demult.lima.summary Statistics after lima command
name.flnc.bam Full-length non-concatemer reads
name.flnc.bam.pbi Full-length non-concatemer reads
name.flnc.filter_summary.json Full-length non-concatemer reads summary
star_index Output directory: outdir/name/star_index
star Output directory: outdir/name/star
filename description STAR results in tab format Log file and summary statistics
sqanti3 Output directory: outdir/name/sqanti3
filename description
name_classification.txt SQANTI transcript classification of isoforms
name_corrected.fasta Transcript sequences after correction using genome sequence
name_corrected.gtf Alignment of corrected sequences
name_junctions.txt File with attribute information at splice junction level (table explaining feature meaning inside output_info).
name_sqanti_report.pdf PDF file showing different quality control and descriptive plots. An example can be found here
name.params.txt SQANTI parameters used
sqanti3-filtered Output directory: outdir/name/sqanti3-filtered
filename description
filtered_name_classification.tsv SQANTI classification filtered based on protein coding, percent polyA downstream, RTS stage
filtered_name_corrected.fasta SQANTI fasta filtered based on protein coding, percent polyA downstream, RTS stage
filtered_name_corrected.gtf SQANTI gtf filtered based on protein coding, percent polyA downstream, RTS stage
name_classification.5degfilter.txt SQANTI classification for filtered_ criteria and additionally for 5' degregation
name_corrected.5degfilter.fasta SQANTI fasta for filtered_ criteria and additionally for 5' degregation
name_corrected.5degfilter.gtf SQANTI gtf for filtered_ criteria and additionally for 5' degregation
pacbio_6frm_gene_grouped Output directory: outdir/name/pacbio_6frm_gene_grouped
filename description
name.6frame.fasta all possible frames (3+, 3-) of PacBio translated
transcriptome_summary Output directory: outdir/name/transcriptome_summary
filename description
gene_level_tab.tsv CPM (long-read) and TPM (short-read) info provided on gene level
pb_gene.tsv PacBio to gene mapping
sqanti_isoform_info.tsv simplified SQANTI classification info
cpat Output directory: outdir/name/cpat
filename description
CPAT_run_info.log CPAT logging file
name_cpat.error CPAT error / logging info
name_cpat.output CPAT output info
name.no_ORF.txt list of PacBio isoforms that did not produce a valid ORF ORF file, best defined by CPAT
name.ORF_prob.tsv All ORFs found and scored by CPAT
name.ORF_seqs.fa All ORF nucleotide sequences found by CPAT
name.r code run by CPAT to produce ORFs
orf_calling Output directory: outdir/name/orf_calling
filename description
name_best_orf.tsv Best ORF for each PacBio accession, as determined by algorithm
refined_database Output directory: outdir/name/refined_database
filename description
name_orf_refined.fasta protein sequence of ORFs after collapsing based on transcripts producing same protein sequence
name_orf_refined.tsv ORF info, transcripts producing same protein collapsed
pacbio_cds Output directory: outdir/name/pacbio_cds
filename description
name_no_transcript_with_cds.gtf PacBio gtf with CDS info added, transcript line not included
name_with_cds.gtf PacBio gtf with CDS info added
make_pacbio_cds.log logging file
rename_cds Output directory: outdir/name/rename_cds
filename description
gencode.cds_renamed_exon.gtf GENCODE gtf file, exons removed and CDS renamed to exon
gencode.transcript_exons_only.gtf GENCODE gtf file, exons and transcript only
name.cds_renamed_exon.gtf PacBio gtf file, exons removed and CDS renamed to exon
name.transcript_exons_only.gtf PacBio gtf file, exons and transcript only
sqanti_protein Output directory: outdir/name/sqanti_protein
filename description
name_sqanti_protein_classification.tsv splice classification data for proteins generated by PacBio
protein_classification Output directory: outdir/name/protein_classification
filename description
name_genes.tsv Mapping of PacBiio accession to transcript gene and protein gene. These can be different if transcript read spans multiple genes
name_unfiltered.protien_classification.tsv protein classification of all PacBio proteins
protein_gene_rename Output directory: outdir/name/protein_gene_rename
filename description
name_orf_refined_gene_update.tsv refined database with gene name updated to reflect protein gene
name_with_cds_refined.gtf PacBio gtf file that includes CDS information with gene name updated to reflect protein gene
name_protein_refined.fasta protein fasta with gene name updated to reflect protein gene
protein_filter Output directory: outdir/name/protein_filter
filename description
name_with_cds_filtered.gtf GTF, filtered to remove intergenic and truncations
name_classification_filtered.tsv Protein classification, filtered to remove intergenic and truncations
name.filtered_protein.fasta protein sequences, filtered to remove intergenic and truncations
hybrid_protein_database Output directory: outdir/name/hybrid_protein_database

High confidence: 3+CPM per gene, 1-4kb average nucleotide length of gene

filename description
name_cds_high_confidence.gtf GTF of high confidence genes
name_high_confidence_genes.tsv list of high confidence genes
name_hybrid.fasta sequence information of high confidence PacBio and Gencode genes
name_refined_high_confidence.tsv high confidence ORF metadata

Database Information

database directory database name in files
GENCODE gencode Gencode
UniProt uniprot UniProt
PacBio Filtered pacbio/filtered filtered
PacBio Refined pacbio/refined refined
PacBio Hybrid pacbio/hybrid hybrid
PacBio Rescue & Resolve pacbio/resue_resolve rescue_resolve

toml files

directory database/toml

filename description
CalibrationTask.toml not used
GlycoSearchTask.toml not used
GptmdTask.toml not used
SearchTask.toml Metamorpheus run parameters
XLSearchTask.toml not used

Search Results Files In search_results/Task1SearchTask

filename description
AllPSMs.psmtsv PSM's found
AllQuantifiedPeaks.tsv quantified peaks found
prose.txt Run information PSM's found in Percolator format
AllQuantifiedPeptides.tsv quantified peptides
results.txt summary statistics
AllPeptides.database.psmtsv peptides found
AllQuantifiedProteinGroups.database.tsv protein groups found
peptide_analysis Output directory: outdir/name/peptide_analysis
filename description
gc_pb_overlap_peptides.tsv overlap of GENCODE peptides with theoretical peptides that could be found in Pacbio databases


Output directory: outdir/name/track_visualization/reference

filename description
gencode_shaded.bed12 GENCODE bed alignment colored
gencode.filtered.gtf GENCODE alignment

pacbio databases

Output directory: outdir/name/track_visualization/database

database database name
PacBio Refined refined
PacBio Filtered filtered
PacBio Hybrid hybrid


filename description
name_database_peptides.bed12 peptide bed alignment
name_database_peptides.gtf peptide gtf alignment
name_database_shaded_peptides.bed12 peptide bed alignment, shaded green


filename description
name_hybrid_shaded_cpm.bed12 protein alignment, shaded by transcript abundance (CPM)
name_hybrid_shaded_protein_class.bed12 protein alignment, shaded by protein classification
accession_mapping Output directory: outdir/name/accession_mapping
filename description
accession_map_gencode_uniprot_pacbio.tsv accession mapping between GENCODE, UniProt and Pacbio
accession_map_stats.tsv frequency between database overlap
protein_group_compare Output directory: outdir/name/protein_group_compare
filename description
ProteinInference_GENCODE_PacBio_comparisons.xlsx protein inference overlap between GENCODE and PacBio Hybrid
ProteinInference_UniProt_PacBio_comparisons.xlsx protein inference overlap between UniProt and PacBio Hybrid
ProteinInference_GENCODE_UniProt_comparisons.xlsx protein inference overlap between GENCODE and UniProt
novel_peptides Output directory: outdir/name/novel_peptides
filename description
name_database.pacbio_novel_peptides_to_gencode.tsv novel peptides found in PacBio compared to GENCODE database
name_database.pacbio_novel_peptides_to_uniprot.tsv novel peptides found in PacBio compared to UniProt database
name_database.pacbio_novel_peptides.tsv novel peptides found in PacBio compared to GENCODE and UniProt databases