Sub-workflow for gene statistics #121

BethYates · 2024-06-20T12:57:52Z

We want to be able to include some standard basic statistics on the gene/protein annotation set for an assembly in a genome note. This sub workflow should accept an annotation set and calculate some statistics, (exact values still to be determined but will most likely be things like the number of protein coding genes, number of non-coding genes, exons per transcript etc as well as BUSCO scores).

This could be a standalone pipeline or could be added to either the genomenote pipeline or to the ensemblgenedownload pipeline, although it may not always be Ensembl that provides the annotations.

Exon count
Exon length
Intron length
DNA strand bias

BethYates · 2024-06-24T10:47:16Z

After disussing this with the ToL Genome Notes editor they would like to see OMark https://omark.omabrowser.org/ included in the set of tools to evaluate an annotation set

BethYates · 2024-06-26T12:43:15Z

To close this issue:

Add a new optional parameter --annotation_set which takes in a file path for a directory containing ensembl gene annotation. This directory should have files consistent with the ouput directory from running the sanger-tol/ensemblgenedownload pipeline
In the main workflow or sub-workflow, parse and validate the parameter value and pass the files from the filepath to a new subworkflow - annotation_stats,
Define the sub-workflow annotation_stats and configure it to run only if the --annotation_set parameter is passed.
Create the subwofkflow annotation_stats, it should read in the relevant annotation files and process the data to generate the following statistics:

- TRANSC_MRNA: the number of transcribed mRNAs 
- PCG: the number of protein coding genes
- NCG: the number of non-coding genes
- CDS_PER_GENE: the average number of coding transcripts per gene
- EXONS_PER_TRANSC: the average number of exons per transcript
- CDS_LENGTH: the average length of coding sequence
- EXON_SIZE: the average length of a coding exon
- INTRON_SIZE: the average length of coding intron size

Output these statistics to a csv file named as the assemblyID. csv in a directory created in the results directory called annotation_statistics, The CSV file should have the following columns,
Variable,Value
where the variable is the name of the variable from the list above an the value is the statistic you have generated.
Run tests to make sure output file is produced.
Update the documentation.

Added new subworkflow 'ANNOTATION_STATS' and new parameter --annotation_set to solve issue #121

BethYates · 2024-08-16T15:22:32Z

The work in #135 goes most of the way to solving this issue, but the output from the subworkflow is currently two files with lots of information rather than one file with just the specific data that we are interested in. The next step is to extend the subworkflow to parse the output of the files produced by the AGAT_SPSTATISTICS and AGAT_SQSTATBASIC.

What I would like to see is

Creation of a new local module that contains a process that calls a python script. (Look at parse_metadata.nf for an example of a local module that performs a similar function) The input to this module should be the output from the two AGAT modules.
Addition of a python script to /bin that reads through the input files and parses out the following information:

- TRANSC_MRNA: the number of transcribed mRNAs 
- PCG: the number of protein coding genes
- NCG: the number of non-coding genes
- CDS_PER_GENE: the average number of coding transcripts per gene
- EXONS_PER_TRANSC: the average number of exons per transcript
- CDS_LENGTH: the average length of coding sequence
- EXON_SIZE: the average length of a coding exon
- INTRON_SIZE: the average length of coding intron size

If you're not sure what any of these terms mean let me know!

The parsed information should be written out to a csv file named as the assemblyID. csv in a directory created in the results directory called annotation_statistics, The CSV file should have the following columns,
Variable,Value
where the variable is the name of the variable from the list above an the value is the statistic you have generated.

muffato added the feature Requests for new features label Jun 20, 2024

SandraBabirye self-assigned this Jul 3, 2024

SandraBabirye mentioned this issue Aug 16, 2024

Added new subworkflow 'ANNOTATION_STATS' and new parameter --annotation_set to solve issue #121 #135

Merged

BethYates added a commit that referenced this issue Aug 16, 2024

Merge pull request #135 from sanger-tol/annotation_statistics

7e201c7

Added new subworkflow 'ANNOTATION_STATS' and new parameter --annotation_set to solve issue #121

tkchafin linked a pull request Aug 29, 2024 that will close this issue

Added new local module to extract only relevant annotation statistics #136

Merged

5 tasks

SandraBabirye mentioned this issue Sep 3, 2024

Added new local module to extract only relevant annotation statistics #136

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-workflow for gene statistics #121

Sub-workflow for gene statistics #121

BethYates commented Jun 20, 2024

BethYates commented Jun 24, 2024

BethYates commented Jun 26, 2024

BethYates commented Aug 16, 2024

Sub-workflow for gene statistics #121

Sub-workflow for gene statistics #121

Comments

BethYates commented Jun 20, 2024

BethYates commented Jun 24, 2024

BethYates commented Jun 26, 2024

BethYates commented Aug 16, 2024