Enhancements in vfdb_parser.py for VFDB full dataset support #320

Change in <vfdb_id> group solves issue concerning vfdb GeneIDs not attributed with GenBank Accession. Change in <name> group solves issue concerning gene names including whitespace characters or brackets (e.g. `cryIA(a)`). Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-14, VFdbParser._fa_header_to_name_pieces did not return any None values.

Newly implemented functions: Extracts VFIDs from <description> part of seq.id in VFDB .fa-file and downloads VFs.xls.gz file from VFDB and links VFIDs to create metadata file including more information. Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-17. Still, manual changes are needed to resulting .fa file as VFDB contains duplicates.

Included a list-based filter for duplicate sequence ids in the downloaded VFDB fasta file. As consequence, ´ariba prepareref´ can be run after execution of vfdb_parser without manual deletion of duplicate entries. Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-18. Command ´ariba prepareref´ is not raising error because of duplicate seq.id (but 1254 sequences are filtered out because they are not recognized as genes though).

Included a check if a sequence can be translated making use of methods from pyfastaq. If sequence can not be translated, it is declared as non-coding in resulting metadata file, allowing processing with `ariba prepareref` without filtering of such sequences. Included funktion reporting maximum length giving advise for choice of parameters in further processing. Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-19. Command ´ariba prepareref´ is not removing any sequence from dataset (when run with advised parameters from vfdb_parser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Commits on Jan 14, 2022

Commits on Jan 17, 2022

Commits on Jan 18, 2022

Commits on Jan 19, 2022

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Are you sure you want to change the base?

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Commits on Jan 14, 2022

Commits on Jan 17, 2022

Commits on Jan 18, 2022

Commits on Jan 19, 2022