Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Commits on Jan 14, 2022

  1. Update name_regex in vfdb_parser.py

    Change in <vfdb_id> group solves issue concerning vfdb GeneIDs not attributed with GenBank Accession.
    Change in <name> group solves issue concerning gene names including whitespace characters or brackets (e.g.  `cryIA(a)`).
    Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-14, VFdbParser._fa_header_to_name_pieces did not return any None values.
    lknegendorf authored Jan 14, 2022
    Configuration menu
    Copy the full SHA
    0cc26fe View commit details
    Browse the repository at this point in the history

Commits on Jan 17, 2022

  1. Implement support for VFDB VFs.xls file

    Newly implemented functions: Extracts VFIDs from <description> part of seq.id in VFDB .fa-file and downloads VFs.xls.gz file from VFDB and links VFIDs to create metadata file including more information.
    Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-17.
    Still, manual changes are needed to resulting .fa file as VFDB contains duplicates.
    lknegendorf authored Jan 17, 2022
    Configuration menu
    Copy the full SHA
    8140047 View commit details
    Browse the repository at this point in the history

Commits on Jan 18, 2022

  1. Remove duplicate references with vfbd_parser

    Included a list-based filter for duplicate sequence ids in the downloaded VFDB fasta file. As consequence, ´ariba prepareref´ can be run after execution of vfdb_parser without manual deletion of duplicate entries.
    Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-18. Command ´ariba prepareref´ is not raising error because of duplicate seq.id (but 1254 sequences are filtered out because they are not recognized as genes though).
    lknegendorf authored Jan 18, 2022
    Configuration menu
    Copy the full SHA
    42b856f View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2022

  1. Add support for noncoding seqs in vfdb_parser

    Included a check if a sequence can be translated making use of methods from pyfastaq. If sequence can not be translated, it is declared as non-coding in resulting metadata file, allowing processing with `ariba prepareref` without filtering of such sequences.
    Included funktion reporting maximum length giving advise for choice of parameters in further processing.
    Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-19. Command ´ariba prepareref´ is not removing any sequence from dataset (when run with advised parameters from vfdb_parser.
    lknegendorf authored Jan 19, 2022
    Configuration menu
    Copy the full SHA
    b9b48d0 View commit details
    Browse the repository at this point in the history