Skip to content

Commit

Permalink
fix the default option value of --field-accession-re, #65
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Sep 19, 2022
1 parent fda834a commit a93923a
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 16 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
- do not panic for invalid TaxIds, e.g., the column name, when using `-I--taxid-field`.
- `taxonkit create-taxdump`:
- fix merged.dmp and delnodes.dmp. Thanks to @apcamargo ! [gtdb-taxdump/issues/2](https://github.com/shenwei356/gtdb-taxdump/issues/2).
- fix bug of handling non-GTDB data when using `-A/--field-accession` and no rank names given.
- fix bug of handling non-GTDB data when using `-A/--field-accession` and no rank names given:
the colname of the accession column would be treated as one of the ranks, which messed up all the ranks.
- fix the default option value of `--field-accession-re` which wrongly remove prefix like `Sp_`. [#65](https://github.com/shenwei356/taxonkit/issues/65)
- `taxonkit list`:
- fix warning message of merged taxids.
- [TaxonKit v0.12.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.12.0)
Expand Down
24 changes: 11 additions & 13 deletions doc/docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -1533,20 +1533,19 @@ Examples:
| sed 1d \
> mgv.tsv

$ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species,contig
16:45:40.555 [WARN] --field-accession-re failed to extract genome accession, the origninal value is used instead. e.g., MGV-GENOME-0231225
16:45:40.817 [INFO] 189680 records saved to mgv/taxid.map
16:45:40.846 [INFO] 54224 records saved to mgv/nodes.dmp
16:45:40.864 [INFO] 54224 records saved to mgv/names.dmp
16:45:40.864 [INFO] 0 records saved to mgv/merged.dmp
16:45:40.864 [INFO] 0 records saved to mgv/delnodes.dmp
$ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species
23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map
23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp
23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp
23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp
23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp
$ head -n 5 mgv/taxid.map
MGV-GENOME-0364295 3108345579
MGV-GENOME-0364296 2356405276
MGV-GENOME-0364303 1099424244
MGV-GENOME-0364311 4037644503
MGV-GENOME-0364312 452745976
MGV-GENOME-0364295 677052301
MGV-GENOME-0364296 677052301
MGV-GENOME-0364303 1414406025
MGV-GENOME-0364311 1849074420
MGV-GENOME-0364312 2074846424
$ echo 677052301 | taxonkit lineage --data-dir mgv/
677052301 Caudovirales;crAss-phage;OTU-61123
Expand Down Expand Up @@ -1579,7 +1578,6 @@ Examples:
# the first column as accession
$ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump
16:31:31.828 [INFO] I will use the first row of input as rank names
16:31:31.841 [WARN] --field-accession-re failed to extract genome accession, the origninal value is used instead. e.g., GCF_000742135.1
16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map
16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp
16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp
Expand Down
9 changes: 7 additions & 2 deletions taxonkit/cmd/create-taxdump.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,14 @@ Attentions:

var err error

isGTDB := getFlagBool(cmd, "gtdb")

reGenomeIDStr := getFlagString(cmd, "field-accession-re")

if isGTDB && !cmd.Flags().Lookup("field-accession-re").Changed {
reGenomeIDStr = `^\w\w_(.+)$`
}

var reGenomeID *regexp.Regexp
if reGenomeIDStr != "" {
if !regexp.MustCompile(`\(.+\)`).MatchString(reGenomeIDStr) {
Expand All @@ -119,7 +125,6 @@ Attentions:
}
}

isGTDB := getFlagBool(cmd, "gtdb")
reGTDBStr := getFlagString(cmd, "gtdb-re-subs")

var reGTDBsubspe *regexp.Regexp
Expand Down Expand Up @@ -906,7 +911,7 @@ func init() {
RootCmd.AddCommand(createTaxDumpCmd)

createTaxDumpCmd.Flags().IntP("field-accession", "A", 0, "field index of assembly accession (genome ID), for outputting taxid.map")
createTaxDumpCmd.Flags().StringP("field-accession-re", "", `^\w\w_(.+)$`, `regular expression to extract assembly accession`)
createTaxDumpCmd.Flags().StringP("field-accession-re", "", `^(.+)$`, `regular expression to extract assembly accession`)
createTaxDumpCmd.Flags().BoolP("field-accession-as-subspecies", "S", false, "treate the accession as subspecies rank")
// -------------------------------------------------------------------

Expand Down

0 comments on commit a93923a

Please sign in to comment.