Skip to content

[WIP]feat(ref): return NCBI assembly reports via gget ref (#179)#237

Open
Elarwei001 wants to merge 6 commits into
scverse:devfrom
Elarwei001:feature/ref-ncbi-assembly-report-179
Open

[WIP]feat(ref): return NCBI assembly reports via gget ref (#179)#237
Elarwei001 wants to merge 6 commits into
scverse:devfrom
Elarwei001:feature/ref-ncbi-assembly-report-179

Conversation

@Elarwei001

@Elarwei001 Elarwei001 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Resolves #179

What this adds

gget ref --assembly_report (Python: assembly_report=True) fetches the NCBI assembly report for an assembly and returns it as a pandas DataFrame (or list of dicts with json=True). The report maps each sequence/chromosome name across the Ensembl/short, GenBank, RefSeq, and UCSC conventions (e.g. 1CM000663.2NC_000001.11chr1), which is handy for translating chromosome names between databases. A standalone gget.assembly_report() is also exposed.

gget ref GCF_000001405.40 --assembly_report
  • Version-optional accession — pass GCF_000001405 (no version) and the latest version is resolved automatically.
  • Look up by organism name--taxon (Python: taxon=True) interprets the input as an organism name (e.g. "homo sapiens") and resolves it to that taxon's NCBI reference assembly via the bundled datasets CLI, so you don't need to know the accession.
  • Browse a taxon's assemblies--list_assemblies (Python: list_assemblies=True) lists all NCBI assemblies for an organism name (accession, name, category, level; reference/representative first), so you can pick a specific non-reference assembly to fetch.
  • CLI format follows gget's convention: JSON by default, --csv for CSV.
gget ref "homo sapiens" --assembly_report --taxon            # reference assembly's report
gget ref "homo sapiens" --assembly_report --list_assemblies  # list all assemblies to choose from

Name resolution (--taxon / --list_assemblies) is opt-in and imported lazily, so accession callers are unaffected and make no extra call.

How it works

Builds the NCBI FTP path by sharding the accession digits (GCA/981/441/385/…), lists the parent directory to find the exact assembly folder (matched precisely so .4 can't collide with .40), then parses the tab-delimited *_assembly_report.txt. Organism-name modes shell out to the bundled datasets summary genome taxon CLI.

Testing

  • Network-free unit tests (tests/test_ref.py): report parsing, json/save paths, the five error branches, version-prefix guard, version-less latest resolution, taxon resolution and assembly listing (parse / order / not-found / datasets-failure), and the ref() delegation.
  • Live tests (TestAssemblyReportLive, skip-on-network/transient): SARS-CoV-2 report anchored on stable identity columns, --taxon "homo sapiens" resolving to the human reference, and --list_assemblies listing the human assemblies.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.27132% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.31%. Comparing base (e43a804) to head (c2ceff6).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines Patch % Lines
gget/gget_ref.py 85.03% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #237      +/-   ##
==========================================
+ Coverage   56.70%   57.31%   +0.61%     
==========================================
  Files          29       29              
  Lines        9392     9517     +125     
==========================================
+ Hits         5326     5455     +129     
+ Misses       4066     4062       -4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
@lauraluebbert lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31
@lauraluebbert lauraluebbert reopened this Jun 28, 2026
Elarwei001 and others added 3 commits June 28, 2026 19:51
Add an `assembly_report` mode to `gget ref` (Python API and CLI). When
enabled, the positional argument is interpreted as an NCBI assembly
accession (e.g. GCF_000001405.40) and the NCBI assembly report is
fetched and parsed, mapping sequence/chromosome names across the
Ensembl/short, GenBank, RefSeq and UCSC naming conventions.

- New `gget.assembly_report()` helper and `ref(..., assembly_report=True)`.
- CLI: `gget ref <accession> --assembly_report [--csv]`.
- Returns a pandas DataFrame by default, list of dicts / JSON with json=True.
- Tests + fixtures (live NCBI, small SARS-CoV-2 assembly) and docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a network-free TestAssemblyReportOffline class that mocks requests
to cover assembly_report (accession validation, parent-dir and report
HTTP errors, no-folder/missing-header errors, parsing, verbose, and
json/save/CSV branches) and the ref(assembly_report=True) delegation.
All PR-added lines are now covered except the json_package.dump rename
deep in the Ensembl network save path (un-coverable network-free).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lauraluebbert lauraluebbert force-pushed the feature/ref-ncbi-assembly-report-179 branch from a871709 to 26ab068 Compare June 28, 2026 23:55
Elarwei001 added a commit to Elarwei001/gget_issues that referenced this pull request Jul 5, 2026
Bilingual (ZH/EN) analysis of scverse/gget#237 (adds assembly_report to
gget ref: interpret the positional arg as an NCBI assembly accession, return
the cross-naming map chr1<->1<->CM000663.2<->NC_000001.11). Covers platform
fit, the --csv-inverted-vs-gget-norm UX wart, and the de-facto live fixture
test. Updated README index + index.html card. Analysis by subagent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Elarwei001 and others added 2 commits July 5, 2026 22:29
…on-less accession (scverse#179)

Addresses review findings on the assembly_report feature:

- --csv now follows gget's convention: it is store_false (default JSON on
  the CLI, --csv converts to CSV), so a flag named --csv gives CSV like every
  other module, instead of the previous inverted behaviour (--csv -> JSON).
- Directory match uses `startswith(accession + "_")` so e.g. 'GCF_000001405.4'
  no longer prefix-matches the folder for 'GCF_000001405.40'.
- The accession version suffix is now optional; when omitted (e.g.
  'GCF_000001405') the latest available version is resolved. Saved files and
  logs carry the resolved version.
- Move the network-touching fixture out of test_ref.json into an explicit
  TestAssemblyReportLive (skip-on-network / transient; anchored to the frozen
  SARS-CoV-2 report but asserting only stable identity columns). Add offline
  tests for the version-prefix guard and the version-less latest resolution.
- Fix a pre-existing mypy error on df.to_dict(orient="records") via cast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cverse#179)

assembly_report previously required an NCBI assembly accession. Add an explicit
`taxon=True` option (CLI `--taxon`/`-tx`) that interprets the input as an
organism/taxon name (e.g. "homo sapiens") and resolves it to that taxon's NCBI
reference assembly accession before fetching the report, using the datasets CLI
already bundled with gget.

- Resolution is opt-in: accession callers are unaffected and make no extra call.
  gget_virus (for the datasets binary) is imported lazily, only in taxon mode.
- Default resolves to the reference assembly; to fetch a specific non-reference
  assembly, pass its accession directly.
- Tests: offline (taxon path wiring, resolver parsing, not-found -> ValueError,
  datasets failure -> RuntimeError) and a live test resolving "homo sapiens".
- Docs: --taxon flag + example (en); changelog note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as ready for review July 5, 2026 15:00
@Elarwei001 Elarwei001 changed the title feat(ref): return NCBI assembly reports via gget ref (#179) [WIP]feat(ref): return NCBI assembly reports via gget ref (#179) Jul 5, 2026
@Elarwei001 Elarwei001 marked this pull request as draft July 5, 2026 15:03
…rse#179)

Complements --taxon (which resolves to the reference assembly): --list_assemblies
(Python: list_assemblies=True) interprets the input as an organism/taxon name and
returns a table of all NCBI assemblies for that taxon (accession, assembly_name,
refseq_category, assembly_level, organism), with reference/representative first, so
a user can discover and then fetch a specific non-reference assembly by accession.

Uses the bundled datasets CLI (imported lazily). Tests: offline (parse/order,
empty -> ValueError, delegation) and a live test listing the human assemblies.
Docs: --list_assemblies flag + example; changelog note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as ready for review July 5, 2026 15:13
@Elarwei001

Elarwei001 commented Jul 5, 2026

Copy link
Copy Markdown
Contributor Author

Hi @lauraluebbert — gget ref --assembly_report (resolves #179) fetches an NCBI assembly report to translate sequence/chromosome names across Ensembl/GenBank/RefSeq/UCSC. Key points:

  1. Core. --assembly_report GCF_000001405.40 returns the name-mapping table as a DataFrame (json=True for dicts); gget.assembly_report() is also exposed.
  2. Version optional. Passing GCF_000001405 (no version) resolves to the latest.
  3. By organism name. --taxon "homo sapiens" resolves to that taxon's reference assembly; --list_assemblies "homo sapiens" lists all its assemblies to pick from. Both are opt-in and reuse the bundled datasets CLI (lazy import) — accession callers are unaffected.
  4. CLI format follows gget's convention: JSON by default, --csv for CSV.
  5. Tests. Network-free unit tests for parsing/errors/resolution + live tests that skip on network/transient errors

would appreciate a review when you have a moment. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants