Skip to content

feat(g2p): make g2p useful for variant-interpretation workflows + bug fixes#258

Merged
lauraluebbert merged 8 commits into
devfrom
fix/g2p-resolve-gene-and-bugs
Jun 27, 2026
Merged

feat(g2p): make g2p useful for variant-interpretation workflows + bug fixes#258
lauraluebbert merged 8 commits into
devfrom
fix/g2p-resolve-gene-and-bugs

Conversation

@lauraluebbert

@lauraluebbert lauraluebbert commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Hi @Elarwei001 — thanks for the original gget g2p module. While reviewing it for biological relevance we found that the wrapper faithfully exposed the three public G2P REST endpoints but didn't quite match how scientists actually use the portal day-to-day, and had a couple of silent-failure bugs against the live API. This PR closes that gap. Tagging you so you can sanity-check the scientific framing and naming.

Why this PR exists (scientific motivation)

The G2P portal is fundamentally a variant interpretation resource — scientists land on it with a list of variants of interest (clinical screens, MAVE deep mutational scans, gnomAD outliers) and ask: "at residue Y of protein Z, is this position in a folded region? at a pocket? at a PTM? does PFES flag it as enriched for pathogenic variants?"

The previous gget g2p exposed the data needed to answer those questions but made the user do a lot of plumbing — look up the UniProt accession by hand, get all ~140 columns × N residues even when they only care about three positions, parse the comma-joined PDB string before feeding it into gget pdb, and figure out which of the 142 columns are actually the scientifically useful ones from the column headers alone. This PR is about closing that last mile so a working scientist can go from "I have a list of variants" to "I have an annotated table" in one call.

What this PR does

Scientific UX

  • Either gene or --uniprot_id is now sufficient — the other is resolved via UniProt (cached). Gene → UniProt picks the canonical reviewed human Swiss-Prot entry; the resolution and its limitations (synonyms, paralogues, non-human, isoforms) are logged so users see what was chosen and how to override.
  • Invariant output schema. The canonical pair used for the query is always prepended as gene_name / uniprot_id columns (and stored on df.attrs), so downstream code doesn't have to branch on which input mode the user used and the chosen identifiers survive in saved CSV/JSON files.
  • residues= filter restricts features / alignment results to specific positions — Python accepts int / list / range / set; CLI accepts --residues 185,1775,1812 or 100-200 or 1-50,185,300-310. The whole point of fetching the features table is usually to score a small variant list; filtering client-side after fetch is one short line of pandas the user no longer has to write.
  • map results get a parsed PDB Ids List column (list[str]) alongside the comma-joined PDB Ids string, so the output is directly chainable into gget pdb without .split(",") boilerplate.
  • Docs now advertise the columns that actually drive variant interpretation: PFES (Protein Feature Enrichment Score) sub-scores, MaveDB per-residue functional scores, fpocket / af2bind / p2rank pocket predictions, intra/inter-chain interaction counts. Linked the g2p-bis feature-description repo so readers can decode the cryptic MaveDB column names.
  • Noted the upstream limitation that variant overlays (gnomAD / ClinVar / HGMD) are web-portal-only — they aren't exposed by the public REST API and therefore aren't reachable from gget g2p. Setting honest expectations rather than implying parity.

Bug fixes

  • Silent failure when the gene/UniProt pair was unknown. The G2P portal returns HTTP 200 with a JSON body like {\"status\":\"failure\",\"message\":\"No data for this gene.\"} over the TSV channel. The previous code parsed that JSON string as a single TSV column header and returned a 0-row DataFrame with no error. Now detected, the upstream message is logged, and None is returned.
  • Consistent failure return. All failure modes (network error, HTTP error, JSON error body, empty response, unresolvable identifier) now return None. Previously it was a mix of None and pd.DataFrame(), which forced callers to check both.
  • Retries on transient failures (connection errors, read timeouts, HTTP 5xx) with exponential backoff — same pattern gget bgee / gget opentargets already use.
  • URL-encoding of gene / uniprot_id / isoform path segments.
  • Removed dead Accept header (the server ignores it and always returns text/plain regardless).

API surface additions

  • New residues= Python argument (and --residues CLI flag).
  • New out= Python argument writes the result to an explicit CSV path; save=True continues to work and writes to the auto-named CSV in CWD; out= takes precedence when both are set.
  • alignment now requires uniprot_id to be passed explicitly — gene→UniProt resolution returns the base accession and can't disambiguate isoforms.

Backward compatibility

All existing call sites continue to work unchanged:

  • CLI: gget g2p BRCA1 -u P38398 -r features
  • Python: gget.g2p(\"BRCA1\", uniprot_id=\"P38398\", resource=\"features\")
  • save=True behavior unchanged.

The one schema change is the two new leading gene_name / uniprot_id columns on every result, which we think is the right trade-off for invariant output across input modes.

lauraluebbert and others added 6 commits June 26, 2026 15:14
- gene is now optional and resolved from uniprot_id via the UniProt REST
  entry endpoint (cached with lru_cache). A UniProt accession alone is
  sufficient identification; the old API required both.
- Fix silent failure where G2P returns HTTP 200 with a JSON
  {"status":"failure",...} body on unknown gene/UniProt pairs. The
  response was being parsed as a single TSV column header and a 0-row
  DataFrame returned with no error. Now detected, logged, and returns
  None.
- All failure modes return None (previously a mix of None and empty
  DataFrame).
- Retry transient failures (5xx, connection errors, timeouts) with
  exponential backoff.
- URL-encode gene/uniprot_id/isoform path segments.
- New `out=` Python argument to write to an explicit CSV path (takes
  precedence over `save`).
- Docs: list g2p in SUMMARY.md (was hidden from the published site);
  advertise PFES, MaveDB, pocket and interaction columns; note that
  variant overlays (gnomAD/ClinVar/HGMD) are portal-only.

Backward-compatible: existing CLI (`gget g2p BRCA1 -u P38398`) and
Python (`gget.g2p("BRCA1", uniprot_id="P38398")`) call sites continue
to work unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Symmetric resolution: `--uniprot_id` is now optional too, resolved from
  `gene` via UniProt (canonical reviewed human Swiss-Prot entry) when
  omitted. Limitations are spelled out in a prominent log message
  (synonyms, paralogues, non-human, unreviewed, isoforms — pass uniprot_id
  to override). The resolved pair travels with the data both as
  df.attrs["gene"]/["uniprot_id"] and as leading `Resolved Gene` /
  `Resolved UniProt` columns whenever resolution happened — so CSV/JSON
  saved files also record what was queried.
- `residues=` argument filters `features`/`alignment` results to specific
  positions (int / list / tuple / range / set in Python; comma-separated
  list and/or inclusive ranges on the CLI: `--residues 100-200,300,400`).
- `map` results gain a parsed `PDB Ids List` column (list[str]) alongside
  the comma-joined `PDB Ids` string, ready to feed into `gget pdb`.
- `alignment` now requires `uniprot_id` explicitly (gene→UniProt returns
  the base accession and cannot disambiguate isoforms).

Backward-compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous behavior added "Resolved Gene" / "Resolved UniProt" columns
only when one of the identifiers was looked up. That meant the output
schema differed depending on input mode, which is awkward for downstream
code that should not care whether the caller supplied gene, uniprot_id,
or both.

Now the canonical pair is *always* prepended as `gene_name` and
`uniprot_id` columns, populated with whichever values were used for the
query. The same keys are also set on `df.attrs`. Output schema is now
identical across all three input modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lauraluebbert lauraluebbert changed the title fix(g2p): make gene optional, fix silent failure on JSON error bodies feat(g2p): make g2p useful for variant-interpretation workflows + bug fixes Jun 26, 2026
@Elarwei001

Copy link
Copy Markdown
Contributor

Thanks @lauraluebbert — this is a great cleanup. The gene⇄UniProt auto-resolution and the residues filter make the module genuinely usable for variant-interpretation workflows, and the bug fixes are spot on (the silent JSON-over-TSV failure on an unknown gene/UniProt pair was a real trap).

I ran an independent check (a standalone script hitting the live G2P + UniProt URLs directly, not through gget) to sanity-check the scientific framing, and the naming/descriptions all hold up:

  • PFES = Protein Feature Enrichment Score and its six sub-scores (PFES_Physicochemical/Structure/Domain/Function/Modification/PPI) match the live columns exactly.
  • fpocket / af2bind / p2rank, AlphaFold pLDDT, DSSP, ASA, the UniProt sites, PhosphoSitePlus PTMs, and the MaveDB per-residue score columns are all present as described, and gnomAD/ClinVar/HGMD overlays are indeed absent from the public REST features endpoint — so the "web-portal-only" note is accurate.

One tiny wording nuance, not a blocker: the docs say the intra-/inter-chain interactions come "(from PDB and AlphaFold)", but in the live table only the intra-chain interactions have both PDB and AlphaFold2 columns — the inter-chain ones are PDB-only (which makes sense, since AlphaFold2 models single chains). Might be worth scoping that phrase to intra-chain.

I've attached the verification script and its output below for reference. Thanks again for the careful work!

Verification script:
verify_g2p_claims_2026-06-27.py

verification output:

==============================================================================
Fetching the REAL G2P features table (BRCA1 / P38398) directly from the API
https://g2p.broadinstitute.org/api/gene/BRCA1/protein/P38398/protein-features
==============================================================================

Returned: 142 columns x 1863 residue rows

1) Features table size
  [PASS] PR says "~140 columns"  -> actual = 142

2) PFES (Protein Feature Enrichment Score) + sub-scores
  [PASS] column 'PFES'
  [PASS] column 'PFES_Physicochemical'
  [PASS] column 'PFES_Structure'
  [PASS] column 'PFES_Domain'
  [PASS] column 'PFES_Function'
  [PASS] column 'PFES_Modification'
  [PASS] column 'PFES_PPI'

3) Pocket prediction tools (fpocket / af2bind / p2rank)
  [PASS] tool 'fpocket'  -> 2 column(s): ['fpocket: pocket number*', 'fpocket: druggability score*']
  [PASS] tool 'af2bind'  -> 2 column(s): ['af2bind: pocket number*', 'af2bind: binding probability*']
  [PASS] tool 'p2rank'  -> 2 column(s): ['p2rank: pocket number*', 'p2rank: pocket probability*']

4) AlphaFold pLDDT / DSSP secondary structure / accessible surface area
  [PASS] AlphaFold pLDDT
  [PASS] DSSP secondary structure
  [PASS] Accessible surface area

5) UniProt sequence annotations (active/binding/domain sites)
  [PASS] column 'Active site (UniProt)'
  [PASS] column 'Binding site (UniProt)'
  [PASS] column 'Domain (UniProt)'

6) Intra/inter-chain interactions (H-bonds, salt bridges, ...)
  [PASS] Intra-chain Hydrogen bond
  [PASS] Intra-chain Salt bridge
  [PASS] Inter-chain Hydrogen bond
     note: intra-chain AlphaFold2 columns = 4, inter-chain AlphaFold2 columns = 0 (PR wording 'from PDB and AlphaFold' is strictly true for intra- only)

7) MaveDB per-residue functional-score columns (URN-style codes)
  [PASS] MaveDB-style columns present  -> 62 columns, e.g. ['00000003-a-2 outliers⁺⁺', '00000003-a-2 avg', '00000003-b-2 outliers⁺⁺']

8) gnomAD / ClinVar / HGMD variant overlays are NOT in the REST features table
  [PASS] no gnomAD/ClinVar/HGMD columns leak through  -> none found

9) UniProt gene <-> accession resolution (what the PR added)
  [PASS] P38398 -> gene  -> BRCA1
  [PASS] BRCA1 -> canonical reviewed human accession  -> P38398 (1 reviewed candidate(s))

==============================================================================
OVERALL: PASS  - every scientific claim in PR #258 verified against live data.
==============================================================================

@lauraluebbert lauraluebbert merged commit 7ea5996 into dev Jun 27, 2026
1 of 2 checks passed
@lauraluebbert lauraluebbert deleted the fix/g2p-resolve-gene-and-bugs branch June 27, 2026 18:24
lauraluebbert added a commit that referenced this pull request Jun 28, 2026
Two pre-commit.ci failures were hitting every PR (it runs --all-files),
unrelated to the PRs themselves:

1. ruff D205 in gget_g2p.py — the g2p() docstring summary spanned two
   lines with no blank line before the body. Collapsed to a single-line
   summary. (#258 reintroduced this; ruff now passes on all of gget/.)

2. mypy exit 4 on pre-commit.ci while passing locally — the mypy hook's
   stub dependencies (pandas-stubs, types-*) and mypy-baseline were
   unpinned, so pre-commit.ci installed newer stubs that inferred
   different types and surfaced "new" errors absent from the baseline.
   Pinned all type-affecting deps (mypy 1.13.0, mypy-baseline 0.7.4, and
   the four stub packages) in both the pre-commit hook and the test
   dependency group so the baseline is reproducible. Verified the pinned
   set reproduces exactly the 553 baselined errors (0 new) in a fresh
   hook env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lauraluebbert added a commit that referenced this pull request Jun 28, 2026
* Bump dev version

* Add space for new udpates

* CI: update pytest results (dev)

* feat(pdb): support PDBx/mmCIF format with automatic fallback (#178, #177) (#222)

* feat(pdb): support PDBx/mmCIF format and auto-fallback (#178, #177)

The legacy PDB format is being phased out by RCSB and is unavailable for
large structures (e.g. 6Q38, 7A01), causing `gget pdb` to fail with
"not found" — the bug reported in #177.

- Add `resource="mmcif"` to download the structure in PDBx/mmCIF (.cif).
- `resource="pdb"` (default) now automatically falls back to PDBx/mmCIF
  when the legacy PDB file is unavailable, logging a warning. Saved files
  use the correct extension (.cif vs .pdb) based on the format fetched.
- Backward compatible: existing commands that already worked are unchanged.
- Tests: explicit mmcif download + legacy->mmcif fallback regression (6Q38).
- Docs + updates.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: ruff lint errors flagged by pre-commit.ci

- gget_g2p.py: collapse the multi-line docstring summary into a
  single line + blank line (ruff D205).
- main.py: add the missing # noqa: E402 to the 6 new import lines
  (g2p, ref, search, seq, setup, virus). All earlier imports already
  carry this noqa because dt_string is computed at module top before
  the import block, so E402 fires on any unmarked later import.
  Also drops the stray "# Module functions" comment that was
  splitting one alphabetical import list into two.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laura Luebbert <laura.lbt60@gmail.com>

* feat(bgee): identify gget in outbound Bgee API requests (#224)

Send `User-Agent: gget/<version> (+https://github.com/scverse/gget)` on
all Bgee API calls so the upstream service can attribute traffic to gget
and reach the project if needed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* CI: update pytest results (dev)

* CI: update pytest results (dev)

* feat(types): pay down mypy errors (var-annotated + json overloads) (#225)

* feat(types): pay down 67 mypy errors (var-annotated + json overloads)

mypy baseline: 613 → 546 (−67, ~11%). No behavior changes — every
edit is annotation-only and verified with python -m py_compile +
`import gget` smoke test + `pytest --collect-only` (400 tests, 0
collection errors). Resolves part of #216.

Two passes:

1. var-annotated quick wins (−21 errors, → 0 remaining)
   - Added explicit type annotations to 13 empty container literals
     across utils.py, gget_ref.py, gget_info.py, gget_blat.py,
     gget_muscle.py, gget_virus.py. Inferred element type from
     surrounding code (list[str], dict[str, Any], etc.) — fell back
     to Any only when the type was genuinely dynamic.

2. typing.overload for the json= flag pattern (−~20 union-attr,
   plus ~26 other category errors that depended on the narrowed
   return type)
   - Added @typing.overload signatures for the 12 modules with
     `def f(..., json: bool = False, ...) -> DataFrame | dict`:
     gget_8cube, gget_archs4, gget_bgee, gget_blast, gget_blat,
     gget_cosmic, gget_diamond, gget_elm, gget_enrichr, gget_info,
     gget_opentargets, gget_search.
   - Now `f(...)` returns DataFrame and `f(..., json=True)` returns
     dict at the type-check level. Implementation signature unchanged.

Why only 67 and not the predicted ~150:
- Most remaining [union-attr] errors come from BeautifulSoup
  (`Tag | None` from `.find()`) and `str | None` checks, not the
  json= flag pattern. Those need per-callsite None-guards, which
  is the next batch.

Remaining categories (sorted, top 6):
  [index]        ~157  (pandas df["col"] indexing — needs cast() or # type: ignore)
  [union-attr]   115   (BeautifulSoup / str|None — needs None-guards)
  [attr-defined]  68   (dynamic JSON response shapes)
  [call-overload] 58   (pandas/numpy stubs)
  [assignment]   ~56
  [arg-type]     ~53

* fix(bgee): restore Literal + overload imports lost in dev merge

The merge of dev into feat/mypy-cleanup (925f66d) collided on
gget_bgee.py's typing import line. Git auto-resolved by taking
dev's version (`from typing import TYPE_CHECKING, Any` — added by
the bgee user-agent PR #224) and silently dropped the `Literal,
overload` additions from this branch, while keeping the @overload
decorators at lines 183/192 that use them. Result: module-load
NameError that broke test collection for every test file that
imports gget.

Restore the full import: TYPE_CHECKING, Any, Literal, overload.

* ci: re-trigger pre-commit.ci

* fix(pre-commit): exclude .github/badges/*.json from formatting

The badge JSON is regenerated by ci.yml's "Generate tests badge JSON"
step using json.dumps() with no `indent` parameter — single-line
compact output. biome's default JSON formatter wants multi-line tab-
indented output. So every CI run writes the compact form, and every
pre-commit.ci run reformats it back, and we get a permanent
biome-format failure that never resolves.

Same fix as the tests/pytest_results.txt entry: just exclude the
auto-generated file from formatting hooks.

* CI: update pytest results (dev)

* CI: update pytest results (dev)

* fix(types): allow tuple request timeouts in http_json (#253)

* fix(types): annotate pdb URL fallback list (#255)

* CI: update pytest results (dev)

* fix(ci): repair archs4 + ELM live-data test failures (#252)

* fix(opentargets): give drug synonyms a GraphQL sub-selection (HTTP 400 fix)

OpenTargets changed the Drug 'synonyms' and 'tradeNames' fields from
[String!]! to the object type [DrugLabelAndSource!]!, which now requires
a sub-selection. The bare-scalar selection caused every drug query to
fail with HTTP 400.

Request '{ label }' for both fields and flatten the response objects
back to a list of label strings so downstream output stays
backward-compatible (a list of strings).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(archs4): tolerate missing 'color' column in tissue expression (#dev-drift)

ARCHS4's tissue-expression CSV intermittently omits the 'color' column,
which made `gget archs4 --which tissue` crash with
`KeyError: "['color'] not found in axis"`. The 'color' column is only used
for plotting upstream and is dropped (never used) by gget, so a missing
column should not be fatal.

Use `drop(columns=["color"], errors="ignore")` so the request degrades
gracefully when the column is absent. Adds network-free regression tests
covering both the present-color and missing-color responses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(opentargets): use baselineExpression for the expression resource

OpenTargets retired the `target.expressions` field (it now returns an empty
list for every gene), so `gget opentargets -r expression` returned nothing.
Baseline expression data moved to the paginated `target.baselineExpression`
field with a new per-biosample data model.

- Repoint the expression query to `baselineExpression(page:{index:0,size:250})
  { rows {...} }` and update rows_path to ["baselineExpression","rows"].
- Output columns change accordingly (per-biosample summary stats: median/min/
  q1/q3/max/unit + tissueBiosample/celltypeBiosample ids + datasource/datatype),
  because the upstream data model changed and the old shape no longer exists.
- Remove the two now-invalid live exact-match fixtures and replace them with
  network-free mocked tests; update docs (example, resource table, updates.md).

Verified live: http_json with the new query returns 1409 rows in ~0.6s and the
parsing pipeline yields the documented columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(opentargets): loosen live-data assertions to structural/invariant (data drifts across releases)

OpenTargets is a live database re-released regularly; several opentargets tests
pinned exact current values (disease ids/scores, result hashes, interaction
partner ids, genotypes) that legitimately change every release, so they failed
on unrelated PRs even though gget returns correct current data.

Replace the exact-value/hash assertions for test_opentargets, _diseases,
_depmap, _depmap_filter, _interactions, _interactions_no_limit and
_pharmacogenetics with structural/invariant assertions (expected columns
present, numeric dtypes, value-format patterns — ontology-curie disease/tissue
ids, ENSG interaction partners, ACH DepMap ids, score in [0,1], nucleotide
genotypes — and the depmap filter invariant). The fixture entries are marked
`code_defined`; the structural methods live in tests/test_opentargets.py.

These stay meaningful (they break on wrong columns, malformed ids, non-numeric
scores, broken filtering, or empty-where-guaranteed) without pinning drifting
data. Verified live against current OpenTargets data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: assert live-data contracts for CI repair

* test: retry ELM live setup downloads

* test: keep OpenTargets expression semantics out of CI repair

* test(opentargets): add semantic anchors + score tolerance to live-data tests (#249)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test(opentargets): rewrite live-data tests as explicit IL13 assertions (#249)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test(opentargets): read gene from fixture + guard to IL13; drop duplicate test_opentargets (#249)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test(archs4): rewrite live tissue tests as concrete fixture-driven checks (#249)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(archs4): deterministic tissue sort via id tiebreaker; restore exact-snapshot tests (#249)

Sort tissue rows by [median desc, id asc] so output is reproducible when medians tie
(ARCHS4 returns tied rows in varying order). Revert the live tissue tests to exact
assert_equal snapshots (re-sorted to the deterministic order); keep the network-free
color regression tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Drop opentargets fixes from this PR (now covered by #256)

Strip back the opentargets-related changes so this PR is focused on
the archs4 + ELM CI-stability fixes only. The opentargets work
(synonyms HTTP 400 fix, fixture refresh, expression skip) is being
handled in a separate PR (#256), per maintainer preference for
one-module-per-PR review.

Reverted to origin/dev:
- gget/gget_opentargets.py
- tests/test_opentargets.py
- tests/fixtures/test_opentargets.json

Trimmed updates.md:
- Removed the opentargets bullet (lives in #256)
- Added an archs4 bullet explaining the color-column + deterministic-
  sort fix (user-visible behavior change, was missing here)

Remaining scope:
- gget_archs4.py: graceful handling of missing color column,
  deterministic median-then-id sort
- tests/test_archs4.py: TestArchs4MissingColor regression test
- tests/fixtures/test_archs4.json: refreshed for the deterministic sort
- tests/test_elm.py: retry ELM setup on transient download failure

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(archs4): drop the redundant with-color companion test

test_tissue_with_color_still_dropped tested the "happy path" that both
the old and the new code already handle the same way (column present →
column dropped from output). It can't catch any plausible regression
of the actual fix (which is the errors="ignore" kwarg, exercised by
the sibling test_tissue_missing_color_does_not_crash).

Removing it tightens the test suite without weakening the regression
guard around the actual bug. _CSV_WITH_COLOR class attribute removed
along with it (no other references).

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laura Luebbert <laura.lbt60@gmail.com>

* fix(opentargets): adapt to upstream API drift + skip deprecated expressions field (#256)

* fix(opentargets): adapt to upstream API drift + skip deprecated expressions field

11 opentargets tests were failing across all Python versions due to
three distinct upstream changes:

1. GraphQL schema change (real bug)
   The Drug type's `synonyms` and `tradeNames` fields are now
   `[DrugLabelAndSource!]!` (was scalar list[str]). Querying without a
   sub-selection returns HTTP 400. Updated QUERY_STRING_DRUGS to query
   `synonyms { label }` and `tradeNames { label }`. The existing
   _collapse_singletons() post-processor flattens each {label: "X"}
   back to "X" automatically, so the public DataFrame shape is
   preserved.

2. Data drift (8 tests, fixtures refreshed)
   Open Targets re-indexed disease ontology IDs (EFO → MONDO), gene
   protein IDs, allele frequencies, depmap entries, and interaction
   data. Per the project convention (refresh fixtures, don't normalize)
   the expected_result blocks for test_opentargets, *_diseases, *_drugs,
   *_interactions, *_pharmacogenetics, *_depmap, *_depmap_filter, and
   *_interactions_no_limit were re-captured from the current upstream
   output via a one-shot helper script.

3. expressions field deprecated (2 tests, skipped)
   The Target.expressions field now returns [] for all queries; Open
   Targets replaced it with Target.baselineExpression which has a
   completely different schema (tissueBiosample, q1/q3/median/min/max
   instead of tissue/rna sub-objects). Migrating gget_opentargets to
   the new field is a user-facing API change and out of scope for this
   PR; the two affected tests are now marked as skipped with a
   reason field.

Also adds a `skip` test type to tests/from_json.py so JSON-defined
tests can be marked as known-skipped (with a reason message) without
needing a separate Python test file or fixture deletion.

Verified locally: 15 passed, 2 skipped, 0 failed.

* docs: add 0.30.8 entries for opentargets fixes

- Drugs resource: HTTP 400 fix (synonyms/tradeNames sub-selection).
  Surfaces both the upstream cause and reassures users that the
  DataFrame column shape is unchanged.
- Expression resource: known limitation pointing at the
  baselineExpression migration as the next step. Flags that the
  tests for this path are skipped in the meantime.

* opentargets(expression): warn user about upstream deprecation

Without this, gget opentargets resource="expression" silently returns
an empty DataFrame, which looks identical to "your gene has no
expression data" and gives the user no signal that the upstream field
is actually retired. Now emits a logger.warning naming the deprecated
field, the planned baselineExpression replacement, and the tracking
issue (#247) so users can subscribe / contribute.

Updated the 0.30.8 entry in updates.md to mention the warning.

* docs(contributing): add PR-scope and upstream-flake guidance (#257)

* docs(contributing): add PR-scope and upstream-flake guidance

Two new subsections under "Contribute through pull requests":

- "Keep PRs small and focused" — explicitly states the one-module-per-PR
  preference. Codifies what's been a verbal/case-by-case practice into
  written guidance so contributors don't ship multi-module PRs and then
  have to re-split them on reviewer request.

- "Failing tests for modules you didn't touch" — explains that gget's
  test suite hits real upstream databases and those services drift, so
  a CI red in a module your PR doesn't touch is almost certainly not
  your bug and shouldn't block you. Avoids the common new-contributor
  trap of assuming they have to fix unrelated failures to merge.

CONTRIBUTING.md is the single source of truth (docs/src/en/contributing.md
is an mdbook include of this file), so the gget docs site picks up the
change automatically. The Spanish translation will follow on the next
auto-translate workflow run after this merges.

* Revise CONTRIBUTING.md for clarity and completeness

Updated self-review checklist for PRs to include biological relevance and clarified testing instructions. Adjusted language for consistency and improved clarity.

* Update CONTRIBUTING.md

* CI: update pytest results (dev)

* feat(alphafold): add custom jackhmmer save directory option (#49) (#234)

* feat(alphafold): add custom jackhmmer save directory option (#49)

Add a jackhmmer_savedir parameter (Python) / -jhd, --jackhmmer_savedir
flag (CLI) to gget alphafold so users can choose where the temporary
jackhmmer files (~2 GB) are stored instead of the hardcoded
~/tmp/jackhmmer/ folder in the home directory.

The new get_jackhmmer_dir() helper centralizes the path resolution and
clean_up() now operates on the resolved directory. Default behavior is
unchanged (backward compatible).

Resolves #49.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(alphafold): cover clean_up default dir (#49)

Add a network-free test for the clean_up(None) default-directory branch:
it patches get_jackhmmer_dir so the None default resolves to a temporary
folder (never the real ~/tmp) and asserts clean_up() resolves and removes
it. Covers the previously-uncovered default-resolution line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update CONTRIBUTING.md

* CI: update pytest results (dev)

* feat(g2p): make g2p useful for variant-interpretation workflows + bug fixes (#258)

* fix(g2p): make gene optional, fix silent failure on JSON error bodies

- gene is now optional and resolved from uniprot_id via the UniProt REST
  entry endpoint (cached with lru_cache). A UniProt accession alone is
  sufficient identification; the old API required both.
- Fix silent failure where G2P returns HTTP 200 with a JSON
  {"status":"failure",...} body on unknown gene/UniProt pairs. The
  response was being parsed as a single TSV column header and a 0-row
  DataFrame returned with no error. Now detected, logged, and returns
  None.
- All failure modes return None (previously a mix of None and empty
  DataFrame).
- Retry transient failures (5xx, connection errors, timeouts) with
  exponential backoff.
- URL-encode gene/uniprot_id/isoform path segments.
- New `out=` Python argument to write to an explicit CSV path (takes
  precedence over `save`).
- Docs: list g2p in SUMMARY.md (was hidden from the published site);
  advertise PFES, MaveDB, pocket and interaction columns; note that
  variant overlays (gnomAD/ClinVar/HGMD) are portal-only.

Backward-compatible: existing CLI (`gget g2p BRCA1 -u P38398`) and
Python (`gget.g2p("BRCA1", uniprot_id="P38398")`) call sites continue
to work unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat(g2p): residues filter, symmetric resolution, parsed PDB list

- Symmetric resolution: `--uniprot_id` is now optional too, resolved from
  `gene` via UniProt (canonical reviewed human Swiss-Prot entry) when
  omitted. Limitations are spelled out in a prominent log message
  (synonyms, paralogues, non-human, unreviewed, isoforms — pass uniprot_id
  to override). The resolved pair travels with the data both as
  df.attrs["gene"]/["uniprot_id"] and as leading `Resolved Gene` /
  `Resolved UniProt` columns whenever resolution happened — so CSV/JSON
  saved files also record what was queried.
- `residues=` argument filters `features`/`alignment` results to specific
  positions (int / list / tuple / range / set in Python; comma-separated
  list and/or inclusive ranges on the CLI: `--residues 100-200,300,400`).
- `map` results gain a parsed `PDB Ids List` column (list[str]) alongside
  the comma-joined `PDB Ids` string, ready to feed into `gget pdb`.
- `alignment` now requires `uniprot_id` explicitly (gene→UniProt returns
  the base accession and cannot disambiguate isoforms).

Backward-compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update g2p.md

* refactor(g2p): invariant schema with gene_name/uniprot_id columns

The previous behavior added "Resolved Gene" / "Resolved UniProt" columns
only when one of the identifiers was looked up. That meant the output
schema differed depending on input mode, which is awkward for downstream
code that should not care whether the caller supplied gene, uniprot_id,
or both.

Now the canonical pair is *always* prepended as `gene_name` and
`uniprot_id` columns, populated with whichever values were used for the
query. The same keys are also set on `df.attrs`. Output schema is now
identical across all three input modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update g2p.md

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(opentargets): repoint expression to baselineExpression (upstream retired target.expressions) (#248)

* fix(opentargets): use baselineExpression for the expression resource

OpenTargets retired the `target.expressions` field (it now returns an empty
list for every gene), so `gget opentargets -r expression` returned nothing.
Baseline expression data moved to the paginated `target.baselineExpression`
field with a new per-biosample data model.

- Repoint the expression query to `baselineExpression(page:{index:0,size:250})
  { rows {...} }` and update rows_path to ["baselineExpression","rows"].
- Output columns change accordingly (per-biosample summary stats: median/min/
  q1/q3/max/unit + tissueBiosample/celltypeBiosample ids + datasource/datatype),
  because the upstream data model changed and the old shape no longer exists.
- Remove the two now-invalid live exact-match fixtures and replace them with
  network-free mocked tests; update docs (example, resource table, updates.md).

Verified live: http_json with the new query returns 1409 rows in ~0.6s and the
parsing pipeline yields the documented columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(opentargets): drive expression page size by limit; warn on truncation

Review follow-up to the baselineExpression migration. Two issues:

1. Silent truncation. The query hard-capped at size:250, but a gene can
   have far more biosamples (IL13 has 1409: 51 GTEx + 1343 Tabula
   Sapiens + 15 DICE). Users got 250 of 1409 with no signal, and an
   explicit limit>250 was also silently capped (limit was only applied
   post-hoc via df.head). Now the page size is a GraphQL variable driven
   by limit: min(limit, 3000) when limit is set, else 3000 (the API max).
   IL13 with no limit now returns all 1409 rows. If a gene exceeds the
   3000-per-page API max and no limit is set, a warning is logged
   pointing the user to --filters.

2. Stale docstring. The Python docstring still described the old model
   ("by tissues, organs, and anatomical systems"); updated to the
   per-biosample summary-statistics model. Docs note updated to drop the
   "up to 250" wording.

Verified live: no limit -> 1409 rows; limit=5 -> 5; datasourceId=gtex
filter -> 51. Mocked tests pass, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(opentargets): add prominent top-of-page note for expression output change

resource="expression" returns a different output shape as of v0.30.8
(baselineExpression per-biosample summary stats, replacing the retired
per-tissue z-score columns). This is a meaningful breaking change for
anyone parsing the expression output, so surface it at the top of the
module docs rather than only in the example section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Laura Luebbert <laura.lbt60@gmail.com>

* CI: update pytest results (dev)

* docs(opentargets): correct all example tables/filters + clarify diseases score & IDs (#240)

* docs(opentargets): clarify diseases score meaning and EFO/MONDO IDs (#168)

Clarify in the documentation that the `diseases` resource `score` is
OpenTargets' single overall target–disease association score (0–1)
aggregated across all data types/sources (not a per-data-source score),
and that the returned `disease.id` values are EFO-mapped traits that
include not only MONDO diseases but also HP phenotypes and EFO
measurements (e.g. "blood protein measurement"). gget returns the
associations exactly as OpenTargets reports them; documented how to
filter to MONDO terms only. Also updated the diseases example table to
the current post-rewrite column names.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(opentargets): replace stale EFO example with generic phrasing

The diseases-score note cited `EFO_0007937` "blood protein measurement"
as an example of an EFO measurement trait surfacing as a "disease".
That ID is stale — copied verbatim from the 2024-era issue #168 (gget
0.29.0). Against the current Open Targets API:

- EFO_0007937 no longer resolves (direct lookup + search both empty)
- "blood protein measurement" is no longer a standalone term (closest
  current ID is EFO_0004536 "total blood protein measurement")
- it no longer appears in LILRA4's associations

Rather than swap in another exact ID that will rot the same way,
replace it with generic phrasing ("biomarker or blood-measurement
traits"). The conceptual point — EFO measurement traits appear among
the results — is verified correct: LILRA4 currently returns a mix of
MONDO/HP/EFO/Orphanet IDs.

The rest of the note (overall-score semantics, the MONDO_0004980
atopic-eczema example, the MONDO filter recipe) was fact-checked
against the live API and is accurate.

* docs(opentargets): correct all example tables to actual 0.30.5+ output

Verified every resource example against live API output. Only the
`diseases` table (fixed earlier in this PR) was correct; the other
six example tables and the resource "Valid Filters" column still
showed pre-0.30.5 column names that no longer exist, so anyone copying
them (especially into --filter) would get wrong results or a
"Filter key not found" ValueError.

Corrected against real output for ENSG00000169194 (IL13):
- drugs: id/name/type/... -> drug.id/drug.name/drug.drugType/
  drug.mechanismsOfAction.rows/.../drug.indications.rows
- tractability: label/modality -> modality/label/value (value was missing)
- pharmacogenetics: rs_id/genotype_id/... -> variantId/genotypeId/.../
  variantFunctionalConsequence.id/.label
- expression: reordered to actual column order, added celltypeBiosample.*
  columns + a single-cell example row
- depmap: depmap_id/effect/tissue_id/... -> tissueId/tissueName/
  cellLineName/expression/diseaseFromSource/depmapId/geneEffect
- interactions: evidence_score/source_db/protein_a_id/... -> score/
  sourceDatabase/intA/.../targetB.approvedSymbol (taxonId is 134, not 9606)

Replaced the two interactions filter examples, which documented a
removed API (filter_mode/OR and -fpa/--filter_gene_b shortcuts removed
in 0.30.5; protein_a_id/gene_b_id keys are not real columns), with one
working example using the generic --filter key=value on real column
names. Updated the resource table's "Valid Filters" column accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Laura Luebbert <laura.lbt60@gmail.com>

* docs(updates): add missing 0.30.8 entries (bgee User-Agent; expression truncation)

Comprehensiveness pass on the 0.30.8 changelog:
- Added the gget bgee User-Agent change (#224), which was merged but
  undocumented.
- Expanded the opentargets expression entry to note the limit-driven
  page size, the 3000-row API cap, and the truncation warning (#248
  follow-up).

Verified all other 0.30.8 entries against merged PRs and confirmed the
0.30.7 vs 0.30.8 split matches the v0.30.7 tag contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* CI: update pytest results (dev)

* ci(mypy): adopt mypy-baseline so pre-commit only fails on new type errors

The mypy hook had ~553 pre-existing errors and exited non-zero, so it
failed on every PR (pre-commit.ci runs --all-files, so per-file diffing
doesn't help). Switch to mypy-baseline:

- .mypy-baseline.txt grandfathers the existing errors (line numbers
  normalized to 0 so they survive line shifts).
- The pre-commit hook now runs `mypy gget | mypy-baseline filter`,
  which fails only on NEW errors not in the baseline. Verified: a clean
  tree passes; a freshly introduced type error fails.
- Works on pre-commit.ci because the baseline filters by error identity,
  not by file.
- Added mypy-baseline to the test dependency group; excluded the
  generated baseline file from the hygiene hooks.

Refresh the baseline after fixing errors with: mypy gget | mypy-baseline sync
Tracked in #216.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* CI: update pytest results (dev)

* docs(contributing): note what runs automatically on PRs

Brief section so contributors know pre-commit.ci (ruff/biome/hygiene +
mypy), the test matrix, and the build check run automatically — and
that pre-commit.ci auto-fixes formatting and mypy only flags new errors
(existing baseline grandfathered). Pairs with the existing
upstream-flake note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update CONTRIBUTING.md

* docs(archs4): clarify legacy endpoint vs archs4.org data source (#254) (#259)

* docs(archs4): document legacy endpoint + archs4.org data-source situation

Addresses #254. gget archs4 uses the legacy maayanlab.cloud HTTP
endpoints, which are lightweight and still maintained. The newer
archs4.org dataset has no equivalent lightweight HTTP API — programmatic
access there is only via archs4py + multi-GB HDF5 downloads, which
doesn't fit gget's lightweight-query design. Documented this so users
know where the data comes from, how it differs from archs4.org, and how
to get the newer data (archs4py) if they need it. Migration remains
tracked in #254.

Docs-only; no code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update updates.md

* Update archs4.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* CI: update pytest results (dev)

* fix(ci): clear dev-wide pre-commit failures (ruff D205 + pin mypy stubs)

Two pre-commit.ci failures were hitting every PR (it runs --all-files),
unrelated to the PRs themselves:

1. ruff D205 in gget_g2p.py — the g2p() docstring summary spanned two
   lines with no blank line before the body. Collapsed to a single-line
   summary. (#258 reintroduced this; ruff now passes on all of gget/.)

2. mypy exit 4 on pre-commit.ci while passing locally — the mypy hook's
   stub dependencies (pandas-stubs, types-*) and mypy-baseline were
   unpinned, so pre-commit.ci installed newer stubs that inferred
   different types and surfaced "new" errors absent from the baseline.
   Pinned all type-affecting deps (mypy 1.13.0, mypy-baseline 0.7.4, and
   the four stub packages) in both the pre-commit hook and the test
   dependency group so the baseline is reproducible. Verified the pinned
   set reproduces exactly the 553 baselined errors (0 new) in a fresh
   hook env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Elarwei <168552401+Elarwei001@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Qiaoli Li <67373873+Qiaoli-Li-Res@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants