Releases: raphschlatt/ads-bib
Releases · raphschlatt/ads-bib
v0.3.3
v0.3.2
Changed
- Configuration loading is now strict for CLI, Python overrides, and run variants: unknown sections or keys fail early instead of being accepted silently.
- Toponymy outputs now use only canonical snake_case layer columns such as
topic_layer_0_idandtopic_layer_0_label; the working-layertopic_idandNamecolumns remain. - The root
ads_bibpackage now exports only the primary public entry points; lower-level stage helpers stay importable from their own modules. - Hugging Face configuration now uses
HF_TOKENas the single public token environment variable.
Removed
- Removed the intermediate
curation.cluster_targetsconfig key. Usecuration.layered_clusters_to_removefor layer-aware Toponymy curation. - Removed legacy
Topic_Layer_Xoutput columns from new Toponymy runs. - Removed public
HF_API_KEYandHUGGINGFACE_API_KEYtoken names from docs, bootstrap, and doctor hints.
v0.3.1
Added
ads_bib.run(from_run=...)now mirrors CLI run variants from Python, including dotted overrides and run-local artifact reuse.
Improved
- Toponymy topic-map layer colors now spread across the full palette for clearer hierarchy inspection.
- Curation docs now clarify that cluster IDs are run-local and that removals should be applied through a variant from the inspected run.
Fixed
curation.clusters_to_removeis now validated when configs load; scalar values such as7fail early with a hint to use[7].- CLI
runanddoctornow report config validation errors without a traceback.
v0.3.0
Added
- External source input support for starting runs from prepared publication and reference Parquet files instead of an ADS search/export step.
- Repository utility scripts for preparing source input from Semantic Scholar and INSPIRE exports.
Changed
- The
local_gpupreset is now the recommended Colab/GPU road: TranslateGemma for translation, Qwen3 embeddings, and Qwen3 topic labels. pipeline.ipynbis now a preset-driven Colab quickstart that loadslocal_gpu, sets only the example query/run context, and uses notebook-friendly output.- The previous Gemma embedding/labeling notebook has moved to
notebooks/pipeline_gemma_experiment.ipynbas a non-primary experiment.
Improved
- Local Transformers translation now batches TranslateGemma calls, retries smaller batches on memory pressure, and applies dynamic generation limits inside the configured maximum.
- Local model stages now share cleanup helpers so GPU memory is released between translation, embeddings, and topic labeling.
- Notebook runs now pre-load the local models through package paths and keep normal progress output cleaner.
Fixed
- Local topic labeling now uses chat-template based interaction for local chat models, avoiding instruction text leaking into BERTopic labels.
- Optional TorchCodec import failures are handled for the local Hugging Face stack used in Colab.
- Topic-map rendering now falls back cleanly when datamapplot cannot form cluster boundary polygons.
- The colorspacious SyntaxWarning emitted during topic-map setup is filtered at the runtime logging boundary.
v0.2.0
Added
- CLI run variants via
ads-bib run --from-run <run-id-or-path> --set ..., including automatic stage planning,--dry-run, downstream artifact hydration for visualization/citation-only variants, and optionalvariantprovenance inrun_summary.yaml.
Changed
- Final dataset bundle exports now clean publication/reference keys, prune dangling reference IDs, and remove placeholder or duplicate author UIDs before writing public Parquet outputs and the dataset manifest.
- New runs now use a modular artifact layout under
runs/<run_id>/data/, with run-local stage restart points (search,export,translated,tokenized,and) plus finaldatasetandcitationsoutputs;artifact_layout_version: 2is recorded inrun_summary.yaml. - Translated and tokenized snapshots now carry metadata fingerprints so changed-config variants do not reuse stale source/config combinations, and enabled AND runs let
ads-andvalidate its own cache metadata instead of loading disambiguated snapshots directly. - OpenRouter embedding defaults now use larger documented batch sizes, the OpenRouter preset pins Toponymy-internal embeddings to Qwen3, and OpenRouter retries now fail fast on non-retryable request/auth/payment errors.
- Tokenized snapshot metadata now stores AND source fingerprints so validated AND cache hits can avoid recomputing expensive frame fingerprints on future runs.
v0.1.1
Fixed
- Toponymy fitting now avoids a large fixed-width Unicode array allocation that could cause memory errors on large corpora.
- Notebook and session resume now load translated, tokenized, and author-disambiguated snapshots even when earlier-stage frames are already in memory.
Changed
- The OpenRouter notebook example now uses Gemini Flash 3 for translation and topic labeling, with Qwen3 embeddings.
v0.1.0
Added
- Shared package runner in
ads_bib.pipelinewith structuredPipelineConfig, named stages, and reusable stage functions. - Thin CLI batch entrypoint:
ads-bib run --config ...with optional--from,--to,--run-name, and--setoverrides. - Notebook adapter in
ads_bib.notebookwithNotebookSessionand package-side config invalidation. - Native
huggingface_apitranslation path viahuggingface_hub.AsyncInferenceClient. - Official packaged runtime presets exposed via
ads-bib run --preset ...andads-bib preset write .... - Workspace bootstrap and stage-aware doctor commands for first-run setup and preflight validation.
- Offline HF provider smoke coverage plus env-gated live HF smoke tests for translation, embeddings, and BERTopic labeling.
Changed
- Base
ads-bibinstalls now own the official runtime stack; only non-default algorithm overrides remain behind theumapandhdbscanextras. - Pin
datamapplotto>=0.6.4,<0.7: 0.7.x changed theselection_handlerslayout and breaksads_bib.visualizeuntil imports are updated. - GitHub Actions now install only the active base contract plus
test,umap, andhdbscan; removed references to historical extras and install profiles. pipeline.ipynbnow uses explicit section dicts plusNotebookSession; it no longer owns config assembly, invalidation,globals()syncing, orSTART_STAGE/STOP_STAGE.- Stage slicing remains a CLI/YAML concern; notebook reruns are driven by executing the corresponding stage cell.
- Notebook stage cells are now strict and no longer auto-chain earlier stages such as
translate -> export. - Fresh in-memory notebook state now takes precedence over same-stage translated/tokenized/disambiguated snapshots when a config change invalidates later stages.
- Run config snapshots are now serialized from structured pipeline config instead of raw notebook globals.
- Prompt selection now supports
topic_model.llm_prompt_namewith package-side resolution and.envfallbacks for ADS/OpenRouter secrets. - Tokenization defaults now use
en_core_web_mdrather thanen_core_web_lg. - AND integration remains optional, but the active path is now the source-based external adapter rather than a placeholder notebook contract.
run_pipeline()remains the dependency-aware batch path; notebook stage execution now has intentionally different UX semantics.- Runtime output is now frontend-aware: CLI runs use compact stage-first console output, notebook runs stay slightly more explanatory, and raw third-party stdout/stderr is redirected into
runs/<run_id>/logs/runtime.log. - Nested progress-bar noise was reduced so normal runs show at most one primary progress bar per stage.
huggingface_apiembeddings now use the native Hugging Face async client instead of LiteLLM, while BERTopic labeling keeps BERTopic's LiteLLM adapter with normalized HF-native model ids.- Pipeline config preparation now injects
HF_TOKENinto translation, embedding, and BERTopic labeling configs whenhuggingface_apiis selected. - CLI runs now persist
run_summary.yamljust like notebook runs, including partial/failure status metadata. - OpenRouter and Hugging Face chat translation now share the same centralized scientific translation prompt contract.
- Official runtime roads now ship as four packaged generic presets accessed via CLI rather than repo-root YAML files.
- Stable local presets now pin only GGUF model families that are validated against the baseline
ADS_envruntime; the CPU labeling preset usesQwen/Qwen2.5-0.5B-Instruct-GGUFinstead of unsupportedqwen35variants. - Base runtime dependencies now include the provider and topic stack needed by the four official roads;
huggingface-hubremains part of that default install. - Hugging Face API key resolution now accepts
HF_TOKEN,HF_API_KEY, andHUGGINGFACE_API_KEY. - Core runtime dependencies now include
pyarrowandnetworkx, and translation now validates theopenaidependency for OpenRouter before execution. - Packaging extras no longer expose the obsolete
translate-local/translate-apinames; the remaining extras aretest,umap, andhdbscan.
Docs
- Site configuration lives at
zensical.tomlin the repository root; build and preview usezensical ...from the root (including GitHub Actions). Package_ToDo.mdremains maintainer-local; repository cleanup no longer depends on a versioned backlog file in the public tree.- Removed
CLAUDE.md; repository engineering rules and conventions live inAGENTS.mdonly. - Public docs and metadata now position the installed package and CLI as the primary runtime path, with
pipeline.ipynbdocumented as an optional GitHub companion. AGENTS.mdarchitecture notes now record the notebook-session adapter and the source-based AND step.- README/runtime docs now document
ads-bib runas the happy path, keepbootstrapanddoctoras support commands, and treathuggingface_apias a full official road across both topic backends.