TextEmbedding Module for Multimodality Reproducibility Study#808
Merged
jhnwu3 merged 25 commits intosunlabuiuc:masterfrom Feb 3, 2026
Merged
TextEmbedding Module for Multimodality Reproducibility Study#808jhnwu3 merged 25 commits intosunlabuiuc:masterfrom
jhnwu3 merged 25 commits intosunlabuiuc:masterfrom
Conversation
Contributor
Author
|
Added TextEmbedding documentation w/ the same structure as VisionEmbeddingModel. |
jhnwu3
approved these changes
Feb 3, 2026
Collaborator
jhnwu3
left a comment
There was a problem hiding this comment.
lgtm, we can always patch if things break when we start building unified multimodal embeddings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Adds a new
TextEmbeddingmodule for encoding clinical text using pretrained transformer encoders (default: Bio_ClinicalBERT).Motivation / Context:
Supports the study protocol requiring non overlapping 128-token text chunks ("cut off in 128 token text bits").
Serves as a reusable embedding component for PyHealth models that consume clinical notes.
Changes:
Chunking: splits long clinical notes into non overlapping fixed size chunks (default
chunk_size=128).Pooling: configurable pooling strategy for chunk embeddings (
poolingcan benone,cls, ormean) to support different downstream architectures.Performance guardrails:
max_chunkscaps chunk count and emits warnings when truncation is applied to reduce OOM (Out Of Memory) risk on extremely long texts.Mask output: returns boolean masks compatible with PyHealth
TransformerLayerfor downstream attention/masking.Backward compatibility:
return_masksupports legacy call sites expecting a single-tensor return.API / Behavior Notes:
Inputs:
Outputs:
Embeddings: chunked transformer representations (shape depends on pooling mode; includes an explicit chunk dimension when chunking is enabled).
Mask: boolean tensor indicating valid chunks for each sample, intended for downstream masking in transformer layers.
Pooling behavior:
none: preserves token-level outputs (for architectures that apply their own pooling).cls: uses the CLS representation per chunk.mean: mean-pools across tokens per chunk (excluding padding).Backward Compatibility:
No breaking changes expected.
return_maskpreserves the legacy single-output return path for existing code.Performance Considerations:
Runtime/memory scale with the number of chunks derived from input length.
max_chunkstruncates excessively long inputs and warns when truncation occurs, reducing likelihood of GPU/CPU OOM.Files Added:
pyhealth/models/text_embedding.py- main moduletests/test_text_embedding.py- test suite (13 tests; 12 passing)examples/text_embedding_tutorial.ipynb- tutorial notebookFiles Modified:
pyhealth/models/__init__.py- export TextEmbeddingpyhealth/models/tfm_tokenizer.py- lazy import adjustment to avoid a repo wide ImportError when optional dependencies are not installedTests:
New: 13 tests (12 passing; 1 skipped when CUDA unavailable)
Overall: 595 tests passing (no regressions)
Documentation/Examples:
Docstrings include I/O specifications and parameter descriptions.
Inline rationale included for backward-compatibility (
return_mask) and performance guardrails (max_chunks).Tutorial notebook executed end to end successfully.