Skip to content

TextEmbedding Module for Multimodality Reproducibility Study#808

Merged
jhnwu3 merged 25 commits intosunlabuiuc:masterfrom
Rian354:master
Feb 3, 2026
Merged

TextEmbedding Module for Multimodality Reproducibility Study#808
jhnwu3 merged 25 commits intosunlabuiuc:masterfrom
Rian354:master

Conversation

@Rian354
Copy link
Copy Markdown
Contributor

@Rian354 Rian354 commented Feb 3, 2026

Summary:
Adds a new TextEmbedding module for encoding clinical text using pretrained transformer encoders (default: Bio_ClinicalBERT).

Motivation / Context:
Supports the study protocol requiring non overlapping 128-token text chunks ("cut off in 128 token text bits").

Serves as a reusable embedding component for PyHealth models that consume clinical notes.

Changes:

  • Chunking: splits long clinical notes into non overlapping fixed size chunks (default chunk_size=128).

  • Pooling: configurable pooling strategy for chunk embeddings (pooling can be none, cls, or mean) to support different downstream architectures.

  • Performance guardrails: max_chunks caps chunk count and emits warnings when truncation is applied to reduce OOM (Out Of Memory) risk on extremely long texts.

  • Mask output: returns boolean masks compatible with PyHealth TransformerLayer for downstream attention/masking.

  • Backward compatibility: return_mask supports legacy call sites expecting a single-tensor return.

API / Behavior Notes:

Inputs:

  • clinical text strings (single note or batched notes).

Outputs:

  • Embeddings: chunked transformer representations (shape depends on pooling mode; includes an explicit chunk dimension when chunking is enabled).

  • Mask: boolean tensor indicating valid chunks for each sample, intended for downstream masking in transformer layers.

Pooling behavior:

  • none: preserves token-level outputs (for architectures that apply their own pooling).

  • cls: uses the CLS representation per chunk.

  • mean: mean-pools across tokens per chunk (excluding padding).

Backward Compatibility:

  • No breaking changes expected.

  • return_mask preserves the legacy single-output return path for existing code.

Performance Considerations:

  • Runtime/memory scale with the number of chunks derived from input length.

  • max_chunks truncates excessively long inputs and warns when truncation occurs, reducing likelihood of GPU/CPU OOM.

Files Added:

  • pyhealth/models/text_embedding.py - main module

  • tests/test_text_embedding.py - test suite (13 tests; 12 passing)

  • examples/text_embedding_tutorial.ipynb - tutorial notebook

Files Modified:

  • pyhealth/models/__init__.py - export TextEmbedding

  • pyhealth/models/tfm_tokenizer.py - lazy import adjustment to avoid a repo wide ImportError when optional dependencies are not installed

Tests:

New: 13 tests (12 passing; 1 skipped when CUDA unavailable)

Overall: 595 tests passing (no regressions)

Documentation/Examples:

  • Docstrings include I/O specifications and parameter descriptions.

  • Inline rationale included for backward-compatibility (return_mask) and performance guardrails (max_chunks).

  • Tutorial notebook executed end to end successfully.

Copy link
Copy Markdown
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add it to the docs/ like Josh did it here?

#806

@Rian354
Copy link
Copy Markdown
Contributor Author

Rian354 commented Feb 3, 2026

Added TextEmbedding documentation w/ the same structure as VisionEmbeddingModel.

Copy link
Copy Markdown
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, we can always patch if things break when we start building unified multimodal embeddings.

@jhnwu3 jhnwu3 merged commit 5d0422b into sunlabuiuc:master Feb 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants