TextEmbedding Module for Multimodality Reproducibility Study by Rian354 · Pull Request #808 · sunlabuiuc/PyHealth

Rian354 · 2026-02-03T00:36:25Z

Summary:
Adds a new TextEmbedding module for encoding clinical text using pretrained transformer encoders (default: Bio_ClinicalBERT).

Motivation / Context:
Supports the study protocol requiring non overlapping 128-token text chunks ("cut off in 128 token text bits").

Serves as a reusable embedding component for PyHealth models that consume clinical notes.

Changes:

Chunking: splits long clinical notes into non overlapping fixed size chunks (default chunk_size=128).
Pooling: configurable pooling strategy for chunk embeddings (pooling can be none, cls, or mean) to support different downstream architectures.
Performance guardrails: max_chunks caps chunk count and emits warnings when truncation is applied to reduce OOM (Out Of Memory) risk on extremely long texts.
Mask output: returns boolean masks compatible with PyHealth TransformerLayer for downstream attention/masking.
Backward compatibility: return_mask supports legacy call sites expecting a single-tensor return.

API / Behavior Notes:

Inputs:

clinical text strings (single note or batched notes).

Outputs:

Embeddings: chunked transformer representations (shape depends on pooling mode; includes an explicit chunk dimension when chunking is enabled).
Mask: boolean tensor indicating valid chunks for each sample, intended for downstream masking in transformer layers.

Pooling behavior:

none: preserves token-level outputs (for architectures that apply their own pooling).
cls: uses the CLS representation per chunk.
mean: mean-pools across tokens per chunk (excluding padding).

Backward Compatibility:

No breaking changes expected.
return_mask preserves the legacy single-output return path for existing code.

Performance Considerations:

Runtime/memory scale with the number of chunks derived from input length.
max_chunks truncates excessively long inputs and warns when truncation occurs, reducing likelihood of GPU/CPU OOM.

Files Added:

pyhealth/models/text_embedding.py - main module
tests/test_text_embedding.py - test suite (13 tests; 12 passing)
examples/text_embedding_tutorial.ipynb - tutorial notebook

Files Modified:

pyhealth/models/__init__.py - export TextEmbedding
pyhealth/models/tfm_tokenizer.py - lazy import adjustment to avoid a repo wide ImportError when optional dependencies are not installed

Tests:

New: 13 tests (12 passing; 1 skipped when CUDA unavailable)

Overall: 595 tests passing (no regressions)

Documentation/Examples:

Docstrings include I/O specifications and parameter descriptions.
Inline rationale included for backward-compatibility (return_mask) and performance guardrails (max_chunks).
Tutorial notebook executed end to end successfully.

jhnwu3

Can you add it to the docs/ like Josh did it here?

#806

Rian354 · 2026-02-03T01:29:29Z

Added TextEmbedding documentation w/ the same structure as VisionEmbeddingModel.

jhnwu3

lgtm, we can always patch if things break when we start building unified multimodal embeddings.

Rian354 and others added 24 commits December 8, 2025 03:08

medlink bounty implementation

4b2b80f

Merge branch 'master' of https://github.com/Rian354/PyHealth

8fe5675

Notebook Clean Up

b142938

Further notebook modification

8369ed3

Removed redecleration of methods

456b5bc

MedLink bounty, processor-native model + tests + MIMIC-III notebook

851f696

Merge branch 'master' into master

6732972

Docstrings + ehr -> sampledataset

41ed5a0

Merge branch 'master' of https://github.com/Rian354/PyHealth

c2525e5

Path config for datasets, build error

6f06e49

samples helper mismatch

fa0afcd

Comments Adressed + Import Ambiguity Correction

b34f535

SEP Fixes

50d1637

Docstring correction w/ increased clarity + repo cleanup

2c09e6f

Merge remote-tracking branch 'upstream/master'

2b4356e

Multimodality Text Encoder Model

0a22263

Merge upstream VisionEmbeddingModel changes

0e6e952

Add TextEmbedding to package exports

a235760

Polish docstrings for clarity and precision

612bd5e

Add note about expected tfm_tokenizer test failures

6abda53

Explain why tfm_tokenizer test failures are expected

bdc0a9f

Sync with upstream master

fda27c3

Add TextEmbedding tutorial notebook

1c47c05

Minor comment wording update

d310515

jhnwu3 requested changes Feb 3, 2026

View reviewed changes

Add TextEmbedding documentation to docs/api

caa9c80

jhnwu3 approved these changes Feb 3, 2026

View reviewed changes

jhnwu3 merged commit 5d0422b into sunlabuiuc:master Feb 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextEmbedding Module for Multimodality Reproducibility Study#808

TextEmbedding Module for Multimodality Reproducibility Study#808
jhnwu3 merged 25 commits intosunlabuiuc:masterfrom
Rian354:master

Rian354 commented Feb 3, 2026

Uh oh!

jhnwu3 left a comment

Uh oh!

Rian354 commented Feb 3, 2026

Uh oh!

jhnwu3 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rian354 commented Feb 3, 2026

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

Rian354 commented Feb 3, 2026

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants