A small PyTorch project for studying latent flow text-to-image training.
The repository trains a compact generator on MONET image latents and caption embeddings. It is meant to make the training stack inspectable: data shards, latent shapes, flow matching, sampling, prompt conditioning, checkpoint galleries, and a simple Self-Flow-lite ablation.
Read the visual walkthrough: What training a tiny text-to-image model looks like
The core path is:
caption -> frozen text embedding
image -> precomputed image latent
noise + text + timestep
-> trainable latent flow model
-> generated latent
-> frozen decoder
-> image
- A minimal latent flow trainer in PyTorch.
- A small DiT-style transformer over
32 x 16 x 16image latents. - Baseline rectified-flow / flow-matching training.
- Self-Flow-lite with per-token timesteps, masked corruption, and an auxiliary reconstruction loss.
- MONET subset preparation from Hugging Face datasets.
- Synthetic smoke tests that run without external data.
- Sampling, VAE decoding, checkpoint galleries, and prompt-conditioning diagnostics.
- Optional GCS, GCE, and Vertex helpers for user-provided cloud projects.
configs/ Local and placeholder cloud training configs
docs/ Source notes and cloud setup notes
scripts/ Data prep, sampling, diagnostics, and optional cloud helpers
src/ Python package source
tests/ Shape/objective tests
Create a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"For the fastest local smoke test, the base install is enough:
python -m pip install -e .For MONET data preparation:
python -m pip install -e ".[data]"For latent decoding:
python -m pip install -e ".[decode]"For cloud helpers:
python -m pip install -e ".[cloud]"This creates fake latent shards and trains for a few steps. It does not download MONET or decode images.
python scripts/create_toy_subset.py --output-dir data/processed/toy --num-samples 128
python -m monet_flow.train --config configs/smoke.yaml
python -m monet_flow.train --config configs/smoke_self_flow_lite.yamlExpected outputs:
outputs/smoke/
outputs/smoke_self_flow_lite/
Each output folder contains:
config.resolved.json
metrics.jsonl
checkpoints/
MONET includes precomputed SANA DC-AE latents, so this project can train a small generator in latent space instead of directly on pixels.
Before downloading MONET, review the dataset card and license terms linked in
docs/sources.md. If Hugging Face requires authentication in
your environment, log in with the Hugging Face CLI or set HF_TOKEN.
Prepare a tiny probe:
python scripts/prepare_monet_subset.py \
--output-dir data/processed/monet_probe \
--max-samples 8 \
--shard-size 8Prepare a larger filtered subset:
python scripts/prepare_monet_subset.py \
--output-dir data/processed/monet_10k \
--max-samples 10000 \
--max-scanned 150000 \
--shard-size 1024 \
--caption-column caption_gemini-2.5-flash-lite \
--min-aesthetic 5.5 \
--max-nsfw 0.2 \
--max-watermark 0.2 \
--min-aspect 0.75 \
--max-aspect 1.33 \
--min-least-dimension 512Split into train and validation shards:
python scripts/split_subset.py \
--input-dir data/processed/monet_10k \
--train-dir data/processed/monet_train \
--val-dir data/processed/monet_val \
--val-shards 1Baseline flow matching:
python -m monet_flow.train --config configs/baseline.yamlSelf-Flow-lite:
python -m monet_flow.train --config configs/self_flow_lite.yamlThe local configs expect:
data/processed/monet_train/
data/processed/monet_val/
Generate and decode a grid:
python -m monet_flow.sample \
--config configs/baseline.yaml \
--checkpoint outputs/monet_baseline_flow/checkpoints/latest.pt \
--output-dir samples/manual_baseline \
--steps 64 \
--guidance-scale 1.0 \
--decode \
--prompts \
"a red car on a road" \
"a cat sitting on grass" \
"a mountain landscape at sunset" \
"a bowl of fruit on a table"The decoded grid is written to:
samples/manual_baseline/grid.png
Use the same prompts and the same initial noise for every checkpoint:
scripts/sample_progress.sh \
configs/baseline.yaml \
outputs/monet_baseline_flow \
--prompts \
"a red car on a road" \
"a cat sitting on grass" \
"a mountain landscape at sunset" \
"a bowl of fruit on a table"This writes:
samples/progress/
step-0002500/grid.png
step-0005000/grid.png
index.json
gallery.html
The baseline uses one timestep per image.
Self-Flow-lite changes the corruption pattern:
normal flow:
every latent patch uses the same t
self-flow-lite:
each latent patch can use a different t
some patches can be heavily corrupted
an auxiliary head reconstructs clean latent tokens
The goal is to push the model to use local context and text conditioning instead of only learning one uniform cleanup rule. In small matched runs, Self-Flow-lite can improve the velocity loss before the samples become prompt-coherent, so inspect loss and generated grids together.
The trainer supports gs:// data and output paths when the cloud extra is
installed.
Example configs:
configs/baseline_gcs.yaml
configs/self_flow_lite_gcs.yaml
configs/monet_probe_gcs_cpu.yaml
Replace placeholder bucket paths with your own storage location.
Sync local shards:
scripts/sync_to_gcs.sh \
data/processed/monet_train \
gs://YOUR_BUCKET/datasets/monet_10k/train
scripts/sync_to_gcs.sh \
data/processed/monet_val \
gs://YOUR_BUCKET/datasets/monet_10k/valCreate basic cloud assets:
PROJECT_ID=YOUR_PROJECT_ID \
REGION=us-central1 \
BUCKET_NAME=YOUR_BUCKET_NAME \
scripts/create_gcp_assets.shLaunch a GCE training worker:
PROJECT_ID=YOUR_PROJECT_ID \
SERVICE_ACCOUNT=YOUR_SERVICE_ACCOUNT_EMAIL \
ZONE=us-central1-a \
MACHINE_TYPE=g2-standard-8 \
ACCELERATOR_TYPE= \
scripts/launch_gce_training.sh \
configs/baseline_gcs.yaml \
gs://YOUR_BUCKETSubmit a Vertex package job:
PACKAGE_URI="$(scripts/build_and_upload_package.sh gs://YOUR_BUCKET)"
PROJECT_ID=YOUR_PROJECT_ID \
REGION=us-central1 \
scripts/submit_vertex_package_job.sh \
"$PACKAGE_URI" \
monet-flow-tiny-baseline \
configs/baseline_gcs.yaml \
gs://YOUR_BUCKET \
monet_flow.trainSee docs/cloud.md for the same flow as a compact guide.
This repo keeps generated artifacts out of git. Dataset shards, decoded samples, checkpoints, and model weights belong in local output folders or user-controlled storage.
Cloud configs use placeholders such as:
gs://REPLACE_WITH_BUCKET/...
Keep service-account files, tokens, private project ids, and bucket names outside
the repository. The helper scripts read explicit environment variables such as
PROJECT_ID, BUCKET_NAME, SERVICE_ACCOUNT, and HF_TOKEN when you provide
them.
Run tests:
python -m pytestRun lint:
python -m ruff check .Evaluate a checkpoint:
python -m monet_flow.evaluate \
--config configs/baseline.yaml \
--checkpoint outputs/monet_baseline_flow/checkpoints/latest.pt \
--output outputs/eval.jsonDiagnose prompt sensitivity:
python scripts/diagnose_checkpoint_conditioning.py \
--config configs/baseline.yaml \
--checkpoint outputs/monet_baseline_flow/checkpoints/latest.pt \
--output-dir samples/diagnostics/baseline_latest \
--steps 64 \
--decodeThe SANA DC-AE decoder expects latents to be divided by its configured scaling
factor before decode. The decoder helper in src/monet_flow/vae.py handles
this:
decoded = vae.decode(latents.float() / scaling_factor).sampleIf decoded images look washed out or incorrectly scaled, check this path first.
This repository uses MONET metadata and latents and compares against ideas from Self-Flow. See docs/sources.md for dataset, paper, and third-party project links. This repository does not redistribute MONET itself.
MIT. See LICENSE.