Skip to content

silentvoice/monet-flow-tiny

Repository files navigation

MONET Flow Tiny

A small PyTorch project for studying latent flow text-to-image training.

The repository trains a compact generator on MONET image latents and caption embeddings. It is meant to make the training stack inspectable: data shards, latent shapes, flow matching, sampling, prompt conditioning, checkpoint galleries, and a simple Self-Flow-lite ablation.

Read the visual walkthrough: What training a tiny text-to-image model looks like

The core path is:

caption -> frozen text embedding
image   -> precomputed image latent

noise + text + timestep
  -> trainable latent flow model
  -> generated latent
  -> frozen decoder
  -> image

What's Here

  • A minimal latent flow trainer in PyTorch.
  • A small DiT-style transformer over 32 x 16 x 16 image latents.
  • Baseline rectified-flow / flow-matching training.
  • Self-Flow-lite with per-token timesteps, masked corruption, and an auxiliary reconstruction loss.
  • MONET subset preparation from Hugging Face datasets.
  • Synthetic smoke tests that run without external data.
  • Sampling, VAE decoding, checkpoint galleries, and prompt-conditioning diagnostics.
  • Optional GCS, GCE, and Vertex helpers for user-provided cloud projects.

Project Layout

configs/      Local and placeholder cloud training configs
docs/         Source notes and cloud setup notes
scripts/      Data prep, sampling, diagnostics, and optional cloud helpers
src/          Python package source
tests/        Shape/objective tests

Install

Create a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

For the fastest local smoke test, the base install is enough:

python -m pip install -e .

For MONET data preparation:

python -m pip install -e ".[data]"

For latent decoding:

python -m pip install -e ".[decode]"

For cloud helpers:

python -m pip install -e ".[cloud]"

Quick Smoke Test

This creates fake latent shards and trains for a few steps. It does not download MONET or decode images.

python scripts/create_toy_subset.py --output-dir data/processed/toy --num-samples 128
python -m monet_flow.train --config configs/smoke.yaml
python -m monet_flow.train --config configs/smoke_self_flow_lite.yaml

Expected outputs:

outputs/smoke/
outputs/smoke_self_flow_lite/

Each output folder contains:

config.resolved.json
metrics.jsonl
checkpoints/

Prepare MONET Latents

MONET includes precomputed SANA DC-AE latents, so this project can train a small generator in latent space instead of directly on pixels.

Before downloading MONET, review the dataset card and license terms linked in docs/sources.md. If Hugging Face requires authentication in your environment, log in with the Hugging Face CLI or set HF_TOKEN.

Prepare a tiny probe:

python scripts/prepare_monet_subset.py \
  --output-dir data/processed/monet_probe \
  --max-samples 8 \
  --shard-size 8

Prepare a larger filtered subset:

python scripts/prepare_monet_subset.py \
  --output-dir data/processed/monet_10k \
  --max-samples 10000 \
  --max-scanned 150000 \
  --shard-size 1024 \
  --caption-column caption_gemini-2.5-flash-lite \
  --min-aesthetic 5.5 \
  --max-nsfw 0.2 \
  --max-watermark 0.2 \
  --min-aspect 0.75 \
  --max-aspect 1.33 \
  --min-least-dimension 512

Split into train and validation shards:

python scripts/split_subset.py \
  --input-dir data/processed/monet_10k \
  --train-dir data/processed/monet_train \
  --val-dir data/processed/monet_val \
  --val-shards 1

Train

Baseline flow matching:

python -m monet_flow.train --config configs/baseline.yaml

Self-Flow-lite:

python -m monet_flow.train --config configs/self_flow_lite.yaml

The local configs expect:

data/processed/monet_train/
data/processed/monet_val/

Sample A Checkpoint

Generate and decode a grid:

python -m monet_flow.sample \
  --config configs/baseline.yaml \
  --checkpoint outputs/monet_baseline_flow/checkpoints/latest.pt \
  --output-dir samples/manual_baseline \
  --steps 64 \
  --guidance-scale 1.0 \
  --decode \
  --prompts \
  "a red car on a road" \
  "a cat sitting on grass" \
  "a mountain landscape at sunset" \
  "a bowl of fruit on a table"

The decoded grid is written to:

samples/manual_baseline/grid.png

Track Progress

Use the same prompts and the same initial noise for every checkpoint:

scripts/sample_progress.sh \
  configs/baseline.yaml \
  outputs/monet_baseline_flow \
  --prompts \
  "a red car on a road" \
  "a cat sitting on grass" \
  "a mountain landscape at sunset" \
  "a bowl of fruit on a table"

This writes:

samples/progress/
  step-0002500/grid.png
  step-0005000/grid.png
  index.json
  gallery.html

Self-Flow-Lite

The baseline uses one timestep per image.

Self-Flow-lite changes the corruption pattern:

normal flow:
  every latent patch uses the same t

self-flow-lite:
  each latent patch can use a different t
  some patches can be heavily corrupted
  an auxiliary head reconstructs clean latent tokens

The goal is to push the model to use local context and text conditioning instead of only learning one uniform cleanup rule. In small matched runs, Self-Flow-lite can improve the velocity loss before the samples become prompt-coherent, so inspect loss and generated grids together.

Cloud Helpers

The trainer supports gs:// data and output paths when the cloud extra is installed.

Example configs:

configs/baseline_gcs.yaml
configs/self_flow_lite_gcs.yaml
configs/monet_probe_gcs_cpu.yaml

Replace placeholder bucket paths with your own storage location.

Sync local shards:

scripts/sync_to_gcs.sh \
  data/processed/monet_train \
  gs://YOUR_BUCKET/datasets/monet_10k/train

scripts/sync_to_gcs.sh \
  data/processed/monet_val \
  gs://YOUR_BUCKET/datasets/monet_10k/val

Create basic cloud assets:

PROJECT_ID=YOUR_PROJECT_ID \
REGION=us-central1 \
BUCKET_NAME=YOUR_BUCKET_NAME \
scripts/create_gcp_assets.sh

Launch a GCE training worker:

PROJECT_ID=YOUR_PROJECT_ID \
SERVICE_ACCOUNT=YOUR_SERVICE_ACCOUNT_EMAIL \
ZONE=us-central1-a \
MACHINE_TYPE=g2-standard-8 \
ACCELERATOR_TYPE= \
scripts/launch_gce_training.sh \
  configs/baseline_gcs.yaml \
  gs://YOUR_BUCKET

Submit a Vertex package job:

PACKAGE_URI="$(scripts/build_and_upload_package.sh gs://YOUR_BUCKET)"

PROJECT_ID=YOUR_PROJECT_ID \
REGION=us-central1 \
scripts/submit_vertex_package_job.sh \
  "$PACKAGE_URI" \
  monet-flow-tiny-baseline \
  configs/baseline_gcs.yaml \
  gs://YOUR_BUCKET \
  monet_flow.train

See docs/cloud.md for the same flow as a compact guide.

Data, Weights, And Credentials

This repo keeps generated artifacts out of git. Dataset shards, decoded samples, checkpoints, and model weights belong in local output folders or user-controlled storage.

Cloud configs use placeholders such as:

gs://REPLACE_WITH_BUCKET/...

Keep service-account files, tokens, private project ids, and bucket names outside the repository. The helper scripts read explicit environment variables such as PROJECT_ID, BUCKET_NAME, SERVICE_ACCOUNT, and HF_TOKEN when you provide them.

Useful Commands

Run tests:

python -m pytest

Run lint:

python -m ruff check .

Evaluate a checkpoint:

python -m monet_flow.evaluate \
  --config configs/baseline.yaml \
  --checkpoint outputs/monet_baseline_flow/checkpoints/latest.pt \
  --output outputs/eval.json

Diagnose prompt sensitivity:

python scripts/diagnose_checkpoint_conditioning.py \
  --config configs/baseline.yaml \
  --checkpoint outputs/monet_baseline_flow/checkpoints/latest.pt \
  --output-dir samples/diagnostics/baseline_latest \
  --steps 64 \
  --decode

Notes On Image Decoding

The SANA DC-AE decoder expects latents to be divided by its configured scaling factor before decode. The decoder helper in src/monet_flow/vae.py handles this:

decoded = vae.decode(latents.float() / scaling_factor).sample

If decoded images look washed out or incorrectly scaled, check this path first.

Sources And Attribution

This repository uses MONET metadata and latents and compares against ideas from Self-Flow. See docs/sources.md for dataset, paper, and third-party project links. This repository does not redistribute MONET itself.

License

MIT. See LICENSE.

About

Tiny educational MONET text-to-image flow model with Self-Flow-lite experiments

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors