Skip to content

feat(clickhouse): support LowCardinality, FixedString, CODEC, and SAMPLE BY#8

Merged
lohanidamodar merged 4 commits intomainfrom
feat/clickhouse-schema-extras
May 11, 2026
Merged

feat(clickhouse): support LowCardinality, FixedString, CODEC, and SAMPLE BY#8
lohanidamodar merged 4 commits intomainfrom
feat/clickhouse-schema-extras

Conversation

@lohanidamodar
Copy link
Copy Markdown
Contributor

Summary

Adds support for four ClickHouse schema features that are common in
production OLAP workloads but currently can't be expressed via the
schema builder, forcing users to drop down to raw DDL — exactly what
a typed schema builder is meant to prevent.

Each addition lives in src/Query/Schema/ alongside the existing
ClickHouse modifiers (ttl, engine, orderBy, settings,
skip-index algorithms). Other dialects throw UnsupportedException
at compile time so misuse is caught early.

What's new

LowCardinality(T) column modifier

$schema->table('events')
    ->bigInteger('id')->primary()
    ->string('status')->lowCardinality()
    ->string('country')->lowCardinality()->nullable()
    ->create();
// ... `status` LowCardinality(String), `country` Nullable(LowCardinality(String)) ...

LowCardinality is a standard ClickHouse storage modifier for string
columns with a bounded number of distinct values — status enums, type
discriminators, country/category codes. Dictionary encoding cuts storage
and accelerates reads, and production OLAP schemas without it are an
anti-pattern. Nullable is applied outside LowCardinality to match
ClickHouse's required wrapping order.

FixedString(N) column type

$schema->table('locations')
    ->fixedString('country_code', 2)   // ISO 3166-1 alpha-2
    ->fixedString('currency_code', 3)  // ISO 4217
    ->fixedString('digest', 32)        // raw MD5
    ->create();

Fixed-length strings are strictly more efficient than String when the
byte length is known and constant — ISO codes, hash digests, fixed-width
identifiers. New Table::fixedString($name, $length) plus a matching
forwarder on Column. Length must be at least 1.

Column-level CODEC(...) clauses

$schema->table('metrics')
    ->bigInteger('id')->primary()
    ->datetime('ts', 3)->codec('Delta(4)')->codec('LZ4')   // monotonic timestamps
    ->bigInteger('value')->codec('T64')->codec('LZ4')      // integer column
    ->string('payload')->codec('ZSTD(3)')                  // text column
    ->create();

Multiple codec() calls accumulate and emit
CODEC(c1, c2, ...). Each codec string is emitted verbatim, so
arguments live inline ('Delta(4)', 'ZSTD(3)') and the modifier
stays a thin wrapper around the underlying DDL. Empty strings and
semicolons are rejected at configure time. Column-level codecs are a
core ClickHouse feature for tuning storage size and read throughput;
the schema builder couldn't express them before this PR.

SAMPLE BY table option

$schema->table('events')
    ->bigInteger('id')->primary()
    ->bigInteger('user_id')->unsigned()
    ->sampleBy('user_id')
    ->create();
// ... ENGINE = MergeTree() ORDER BY (`id`) SAMPLE BY user_id

SAMPLE BY enables approximate-query support
(SELECT ... SAMPLE k) and must be declared at table creation time.
Emitted after ORDER BY and before TTL / SETTINGS. Rejected on
engines that don't take an ORDER BY clause (Memory, Log,
TinyLog, StripeLog).

Why these specifically

The schema builder can already model the standard MergeTree shape, but
production ClickHouse schemas almost always reach for one or more of
these modifiers. Without them, users have to fall back to raw DDL,
which defeats the purpose of a typed builder.

The patches follow the same dialect pattern as the existing ttl,
engine, orderBy, settings, and skip-index features added in #6:
state lives on Column / Table, ClickHouse compiles it, and base
Schema / PostgreSQL / SQLite overrides throw
UnsupportedException so misuse on the wrong dialect is caught at
compile time.

Out of scope (planned follow-ups)

  • ClickHouse aggregate selectors (uniqExact, uniq, uniqCombined,
    uniqHLL12) on Builder — would let users express
    ClickHouse-native exact and approximate distinct-count aggregates
    without dropping to raw expressions.
  • Time-bucket helpers (toStartOfHour, toStartOfDay, toStartOfWeek,
    toStartOfMonth, toStartOfMinute) on Builder — for time-series
    rollups in SELECT and GROUP BY.

These are query-builder features rather than schema features, so a
separate PR keeps this one focused.

Tests

  • New ClickHouse schema tests in tests/Query/Schema/ClickHouseTest.php
    asserting exact DDL output for each feature, plus validation-error
    coverage (zero length, empty/semicolon codec, empty/semicolon SAMPLE
    BY, SAMPLE BY on a non-ORDER BY engine).
  • Cross-dialect throw tests in
    tests/Query/Schema/FluentBuilderTest.php covering LowCardinality,
    FixedString, column CODEC, and SAMPLE BY on MySQL / PostgreSQL /
    SQLite.

Test plan

  • composer test (5197 tests pass)
  • composer lint (Pint passes)
  • composer check (PHPStan max passes)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

📊 Coverage

Metric PR Baseline Δ
Lines 91.85% (7138/7771) 91.82% +0.03%
Methods 84.56% (1068/1263) 84.46% +0.10%
Classes 65.84% (133/202) 65.84% +0.00%

Full per-file breakdown in the job summary.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR extends the ClickHouse schema builder with four production-oriented DDL features — LowCardinality(T), FixedString(N), column-level CODEC(...), and table-level SAMPLE BY — each following the existing dialect-isolation pattern where new state lives on Column\ClickHouse / Table\ClickHouse and the ClickHouse compiler reads it, while other dialects can't reach the new methods at all since they're only exposed on ClickHouse-specific builder subclasses.

  • LowCardinality and FixedString: The compiler handles the correct Nullable(LowCardinality(T)) wrapping order in both the FixedString early-return branch and the general-type branch. The fixedString() factory correctly throws ValidationException before appending the column to the table when length is invalid, keeping table state consistent.
  • CODEC: Multiple codec() calls accumulate into CODEC(c1, c2, ...), positioned after DEFAULT and before TTL/COMMENT in the column definition — matching ClickHouse's required DDL order. Empty strings and semicolons are rejected at configure time.
  • SAMPLE BY: Emitted after ORDER BY and before TTL/SETTINGS; throws UnsupportedException at compile time for non-ORDER BY engines, mirroring the pattern of the existing TTL and engine guards.

Confidence Score: 5/5

All four new schema features are correctly scoped to ClickHouse-specific builder types, validated at configure time, and compiled in the right DDL order — safe to merge.

The new LowCardinality, FixedString, CODEC, and SAMPLE BY features are self-contained in the ClickHouse builder layer. Wrapping orders, DDL clause ordering, and validation guards are all correct. Methods are only reachable through Column\ClickHouse / Table\ClickHouse, so cross-dialect misuse is prevented at the type level rather than by runtime branches. The test suite covers both happy-path DDL output and validation-error cases for each feature.

No files require special attention.

Important Files Changed

Filename Overview
src/Query/Schema/ClickHouse.php Compiler updated to handle FixedString, LowCardinality, CODEC, and SAMPLE BY; wrapping order (Nullable outside LowCardinality) is correct and SAMPLE BY is positioned correctly after ORDER BY.
src/Query/Schema/Column/ClickHouse.php Adds isLowCardinality, fixedStringLength, and codecs state with validation; codec() and asFixedString() correctly guard against empty/semicolon inputs and length < 1.
src/Query/Schema/Table/ClickHouse.php Adds fixedString() factory method and sampleBy() with validation; fixedString correctly throws before adding the column to the list when length is invalid.
src/Query/Schema/Forwarder/ClickHouse.php Adds fixedString() and sampleBy() forwarder methods that correctly delegate to the parent table; follows existing forwarder pattern.
tests/Query/Schema/ClickHouseTest.php Good coverage of happy-path DDL output and validation errors for all four new features; CODEC+TTL+COMMENT ordering test is especially useful.
README.md Documents all four new features with accurate DDL examples; also corrects an existing omission by adding Views and Databases to the interface list.

Reviews (4): Last reviewed commit: "refactor(clickhouse): drop empty Feature..." | Re-trigger Greptile

@lohanidamodar lohanidamodar requested a review from abnegate May 10, 2026 07:23
…ia Feature\OLAP

Adds four OLAP-shaped column and table modifiers to the ClickHouse
dialect, exposed through a new `Feature\OLAP` marker interface so the
methods are reachable only from the dialect's typed Column/Table
subclasses — not from `MySQL`, `PostgreSQL`, `SQLite`, or `MongoDB`
builders.

- `Column::lowCardinality()` wraps the column type in
  `LowCardinality(...)`. `Nullable` is applied outside to keep
  ClickHouse's required wrapping order.
- `Table::fixedString($name, $length)` (with a Column-chain forwarder)
  adds a `FixedString(N)` column for fixed-length values like ISO codes
  and hash digests.
- `Column::codec($spec)` accumulates one or more `CODEC(...)` entries on
  the column. Multiple calls produce `CODEC(c1, c2, ...)`.
- `Table::sampleBy($expression)` (with a Column-chain forwarder)
  registers a `SAMPLE BY` clause emitted between `ORDER BY` and `TTL` /
  `SETTINGS`. Rejected on engines that don't take an `ORDER BY` clause.

State for `isLowCardinality`, `codecs`, and `sampleBy` lives on
`Column\ClickHouse` / `Table\ClickHouse`, so non-OLAP dialects don't
expose the methods at all and don't carry the state. The `FixedString`
`ColumnType` case is only produced via `Table\ClickHouse::fixedString()`;
other dialects' `compileColumnType()` declare a defensive
`UnsupportedException` branch to satisfy match exhaustiveness even
though the case is unreachable from their builders.
…AMPLE BY

Adds four sections to the ClickHouse Schema chapter covering the new
`Feature\OLAP` modifiers. The narrative makes clear that the methods
are dialect-scoped at the type level — calling them on `MySQL`,
`PostgreSQL`, `SQLite`, or `MongoDB` builders is a compile-time error,
not a runtime throw.

Also extends the ClickHouse "Supports the ... interfaces" line to list
`Views`, `Databases`, and `OLAP` alongside the existing entries.
@lohanidamodar lohanidamodar force-pushed the feat/clickhouse-schema-extras branch from a4c51b3 to 825c507 Compare May 11, 2026 03:48
@lohanidamodar
Copy link
Copy Markdown
Contributor Author

Thanks for the Feature-interface direction — refactored to use Feature\OLAP (matching the scaffolding added on main):

  • lowCardinality() and codec() now live on Column\ClickHouse; fixedString() and sampleBy() on Table\ClickHouse (with column-chain forwarders via Forwarder\ClickHouse).
  • Only Schema\ClickHouse implements Feature\OLAP; the methods don't exist on Column\MySQL / Column\PostgreSQL / Column\SQLite / Column\MongoDB (or their Table\X counterparts), so calling them is a compile-time error rather than a runtime UnsupportedException.
  • Greptile P1 (MongoDB silently dropping modifiers) is resolved by construction — the methods aren't reachable from Schema\MongoDB.
  • Modifier state (isLowCardinality, codecs, sampleBy) lives on the ClickHouse Column/Table subclasses, not on the base classes.
  • Generated DDL is unchanged from the prior PR revision.

Tests + lint + PHPStan green. Ready for re-review.

…rom global ColumnType

Removes `ColumnType::FixedString` from the cross-dialect enum. FixedString
state now lives on `Column\ClickHouse` (via `asFixedString()` /
`isFixedString()` / `$fixedStringLength`), and `Schema\ClickHouse::compileColumnType()`
reads that state to emit `FixedString(N)` DDL.

`Table\ClickHouse::fixedString()` now registers a `ColumnType::String` column
and tags it with the FixedString state, so the global enum carries no
ClickHouse-only cases and the other dialects (`MySQL`, `PostgreSQL`,
`SQLite`, `MongoDB`) no longer need `UnsupportedException` match branches —
their `compileColumnType()` methods are byte-identical to `main`.

`Feature\OLAP` remains a marker interface matching the dialect-shape pattern
(OLAP modifiers live on the column/table builder, not on `Schema`, so they
cannot be expressed as a Schema-level method contract); docblock updated to
explain why and to confirm the non-OLAP dialects are unchanged by construction.

Compiled DDL bytes for ClickHouse are unchanged; all 5175 tests pass; lint
and PHPStan max are clean.
@lohanidamodar
Copy link
Copy Markdown
Contributor Author

Sorry, the earlier round wasn't a full fix — ColumnType::FixedString was still in the global enum, which forced the other dialects to handle/throw on it. Refactored properly this round (951783c):

  • ColumnType::FixedString removed from the global enum
  • FixedString state now lives on Column\ClickHouse (asFixedString() / isFixedString() / $fixedStringLength); Schema\ClickHouse::compileColumnType() reads that state to emit FixedString(N) DDL
  • MySQL / PostgreSQL / SQLite / MongoDB are byte-identical to main (no UnsupportedException branches added by this PR; git diff origin/main -- src/Query/Schema/MySQL.php … is empty)
  • Feature\OLAP stays a marker — OLAP modifiers (lowCardinality(), fixedString(), codec(), sampleBy()) live on the dialect's column/table builder, not on Schema, so they can't be expressed as Schema-level method signatures the way TableComments / Views / Databases are; docblock now explains that and confirms non-OLAP dialects are unaffected by construction

Other three features (LowCardinality, CODEC, SAMPLE BY) were already correctly scoped to Column\ClickHouse / Table\ClickHouse and unaffected. ClickHouse-side DDL output bytes are unchanged; all 5175 tests green, lint and PHPStan max clean.

The interface declared no methods, was inspected nowhere, and pulled no
weight at runtime or in the type system. Every sibling in `Feature/*`
declares Statement-returning method signatures, but OLAP modifiers are
intrinsic to the column/table builder shape and can't be expressed at the
Schema level.

Dialect-scoping is fully preserved by `Column\ClickHouse` / `Table\ClickHouse`
/ `Forwarder\ClickHouse` carrying the modifier methods natively — calling
them on a non-ClickHouse builder is a clean type-system error, not a
runtime exception.
@lohanidamodar
Copy link
Copy Markdown
Contributor Author

Dropped the empty Feature\OLAP marker — every sibling in Feature/* declares Statement-returning methods, but the OLAP modifiers are intrinsic to the column/table builder shape and can't be expressed at the Schema level. The interface enforced nothing and was inspected nowhere, so it wasn't pulling weight.

Dialect-scoping is fully preserved by Column\ClickHouse / Table\ClickHouse / Forwarder\ClickHouse carrying the methods natively — non-ClickHouse dialects still get a clean type-system error (not a runtime exception) on attempted calls.

@lohanidamodar lohanidamodar merged commit 5c8bba8 into main May 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants