Skip to content

stefareed/null-where

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

null-where

A SQL linter that catches queries that return wrong answers silently.


TL;DR

SQL has two failure modes that produce no errors. They return numbers that look plausible but describe a slightly different question than the one you asked and nothing in the output tells you that.

Failure mode 1: NULL rows vanish from aggregates.

SELECT department, AVG(salary) FROM employees GROUP BY department;

If some employees have no salary on file, they're silently excluded. AVG(salary) computes the mean over employees with a recorded salary, not over all employees. If the missing values are non-random (new hires, contractors, part-time staff), the result is biased, and you have no way to tell from the output.

Failure mode 2: Query definitions drift from what they measure.

CREATE VIEW active_products AS
    SELECT * FROM products WHERE status IN ('available', 'backordered');

When the product team adds a new status 'pre-order' the view silently excludes it. Every report and dashboard built on active_products now understates inventory but there's no error or warning and the wrong number goes undetected.

null-where catches both by analyzing your SQL and explaining not just what is wrong but why. It tells what population your query actually describes, and where the name and the measurement have come apart.

from null_where import lint, report

print(report(lint("""
    SELECT department, AVG(salary)
    FROM employees
    GROUP BY department
""")))
Found 1 issue(s): 1 error(s), 0 warning(s), 0 info(s)
─────────────────────────────────────────────────────

[ERROR] ✗  SILENT_DENOMINATOR_SHIFT
   AVG(salary) present without both COUNT(*) and COUNT(salary).
   Fragment : AVG(salary)
   Why      : NULL values in 'salary' are silently excluded from AVG(). The
              result describes only the sub-population where salary is recorded —
              not the full department. If missingness is non-random, the mean
              is biased with no indication in the output.
   Fix      : COUNT(*) AS total_employees,
              COUNT(salary) AS employees_with_salary,
              COUNT(*) - COUNT(salary) AS employees_without_salary,
              AVG(salary) AS mean_salary_on_file,
              SUM(COALESCE(salary, 0)) / COUNT(*) AS mean_salary_lower_bound

Works on any SQL database, any domain. HR queries, financial reports, product analytics, research pipelines. Failure modes are structural, not domain-specific.


Installation

pip install sqlglot
git clone https://github.com/yourname/null-where.git
cd null-where
pip install -e .

Or drop null_where/linter.py directly into your project. The only dependency is sqlglot.


Usage

Python

from null_where import lint, lint_file, report, Severity

# Check a query string
findings = lint("SELECT AVG(revenue) FROM orders WHERE region != 'EU' GROUP BY product")
print(report(findings))

# Check a .sql file
findings = lint_file("reports/monthly_summary.sql")

# One-line summary per finding
print(report(findings, verbose=False))

# Filter by severity
errors = [f for f in findings if f.severity == Severity.ERROR]

# Access fields programmatically
for f in findings:
    print(f.rule, f.severity.value, f.message)
    print(f.rationale)
    print(f.suggestion)

CLI

# Lint a query string
python -m null_where.linter "SELECT AVG(price) FROM products WHERE category != 'archived'"

# Lint a file
python -m null_where.linter queries/dashboard.sql

# Quiet mode: one line per finding
python -m null_where.linter --quiet queries/report.sql

# Specify SQL dialect (postgres, mysql, sqlite, bigquery, snowflake, ...)
python -m null_where.linter --dialect bigquery queries/warehouse.sql

Exit code 0 if no errors; exit code 1 if any ERRORs present — suitable for CI.


What it checks

Rule Severity Problem Fix
SILENT_DENOMINATOR_SHIFT ERROR AVG(col) or SUM(col) without companion COUNT(*) and COUNT(col). NULL rows drop silently from the denominator. The aggregate describes a biased subpopulation with no indication in the output. Add COUNT(*), COUNT(col), and SUM(COALESCE(col,0))/COUNT(*) alongside every aggregate. Report both the non-null mean and the lower-bound mean.
NULL_UNSAFE_COMPARISON WARNING col != 'x' (or <>) returns UNKNOWN — not FALSE — when col is NULL. WHERE discards UNKNOWN rows silently. Rows where col is NULL appear in neither col != x NOR col = x, breaking complementarity. Use col IS DISTINCT FROM 'x' — NULL-safe, always returns TRUE or FALSE.
PHANTOM_ROW_COUNT ERROR COUNT(*) after a LEFT JOIN counts phantom rows: when no right-side match exists, the JOIN produces a row of NULLs, and COUNT(*) counts it as 1. A group with zero actual members reports count = 1. Use COUNT(right_table.primary_key). COUNT(col) ignores NULLs, so unmatched rows correctly score 0.
HARDCODED_VALUE_LIST WARNING WHERE col IN ('a', 'b', 'c') freezes the definition at write time. When the domain expands — new statuses, new categories, new regions — existing queries silently exclude the additions. JOIN against a reference table (WHERE ref.is_active = TRUE) so queries update automatically when the reference updates.
HARDCODED_THRESHOLD WARNING Numeric thresholds baked into queries (WHERE score >= 0.7, WHERE revenue > 10000). When standards change, every downstream query silently applies the old value. Store thresholds in a config table or named CTE so they can be updated centrally.
CEILING_FILTER WARNING HAVING AVG(col) > 0.95 filters for near-perfect values and treats them as a clean signal. At ceiling, aggregates stop discriminating — the group may have saturated rather than genuinely improved. Include STDDEV. Low variance + high mean = saturation signature. Return full results for human review rather than filtering at the query level.
COMPLEMENTARITY_TRAP INFO Threshold filter with no NULL handling. In-threshold and out-of-threshold counts will not sum to total rows when NULL values are present. A rate computed this way overstates performance. Use CASE WHEN to produce explicit in-threshold, out-of-threshold, and null counts. Verify: they sum to total.
SELECT_STAR_SCHEMA_DRIFT INFO SELECT * in a view or saved query expands to the columns present at creation time. Schema changes alter what the query returns without touching its definition. Name columns explicitly. Schema changes become visible errors rather than silent behavior changes.

Patterns

Make NULL gaps visible

The standard AVG(col) pattern hides how much of your data is missing. This version makes it explicit and works for any metric in any table:

SELECT
    group_col,
    COUNT(*)                                    AS total_rows,
    COUNT(measure_col)                          AS rows_with_value,
    COUNT(*) - COUNT(measure_col)               AS rows_without_value,
    ROUND(100.0 * COUNT(measure_col)
          / COUNT(*), 1)                        AS pct_coverage,
    AVG(measure_col)                            AS mean_where_present,
    SUM(COALESCE(measure_col, 0)) / COUNT(*)    AS mean_lower_bound
FROM your_table
GROUP BY group_col;

If pct_coverage drops between time periods or cohorts while the mean rises, the denominator shrank — investigate before reporting.

NULL-safe comparisons

-- ✗ Silently excludes rows where status IS NULL
WHERE status != 'inactive'

-- ✓ Includes rows where status IS NULL (they are distinct from 'inactive')
WHERE status IS DISTINCT FROM 'inactive'

LEFT JOIN with correct counts

-- ✗ COUNT(*) counts the phantom NULL row from an unmatched LEFT JOIN
SELECT c.name, COUNT(*) AS member_count
FROM categories c
LEFT JOIN items i ON i.category_id = c.id
GROUP BY c.name;

-- ✓ COUNT(col) ignores the NULL row — empty categories correctly show 0
SELECT c.name, COUNT(i.id) AS member_count
FROM categories c
LEFT JOIN items i ON i.category_id = c.id
GROUP BY c.name;

Reference tables instead of hardcoded lists

-- ✗ Frozen at write time — misses any values added later
WHERE status IN ('active', 'pending')

-- ✓ Updates automatically when the reference table updates
JOIN valid_statuses s ON t.status = s.status
WHERE s.is_current = TRUE

Named thresholds

-- ✗ Threshold baked in — silently stale after any standard change
WHERE conversion_rate >= 0.03

-- ✓ Threshold named and centralized
WITH thresholds AS (SELECT 0.03 AS min_conversion_rate)
SELECT ...
WHERE conversion_rate >= (SELECT min_conversion_rate FROM thresholds)

The explicit count template

Any time you're reporting a rate that should sum to 100% with its complement, make the three-way split explicit:

SELECT
    SUM(CASE WHEN col >= threshold THEN 1 ELSE 0 END) AS above_threshold,
    SUM(CASE WHEN col < threshold  THEN 1 ELSE 0 END) AS below_threshold,
    SUM(CASE WHEN col IS NULL      THEN 1 ELSE 0 END) AS no_value,
    COUNT(*)                                           AS total
FROM your_table;
-- above_threshold + below_threshold + no_value should equal total

Why this happens: the conceptual foundation

The two failure modes null-where catches aren't bugs. They're built into SQL by design, and they reflect something real about the gap between what queries say and what they measure. Understanding why they happen is what makes the fixes stick and what lets you recognize new instances of the same problem that the linter doesn't cover.

Queries make two promises

Every SQL query makes two promises simultaneously, and most of the time they're the same promise:

  1. What it's called: the name of the view, the column aliases, the table name in FROM.
  2. What it actually picks out: the specific rows, computed under the specific conditions in the query body at the time it was written.

When you write SELECT AVG(salary) FROM employees, the name says "average salary of employees." The body says "average of the salary column, excluding NULLs, over the rows currently in this table." Most of the time those two things are the same, so nobody notices the gap.

The failure modes happen when they come apart.

NULL: when a row is there but its value isn't

NULL means "no value." Not zero, not empty string. It's absence. SQL lets NULL values exist in columns, but then does something surprising: aggregate functions (AVG, SUM, COUNT(col)) silently skip rows where the column is NULL and compute their result over the remaining rows only.

The query name still says "average salary." The computation is now over a different, smaller group, the employees who have a salary on file. If that group is representative of all employees, no harm done. If it isn't (if NULL salaries are more common for contractors, or new hires, or one particular department) the number is biased in a way you can't detect from the output.

This is why the fix is always to make the gap visible: report how many rows were included, how many were excluded, and what the lower-bound estimate looks like if you treat the excluded rows as zeros. You're not fixing the NULL, you're surfacing it so the reader can judge whether it matters.

Definition drift: when the name and the measurement diverge over time

The second failure mode is slower and subtler. A view or saved query is written at a particular moment, based on what the data looks like at that moment. The name captures the intent: active_customers, high_value_orders, current_inventory. The body captures the mechanics: WHERE status IN ('active', 'trial'), WHERE order_total > 500, WHERE discontinued = FALSE.

Over time, the world changes and new statuses get added, thresholds get revised, columns get renamed. But the body of the query doesn't change and nobody touches it, because nobody broke it. But the gap between what the name promises and what the body delivers quietly widens.

The linter catches the patterns most likely to cause this: hardcoded lists of values that will grow, numeric thresholds that will shift, SELECT * that will expand silently when columns are added. The fix in each case is the same: express the intent in terms that can update when the world does, rather than baking a snapshot of today's world into the query body.

The names for these problems (optional reading)

These two failure modes have names in philosophy of language, because philosophers ran into them first.

The gap between a name and what it actually refers to is the distinction between sense and reference, worked out by the logician Gottlob Frege in the 1890s. "Average salary" is the sense (the description). The set of rows actually computed is the reference. In a working query they track each other. When NULLs silently narrow the denominator, the sense stays fixed while the reference quietly shrinks.

The gap between a description and the thing it describes is the de dicto / de re distinction. A de re claim is about the actual thing (whatever it turns out to be). A de dicto claim is about the description (the words, the conditions, the snapshot). WHERE status IN ('active', 'pending') is a de dicto commitment: a claim about those exact words, written at that exact time. The view name active_customers makes a de re promise: a claim about whatever "active" actually means. When those two things diverge/when the description stops tracking the reality, the query is technically correct and substantively wrong.

The linter flags the places where de dicto commitments are structurally likely to diverge from de re claims.


SQL ↔ FOL translation

This repo also includes a SQL ↔ First-Order Logic translator with Fregean semantic annotations. It makes the logical structure of a query explicit expressing what the query actually quantifies over, what conditions it applies, and how its structure relates to what it claims to measure.

from sql_fol import sql_to_fol, fol_to_sql, fregean_analysis

# SQL → FOL
sql_to_fol("SELECT name FROM employees WHERE salary > 50000")
# → λname. ∃e. (employees(e) ∧ (salary > 50000))

sql_to_fol("SELECT DISTINCT region FROM orders WHERE total > 1000")
# → λregion. ∃!o. (orders(o) ∧ (total > 1000))

sql_to_fol("SELECT e.name, d.name FROM employees e, departments d WHERE e.dept_id = d.id")
# → λe.name,d.name. ∃e,d. (employees(e) ∧ departments(d) ∧ (e.dept_id = d.id))

# FOL → SQL
fol_to_sql("λname. ∃e. (Employees(e) ∧ (e.salary > 50000))")
# → SELECT name FROM Employees AS e WHERE e.salary > 50000;

# Fregean semantic annotations
for ann in fregean_analysis("SELECT name FROM employees WHERE salary > 50000"):
    print(f"[{ann['term']}]  {ann['gloss']}")
    print(f"  {ann['analysis']}")

The FOL translation makes the sense/reference gap concrete. The λ prefix (the SELECT list) is the sense, or how the result is named and framed. The body (the FROM and WHERE clauses) is the reference, or what the query actually picks out. When NULLs silently narrow what gets computed, the λ stays fixed while the body shrinks. When a view drifts from its definition, the λ is unchanged while the extension it picks out changes underneath it.

The correspondence table:

SQL FOL Fregean term
Table name R(x) Begriff — an unsaturated concept; rows are its Umfang (extension)
Row / tuple bound variable x Gegenstand — an object falling under the concept
SELECT cols λcols. λ-abstraction over the result — the sense/mode of presentation
FROM R ∃x. R(x) existential claim over the table's rows
WHERE P(x) ∧ P(x) Sättigung — filling the open argument positions of the concept
Column predicate sub-formula Merkmal — sub-predicate contributing to the concept's definition
INNER JOIN R(x) ∧ S(x) shared variable binding two concepts
NOT EXISTS (...) ¬∃x. φ(x) negated existential
DISTINCT ∃!x unique existential
View V = query Q same reference, different sense same rows, different mode of presentation
NULL reference failure a term with sense but no value — Bedeutungslosigkeit

Project structure

null-where/
├── null_where/
│   ├── __init__.py          # Public API: lint, lint_file, report, Severity
│   └── linter.py            # Eight checks, sqlglot-based AST analysis
├── sql_fol/
│   ├── __init__.py          # Public API: sql_to_fol, fol_to_sql, fregean_analysis
│   ├── sql_to_fol.py        # Recursive-descent SQL parser, FOL emitter
│   ├── fol_to_sql.py        # FOL parser, SQL emitter
│   └── annotations.py       # Fregean semantic annotation engine
├── null_where_demo.ipynb    # Colab demo notebook
├── tests/
├── pyproject.toml
└── README.md

References

  • Codd, E.F. (1970). A Relational Model of Data for Large Shared Data Banks. CACM 13(6).
  • Codd, E.F. (1979). Extending the Database Relational Model to Capture More Meaning. TODS 4(4).
  • Frege, G. (1879). Begriffsschrift.
  • Frege, G. (1892). Über Sinn und Bedeutung.
  • Imielinski, T. & Lipski, W. (1984). Incomplete Information in Relational Databases. JACM 31(4).
  • van Fraassen, B. (1966). Singular Terms, Truth-Value Gaps, and Free Logic. Journal of Philosophy.
  • Heim, I. & Kratzer, A. (1998). Semantics in Generative Grammar.

License

MIT

About

SQL linter that explains why some queries silently return wrong answers. Catches NULL aggregation bias, definition drift in views, hardcoded value lists, and phantom row counts + rationale and fixes. Includes a SQL ↔ FOL translator with Fregean semantic annotations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors