Add 'reflow' capability for configurable, black-like deterministic formatting #2436

NiallRees · 2022-01-23T21:05:33Z

Feature

Same functionality as black, but totally configurable. The 'reflow' command would produce a deterministic output for the same 'code_only' input, regardless of the non-code elements (whitespace, capitalisation, indentation).

Benefits:

Fully user configurable code style
Deterministic output for queries with the same choice and order of code segments
No need to maintain and create rules for formatting-only concerns, which is hard currently as we have to detect bad SQL, which has infinite possibilities. By letting the user decide what good SQL is, we only have to define how the SQL is rendered.
The implementation suggested is purely additive in relation to the parse tree, mitigating issues where formatting rules can interfere with each other.

Illustrative example

The user has a sql file with the following:

select a,b, 
c FROM schema.table TBL

The user has configured the following:

base:
  case: lower # render all base segments (BaseSegment) and children in lowercase
select_clause_elements:
  indent_from_parent: 1 # indents all of `select_clause_elements_segment` by one unit from parent segment (this grouping segment doesn't exist yet)
select_clause_element:
  pre_newline: true # adds a newline before `select_clause_element`
from_clause:
  pre_newline: true # adds a newline before `from_clause`

And so the output is:

select
    a,
    b,
    c
from schema.table tbl

How

We're already embedding Indent and Dedent into the dialects, this feature builds upon that to add all configuration and functionality to render valid SQL from code-only elements.

We parse the SQL as usual
We remove all non-code segments from the parse tree (output here is what our dialect test fixture ymls look like).
Using this parse tree, we render the 'reflowed' SQL.

We've talked for a while about the idea of separating rules by what they do. It seems we can clearly draw a line between:

Rules which care about query structure (choice and ordering of code segments)
Rules which just care about appearance (non-code segments, strictly whitespace, capitalisation and indentation).

This feature focuses strictly on the latter and would not add, remove or modify code segments beyond their capitalisation.

Implementation suggestions

Embed the rendering logic into the dialect grammars/segments for maximum maintainability and ease of contribution. This is deliberately moving away from the approach in L003 which is very difficult to maintain due to its aim of handling all indentation across every grammar and segment.
Unless a segment/grammar has unique rendering logic, it can inherit the logic and behaviour from its parent. For instance, we only need to define 'case' and pre_newline config at the top level segment.
Configuration precedence and inheritance:
- 1. User-specified configuration
- 1. 'Theme'-specified configuration (we could have a library of SQL styles which could be used as a base)
- 1. Segment specified configuration
- 1. Next closest parent segment configuration (e.g. SchemaReferenceSegment would use ObjectReferenceSegment's configuration if it did not have its own).
- Acceptance criteria for a V1 could be that we're just able to render valid SQL from code-only segments. We can add configurable style after that.
Not all formatting rules are as simple as 'always add newlines before select elements'. But, we could make segment-specific configurations. For example there could feasibly be a config for select_clause_element: pre_newline_when_multiple.

My current availability combined with the fact that there are many much better developers than me around here means I would really love to see someone/others take this on, I'm not precious about any of the above ideas!

The text was updated successfully, but these errors were encountered:

alanmcruickshank · 2022-07-14T16:07:53Z

So I'm going to take a go at this as my next biggie. Here's my rough plan of attack:

Step 1. Blunt reflow.

First step of implementing this I think will just be on the API (i.e. not on the CLI yet until this is more stable). The "blunt reflow" would be to remove all whitespace and newlines, then step through the file, adding a newline (and appropriate indent) for every Indent() and Dedent() segment, otherwise adding single whitespaces.

This will be implemented ideally as a method on BaseSegment so it can either be called on a while file, statement or even just a specific clause. There will importantly be a configurable limit on how much to recurse, so when applying a reflow, rules might choose to only reflow the layer where a fix was applied and not every child segment (and their children and so on).

This is totally deterministic, but ignores line lengths, delimiters and will look kind of silly. It demonstrates proof of concept for what we're trying to do though.

Step 2. Line break hint segments.

To facilitate appropriate splitting of lists (e.g. SELECT a,b,c,d,e,f,g) I suggest added a new MetaSegment (similar to Indent()), which hints at appropriate places to put newlines. LineBreak() is probably a parent class of Indent() in that all indents are possible line breaks but not visa versa. Like indents, these would be configurable so that we can account for things like trailing or leading commas, and ditto for operators.

Adding this capability into the blunt reflow just makes files even longer. Things still look silly, but this is the most extreme version of laying out a file. All still deterministic.

Step 3. Meta segment priority.

For all Indent(), Dedent() and LineBreak() segments, introduce configurable a priority value. Higher values suggest more likely places to put line breaks (with or without indents). One edit option would be to change priority values to be very high (perhaps a dummy value of 100), which effectively makes a newline compulsory. This is the method we would encourage for rules which detect a missing newline: they don't actually add the newline, they just make one compulsory and then trigger a reflow.

NOTE: at this stage, the priority isn't actually used, but we can set it.

Step 4. Progressively pack.

So I think this is where most of the "magic" happens:

We already have a mix line length.
We would now have a priority score for all potential places for newlines.

I'm not sure whether we first "blunt reflow" and then work backwards, or whether we remove all newlines and work forwards (probably the latter). Assuming the latter, we progressively add high priority newlines (and work down the priority order), until a segment (and all it's children) all fit within the max line length.

That would have the effect for example that we might have to reflow more aggressively in a SELECT clause, but less in a FROM clause within the same query (i.e. different priority line breaks would be activated), but that's ok based on the varying complexity of the content of those sections.

Alternatives/Options

I think one variation of this plan is that rather than having an explicit LineBreak segment, we enable some segments to be configured as a kind of explicit delimiter - similar to the suggestion above of a "newline after" or "newline before" config. This results in a much cleaner parse tree, and fewer scattered Meta segments - but it does increase complexity within an already crowded BaseSegment class.
Perhaps a trivial variation on the plan would be that for compulsory line breaks, there's an explicit compulsory flag which is settable by rules, rather than just setting the priority to a really high value (likely via a constant).
I haven't really accounted for configuration of the priority values - in this plan I'm assuming they're hard coded in the dialect (albeit with sensible defaults on things like the Delimited grammar). This presents an interesting tension between rules setting some priority values and the dialect setting the rest - which would result in a cli reflow command giving contradictory results to a fix command with all rules. That's not necessarily bad but does pose some interesting questions.

Would love feedback on this rough plan before I dive in.

barrywhart · 2022-07-14T16:26:15Z

The plan sounds good overall. I had to skim it just now, but one thought I've had about this is that the depth of a segment in the parse tree should have some influence on how lines are broken. All else being equal, my intuition is that it's better to break lines higher than lower in the tree, e.g. each function parameter on its own line is preferred to splitting up a computation for a single parameter, e.g. a + b * c .... Just a thought...

barrywhart · 2022-07-14T16:26:18Z

The plan sounds good overall. I had to skim it just now, but one thought I've had about this is that the depth of a segment in the parse tree should have some influence on how lines are broken. All else being equal, my intuition is that it's better to break lines higher than lower in the tree, e.g. each function parameter on its own line is preferred to splitting up a computation for a single parameter, e.g. a + b * c .... Just a thought...

NiallRees · 2022-07-14T16:58:38Z

Sounds great @alanmcruickshank . What are your thoughts on segment-level configurability of this, and do you think there are any considerations we should be wary of early on on to facilitate that?

Referring to this idea in the issue description:

base:
  case: lower # render all base segments (BaseSegment) and children in lowercase
select_clause_elements:
  indent_from_parent: 1 # indents all of `select_clause_elements_segment` by one unit from parent segment (this grouping segment doesn't exist yet)
select_clause_element:
  pre_newline: true # adds a newline before `select_clause_element`
from_clause:
  pre_newline: true # adds a newline before `from_clause`

Where these configs would then descend down to any segments which inherit from another. I'm then imagining a world where community members can codify their entire code style using this config but perhaps I'm getting too far ahead.

alanmcruickshank · 2022-07-14T20:29:46Z

Where these configs would then descend down to any segments which inherit from another. I'm then imagining a world where community members can codify their entire code style using this config but perhaps I'm getting too far ahead.

For configuration it would leverage the existing configuration options for turning some meta segments on and off (a la indented_joins, indented_ctes and other options in [sqlfluff:indentation]), but it doesn't explicitly support the more flexible approach that you're suggesting.

I don't think "normal" indentation is that different from team to team (beyond the examples that we've already covered) - what kind of use cases did you have in mind for a much more flexible approach?

alanmcruickshank · 2022-07-14T20:31:15Z

All else being equal, my intuition is that it's better to break lines higher than lower in the tree

Agreed @barrywhart - I'm not yet sure how tree depth and priority might interact, but I definitely agree with the intent.

NiallRees added the enhancement New feature or request label Jan 23, 2022

NiallRees mentioned this issue Feb 20, 2022

[WIP] Add reflow capability #2691

Closed

1 task

alanmcruickshank added this to To do in Major Release: 1.0.0 Mar 2, 2022

alanmcruickshank self-assigned this Jul 14, 2022

alanmcruickshank mentioned this issue Jul 26, 2022

Enhancement: Whitespace reflow utils master issue #3673

Closed

19 tasks

alanmcruickshank mentioned this issue Nov 5, 2022

2.0.0 Release Checklist #4030

Closed

12 tasks

alanmcruickshank mentioned this issue Mar 6, 2023

Add a sqlfluff format CLI command #4473

Merged

alanmcruickshank closed this as completed in #4473 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'reflow' capability for configurable, black-like deterministic formatting #2436

Add 'reflow' capability for configurable, black-like deterministic formatting #2436

NiallRees commented Jan 23, 2022 •

edited

alanmcruickshank commented Jul 14, 2022

barrywhart commented Jul 14, 2022

barrywhart commented Jul 14, 2022

NiallRees commented Jul 14, 2022

alanmcruickshank commented Jul 14, 2022

alanmcruickshank commented Jul 14, 2022

Add 'reflow' capability for configurable, black-like deterministic formatting #2436

Add 'reflow' capability for configurable, black-like deterministic formatting #2436

Comments

NiallRees commented Jan 23, 2022 • edited

Feature

Benefits:

Illustrative example

How

We've talked for a while about the idea of separating rules by what they do. It seems we can clearly draw a line between:

Implementation suggestions

alanmcruickshank commented Jul 14, 2022

Step 1. Blunt reflow.

Step 2. Line break hint segments.

Step 3. Meta segment priority.

Step 4. Progressively pack.

Alternatives/Options

barrywhart commented Jul 14, 2022

barrywhart commented Jul 14, 2022

NiallRees commented Jul 14, 2022

alanmcruickshank commented Jul 14, 2022

alanmcruickshank commented Jul 14, 2022

NiallRees commented Jan 23, 2022 •

edited