Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'reflow' capability for configurable, black-like deterministic formatting #2436

Closed
NiallRees opened this issue Jan 23, 2022 · 6 comments · Fixed by #4473
Closed

Add 'reflow' capability for configurable, black-like deterministic formatting #2436

NiallRees opened this issue Jan 23, 2022 · 6 comments · Fixed by #4473
Assignees
Labels
enhancement New feature or request

Comments

@NiallRees
Copy link
Member

NiallRees commented Jan 23, 2022

Feature

Same functionality as black, but totally configurable. The 'reflow' command would produce a deterministic output for the same 'code_only' input, regardless of the non-code elements (whitespace, capitalisation, indentation).

Benefits:

  • Fully user configurable code style
  • Deterministic output for queries with the same choice and order of code segments
  • No need to maintain and create rules for formatting-only concerns, which is hard currently as we have to detect bad SQL, which has infinite possibilities. By letting the user decide what good SQL is, we only have to define how the SQL is rendered.
  • The implementation suggested is purely additive in relation to the parse tree, mitigating issues where formatting rules can interfere with each other.

Illustrative example

The user has a sql file with the following:

select a,b, 
c FROM schema.table TBL

The user has configured the following:

base:
  case: lower # render all base segments (BaseSegment) and children in lowercase
select_clause_elements:
  indent_from_parent: 1 # indents all of `select_clause_elements_segment` by one unit from parent segment (this grouping segment doesn't exist yet)
select_clause_element:
  pre_newline: true # adds a newline before `select_clause_element`
from_clause:
  pre_newline: true # adds a newline before `from_clause`

And so the output is:

select
    a,
    b,
    c
from schema.table tbl

How

We're already embedding Indent and Dedent into the dialects, this feature builds upon that to add all configuration and functionality to render valid SQL from code-only elements.

  1. We parse the SQL as usual
  2. We remove all non-code segments from the parse tree (output here is what our dialect test fixture ymls look like).
  3. Using this parse tree, we render the 'reflowed' SQL.

We've talked for a while about the idea of separating rules by what they do. It seems we can clearly draw a line between:

  1. Rules which care about query structure (choice and ordering of code segments)
  2. Rules which just care about appearance (non-code segments, strictly whitespace, capitalisation and indentation).

This feature focuses strictly on the latter and would not add, remove or modify code segments beyond their capitalisation.

Implementation suggestions

  • Embed the rendering logic into the dialect grammars/segments for maximum maintainability and ease of contribution. This is deliberately moving away from the approach in L003 which is very difficult to maintain due to its aim of handling all indentation across every grammar and segment.
  • Unless a segment/grammar has unique rendering logic, it can inherit the logic and behaviour from its parent. For instance, we only need to define 'case' and pre_newline config at the top level segment.
  • Configuration precedence and inheritance:
      1. User-specified configuration
      1. 'Theme'-specified configuration (we could have a library of SQL styles which could be used as a base)
      1. Segment specified configuration
      1. Next closest parent segment configuration (e.g. SchemaReferenceSegment would use ObjectReferenceSegment's configuration if it did not have its own).
    • Acceptance criteria for a V1 could be that we're just able to render valid SQL from code-only segments. We can add configurable style after that.
  • Not all formatting rules are as simple as 'always add newlines before select elements'. But, we could make segment-specific configurations. For example there could feasibly be a config for select_clause_element: pre_newline_when_multiple.

My current availability combined with the fact that there are many much better developers than me around here means I would really love to see someone/others take this on, I'm not precious about any of the above ideas!

@NiallRees NiallRees added the enhancement New feature or request label Jan 23, 2022
@alanmcruickshank alanmcruickshank self-assigned this Jul 14, 2022
@alanmcruickshank
Copy link
Member

So I'm going to take a go at this as my next biggie. Here's my rough plan of attack:

Step 1. Blunt reflow.

First step of implementing this I think will just be on the API (i.e. not on the CLI yet until this is more stable). The "blunt reflow" would be to remove all whitespace and newlines, then step through the file, adding a newline (and appropriate indent) for every Indent() and Dedent() segment, otherwise adding single whitespaces.

This will be implemented ideally as a method on BaseSegment so it can either be called on a while file, statement or even just a specific clause. There will importantly be a configurable limit on how much to recurse, so when applying a reflow, rules might choose to only reflow the layer where a fix was applied and not every child segment (and their children and so on).

This is totally deterministic, but ignores line lengths, delimiters and will look kind of silly. It demonstrates proof of concept for what we're trying to do though.

Step 2. Line break hint segments.

To facilitate appropriate splitting of lists (e.g. SELECT a,b,c,d,e,f,g) I suggest added a new MetaSegment (similar to Indent()), which hints at appropriate places to put newlines. LineBreak() is probably a parent class of Indent() in that all indents are possible line breaks but not visa versa. Like indents, these would be configurable so that we can account for things like trailing or leading commas, and ditto for operators.

Adding this capability into the blunt reflow just makes files even longer. Things still look silly, but this is the most extreme version of laying out a file. All still deterministic.

Step 3. Meta segment priority.

For all Indent(), Dedent() and LineBreak() segments, introduce configurable a priority value. Higher values suggest more likely places to put line breaks (with or without indents). One edit option would be to change priority values to be very high (perhaps a dummy value of 100), which effectively makes a newline compulsory. This is the method we would encourage for rules which detect a missing newline: they don't actually add the newline, they just make one compulsory and then trigger a reflow.

NOTE: at this stage, the priority isn't actually used, but we can set it.

Step 4. Progressively pack.

So I think this is where most of the "magic" happens:

  • We already have a mix line length.
  • We would now have a priority score for all potential places for newlines.

I'm not sure whether we first "blunt reflow" and then work backwards, or whether we remove all newlines and work forwards (probably the latter). Assuming the latter, we progressively add high priority newlines (and work down the priority order), until a segment (and all it's children) all fit within the max line length.

That would have the effect for example that we might have to reflow more aggressively in a SELECT clause, but less in a FROM clause within the same query (i.e. different priority line breaks would be activated), but that's ok based on the varying complexity of the content of those sections.

Alternatives/Options

  1. I think one variation of this plan is that rather than having an explicit LineBreak segment, we enable some segments to be configured as a kind of explicit delimiter - similar to the suggestion above of a "newline after" or "newline before" config. This results in a much cleaner parse tree, and fewer scattered Meta segments - but it does increase complexity within an already crowded BaseSegment class.
  2. Perhaps a trivial variation on the plan would be that for compulsory line breaks, there's an explicit compulsory flag which is settable by rules, rather than just setting the priority to a really high value (likely via a constant).
  3. I haven't really accounted for configuration of the priority values - in this plan I'm assuming they're hard coded in the dialect (albeit with sensible defaults on things like the Delimited grammar). This presents an interesting tension between rules setting some priority values and the dialect setting the rest - which would result in a cli reflow command giving contradictory results to a fix command with all rules. That's not necessarily bad but does pose some interesting questions.

Would love feedback on this rough plan before I dive in.

@barrywhart
Copy link
Member

The plan sounds good overall. I had to skim it just now, but one thought I've had about this is that the depth of a segment in the parse tree should have some influence on how lines are broken. All else being equal, my intuition is that it's better to break lines higher than lower in the tree, e.g. each function parameter on its own line is preferred to splitting up a computation for a single parameter, e.g. a + b * c .... Just a thought...

1 similar comment
@barrywhart
Copy link
Member

The plan sounds good overall. I had to skim it just now, but one thought I've had about this is that the depth of a segment in the parse tree should have some influence on how lines are broken. All else being equal, my intuition is that it's better to break lines higher than lower in the tree, e.g. each function parameter on its own line is preferred to splitting up a computation for a single parameter, e.g. a + b * c .... Just a thought...

@NiallRees
Copy link
Member Author

Sounds great @alanmcruickshank . What are your thoughts on segment-level configurability of this, and do you think there are any considerations we should be wary of early on on to facilitate that?

Referring to this idea in the issue description:

base:
  case: lower # render all base segments (BaseSegment) and children in lowercase
select_clause_elements:
  indent_from_parent: 1 # indents all of `select_clause_elements_segment` by one unit from parent segment (this grouping segment doesn't exist yet)
select_clause_element:
  pre_newline: true # adds a newline before `select_clause_element`
from_clause:
  pre_newline: true # adds a newline before `from_clause`

Where these configs would then descend down to any segments which inherit from another. I'm then imagining a world where community members can codify their entire code style using this config but perhaps I'm getting too far ahead.

@alanmcruickshank
Copy link
Member

Where these configs would then descend down to any segments which inherit from another. I'm then imagining a world where community members can codify their entire code style using this config but perhaps I'm getting too far ahead.

For configuration it would leverage the existing configuration options for turning some meta segments on and off (a la indented_joins, indented_ctes and other options in [sqlfluff:indentation]), but it doesn't explicitly support the more flexible approach that you're suggesting.

I don't think "normal" indentation is that different from team to team (beyond the examples that we've already covered) - what kind of use cases did you have in mind for a much more flexible approach?

@alanmcruickshank
Copy link
Member

All else being equal, my intuition is that it's better to break lines higher than lower in the tree

Agreed @barrywhart - I'm not yet sure how tree depth and priority might interact, but I definitely agree with the intent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

3 participants