Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Jinja whitespace handling in rules #1647

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
019b68d
Issue 1437: Robust Jinja raw/templated slice mapping
Oct 14, 2021
925dfe8
Parametrize the test
Oct 14, 2021
2ca72e5
Add a basic (no whitespace control) test case
Oct 14, 2021
d212a7f
Now correctly handling block_start whitespace stripping
Oct 14, 2021
542a1c0
Add test case
Oct 14, 2021
66c1141
Enhance test case
Oct 14, 2021
ebc9e29
Handle variables with whitespace stripping, add a test for it
Oct 14, 2021
b2b3f42
Tidying, update test case
Oct 14, 2021
5d00b0d
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 14, 2021
a19d393
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 14, 2021
b38e7fa
Handle comments
Oct 14, 2021
f0a6d83
Merge branch 'bhart-issue_1437_robust_jinja_raw_templated_slice_mappi…
Oct 14, 2021
c31b4cd
Add test for issue 1608
Oct 14, 2021
8abfda4
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 14, 2021
352e419
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 14, 2021
6f18a25
Fix some things that were causing lint messages
Oct 14, 2021
c8f63c9
The test for this issue needs to run "fix"
Oct 14, 2021
6212dcf
Add missing file, update .gitignore
Oct 14, 2021
42de56b
"no cover"
Oct 14, 2021
abe5e2a
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 14, 2021
651450c
Trigger build
Oct 14, 2021
52c88e5
Merge branch 'bhart-issue_1437_robust_jinja_raw_templated_slice_mappi…
Oct 14, 2021
b370d08
PR review, "no cover"
Oct 14, 2021
39cedc9
Add two more test cases, basic_data and strip_right_data
Oct 14, 2021
996d19b
PR review
Oct 14, 2021
332aba7
More comments on the algorithm
Oct 14, 2021
168b7d3
More comments
Oct 14, 2021
1928ad6
Handle right whitespace stripping similarly as left
Oct 15, 2021
07b0532
Test updates
Oct 15, 2021
c353999
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 15, 2021
72917dd
Remove test case changes that were added in this PR and stopped worki…
Oct 15, 2021
d1cd2bc
Merge branch 'bhart-issue_1437_robust_jinja_raw_templated_slice_mappi…
Oct 15, 2021
9845311
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 15, 2021
16371cc
Add "pragma no cover"
Oct 15, 2021
d5c92c8
Merge branch 'bhart-issue_1437_robust_jinja_raw_templated_slice_mappi…
Oct 15, 2021
2709aab
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
barrywhart Oct 15, 2021
e06b50e
Change how I'm doing "pragma: no cover" to be simpler/safer
Oct 15, 2021
31ea23e
Merge branch 'bhart-issue_1437_robust_jinja_raw_templated_slice_mappi…
Oct 15, 2021
7ad47e2
Merge branch 'main' into bhart-issue_1437_robust_jinja_raw_templated_…
tunetheweb Oct 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -42,3 +42,4 @@ test-reports
# Ignore conda environment.yml contributors might be using and direnv config
environment.yml
.envrc
**/*FIXED.sql
Expand Up @@ -439,7 +439,3 @@ def _unsafe_process(self, fname, in_str=None, config=None):
# No violations returned in this way.
[],
)

@classmethod
def _preprocess_template(cls, in_str: str) -> str:
return in_str
@@ -0,0 +1,3 @@
{% macro echo(colname) %}
{{colname}}
{% endmacro %}
@@ -0,0 +1,16 @@
{{ config(materialized='view') }}

with cte_example as (
select 1 as col_name
),

final as
(
select
col_name,
{{- echo('col_name') -}} as col_name2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, sqlfluff fix was corrupting this templated line of code.

from
cte_example
)

select * from final
@@ -0,0 +1,16 @@
{{ config(materialized='view') }}

with cte_example as (
select 1 as col_name
),

final as
(
select
col_name,
{{- echo('col_name') -}} as col_name2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line looks strange after fixing because the templated file itself is pretty strange -- here's the file after rendering, before any fixes. Note that sqlfluff fix was previously corrupting the file; the goal for now is to address the corruption issue.

Future issues & PRs may improve the appearance when fixing templated files that use whitespace control.

with cte_example as (
     select 1 as col_name
),

final as
(
    select
        col_name,
col_name
as col_name2
    from
        cte_example
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the extra whitespace is EXACTLY the length of col_name. Is it possible you're not taking into account the length of the templated text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this. The amount of whitespace is not affected by changing col_name to something longer. Rather, this is happening because of the echo() macro, defined as:

{% macro echo(colname) %}
{{colname}}
{% endmacro %}

This results in col_name having no indentation in the rendered file (see the expanded file above). SQLFluff is trying to fix the indentation by adding 8 spaces. It shouldn't do this, but this is a separate issue. (I see the exact same behavior if I remove whitespace stripping from the input file.)

from
cte_example
)

select * from final
18 changes: 18 additions & 0 deletions plugins/sqlfluff-templater-dbt/test/templater_test.py
Expand Up @@ -215,6 +215,24 @@ def test__dbt_templated_models_do_not_raise_lint_error(
assert len(violations) == 0


def test__dbt_templated_models_fix_does_not_corrupt_file(project_dir): # noqa: F811
"""Test fix for issue 1608. Previously "sqlfluff fix" corrupted the file."""
lntr = Linter(config=FluffConfig(configs=DBT_FLUFF_CONFIG))
lnt = lntr.lint_path(
os.path.join(project_dir, "models/my_new_project/issue_1608.sql"), fix=True
)
lnt.persist_changes(fixed_file_suffix="FIXED")
with open(
os.path.join(project_dir, "models/my_new_project/issue_1608.sql.after")
) as f:
comp_buff = f.read()
with open(
os.path.join(project_dir, "models/my_new_project/issue_1608FIXED.sql")
) as f:
fixed_buff = f.read()
assert fixed_buff == comp_buff


def test__templater_dbt_templating_absolute_path(
project_dir, dbt_templater # noqa: F811
):
Expand Down
12 changes: 6 additions & 6 deletions src/sqlfluff/core/linter/linted_file.py
Expand Up @@ -385,29 +385,29 @@ def fix_string(self) -> Tuple[Any, bool]:
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After my changes, this section of code was no longer covered by tests. I am not familiar with this code and had not touched it in this PR, so for now I added # pragma: no cover.

I ran the full test suite in main to determine which test(s) were hitting this code, and there was only one: test/rules/std_fix_auto_test.py::test__std_fix_auto[ansi-017_lintresult_fixes_cannot_span_block_boundaries]. I checked my PR to see if the jinja.py changes were involved in this test, and they aren't (i.e. it executes the ifs in the new code, but skips over the bodies). This makes sense, because the test doesn't use whitespace stripping. I'd rather not go any farther down this rabbit hole, as it could be bottomless...

# We have a single occurrence of the thing we want to patch. This
# means we can use its position to place our patch.
new_source_slice = slice(
new_source_slice = slice( # pragma: no cover
enriched_patch.source_slice.start + positions[0],
enriched_patch.source_slice.start
+ positions[0]
+ len(enriched_patch.templated_str),
)
enriched_patch = EnrichedFixPatch(
enriched_patch = EnrichedFixPatch( # pragma: no cover
source_slice=new_source_slice,
templated_slice=enriched_patch.templated_slice,
patch_category=enriched_patch.patch_category,
fixed_raw=enriched_patch.fixed_raw,
templated_str=enriched_patch.templated_str,
source_str=enriched_patch.source_str,
)
linter_logger.debug(
linter_logger.debug( # pragma: no cover
" * Keeping Tricky Case. Positions: %s, New Slice: %s, Patch: %s",
positions,
new_source_slice,
enriched_patch,
)
filtered_source_patches.append(enriched_patch)
dedupe_buffer.append(enriched_patch.dedupe_tuple())
continue
filtered_source_patches.append(enriched_patch) # pragma: no cover
dedupe_buffer.append(enriched_patch.dedupe_tuple()) # pragma: no cover
continue # pragma: no cover

# Sort the patches before building up the file.
filtered_source_patches = sorted(
Expand Down
2 changes: 1 addition & 1 deletion src/sqlfluff/core/templaters/base.py
Expand Up @@ -310,7 +310,7 @@ def templated_slice_to_source_slice(
if ts_start_sf_start < len(self.sliced_file):
return self.sliced_file[1].source_slice
else:
return self.sliced_file[-1].source_slice
return self.sliced_file[-1].source_slice # pragma: no cover
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this gone out of cover? Is this still the right return value? Or should the first "We should never get here" clause include this (i.e. a >= instead of a >)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why -- I think this is one of those areas where there are no explicit tests, it's just tested implicitly by higher-level code. Presumably, updating the raw / templated mapping caused it to stop hitting this code. I can't say if it's no longer necessary or if there's just no test case that's hitting it anymore.

else:
start_slices = self.sliced_file[ts_start_sf_start:ts_start_sf_stop]
if ts_stop_sf_start == ts_stop_sf_stop: # pragma: no cover TODO?
Expand Down
79 changes: 60 additions & 19 deletions src/sqlfluff/core/templaters/jinja.py
Expand Up @@ -330,12 +330,52 @@ def _slice_template(cls, in_str: str) -> Iterator[RawFileSlice]:
}

# https://jinja.palletsprojects.com/en/2.11.x/api/#jinja2.Environment.lex
for _, elem_type, raw in env.lex(cls._preprocess_template(in_str)):
for _, elem_type, raw in env.lex(in_str):
if elem_type == "data":
yield RawFileSlice(raw, "literal", idx)
idx += len(raw)
continue
str_buff += raw

if elem_type.endswith("_begin"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a corresponding tag for skipped characters after the tag too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, no. To confirm this, I added two more test cases to jinja_test.py:

  • basic_data
  • `strip_right_data

I confirmed that test doesn't hit the Jinja lexer "strip" code. AFAICT, adding a - on the close tag does nothing -- the rendered code looks identical whether it's present or not. Maybe this is a misconception about Jinja?

Also note that in the lexer, there's a token type OptionalLStrip but no OptionalRStrip, and there's no similar-looking code for stripping on the other side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's the case. If you put this SQL in a test.sql file:

SELECT
    {{ 'col1,' -}}
    col2
FROM
    table1

And run sqlfluff parse test.sql and then remove the negative sign and rerun you'll see it's different (it's missing the newline and whitespace before col2 as expected when the -}} is used).

So it definitely does something. So confused why it's not needed. Or is the fact you've not coded that the real reason for the strange "after" in above test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed your point, @tunetheweb, and confirmed this in detail by logging lex() output.

No right strip:

lex: 'SELECT\n    ' | data
lex: '{{' | variable_begin
lex: ' ' | whitespace
lex: "'col1,'" | string
lex: ' ' | whitespace
lex: '}}' | variable_end
lex: '\n    col2\nFROM\n    table1\n' | data

Right strip:

lex: 'SELECT\n    ' | data
lex: '{{' | variable_begin
lex: ' ' | whitespace
lex: "'col1,'" | string
lex: ' ' | whitespace
lex: '-}}\n    ' | variable_end
lex: 'col2\nFROM\n    table1\n' | data

Very sloppy work on my part! I'll dig into lex() to see where that behavior is hidden (not strictly necessary, but I'll feel better understanding it) and update my PR and test cases to handle this properly. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem, though, that with right stripping, no whitespace is missing from lex() output. Rather, the whitespace is shifted from the following token onto the _end token. Unsure at this point how this affects linting, but it doesn't appear to trigger the same file corruption behavior we see with left stripping, where the file slice offsets are wrong because the total length of lex output doesn't match the total length of the input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial answer: Right stripping, is handled entirely in the tokenizer regexes, e.g.

(Pdb) m.re
re.compile('\\-\\}\\}\\s*|\\}\\}', re.MULTILINE|re.DOTALL)

Note the left vs right hand side of the | in the regex. The left side detects the - and consumes all the trailing whitespace. The right side handles the "no strip" case, which leaves the whitespace to be consumed by the next token.

I'm unsure what the "correct" behavior is here, but if right stripping needs to behave the same as left stripping, we'd want to detect when this happens and split the end token into two slices: the end token itself and a second token with the trailing whitespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added handling for right stripping. As noted above, no whitespace was missing from lex(), so this change was more about making left and right stripping result in similar "raw slices" in SQLFluff, not about avoiding file corruption.

I also updated a couple of right stripping test cases.

# When a "begin" tag (whether block, comment, or data) uses
# whitespace stripping
# (https://jinja.palletsprojects.com/en/3.0.x/templates/#whitespace-control),
# the Jinja lex() function handles this by discarding adjacent
# whitespace from in_str. For more insight, see the tokeniter()
# function in this file:
# https://github.com/pallets/jinja/blob/main/src/jinja2/lexer.py
# We want to detect and correct for this in order to:
# - Correctly update "idx" (if this is wrong, that's a
# potential DISASTER because lint fixes use this info to
# update the source file, and incorrect values often result in
# CORRUPTING the user's file so it's no longer valid SQL. :-O
# - Guarantee that the slices we return fully "cover" the
# contents of in_str.
#
# We detect skipped characters by looking ahead in in_str for
# the token just returned from lex(). The token text will either
# be at the current 'idx' position (if whitespace stripping did
# not occur) OR it'll be farther along in in_str, but we're
# GUARANTEED that lex() only skips over WHITESPACE; nothing else.

# Find the token returned. Did lex() skip over any characters?
num_chars_skipped = in_str.index(raw, idx) - idx
if num_chars_skipped:
# Yes. It skipped over some characters. Compute a string
# containing the skipped characters.
skipped_str = in_str[idx : idx + num_chars_skipped]

# Sanity check: Verify that Jinja only skips over
# WHITESPACE, never anything else.
if not skipped_str.isspace(): # pragma: no cover
templater_logger.warning(
"Jinja lex() skipped non-whitespace: %s", skipped_str
)
# Treat the skipped whitespace as a literal.
yield RawFileSlice(skipped_str, "literal", idx)
idx += num_chars_skipped
Comment on lines +363 to +377
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!!! Seems so simple.


# raw_end and raw_begin behave a little differently in
# that the whole tag shows up in one go rather than getting
# parts of the tag at a time.
Expand All @@ -361,22 +401,23 @@ def _slice_template(cls, in_str: str) -> Iterator[RawFileSlice]:
block_type = "block_start"
if trimmed_content.split()[0] == "for":
block_subtype = "loop"
yield RawFileSlice(str_buff, block_type, idx, block_subtype)
idx += len(str_buff)
m = re.search(r"\s+$", raw, re.MULTILINE | re.DOTALL)
if raw.startswith("-") and m:
# Right whitespace was stripped. Split off the trailing
# whitespace into a separate slice. The desired behavior is
# to behave similarly as the left stripping case above.
# Note that the stakes are a bit different, because lex()
# hasn't *omitted* any characters from the strings it
# returns, it has simply grouped them differently than we
# want.
trailing_chars = len(m.group(0))
yield RawFileSlice(
str_buff[:-trailing_chars], block_type, idx, block_subtype
)
idx += len(str_buff) - trailing_chars
yield RawFileSlice(str_buff[-trailing_chars:], "literal", idx)
idx += trailing_chars
else:
yield RawFileSlice(str_buff, block_type, idx, block_subtype)
idx += len(str_buff)
str_buff = ""

@classmethod
def _preprocess_template(cls, in_str: str) -> str:
"""Does any preprocessing of the template required before expansion."""
# Using Jinja whitespace stripping (e.g. `{%-` or `-%}`) breaks the
# position markers between unlexed and lexed file. So let's ignore any
# request to do that before lexing, by replacing '-' with '+'
#
# Note: '+' is the default, so shouldn't really be needed but we
# explicitly state that to preserve the space for the missing '-' character
# so it looks the same.
in_str = in_str.replace("{%-", "{%+")
in_str = in_str.replace("-%}", "+%}")
in_str = in_str.replace("{#-", "{#+")
in_str = in_str.replace("-#}", "+#}")
return in_str