Refactor JinjaTracer: Split into two classes, break up `_slice_template()` function #2870

barrywhart · 2022-03-15T18:17:56Z

Brief summary of the change made

Several recent fixes to JinjaTracer have demonstrated that it's a large class that's tricky to maintain. Two major concerns:

The JinjaTracer._slice_template() function is 275 lines long with lots of intricate if conditions, it's hard to read and easy to break. This PR splits it up into a number of smaller functions, reducing it to about 100 lines.
JinjaTracer performs two related but fairly distinct tasks. Combining these in one class makes it hard to understand because it's unclear at times which fields are used for one task and which for both.
- Analyze a template to prepare for tracing.
- Trace the template.

In creating this PR, I was pretty careful to avoid making behavior changes to the code. This code has pretty good automated testing now, as well as internal sanity checking of the JinjaTracer output. So, in reviewing this PR, I suggest focusing more on coding style and readability than correctness or "what changed from before" (because the text changes are massive due to moving things around).

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
- .yml rule test cases in test/fixtures/rules/std_rule_cases.
- .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
- Full autofix test cases in test/fixtures/linter/autofix.
- Other.
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

codecov · 2022-03-15T18:30:03Z

Codecov Report

Merging #2870 (e9f9aa3) into main (c0f1983) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #2870   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          163       163           
  Lines        12600     12632   +32     
=========================================
+ Hits         12600     12632   +32

Impacted Files	Coverage Δ
src/sqlfluff/core/templaters/jinja.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/templaters/slicers/tracer.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0f1983...e9f9aa3. Read the comment docs.

barrywhart · 2022-03-15T19:03:08Z

requirements_dev.txt

@@ -22,6 +22,7 @@ types-chardet
 types-appdirs
 types-colorama
 types-pyyaml
+types-Jinja2


Unrelated to the rest of the PR, but I noticed we were missing this

barrywhart · 2022-03-15T19:03:40Z

src/sqlfluff/core/templaters/slicers/tracer.py

-        return RawSliceInfo(unique_alternate_id, alternate_code, [])
+        return self.make_raw_slice_info(unique_alternate_id, alternate_code, [])
+
+    def update_inside_set_or_macro(self, block_type, trimmed_parts):


Extracted this function from update_inside_set_or_macro().

barrywhart · 2022-03-15T19:05:26Z

src/sqlfluff/core/templaters/slicers/tracer.py

+            # Exiting a set/macro block.
+            self.inside_set_or_macro = False
+
+    def make_raw_slice_info(self, *args, **kwargs) -> RawSliceInfo:


Created this helper function to centralize/reduce the number of places we check inside_set_or_macro. This reduces the likelihood of future bugs, where we forget to check this flag.

barrywhart · 2022-03-15T19:06:04Z

src/sqlfluff/core/templaters/slicers/tracer.py

@@ -215,9 +247,9 @@ def _slice_template(self) -> List[RawFileSlice]:
        # https://jinja.palletsprojects.com/en/2.11.x/api/#jinja2.Environment.lex
        stack = []
        result = []
-        set_idx = None


The set_idx variable was moved and renamed as a new instance field on JinjaTracer: inside_set_or_macro.

barrywhart · 2022-03-15T21:24:45Z

Tagging @OTooleMichael for review. I don't think I can explicitly add him as a reviewer yet (probably a GitHub group needs updating).

tunetheweb · 2022-03-15T21:51:13Z

Tagging @OTooleMichael for review. I don't think I can explicitly add him as a reviewer yet (probably a GitHub group needs updating).

@alanmcruickshank sent you an invite couple of days back. Check your email (and spam folder) @OTooleMichael

OTooleMichael

LGTM:
I read the code linearly, commented as I went and asked questions.

OTooleMichael · 2022-03-16T07:13:12Z

src/sqlfluff/core/templaters/slicers/tracer.py

+    re_open_tag = regex.compile(r"^\s*({[{%])[\+\-]?\s*")
+    re_close_tag = regex.compile(r"\s*[\+\-]?([}%]})\s*$")


Why are these not full Regexes - ie starting with ^ and ending with $?

Also wondering if gathering the whitespace before a tag has influence on indents (I note later on that search is used vs the lexed array) perhaps some of these matches are forced by what the lexer returns?

The templated block is a whole string and not the individual components.

So {% if ... %} is all one templated string.

So we want ^{% for opening tag (simplified), and %}$ for closing tag (simplified) of that string, as @barrywhart has.

In the past, we received one token at a time from Jinja, then combined them into a single string (str_buff) once they became a full block, e.g. {% if a == 3 %}. Now we still build str_buff, but we also keep the parts separate in an array str_parts. The regexes remain the same for legacy reasons. I think it would work either way.

Mostly we want what the lexer returns, as the purpose of this class is to deduce exactly what Jinja is going to return for the "real" template, including tricky whitespace handling. That's been working pretty well for a while. If you look at handle_left_whitespace_stripping() and the code section (not function beginning with # Right whitespace was stripped., you'll see how some of the trickier bits work.

There is one case where Jinja returns the whole block at a time, and that's a {% raw %} ... {% endraw %} section.

OTooleMichael · 2022-03-16T07:15:44Z

src/sqlfluff/core/templaters/slicers/tracer.py

+                m_strip_right = regex.search(
+                    r"\s+$", raw, regex.MULTILINE | regex.DOTALL
+                )


What happening here ?

Not sure you need the DOTALL here (there is no .)?

We're trying to determine if a block uses right whitespace stripping, e.g.

SELECT {{ 'col1,' -}} col2

Note the -. This tells Jinja to discard any whitespace between the }} and the next bit of text. We need to account for this because it affects the template output.

Which goes to my next question further down.

Also still not sure we need the DOTALL here.

I can't remember why these flags were added. Maybe they were necessary back when we only had the whole block string, rather than the parts? Note that in dbt, it's common to do things like:

DOTALL actually means "allow the . pattern character to match newlines, and MULTILINE` means "allow multiline matches". https://www.codespeedy.com/re-dotall-in-python/

I'm hesitant to change this now. 😅 If the string were really long, I could see it slowing things down a bit, but this string is literally just one Jinja token -- should be very short.

Fair point. It's not changed in this PR and don't think it'll matter too much one way of another. So fine leaving it.

OTooleMichael · 2022-03-16T07:22:35Z

src/sqlfluff/core/templaters/slicers/tracer.py

+            # Handle a tag received in one go.
+            trimmed_content = str_buff[len(m_open.group(0)) : -len(m_close.group(0))]
+            trimmed_parts = trimmed_content.split()


Is there need for the If and the str_parts array at all, can the Else just be applied to both scenarios? (I guess its not expensive enough to matter about the splitting computation)

Are you asking if we don't need to keep the individual lex results as separate strings?

The reason I did that is that Jinja doesn't just split on whitespace, so by keeping things as Jinja returns them, the code is more robust. Here's a test case for why:

{% set col= "col1" %} SELECT {{ col }}

Note there's no space after col. Before switching to str_parts, we needed special handling since we might see any of:

{% set col= "col1" %} {% set col = "col1" %} {% set col ="col1" %}

This resulted in ugly, fragile code. Now it "just works"™️.

I'll add the above as a code comment.

I just meant why can the algo between lines 438 and 442 just be applied to both the if and the else.
the fn declaration says str_buff is non optional.

str_buff is required, but we prefer to use str_parts because it makes better use of Jinja lexing.

(Not sure I'm fully understanding, though.)

If you feel like tinkering with it, the tests in test/core/templaters/jinja_test.py are pretty thorough.

OTooleMichael · 2022-03-16T07:25:25Z

src/sqlfluff/core/templaters/slicers/tracer.py

+        """
+        # Find the token returned. Did lex() skip over any characters?
+        num_chars_skipped = self.raw_str.index(token, self.idx_raw) - self.idx_raw
+        if num_chars_skipped:


Personally Id prefer a guard if not num_chars_skipped: \ return

I'll do that. Previously, I had just mechanically pasted this code after pulling it out of a larger function. 👍

OTooleMichael · 2022-03-16T07:27:53Z

src/sqlfluff/core/templaters/slicers/tracer.py

+                    # to behave similarly as the left stripping case. Note that
+                    # the stakes are a bit lower here, because lex() hasn't
+                    # *omitted* any characters from the strings it returns,
+                    # it has simply grouped them differently than we want.
+                    trailing_chars = len(m_strip_right.group(0))
+                    self.raw_sliced.append(


Perhaps just throw this into a fn for ease of reading its nice to balance
handle_left_whitespace_stripping even if this one is easier

I looked at doing that a few different ways. It's definitely doable, but the function ended up taking something like 5 parameters, and I couldn't find a reasonable way to get that down. (I tried to limit the new functions to 3 or fewer parameters.) The logic here is not complex, but it depends on a lot of state, so it seemed like maybe it wasn't improving the code. Thoughts?

src/sqlfluff/core/templaters/slicers/tracer.py

tunetheweb

LGTM

Minor cleanup/refactor of JinjaTracer

048e10a

barrywhart marked this pull request as draft March 15, 2022 18:18

More refactoring

212a885

barrywhart commented Mar 15, 2022

View reviewed changes

Barry Hart added 7 commits March 15, 2022 15:41

More refactoring

e562925

More refactoring

8f4ec6f

Add more type annotations

e68510f

Extract a function track_block_end() out of update_next_slice_indices()

89de547

Make some functions static

b18c6a7

Split up _slice_template() some more

dc15b3f

More types and tidying

2170fb6

barrywhart changed the title ~~Minor cleanup/refactor of JinjaTracer~~ Refactor JinjaTracer: Simplify _slice_template() by splitting it up Mar 15, 2022

barrywhart changed the title ~~Refactor JinjaTracer: Simplify _slice_template() by splitting it up~~ JinjaTracer tech debt: Split _slice_template() into smaller functions Mar 15, 2022

barrywhart marked this pull request as ready for review March 15, 2022 21:18

barrywhart requested a review from tunetheweb March 15, 2022 21:18

Barry Hart added 5 commits March 15, 2022 23:35

RawSliceInfo: Provide default value for next_slice_indices

fd1e552

Use RawSliceInfo instance instead of alt_id/alt_code variables

4129044

Split JinjaTracer into two classes: JinjaAnalyzer and JinjaTracer

9171712

Make idx a field, rename as idx_raw

a642326

Tidying

5bdc8a8

barrywhart changed the title ~~JinjaTracer tech debt: Split _slice_template() into smaller functions~~ Refactor JinjaTracer: Split into two classes, break up long _slice_template() function Mar 16, 2022

barrywhart changed the title ~~Refactor JinjaTracer: Split into two classes, break up long _slice_template() function~~ Refactor JinjaTracer: Split into two classes, break up _slice_template() function Mar 16, 2022

barrywhart changed the title ~~Refactor JinjaTracer: Split into two classes, break up _slice_template() function~~ Refactor JinjaTracer: Split into two classes, break up "_slice_template()" function Mar 16, 2022

OTooleMichael approved these changes Mar 16, 2022

View reviewed changes

tunetheweb reviewed Mar 16, 2022

View reviewed changes

src/sqlfluff/core/templaters/slicers/tracer.py Outdated Show resolved Hide resolved

Barry Hart and others added 4 commits March 16, 2022 08:59

PR review

aadd5e8

PR review

0cd54db

Add explanatory comment about str_buff and str_parts

b324f59

Merge branch 'main' into bhart-jinja_tracer_tidying

e9f9aa3

tunetheweb approved these changes Mar 16, 2022

View reviewed changes

barrywhart merged commit 28df654 into sqlfluff:main Mar 16, 2022

tunetheweb changed the title ~~Refactor JinjaTracer: Split into two classes, break up "_slice_template()" function~~ Refactor JinjaTracer: Split into two classes, break up _slice_template() function Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor JinjaTracer: Split into two classes, break up `_slice_template()` function #2870

Refactor JinjaTracer: Split into two classes, break up `_slice_template()` function #2870

barrywhart commented Mar 15, 2022 •

edited

codecov bot commented Mar 15, 2022 •

edited

barrywhart Mar 15, 2022

barrywhart Mar 15, 2022

barrywhart Mar 15, 2022

barrywhart Mar 15, 2022 •

edited

barrywhart commented Mar 15, 2022

tunetheweb commented Mar 15, 2022

OTooleMichael left a comment

OTooleMichael Mar 16, 2022

tunetheweb Mar 16, 2022

barrywhart Mar 16, 2022

OTooleMichael Mar 16, 2022

tunetheweb Mar 16, 2022

barrywhart Mar 16, 2022

tunetheweb Mar 16, 2022

barrywhart Mar 16, 2022

tunetheweb Mar 16, 2022 •

edited

OTooleMichael Mar 16, 2022

barrywhart Mar 16, 2022

barrywhart Mar 16, 2022

OTooleMichael Mar 16, 2022

barrywhart Mar 16, 2022

OTooleMichael Mar 16, 2022

barrywhart Mar 16, 2022

OTooleMichael Mar 16, 2022

barrywhart Mar 16, 2022

tunetheweb left a comment

		re_open_tag = regex.compile(r"^\s({[{%])[\+\-]?\s")
		re_close_tag = regex.compile(r"\s[\+\-]?([}%]})\s$")

Refactor JinjaTracer: Split into two classes, break up _slice_template() function #2870

Refactor JinjaTracer: Split into two classes, break up _slice_template() function #2870

Conversation

barrywhart commented Mar 15, 2022 • edited

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

codecov bot commented Mar 15, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barrywhart Mar 15, 2022 • edited

Choose a reason for hiding this comment

barrywhart commented Mar 15, 2022

tunetheweb commented Mar 15, 2022

OTooleMichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tunetheweb Mar 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tunetheweb left a comment

Choose a reason for hiding this comment

Refactor JinjaTracer: Split into two classes, break up `_slice_template()` function #2870

Refactor JinjaTracer: Split into two classes, break up `_slice_template()` function #2870

barrywhart commented Mar 15, 2022 •

edited

codecov bot commented Mar 15, 2022 •

edited

barrywhart Mar 15, 2022 •

edited

tunetheweb Mar 16, 2022 •

edited