Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor JinjaTracer: Split into two classes, break up _slice_template() function #2870

Merged
merged 18 commits into from Mar 16, 2022

Conversation

barrywhart
Copy link
Member

@barrywhart barrywhart commented Mar 15, 2022

Brief summary of the change made

Several recent fixes to JinjaTracer have demonstrated that it's a large class that's tricky to maintain. Two major concerns:

  • The JinjaTracer._slice_template() function is 275 lines long with lots of intricate if conditions, it's hard to read and easy to break. This PR splits it up into a number of smaller functions, reducing it to about 100 lines.
  • JinjaTracer performs two related but fairly distinct tasks. Combining these in one class makes it hard to understand because it's unclear at times which fields are used for one task and which for both.
    • Analyze a template to prepare for tracing.
    • Trace the template.

In creating this PR, I was pretty careful to avoid making behavior changes to the code. This code has pretty good automated testing now, as well as internal sanity checking of the JinjaTracer output. So, in reviewing this PR, I suggest focusing more on coding style and readability than correctness or "what changed from before" (because the text changes are massive due to moving things around).

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@barrywhart barrywhart marked this pull request as draft March 15, 2022 18:18
@codecov
Copy link

codecov bot commented Mar 15, 2022

Codecov Report

Merging #2870 (e9f9aa3) into main (c0f1983) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #2870   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          163       163           
  Lines        12600     12632   +32     
=========================================
+ Hits         12600     12632   +32     
Impacted Files Coverage Δ
src/sqlfluff/core/templaters/jinja.py 100.00% <100.00%> (ø)
src/sqlfluff/core/templaters/slicers/tracer.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0f1983...e9f9aa3. Read the comment docs.

@@ -22,6 +22,7 @@ types-chardet
types-appdirs
types-colorama
types-pyyaml
types-Jinja2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the rest of the PR, but I noticed we were missing this

return RawSliceInfo(unique_alternate_id, alternate_code, [])
return self.make_raw_slice_info(unique_alternate_id, alternate_code, [])

def update_inside_set_or_macro(self, block_type, trimmed_parts):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this function from update_inside_set_or_macro().

# Exiting a set/macro block.
self.inside_set_or_macro = False

def make_raw_slice_info(self, *args, **kwargs) -> RawSliceInfo:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created this helper function to centralize/reduce the number of places we check inside_set_or_macro. This reduces the likelihood of future bugs, where we forget to check this flag.

@@ -215,9 +247,9 @@ def _slice_template(self) -> List[RawFileSlice]:
# https://jinja.palletsprojects.com/en/2.11.x/api/#jinja2.Environment.lex
stack = []
result = []
set_idx = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set_idx variable was moved and renamed as a new instance field on JinjaTracer: inside_set_or_macro.

@barrywhart barrywhart changed the title Minor cleanup/refactor of JinjaTracer Refactor JinjaTracer: Simplify _slice_template() by splitting it up Mar 15, 2022
@barrywhart barrywhart changed the title Refactor JinjaTracer: Simplify _slice_template() by splitting it up JinjaTracer tech debt: Split _slice_template() into smaller functions Mar 15, 2022
@barrywhart barrywhart marked this pull request as ready for review March 15, 2022 21:18
@barrywhart
Copy link
Member Author

Tagging @OTooleMichael for review. I don't think I can explicitly add him as a reviewer yet (probably a GitHub group needs updating).

@tunetheweb
Copy link
Member

Tagging @OTooleMichael for review. I don't think I can explicitly add him as a reviewer yet (probably a GitHub group needs updating).

@alanmcruickshank sent you an invite couple of days back. Check your email (and spam folder) @OTooleMichael

@barrywhart barrywhart changed the title JinjaTracer tech debt: Split _slice_template() into smaller functions Refactor JinjaTracer: Split into two classes, break up long _slice_template() function Mar 16, 2022
@barrywhart barrywhart changed the title Refactor JinjaTracer: Split into two classes, break up long _slice_template() function Refactor JinjaTracer: Split into two classes, break up _slice_template() function Mar 16, 2022
@barrywhart barrywhart changed the title Refactor JinjaTracer: Split into two classes, break up _slice_template() function Refactor JinjaTracer: Split into two classes, break up "_slice_template()" function Mar 16, 2022
Copy link
Contributor

@OTooleMichael OTooleMichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM:
I read the code linearly, commented as I went and asked questions.

Comment on lines +177 to +178
re_open_tag = regex.compile(r"^\s*({[{%])[\+\-]?\s*")
re_close_tag = regex.compile(r"\s*[\+\-]?([}%]})\s*$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these not full Regexes - ie starting with ^ and ending with $?

Also wondering if gathering the whitespace before a tag has influence on indents (I note later on that search is used vs the lexed array) perhaps some of these matches are forced by what the lexer returns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The templated block is a whole string and not the individual components.

So {% if ... %} is all one templated string.

So we want ^{% for opening tag (simplified), and %}$ for closing tag (simplified) of that string, as @barrywhart has.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, we received one token at a time from Jinja, then combined them into a single string (str_buff) once they became a full block, e.g. {% if a == 3 %}. Now we still build str_buff, but we also keep the parts separate in an array str_parts. The regexes remain the same for legacy reasons. I think it would work either way.

Mostly we want what the lexer returns, as the purpose of this class is to deduce exactly what Jinja is going to return for the "real" template, including tricky whitespace handling. That's been working pretty well for a while. If you look at handle_left_whitespace_stripping() and the code section (not function beginning with # Right whitespace was stripped., you'll see how some of the trickier bits work.

There is one case where Jinja returns the whole block at a time, and that's a {% raw %} ... {% endraw %} section.

Comment on lines +311 to +313
m_strip_right = regex.search(
r"\s+$", raw, regex.MULTILINE | regex.DOTALL
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happening here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you need the DOTALL here (there is no .)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to determine if a block uses right whitespace stripping, e.g.

SELECT
  {{ 'col1,' -}}
  col2

Note the -. This tells Jinja to discard any whitespace between the }} and the next bit of text. We need to account for this because it affects the template output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which goes to my next question further down.

Also still not sure we need the DOTALL here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why these flags were added. Maybe they were necessary back when we only had the whole block string, rather than the parts? Note that in dbt, it's common to do things like:

DOTALL actually means "allow the . pattern character to match newlines, and MULTILINE` means "allow multiline matches". https://www.codespeedy.com/re-dotall-in-python/

I'm hesitant to change this now. 😅 If the string were really long, I could see it slowing things down a bit, but this string is literally just one Jinja token -- should be very short.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. It's not changed in this PR and don't think it'll matter too much one way of another. So fine leaving it.

Comment on lines +439 to +441
# Handle a tag received in one go.
trimmed_content = str_buff[len(m_open.group(0)) : -len(m_close.group(0))]
trimmed_parts = trimmed_content.split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there need for the If and the str_parts array at all, can the Else just be applied to both scenarios? (I guess its not expensive enough to matter about the splitting computation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking if we don't need to keep the individual lex results as separate strings?

The reason I did that is that Jinja doesn't just split on whitespace, so by keeping things as Jinja returns them, the code is more robust. Here's a test case for why:

{% set col= "col1" %}
SELECT {{ col }}

Note there's no space after col. Before switching to str_parts, we needed special handling since we might see any of:

{% set col= "col1" %}
{% set col = "col1" %}
{% set col ="col1" %}

This resulted in ugly, fragile code. Now it "just works"™️.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the above as a code comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant why can the algo between lines 438 and 442 just be applied to both the if and the else.
the fn declaration says str_buff is non optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str_buff is required, but we prefer to use str_parts because it makes better use of Jinja lexing.

(Not sure I'm fully understanding, though.)

If you feel like tinkering with it, the tests in test/core/templaters/jinja_test.py are pretty thorough.

"""
# Find the token returned. Did lex() skip over any characters?
num_chars_skipped = self.raw_str.index(token, self.idx_raw) - self.idx_raw
if num_chars_skipped:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally Id prefer a guard if not num_chars_skipped: \ return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that. Previously, I had just mechanically pasted this code after pulling it out of a larger function. 👍

Comment on lines 317 to 322
# to behave similarly as the left stripping case. Note that
# the stakes are a bit lower here, because lex() hasn't
# *omitted* any characters from the strings it returns,
# it has simply grouped them differently than we want.
trailing_chars = len(m_strip_right.group(0))
self.raw_sliced.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just throw this into a fn for ease of reading its nice to balance
handle_left_whitespace_stripping even if this one is easier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at doing that a few different ways. It's definitely doable, but the function ended up taking something like 5 parameters, and I couldn't find a reasonable way to get that down. (I tried to limit the new functions to 3 or fewer parameters.) The logic here is not complex, but it depends on a lot of state, so it seemed like maybe it wasn't improving the code. Thoughts?

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@barrywhart barrywhart merged commit 28df654 into sqlfluff:main Mar 16, 2022
@tunetheweb tunetheweb changed the title Refactor JinjaTracer: Split into two classes, break up "_slice_template()" function Refactor JinjaTracer: Split into two classes, break up _slice_template() function Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants