Slow lint of ~1600 line generated query #1784

jdub · 2021-10-30T08:22:21Z

Expected Behaviour

Without my horrifying generated join/union:

$ time sqlfluff lint 
=== [dbt templater] Sorting Nodes...
=== [dbt templater] Compiling dbt project...
=== [dbt templater] Project Compiled.
... irrelevant lint output ...
All Finished 📜 🎉!

real    0m11.659s
user    0m11.038s
sys     0m0.571s

Observed Behaviour

With my horrifying generated join/union:

$ time sqlfluff lint 
=== [dbt templater] Sorting Nodes...
=== [dbt templater] Compiling dbt project...
=== [dbt templater] Project Compiled.
... irrelevant lint output ...
All Finished 📜 🎉!

real    7m46.565s
user    7m38.807s
sys     0m5.049s

Interesting to note: Though most of the time is spent processing prior to printing the lint messages, sqlfluff also spends quite a while doing… something… between printing lint messages and "All Finished".

Steps to Reproduce

This is a necessary evil, which we need not discuss. I have attached the dbt-compiled SQL and output of sqlfluff parse of a model that wraps this model in a CTE (thus the extra bits): stg_history__history.zip

with

unioned as (

{%- for instance in ['history01', 'history02', 'history03', 'history04'] %}

{# due to key collisions, we must join by shard first, then union the results #}
{%- for shard in range(32) %}
    select
        history.id || '_{{ instance }}_{{ shard }}' as history_id,
        history.*,
        profile.userid,
        profile.profileid
    from {{ source(instance, 'history') }} as history -- noqa: L031
    join {{ source(instance, 'profile') }} as profile on history.profileid = profile.id -- noqa: L031
        and history.`_TABLE_SUFFIX` = '{{ shard }}'
        and profile.`_TABLE_SUFFIX` = '{{ shard }}'
{% if not loop.last %}
    union all
{% endif -%}
{% endfor -%}{# shard #}

{%- if not loop.last %}
    union all
{% endif -%}
{% endfor -%}{# instance #}

)

select * from unioned

Note that the dbt source definitions for these tables include BigQuery wildcards, e.g.

sources:
  - name: history01
    database: bigquery-project
    schema: history01_public
    tables:
      - name: history
        identifier: history_*
      - name: profile
        identifier: profile_*

Version

Python 3.9.7
dbt 0.21.0
sqlfluff 0.7.1

Configuration

[sqlfluff]
dialect = bigquery
templater = dbt

# https://docs.sqlfluff.com/en/stable/rules.html
# L016: Line is too long
# L034: Use wildcards then simple targets before calculations and aggregates in select statements
exclude_rules = L016,L034

[sqlfluff:rules:L014]
extended_capitalisation_policy = lower

The text was updated successfully, but these errors were encountered:

tunetheweb · 2021-10-30T09:32:05Z

We have a fix for this about to be merged in #1744 . It would be good to know if that branch resolves your issue too.

jdub · 2021-10-30T09:50:39Z

I did a pip install git+https://github.com/WittierDinosaur/sqlfluff@performance-spike, and…

$ time sqlfluff lint
=== [dbt templater] Sorting Nodes...
=== [dbt templater] Compiling dbt project...
=== [dbt templater] Project Compiled.
... irrelevant lint output ...
All Finished 📜 🎉!

real    3m33.226s
user    3m27.087s
sys     0m3.934s

Slashed! Sure, it's still a lot more than 12 seconds, but that's an impressive improvement. 👍🏻

tunetheweb · 2021-10-30T09:53:20Z

Indeed - nice job by @WittierDinosaur !!

Hopefully get this out soon!

WittierDinosaur · 2021-10-30T11:20:01Z

@jdub Can you run it again, but with the --bench flag? It'd be useful to see the time breakdown between the parser and linter for a file of this size

jdub · 2021-10-30T15:42:38Z

Ah, didn't know about --bench! Using the #1744 branch…

… without awful model:

==== overall timings ====
Clock time:    15.03
=== templating ===
cnt:              14 sum:            0.13
min:            0.01 max:            0.03
avg:            0.01
=== lexing ===
cnt:              14 sum:            0.19
min:            0.00 max:            0.04
avg:            0.01
=== parsing ===
cnt:              14 sum:            3.77
min:            0.02 max:            1.22
avg:            0.27
=== linting ===
cnt:              14 sum:            7.60
min:            0.07 max:            1.58
avg:            0.54

… with awful model:

==== overall timings ====
Clock time:   231.27
=== templating ===
cnt:              16 sum:            0.22
min:            0.01 max:            0.07
avg:            0.01
=== lexing ===
cnt:              16 sum:            3.57
min:            0.00 max:            2.75
avg:            0.22
=== parsing ===
cnt:              16 sum:           44.15
min:            0.02 max:           20.96
avg:            2.76
=== linting ===
cnt:              16 sum:          180.90
min:            0.11 max:          106.40
avg:           11.31

tunetheweb · 2021-10-31T13:37:48Z

The fix has been merged and will be included in the next release so closing this issue.

jdub added the bug Something isn't working label Oct 30, 2021

tunetheweb added the performance label Oct 30, 2021

tunetheweb closed this as completed Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow lint of ~1600 line generated query #1784

Slow lint of ~1600 line generated query #1784

jdub commented Oct 30, 2021

tunetheweb commented Oct 30, 2021

jdub commented Oct 30, 2021

tunetheweb commented Oct 30, 2021

WittierDinosaur commented Oct 30, 2021

jdub commented Oct 30, 2021

tunetheweb commented Oct 31, 2021

Slow lint of ~1600 line generated query #1784

Slow lint of ~1600 line generated query #1784

Comments

jdub commented Oct 30, 2021

Expected Behaviour

Observed Behaviour

Steps to Reproduce

Version

Configuration

tunetheweb commented Oct 30, 2021

jdub commented Oct 30, 2021

tunetheweb commented Oct 30, 2021

WittierDinosaur commented Oct 30, 2021

jdub commented Oct 30, 2021

tunetheweb commented Oct 31, 2021