Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow lint of ~1600 line generated query #1784

Closed
jdub opened this issue Oct 30, 2021 · 6 comments
Closed

Slow lint of ~1600 line generated query #1784

jdub opened this issue Oct 30, 2021 · 6 comments
Labels
bug Something isn't working performance

Comments

@jdub
Copy link

jdub commented Oct 30, 2021

Expected Behaviour

Without my horrifying generated join/union:

$ time sqlfluff lint 
=== [dbt templater] Sorting Nodes...
=== [dbt templater] Compiling dbt project...
=== [dbt templater] Project Compiled.
... irrelevant lint output ...
All Finished 📜 🎉!

real    0m11.659s
user    0m11.038s
sys     0m0.571s

Observed Behaviour

With my horrifying generated join/union:

$ time sqlfluff lint 
=== [dbt templater] Sorting Nodes...
=== [dbt templater] Compiling dbt project...
=== [dbt templater] Project Compiled.
... irrelevant lint output ...
All Finished 📜 🎉!

real    7m46.565s
user    7m38.807s
sys     0m5.049s

Interesting to note: Though most of the time is spent processing prior to printing the lint messages, sqlfluff also spends quite a while doing… something… between printing lint messages and "All Finished".

Steps to Reproduce

This is a necessary evil, which we need not discuss. I have attached the dbt-compiled SQL and output of sqlfluff parse of a model that wraps this model in a CTE (thus the extra bits): stg_history__history.zip

with

unioned as (

{%- for instance in ['history01', 'history02', 'history03', 'history04'] %}

{# due to key collisions, we must join by shard first, then union the results #}
{%- for shard in range(32) %}
    select
        history.id || '_{{ instance }}_{{ shard }}' as history_id,
        history.*,
        profile.userid,
        profile.profileid
    from {{ source(instance, 'history') }} as history -- noqa: L031
    join {{ source(instance, 'profile') }} as profile on history.profileid = profile.id -- noqa: L031
        and history.`_TABLE_SUFFIX` = '{{ shard }}'
        and profile.`_TABLE_SUFFIX` = '{{ shard }}'
{% if not loop.last %}
    union all
{% endif -%}
{% endfor -%}{# shard #}

{%- if not loop.last %}
    union all
{% endif -%}
{% endfor -%}{# instance #}

)

select * from unioned

Note that the dbt source definitions for these tables include BigQuery wildcards, e.g.

sources:
  - name: history01
    database: bigquery-project
    schema: history01_public
    tables:
      - name: history
        identifier: history_*
      - name: profile
        identifier: profile_*

Version

  • Python 3.9.7
  • dbt 0.21.0
  • sqlfluff 0.7.1

Configuration

[sqlfluff]
dialect = bigquery
templater = dbt

# https://docs.sqlfluff.com/en/stable/rules.html
# L016: Line is too long
# L034: Use wildcards then simple targets before calculations and aggregates in select statements
exclude_rules = L016,L034

[sqlfluff:rules:L014]
extended_capitalisation_policy = lower
@jdub jdub added the bug Something isn't working label Oct 30, 2021
@tunetheweb
Copy link
Member

We have a fix for this about to be merged in #1744 . It would be good to know if that branch resolves your issue too.

@jdub
Copy link
Author

jdub commented Oct 30, 2021

I did a pip install git+https://github.com/WittierDinosaur/sqlfluff@performance-spike, and…

$ time sqlfluff lint
=== [dbt templater] Sorting Nodes...
=== [dbt templater] Compiling dbt project...
=== [dbt templater] Project Compiled.
... irrelevant lint output ...
All Finished 📜 🎉!

real    3m33.226s
user    3m27.087s
sys     0m3.934s

Slashed! Sure, it's still a lot more than 12 seconds, but that's an impressive improvement. 👍🏻

@tunetheweb
Copy link
Member

Indeed - nice job by @WittierDinosaur !!

Hopefully get this out soon!

@WittierDinosaur
Copy link
Contributor

@jdub Can you run it again, but with the --bench flag? It'd be useful to see the time breakdown between the parser and linter for a file of this size

@jdub
Copy link
Author

jdub commented Oct 30, 2021

Ah, didn't know about --bench! Using the #1744 branch…

… without awful model:

==== overall timings ====
Clock time:    15.03
=== templating ===
cnt:              14 sum:            0.13
min:            0.01 max:            0.03
avg:            0.01
=== lexing ===
cnt:              14 sum:            0.19
min:            0.00 max:            0.04
avg:            0.01
=== parsing ===
cnt:              14 sum:            3.77
min:            0.02 max:            1.22
avg:            0.27
=== linting ===
cnt:              14 sum:            7.60
min:            0.07 max:            1.58
avg:            0.54

… with awful model:

==== overall timings ====
Clock time:   231.27
=== templating ===
cnt:              16 sum:            0.22
min:            0.01 max:            0.07
avg:            0.01
=== lexing ===
cnt:              16 sum:            3.57
min:            0.00 max:            2.75
avg:            0.22
=== parsing ===
cnt:              16 sum:           44.15
min:            0.02 max:           20.96
avg:            2.76
=== linting ===
cnt:              16 sum:          180.90
min:            0.11 max:          106.40
avg:           11.31

@tunetheweb
Copy link
Member

The fix has been merged and will be included in the next release so closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

3 participants