Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt templater mangles multi-byte characters after sqlfluff fix #3585

Closed
3 tasks done
gregoryroche opened this issue Jul 11, 2022 · 8 comments · Fixed by #4109
Closed
3 tasks done

dbt templater mangles multi-byte characters after sqlfluff fix #3585

gregoryroche opened this issue Jul 11, 2022 · 8 comments · Fixed by #4109
Labels
bug Something isn't working dbt Related to Data Build Tool

Comments

@gregoryroche
Copy link

gregoryroche commented Jul 11, 2022

Search before asking

  • I searched the issues and found no similar issues.

What Happened

Running sqlfluff fix with the dbt templater on a file which contains multi-byte characters sometimes causes those characters to be corrupted during fixing.

Here's a valid dbt model, saved in a file with UTF-8 text encoding: (note the Chinese characters in line 4, the line also has a trailing space)

SELECT *
FROM {{ ref('some_model') }} AS t
WHERE
    t.col_1 = '銷貨' 
    AND t.col_2 = 'Y'

Running this command, specifying utf-8 as the file encoding and jinja as the templater, formats the file correctly by removing the trailing whitespace in line 4: sqlfluff fix example.sql --encoding utf-8 --rules L001 --templater jinja

Running the same command, still with --encoding utf-8 but this time with --templater dbt, does remove the trailing whitespace but also causes the Chinese characters in the file to become mangled, as if the file had been opened with the incorrect file encoding:

SELECT *
FROM {{ ref('some_model') }} AS t
WHERE
    t.col_1 = '銷貨' -- mangled characters
    AND t.col_2 = 'Y'

Expected Behaviour

Using the dbt templater on this file should remove the trailing whitespace but not alter the appearance of multi-byte characters.

Observed Behaviour

Using the dbt templater on this file also corrupts the multi-byte characters in the line.

How to reproduce

  1. Create a bare-bones dbt project
  2. Save this snippet as an .sql file with UTF-8 encoding: (need to replace some_model with a model that actually exists or dbt compilation will fail)
SELECT *
FROM {{ ref('some_model') }} AS t
WHERE
    t.col_1 = '銷貨' 
    AND t.col_2 = 'Y'
  1. Run this command: sqlfluff fix example.sql --force --encoding utf-8 --rules L001 --dialect redshift --templater dbt
  2. Inspect the file, the Chinese characters have been corrupted to 銷貨.

Extra weirdness: I can only reproduce this issue when the FROM statement uses a Jinja template. If this line is replaced with a direct reference to a table (e.g. FROM information_schema.tables), the multi-byte characters are left untouched.

Dialect

redshift

Version

> sqlfluff --version
sqlfluff, version 1.1.0

> pip list
dbt-core                   1.1.1
dbt-extractor              0.4.1
dbt-postgres               1.1.1
dbt-redshift               1.1.0
sqlfluff-templater-dbt     1.1.0

Configuration

[sqlfluff]
templater = dbt
dialect = redshift
rules = L001, L004, L005, L006, L008, L038, L039
encoding = utf-8
indent_unit = 'space'
tab_space_size = 4

Are you willing to work on and submit a PR to address the issue?

  • Yes I am willing to submit a PR!

Code of Conduct

@gregoryroche gregoryroche added the bug Something isn't working label Jul 11, 2022
@gregoryroche gregoryroche changed the title sqlfluff fix with dbt templater mangles Chinese characters dbt templater mangles multi-byte characters after sqlfluff fix Jul 11, 2022
@gregoryroche gregoryroche changed the title dbt templater mangles multi-byte characters after sqlfluff fix dbt templater mangles multi-byte characters after sqlfluff fix Jul 11, 2022
@barrywhart
Copy link
Member

Please try explicitly specifying the encoding in .sqlfluff. The default is to guess, and it probably guessed wrong. If so, this is not a bug, it's just an incorrect guess.

@gregoryroche
Copy link
Author

gregoryroche commented Jul 11, 2022

Please try explicitly specifying the encoding in .sqlfluff. The default is to guess, and it probably guessed wrong. If so, this is not a bug, it's just an incorrect guess.

  1. I already specify the encoding in .sqlfluff (I copied the contents into the "Configuration" section above)
  2. This issue still occurs even if the file encoding is manually specified in the sqlfluff command, as I've used above:
    sqlfluff fix example.sql --force ***--encoding utf-8*** --rules L001 --dialect redshift --templater dbt. I've edited the main body of the original issue to make this more explicitly clear.

@barrywhart barrywhart added awaiting feedback Cannot continue investigating until more information is provided. and removed awaiting feedback Cannot continue investigating until more information is provided. labels Jul 11, 2022
@barrywhart
Copy link
Member

Thanks! We'll take a look. It's surprising this would happen, since the dbt templater inherits from the jinja templater. There must be something subtle happening here.

@gregoryroche
Copy link
Author

gregoryroche commented Jul 11, 2022

Thanks Barry, I've been looking into this further and I think it might be more to do with problems with Powershell/cmd than sqlfluff.

First I tried rerunning the same sqlfluff fix command as above, but with -vvvvv. This command was run in cmd under Windows 10. This command outputted so much text that the useful bits fell out of the buffer.

Then I tried running the same command but appending > output.txt to catch all the output. cmd still spammed a load of output to the console, but this time it was under a ===Logging error=== heading with messages complaining about the logger being unable to decode special characters. The sql file remained unchanged.

I then tried running the same command in Powershell, this also spammed a load of similar logging errors, and also didn't actually make any changes to the sql file.

This seemed super weird to me, as if the logging module was using some different encoding setting than the rest of the project. This made me suspect that my command line tools were defaulting to some encoding other than UTF-8 (I'm in Germany and Windows installations here default to code page 1252). So I followed the instructions at the top of this stack overflow post to set my command line tools to UTF-8 by default, rebooted, ran the same sqlfluff fix command as before in Powershell, and now the file's special characters are left unchanged, for both the jinja and dbt templater.

I'm still waiting for confirmation from my colleague that this fix works for him too.

@barrywhart
Copy link
Member

Character encoding and time zones are the dark underbelly of software. Argh.

@gregoryroche
Copy link
Author

gregoryroche commented Jul 12, 2022

Happy to confirm that this fix worked for my colleague too. It's still unclear to me why (before the fix) the behaviour would differ between the jinja and dbt templaters, perhaps there is still some underlying issue there to be fixed on sqlfluff's side, but for now we have a workaround that doesn't corrupt our files.

Perhaps it would be worth mentioning this behaviour somewhere in the documentation for the tool, to potentially save others some hassle in the future?

@barrywhart
Copy link
Member

What specific behavior do you mean? Feel free to create a PR, or update this issue and perhaps someone else will do it.

About dbt vs jinja, one possible cause is that with jinja, SQLFluff itself is reading and decoding the file, while with the dbt templater, dbt itself is doing that. They may not handle decoding the same way. It's possible, also, that they have their own "encoding" config option.

@tunetheweb tunetheweb added the dbt Related to Data Build Tool label Jul 13, 2022
@barrywhart barrywhart reopened this Nov 3, 2022
@barrywhart
Copy link
Member

Reopening this issue, as I may have been mistaken about dbt reading the file on behalf of SQLFluff. I think this may be an issue with the dbt templater, perhaps using its own naive code for reading the file rather than using the specified codec, autodetect, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dbt Related to Data Build Tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants