dbt templater mangles multi-byte characters after `sqlfluff fix` #3585

gregoryroche · 2022-07-11T09:34:28Z

Search before asking

I searched the issues and found no similar issues.

What Happened

Running sqlfluff fix with the dbt templater on a file which contains multi-byte characters sometimes causes those characters to be corrupted during fixing.

Here's a valid dbt model, saved in a file with UTF-8 text encoding: (note the Chinese characters in line 4, the line also has a trailing space)

SELECT *
FROM {{ ref('some_model') }} AS t
WHERE
    t.col_1 = '銷貨' 
    AND t.col_2 = 'Y'

Running this command, specifying utf-8 as the file encoding and jinja as the templater, formats the file correctly by removing the trailing whitespace in line 4: sqlfluff fix example.sql --encoding utf-8 --rules L001 --templater jinja

Running the same command, still with --encoding utf-8 but this time with --templater dbt, does remove the trailing whitespace but also causes the Chinese characters in the file to become mangled, as if the file had been opened with the incorrect file encoding:

SELECT *
FROM {{ ref('some_model') }} AS t
WHERE
    t.col_1 = 'éŠ·è²¨' -- mangled characters
    AND t.col_2 = 'Y'

Expected Behaviour

Using the dbt templater on this file should remove the trailing whitespace but not alter the appearance of multi-byte characters.

Observed Behaviour

Using the dbt templater on this file also corrupts the multi-byte characters in the line.

How to reproduce

Create a bare-bones dbt project
Save this snippet as an .sql file with UTF-8 encoding: (need to replace some_model with a model that actually exists or dbt compilation will fail)

SELECT *
FROM {{ ref('some_model') }} AS t
WHERE
    t.col_1 = '銷貨' 
    AND t.col_2 = 'Y'

Run this command: sqlfluff fix example.sql --force --encoding utf-8 --rules L001 --dialect redshift --templater dbt
Inspect the file, the Chinese characters have been corrupted to éŠ·è²¨.

Extra weirdness: I can only reproduce this issue when the FROM statement uses a Jinja template. If this line is replaced with a direct reference to a table (e.g. FROM information_schema.tables), the multi-byte characters are left untouched.

Dialect

redshift

Version

> sqlfluff --version
sqlfluff, version 1.1.0

> pip list
dbt-core                   1.1.1
dbt-extractor              0.4.1
dbt-postgres               1.1.1
dbt-redshift               1.1.0
sqlfluff-templater-dbt     1.1.0

Configuration

[sqlfluff]
templater = dbt
dialect = redshift
rules = L001, L004, L005, L006, L008, L038, L039
encoding = utf-8
indent_unit = 'space'
tab_space_size = 4

Are you willing to work on and submit a PR to address the issue?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

barrywhart · 2022-07-11T12:53:06Z

Please try explicitly specifying the encoding in .sqlfluff. The default is to guess, and it probably guessed wrong. If so, this is not a bug, it's just an incorrect guess.

gregoryroche · 2022-07-11T12:56:52Z

Please try explicitly specifying the encoding in .sqlfluff. The default is to guess, and it probably guessed wrong. If so, this is not a bug, it's just an incorrect guess.

I already specify the encoding in .sqlfluff (I copied the contents into the "Configuration" section above)
This issue still occurs even if the file encoding is manually specified in the sqlfluff command, as I've used above:
sqlfluff fix example.sql --force ***--encoding utf-8*** --rules L001 --dialect redshift --templater dbt. I've edited the main body of the original issue to make this more explicitly clear.

barrywhart · 2022-07-11T14:08:44Z

Thanks! We'll take a look. It's surprising this would happen, since the dbt templater inherits from the jinja templater. There must be something subtle happening here.

gregoryroche · 2022-07-11T14:47:33Z

Thanks Barry, I've been looking into this further and I think it might be more to do with problems with Powershell/cmd than sqlfluff.

First I tried rerunning the same sqlfluff fix command as above, but with -vvvvv. This command was run in cmd under Windows 10. This command outputted so much text that the useful bits fell out of the buffer.

Then I tried running the same command but appending > output.txt to catch all the output. cmd still spammed a load of output to the console, but this time it was under a ===Logging error=== heading with messages complaining about the logger being unable to decode special characters. The sql file remained unchanged.

I then tried running the same command in Powershell, this also spammed a load of similar logging errors, and also didn't actually make any changes to the sql file.

This seemed super weird to me, as if the logging module was using some different encoding setting than the rest of the project. This made me suspect that my command line tools were defaulting to some encoding other than UTF-8 (I'm in Germany and Windows installations here default to code page 1252). So I followed the instructions at the top of this stack overflow post to set my command line tools to UTF-8 by default, rebooted, ran the same sqlfluff fix command as before in Powershell, and now the file's special characters are left unchanged, for both the jinja and dbt templater.

I'm still waiting for confirmation from my colleague that this fix works for him too.

barrywhart · 2022-07-11T16:25:22Z

Character encoding and time zones are the dark underbelly of software. Argh.

gregoryroche · 2022-07-12T07:47:04Z

Happy to confirm that this fix worked for my colleague too. It's still unclear to me why (before the fix) the behaviour would differ between the jinja and dbt templaters, perhaps there is still some underlying issue there to be fixed on sqlfluff's side, but for now we have a workaround that doesn't corrupt our files.

Perhaps it would be worth mentioning this behaviour somewhere in the documentation for the tool, to potentially save others some hassle in the future?

barrywhart · 2022-07-12T20:41:20Z

What specific behavior do you mean? Feel free to create a PR, or update this issue and perhaps someone else will do it.

About dbt vs jinja, one possible cause is that with jinja, SQLFluff itself is reading and decoding the file, while with the dbt templater, dbt itself is doing that. They may not handle decoding the same way. It's possible, also, that they have their own "encoding" config option.

barrywhart · 2022-11-03T17:00:05Z

Reopening this issue, as I may have been mistaken about dbt reading the file on behalf of SQLFluff. I think this may be an issue with the dbt templater, perhaps using its own naive code for reading the file rather than using the specified codec, autodetect, etc.

gregoryroche added the bug Something isn't working label Jul 11, 2022

gregoryroche changed the title ~~sqlfluff fix with dbt templater mangles Chinese characters~~ dbt templater mangles multi-byte characters after sqlfluff fix Jul 11, 2022

gregoryroche changed the title ~~dbt templater mangles multi-byte characters after sqlfluff fix~~ dbt templater mangles multi-byte characters after sqlfluff fix Jul 11, 2022

barrywhart added awaiting feedback Cannot continue investigating until more information is provided. and removed awaiting feedback Cannot continue investigating until more information is provided. labels Jul 11, 2022

tunetheweb added the dbt Related to Data Build Tool label Jul 13, 2022

barrywhart closed this as completed Jul 19, 2022

barrywhart mentioned this issue Aug 11, 2022

Special characters get encoded wrong with dbt templater #3729

Open

3 tasks

barrywhart reopened this Nov 3, 2022

barrywhart mentioned this issue Nov 26, 2022

Bug fix: dbt templater ignores .sqlfluff file encoding on Windows #4109

Merged

1 task

barrywhart closed this as completed in #4109 Nov 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt templater mangles multi-byte characters after `sqlfluff fix` #3585

dbt templater mangles multi-byte characters after `sqlfluff fix` #3585

gregoryroche commented Jul 11, 2022 •

edited

barrywhart commented Jul 11, 2022

gregoryroche commented Jul 11, 2022 •

edited

barrywhart commented Jul 11, 2022

gregoryroche commented Jul 11, 2022 •

edited

barrywhart commented Jul 11, 2022

gregoryroche commented Jul 12, 2022 •

edited

barrywhart commented Jul 12, 2022

barrywhart commented Nov 3, 2022

dbt templater mangles multi-byte characters after sqlfluff fix #3585

dbt templater mangles multi-byte characters after sqlfluff fix #3585

Comments

gregoryroche commented Jul 11, 2022 • edited

Search before asking

What Happened

Expected Behaviour

Observed Behaviour

How to reproduce

Dialect

Version

Configuration

Are you willing to work on and submit a PR to address the issue?

Code of Conduct

barrywhart commented Jul 11, 2022

gregoryroche commented Jul 11, 2022 • edited

barrywhart commented Jul 11, 2022

gregoryroche commented Jul 11, 2022 • edited

barrywhart commented Jul 11, 2022

gregoryroche commented Jul 12, 2022 • edited

barrywhart commented Jul 12, 2022

barrywhart commented Nov 3, 2022

dbt templater mangles multi-byte characters after `sqlfluff fix` #3585

dbt templater mangles multi-byte characters after `sqlfluff fix` #3585

gregoryroche commented Jul 11, 2022 •

edited

gregoryroche commented Jul 11, 2022 •

edited

gregoryroche commented Jul 11, 2022 •

edited

gregoryroche commented Jul 12, 2022 •

edited