New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt templater mangles multi-byte characters after sqlfluff fix
#3585
Comments
sqlfluff fix
Please try explicitly specifying the encoding in .sqlfluff. The default is to guess, and it probably guessed wrong. If so, this is not a bug, it's just an incorrect guess. |
|
Thanks! We'll take a look. It's surprising this would happen, since the dbt templater inherits from the jinja templater. There must be something subtle happening here. |
Thanks Barry, I've been looking into this further and I think it might be more to do with problems with Powershell/cmd than sqlfluff. First I tried rerunning the same Then I tried running the same command but appending I then tried running the same command in Powershell, this also spammed a load of similar logging errors, and also didn't actually make any changes to the sql file. This seemed super weird to me, as if the logging module was using some different encoding setting than the rest of the project. This made me suspect that my command line tools were defaulting to some encoding other than UTF-8 (I'm in Germany and Windows installations here default to code page 1252). So I followed the instructions at the top of this stack overflow post to set my command line tools to UTF-8 by default, rebooted, ran the same I'm still waiting for confirmation from my colleague that this fix works for him too. |
Character encoding and time zones are the dark underbelly of software. Argh. |
Happy to confirm that this fix worked for my colleague too. It's still unclear to me why (before the fix) the behaviour would differ between the Perhaps it would be worth mentioning this behaviour somewhere in the documentation for the tool, to potentially save others some hassle in the future? |
What specific behavior do you mean? Feel free to create a PR, or update this issue and perhaps someone else will do it. About dbt vs jinja, one possible cause is that with jinja, SQLFluff itself is reading and decoding the file, while with the dbt templater, dbt itself is doing that. They may not handle decoding the same way. It's possible, also, that they have their own "encoding" config option. |
Reopening this issue, as I may have been mistaken about dbt reading the file on behalf of SQLFluff. I think this may be an issue with the dbt templater, perhaps using its own naive code for reading the file rather than using the specified codec, autodetect, etc. |
Search before asking
What Happened
Running
sqlfluff fix
with thedbt
templater on a file which contains multi-byte characters sometimes causes those characters to be corrupted during fixing.Here's a valid dbt model, saved in a file with UTF-8 text encoding: (note the Chinese characters in line 4, the line also has a trailing space)
Running this command, specifying
utf-8
as the file encoding andjinja
as the templater, formats the file correctly by removing the trailing whitespace in line 4:sqlfluff fix example.sql --encoding utf-8 --rules L001 --templater jinja
Running the same command, still with
--encoding utf-8
but this time with--templater dbt
, does remove the trailing whitespace but also causes the Chinese characters in the file to become mangled, as if the file had been opened with the incorrect file encoding:Expected Behaviour
Using the
dbt
templater on this file should remove the trailing whitespace but not alter the appearance of multi-byte characters.Observed Behaviour
Using the
dbt
templater on this file also corrupts the multi-byte characters in the line.How to reproduce
some_model
with a model that actually exists or dbt compilation will fail)sqlfluff fix example.sql --force --encoding utf-8 --rules L001 --dialect redshift --templater dbt
銷貨
.Extra weirdness: I can only reproduce this issue when the
FROM
statement uses a Jinja template. If this line is replaced with a direct reference to a table (e.g.FROM information_schema.tables
), the multi-byte characters are left untouched.Dialect
redshift
Version
Configuration
Are you willing to work on and submit a PR to address the issue?
Code of Conduct
The text was updated successfully, but these errors were encountered: