DISTKEY not supported #99

erikalfthan · 2021-01-04T15:09:43Z

Describe the bug
One of the main changes in redshift vs postgres is how index/keys are used. More specifically, one can specify a DISTKEY and SORTKEY for a table like

CREATE TABLE myschema.mytable (
col1 INT
,col2 VARCHAR(20)
)
DISTKEY(col1)
SORTKEY(col1, col2);

This can improve performance by helping redshift to keep joins on a single node, skip partitions etc

For my tests, I don't care about the performance, but I need the production SQL to look like this, and I want my tests to test actual code, not a similar but not the same variation.

It would make this mock library more complete and should probably be handled like UNLOAD etc. NB: I am a newbie using this library, so have haven't checked how this is handled.

Environment

Host OS
Using Windows, setting PYTEST_MOCK_RESOURCES_HOST=localhost (somehow my networking failed for host.docker.internal, but it could be Docker Desktop or Corporate Security)
Docker image if applicable
Postgres with default params as provided by pytest version 1.4.1
Python Version
Python 3.8.5 64-bit
Virtualenv/Pyenv etc.. if applicable
virtualenv, installing pytest-mock-resources with postgres binaries and redshift

(Also in requirements.txt before trying pytest-mock-resources
SQLAlchemy
psycopg2
sqlalchemy-redshift
pandas

To Reproduce
Steps to reproduce the behavior:

Create a test case executing sql

redshift = create_redshift_fixture()
def test_redshift_with_distkey(redshift):
with redshift.connect() as conn:
conn.execute('create table table1 ( col1 int, col2 varchar(20)) distkey(col1) sortkey(col1, col2);')

Run pytest
See error
Error is logged and test is aborted
(psycopg2.errors.SyntaxError) syntax error at or near "DISTKEY"

Expected behavior
Since DISTKEY and SORTKEY probably arent relevant in test cases, ignore / remove this part of SQL before executing in docker postgres

Actual Behavior
Described above - can't disclose much more anyways.

Additional context
Running simpler SQL works in my current setup

Seems related to #82

DanCardin · 2021-01-04T15:27:35Z

If it helps, the reason we haven't run into this ourselves is because we use sqlalchemy-redshift + alembic + sqlalchemy tables/models to model our databases. Alembic (with sqlalchemy-redshift) will produce the distkey/sortkeys if you put them in sqlalchemy's table_args (on a model), but pytest-mock-resources will directly use the models to create the tables and ignores them.

If that's not an acceptable workflow, the problem with this request is that you're executing raw SQL. While I'm not sure that i.e. a CreateTable command using sqlalchemy-redshift (with distkey set) would perform any better currently, it's at least something we could likely add robust support for because it's a structured single-purpose command that the redshift dialect supports at the sqlalchemy level.

With raw SQL, my impression is that sqlalchemy just sends your text verbatim, so there's much less we can do in a 100% correct way. We could, attempt somewhat hacky regex checks through the queries to parse out distkey/sortkey and either translate or remove them, but i wouldn't expect it to be particularly robust (for example if DISTKEY resided in some string)

erikalfthan · 2021-01-04T15:37:06Z

The performance is not gained on the CREATE TABLE statement, but can be huge on a big cluster if it ensures that subsequent JOINs never leaves a particular node.

My reason for trying this library was that it seems that no one are able to mock redshift :)

The project Im working on actually still uses Pentaho PDI with JDBC to drive most of the ETL, even if almost all functional code has left PDI, so unfortunately I can't use anything else than raw SQL and wanted to avoid the hacky regex solution. I didnt look into how you had implemented this before I tried to use it.

So, thanks for a quick reply, you can choose to close this as out of scope or similar. Maybe add a comment in the documentation that you dont support raw sql DISTKEY and SORTKEY where it is stated that you do support UNLOAD.

DanCardin · 2021-01-04T17:59:17Z

To be clear, I'm not necessarily against limited parsing the SQL statement, especially because it could likely be scoped to i.e. CREATE TABLE statements. And fyi, this is how we support COPY/UNLOAD (through a combination of specific sqlalchemy-redshift handling and manual parsing of COPY/UNLOAD statements and altering the query at runtime).

So ultimately, any support we add for raw sql statements for this is going to be doing the "hacky regex solution" (to essentially just remove them from the query string), which is honestly probably fine for most cases.

But as you've already noticed we don't support it today. If you were so inclined, I'd be happy to direct you to the relevant locations for adding support yourself, else we could certainly add support, but it wouldn't be as quick as my issue responses 😛.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISTKEY not supported #99

DISTKEY not supported #99

erikalfthan commented Jan 4, 2021 •

edited

DanCardin commented Jan 4, 2021

erikalfthan commented Jan 4, 2021 •

edited

DanCardin commented Jan 4, 2021

DISTKEY not supported #99

DISTKEY not supported #99

Comments

erikalfthan commented Jan 4, 2021 • edited

DanCardin commented Jan 4, 2021

erikalfthan commented Jan 4, 2021 • edited

DanCardin commented Jan 4, 2021

erikalfthan commented Jan 4, 2021 •

edited

erikalfthan commented Jan 4, 2021 •

edited