Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTKEY not supported #99

Open
erikalfthan opened this issue Jan 4, 2021 · 3 comments
Open

DISTKEY not supported #99

erikalfthan opened this issue Jan 4, 2021 · 3 comments

Comments

@erikalfthan
Copy link

erikalfthan commented Jan 4, 2021

Describe the bug
One of the main changes in redshift vs postgres is how index/keys are used. More specifically, one can specify a DISTKEY and SORTKEY for a table like

CREATE TABLE myschema.mytable (
col1 INT
,col2 VARCHAR(20)
)
DISTKEY(col1)
SORTKEY(col1, col2);

This can improve performance by helping redshift to keep joins on a single node, skip partitions etc

For my tests, I don't care about the performance, but I need the production SQL to look like this, and I want my tests to test actual code, not a similar but not the same variation.

It would make this mock library more complete and should probably be handled like UNLOAD etc. NB: I am a newbie using this library, so have haven't checked how this is handled.

Environment

  • Host OS
    Using Windows, setting PYTEST_MOCK_RESOURCES_HOST=localhost (somehow my networking failed for host.docker.internal, but it could be Docker Desktop or Corporate Security)

  • Docker image if applicable
    Postgres with default params as provided by pytest version 1.4.1

  • Python Version
    Python 3.8.5 64-bit

  • Virtualenv/Pyenv etc.. if applicable
    virtualenv, installing pytest-mock-resources with postgres binaries and redshift

(Also in requirements.txt before trying pytest-mock-resources
SQLAlchemy
psycopg2
sqlalchemy-redshift
pandas

To Reproduce
Steps to reproduce the behavior:

  1. Create a test case executing sql

redshift = create_redshift_fixture()
def test_redshift_with_distkey(redshift):
with redshift.connect() as conn:
conn.execute('create table table1 ( col1 int, col2 varchar(20)) distkey(col1) sortkey(col1, col2);')

  1. Run pytest
  2. See error
    Error is logged and test is aborted
    (psycopg2.errors.SyntaxError) syntax error at or near "DISTKEY"

Expected behavior
Since DISTKEY and SORTKEY probably arent relevant in test cases, ignore / remove this part of SQL before executing in docker postgres

Actual Behavior
Described above - can't disclose much more anyways.

Additional context
Running simpler SQL works in my current setup

Seems related to #82

@DanCardin
Copy link
Contributor

If it helps, the reason we haven't run into this ourselves is because we use sqlalchemy-redshift + alembic + sqlalchemy tables/models to model our databases. Alembic (with sqlalchemy-redshift) will produce the distkey/sortkeys if you put them in sqlalchemy's table_args (on a model), but pytest-mock-resources will directly use the models to create the tables and ignores them.

If that's not an acceptable workflow, the problem with this request is that you're executing raw SQL. While I'm not sure that i.e. a CreateTable command using sqlalchemy-redshift (with distkey set) would perform any better currently, it's at least something we could likely add robust support for because it's a structured single-purpose command that the redshift dialect supports at the sqlalchemy level.

With raw SQL, my impression is that sqlalchemy just sends your text verbatim, so there's much less we can do in a 100% correct way. We could, attempt somewhat hacky regex checks through the queries to parse out distkey/sortkey and either translate or remove them, but i wouldn't expect it to be particularly robust (for example if DISTKEY resided in some string)

@erikalfthan
Copy link
Author

erikalfthan commented Jan 4, 2021

The performance is not gained on the CREATE TABLE statement, but can be huge on a big cluster if it ensures that subsequent JOINs never leaves a particular node.

My reason for trying this library was that it seems that no one are able to mock redshift :)

The project Im working on actually still uses Pentaho PDI with JDBC to drive most of the ETL, even if almost all functional code has left PDI, so unfortunately I can't use anything else than raw SQL and wanted to avoid the hacky regex solution. I didnt look into how you had implemented this before I tried to use it.

So, thanks for a quick reply, you can choose to close this as out of scope or similar. Maybe add a comment in the documentation that you dont support raw sql DISTKEY and SORTKEY where it is stated that you do support UNLOAD.

@DanCardin
Copy link
Contributor

To be clear, I'm not necessarily against limited parsing the SQL statement, especially because it could likely be scoped to i.e. CREATE TABLE statements. And fyi, this is how we support COPY/UNLOAD (through a combination of specific sqlalchemy-redshift handling and manual parsing of COPY/UNLOAD statements and altering the query at runtime).

So ultimately, any support we add for raw sql statements for this is going to be doing the "hacky regex solution" (to essentially just remove them from the query string), which is honestly probably fine for most cases.

But as you've already noticed we don't support it today. If you were so inclined, I'd be happy to direct you to the relevant locations for adding support yourself, else we could certainly add support, but it wouldn't be as quick as my issue responses 😛.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants