Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add format and compression to COPY command #21

Merged
merged 1 commit into from Aug 27, 2015

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Aug 24, 2015

replaces #1

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

@graingert when you get a chance can you close #1? thanks!

what's needed to merge this PR?

  1. use sa.text.
  2. add a test

anything else?

@graingert
graingert reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
@@ -360,12 +363,13 @@ def visit_copy_command(element, compiler, **kw):
return """
COPY %(schema_name)s.%(table_name)s FROM '%(data_location)s'
CREDENTIALS 'aws_access_key_id=%(access_key)s;aws_secret_access_key=%(secret_key)s%(session_token)s'
CSV
%(format)s

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

can we do this without string formatting?

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

The quoted values should definitely be implemented as SQL parameters

@graingert
graingert reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
@@ -350,7 +352,8 @@ def __init__(self, schema_name, table_name, data_location, access_key, secret_ke
self.access_key = access_key
self.secret_key = secret_key
self.session_token = session_token
self.options = options
self.options = options or {}
self.format = format

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

format allows SQL injection.

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

Maybe an interface like:

copy = CopyCommand('schema1', 't1', 's3://mybucket/data/listing/', 'cookies', 'cookies', format=CopyCommand.CSV)

With some sort of

if format not in [self.CSV, self.JSON]:
    raise SomeException()

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

i'm not sure what you mean ... how is this different from any other argument to the function?

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

well they all do don't they.

I guess CopyCommand should probably take a Table object instead.

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

oh i see, nevermind

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

needs a test, and needs some review for SQL injection possibilities

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

I know it's not your code, but I think this PR is a good excuse to clean this function up a bit.

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

@graingert how do you feel about adding enum34 as a dependency so that we can have something like this:

import enum


class Format(enum.Enum):
    CSV = 0
    JSON = 1


class Compression(enum.Enum):
    GZIP = 0
    LZOP = 1
    # more as needed
@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

Could do something like assert format in ['JSON', 'CSV'] rather than an enum

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

It doesn't appear that there's a way to use bindparams with the session_token parameter:

In [1]: sa.text(':secret_key:session_token').bindparams(secret_key='a', session_token='b')
ArgumentError: This text() construct doesn't define a bound parameter named 'secret_key'
@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

The format for credentials is:

aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>[;token=<temporary-session-token>]
@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

I think the credentials string will need to be formatted in python and bound as one single SQL parameter

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

fine by me

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

looks nice, could do with some tests though. Don't worry about uploading files to S3 to the real redshift instance for now. Just some simple compiles with verification of the text produced:

You can use http://docs.sqlalchemy.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined to do the compile including bound params.

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

i added test_format, test_compression, test_invalid_format and test_invalid_compression. What else do you want me to add?

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

Or rather, what else do you want me to test?

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

Ah sorry I didn't spot the tests.

@graingert
graingert reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
class CopyCommand(Executable, ClauseElement):
''' Prepares a RedShift COPY statement
''' Prepares a Redshift COPY statement.

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

I prefer pep257 dosctrings:

"""
Prepares a Redshift COPY statement.

...
"""
@graingert
graingert reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
creds_rx = re.compile(r"""
^aws_access_key_id=[A-Z0-9]{20};
aws_secret_access_key=[A-Za-z0-9/+=]{40}
(?:;token=[A-Za-z0-9/+=]+)?$ # TODO: regex for token

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

This is TO-DOne right?

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

I'm not sure. I couldn't find anything official or otherwise about a regex matching session tokens. I looked at a few and this matched the ones I looked at.

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

Can you change the comment from the TODO to one that discusses your findings instead?

@graingert
graingert reviewed Aug 24, 2015
View changes
tests/test_alembic_dialect.py Outdated
@@ -1,3 +1,6 @@
import pytest

pytest.importorskip('alembic')

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

maybe a good idea in another PR?

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

without this i can't run the tests

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

you should run the tests with tox -e py27 or tox -e py34 etc. It will automatically install the requirements for you

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

i don't want to use virtualenv i'm using conda

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

i don't think alembic should be required to run the tests as it isn't required to use the dialect

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

unless tox now works with conda ... last time i tried things exploded

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

can you move this import change to another PR?

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

sure

@graingert
graingert reviewed Aug 24, 2015
View changes
tests/test_copy_command.py Outdated
@pytest.fixture
def creds():
s = 'aws_access_key_id={0};aws_secret_access_key={1}'
return s.format(''.join(choice(ascii_uppercase + digits)

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

can you hard-code this instead?

@graingert
graingert reviewed Aug 24, 2015
View changes
tests/test_copy_command.py Outdated
import re

from random import choice
from string import ascii_uppercase, ascii_letters, digits

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

I generally prefer:

import string

string.ascii_uppercase
@graingert
graingert reviewed Aug 24, 2015
View changes
tests/test_copy_command.py Outdated

@pytest.fixture
def tbl():
return sa.Table('t1', sa.MetaData(), schema='schema1')

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

This isn't mutated so it's fine as a global variable

@graingert
graingert reviewed Aug 24, 2015
View changes
tests/test_copy_command.py Outdated
t = sa.Table('t1', sa.MetaData(), schema='schema1')
with pytest.raises(ValueError):
CopyCommand(t, 's3://bucket', creds,
format=';drop table bobby_tables;')

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

:D

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

This changes the interfaces (to something I'd prefer), so we'll have to get some consensus with other users of this code.

Another possibility is to rename this implementation as NewCopyCommand (name TBD) and adapt the old CopyCommand interface to call this NewCopyCommand

@graingert
graingert reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
self.credentials = credentials
self.delimiter = delimiter
self.ignore_header = ignore_header
self.null = null

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

If I do a ctrl+F "self.null" on this page I don't find anything

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

that's because it's used in the visit_copy_command as element.null

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

😳 ah of course.

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 24, 2015

@jklukas @thisfred @bouk: Thoughts?

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 24, 2015

I had no idea so many people were developing this library! Awesome.

'aws_access_key_id=<access-key-id>;'
'aws_secret_access_key=<secret-access-key>'
'[;token=<temporary-session-token>]\ngot %r' %
credentials)

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

why did you change the way the credentials are passed in? This seems way more brittle

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

this isn't brittle, this simply fails fast with invalid data.

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

I'd probably prefer taking the params as arguments, doing the format then doing the regex validation.

Eg: (Don't quote me on this)

def __init__(self, ..., access_key_id, secret_access_key, token=None):
    credentials = 'aws_access_key_id={access_key};aws_secret_access_key={secret_access_key}'.format(
        access_key=access_key_id,
        secret_access_key=secret_access_key,
    )
    if token is not None:
        credentials += ';token=' + token
    if not creds_rx.match(credentials):
        ...fail...

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

But why not just have them as seperate arguments like before? That's definitely more robust instead of this regex stuff

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

fine by me to have this as separate arguments

@bouk
bouk reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated

if compression is not None and compression not in ['GZIP', 'LZOP']:
raise ValueError('"compression" parameter must be one of %s' %
['GZIP', 'LZOP'])

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

these should be constants on the class (same for formats)

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

why?

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

because they're constants and you're repeating them

@bouk
bouk reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
type_=_UnquotedParameter)
),
literal_binds=True
)

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

What's the benefit of doing it this way? Seems like an abuse of an sqlalchemy feature, sqlalchemy doesn't do it this way internally.

I think keeping the normal string interpolation is better

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

The type_=sa.Integer|sa.String values should definitely parametrized.

For the other values I'm fine with .format() as long as they've been validated against a regex or constant

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

I'm heavily -1 on mixing approaches

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

@bouk What is being abused here?

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

The bindparams feature, sqlalchemy never does it this way internally and just does string interpolation

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

How so? All of the bind params are either checked for proper values or correctly quoted. Can you give an example with this PR of an injection that's still possible?

This comment has been minimized.

@graingert

graingert Aug 24, 2015
Collaborator

I mean to say using .format has the same security considerations as
bindparams when you disable quoting.
On 24 Aug 2015 23:24, "Phillip Cloud" notifications@github.com wrote:

In redshift_sqlalchemy/dialect.py
#21 (comment)
:

  •                     type_=_UnquotedParameter),
    
  •        sa.bindparam('empty_as_null',
    
  •                     value='EMPTYASNULL' if element.empty_as_null else '',
    
  •                     type_=_UnquotedParameter),
    
  •        sa.bindparam('blanks_as_null',
    
  •                     value='BLANKSASNULL' if element.blanks_as_null else '',
    
  •                     type_=_UnquotedParameter),
    
  •        sa.bindparam('format',
    
  •                     value=element.format,
    
  •                     type_=_UnquotedParameter),
    
  •        sa.bindparam('compression',
    
  •                     value=element.compression,
    
  •                     type_=_UnquotedParameter)
    
  •    ),
    
  •    literal_binds=True
    
  • )

How so? All of the bind params are either checked for proper values or
correctly quoted. Can you give an example with this PR of an injection
that's still possible?


Reply to this email directly or view it on GitHub
https://github.com/graingert/redshift_sqlalchemy/pull/21/files#r37812363
.

This comment has been minimized.

@cpcloud

cpcloud Aug 25, 2015
Author Contributor

Ok, so what would you guys like me to do?

This comment has been minimized.

@cpcloud

cpcloud Aug 25, 2015
Author Contributor

Happy to fully revert this and go the route of format if that is what's desired

This comment has been minimized.

@graingert

graingert Aug 25, 2015
Collaborator

Don't fully revert it! I think .format is better for baked in query text but make sure the type_=sa.Integer|sa.String params are still bindparams

@bouk
bouk reviewed Aug 24, 2015
View changes
redshift_sqlalchemy/dialect.py Outdated
return compiler.process(
sa.text(qs).bindparams(
sa.bindparam('schema',
value=element.table.schema or 'public',

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

you shouldn't assume 'public' here, it needs to run against the currently selected schema. It should just be left out if not specified

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

Is the currently selected schema not element.table.schema?

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

That's the schema that the table is in, the connection has a seperate schema that's currently selected

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

Or are you referring to the search_path

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

Is there a way to get the currently selected schema?

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

But that wasn't my question.

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

It's actually much more difficult to reconstruct the fully qualified table name than it is to just plop the current schema in there, if there's a way to do it

This comment has been minimized.

@cpcloud

cpcloud Aug 24, 2015
Author Contributor

actually, if the search path contains multiple schemata then it wouldn't work

This comment has been minimized.

@bouk

bouk Aug 24, 2015
Contributor

You don't need to reconstruct it, just don't specify a schema

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 26, 2015

Is there a place that we can chat? I feel like this is moving slower than it could.

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 26, 2015

sqla irc?

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 26, 2015

Yup my contact details are here https://graingert.co.uk/foaf.rdf
On 26 Aug 2015 21:19, "Phillip Cloud" notifications@github.com wrote:

sqla irc?


Reply to this email directly or view it on GitHub
#21 (comment)
.

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 26, 2015

i'm in the sqla irc

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

I still don't see why setting compiler.dialect._backslash_escapes to False and then resetting it back to the default for the dialect isn't the right solution here. No matter what the approach is (overriding compiler.render_literal_value, or a custom type), it will involve turning off _backslash_escapes and then turning it back to the default.

Can we please get this merged today. This is dragging on far too long.

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

it will involve turning off _backslash_escapes and then turning it back to the default.

I will not accept anything that breaks the thread-safety of the query compiler. Let's just use .format() here

@bouk
Copy link
Contributor

@bouk bouk commented Aug 27, 2015

Also, setting private variables is a no-no

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

Let's just use .format() here.

For everything, or just for null?

Also, setting private variables is a no-no

I'm aware of that.

I will not accept anything that breaks the thread-safety of the compiler.

Can you give an example of how this breaks thread safety? Is there no public state on the compiler that can be set by a user during compilation?

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

Just for null, call the parameter "dangerous_null_delimiter=None" and only add the "NULL AS " if it's not None.

Thread 1 Thread 2
disable slashes
text_query = compile(obj)
text_query = compile(obj)
enable slashes
text_query is not escaped!
@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

Nice example.

call the parameter "dangerous_null_delimiter=None"

is that a joke?

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

nope

@graingert graingert added this to the 1.0.0 milestone Aug 27, 2015
@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

Thanks, but that's unnecessary.

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

Cool looks like this is ready to go, can you clean up your commits (squash and rebase as appropriate)?

@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

Sure

@cpcloud cpcloud force-pushed the cpcloud:format-compression branch 2 times, most recently Aug 27, 2015
@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

@graingert Would you like me to squash further?

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

probably best for one commit here I think?

@cpcloud cpcloud force-pushed the cpcloud:format-compression branch to 529663e Aug 27, 2015
@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

done

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

@jklukas @thisfred @bouk: I'm going to merge this. If you've got any objections we can change them in another PR.

graingert added a commit that referenced this pull request Aug 27, 2015
Add format and compression to COPY command
@graingert graingert merged commit c302895 into sqlalchemy-redshift:master Aug 27, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@cpcloud cpcloud deleted the cpcloud:format-compression branch Aug 27, 2015
@cpcloud
Copy link
Contributor Author

@cpcloud cpcloud commented Aug 27, 2015

@graingert thanks for your patience.

@graingert
Copy link
Collaborator

@graingert graingert commented Aug 27, 2015

@graingert no, thank you for your patience, feature and improvements to the code!

haleemur pushed a commit to haleemur/redshift_sqlalchemy that referenced this pull request Sep 2, 2015
…sion

Add format and compression to COPY command
@jklukas jklukas mentioned this pull request Mar 18, 2019
4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.