Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column ops implementation first draft #1696

Merged
merged 22 commits into from
Jun 12, 2024

Conversation

sfc-gh-heshah
Copy link
Contributor

Added draft implementations for most column expressions (excluding those without corresponding AST entities, ISIN, OVER, and string expression based functions missing IR entities).

Copy link

@sfc-gh-oplaton sfc-gh-oplaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficially, the changes look good, with one observation: The functionality in src/snowflake/snowpark/column.py is very repetitive, to the point where I probably would have missed anything broken.

I think it would make sense for at least some of the functionality (e.g. binary operators) to be further encapsulated into a higher-order function.

For example, __eq__ and __ne__ could become:

def __eq__(self, other):
    return self.bin_op_impl(lambda ast: ast.sp_column_equal_to, EqualTo, other)

def __ne__(self, other):
    return self.bin_op_impl(lambda ast: ast.sp_column_not_equal, NotEqualTo, other)

def bin_op_impl(self, property, comparator, other):
    ast = proto.SpColumnExpr()
    property(ast).lhs.CopyFrom(self._ast)
    Column._fill_ast(property(ast).rhs, other)
    return Column(comparator(self._expression, Column._to_expr(other)), ast = ast)

@sfc-gh-oplaton
Copy link

I think the change looks good. Thanks for the updated approach! Of course, I'd recommend getting additional input from Leonhard and/or Arthur, as I might have missed some of the finer points of the design.

@sfc-gh-heshah
Copy link
Contributor Author

Sounds good, thank you for the suggestion!

Originally I was thinking the change you mentioned should correspond to further abstracting the IR for SpColumnExpr operations as Arthur suggested on Slack, but I think this minimizes the code changes required if the AST definitions are changed.

Along those same lines I though there were other repetitive AST building patterns that could be abstracted into a static Column._create_ast method which I added. This should further reduce dependence on the exact protobuf definition, and minimize changes necessary in the code in case the AST is evolved further, but please let me know if its a bit overboard.

ast = proto.SpColumnExpr()
if property is not None:
for attr, value in assign_fields.items():
setattr(property(ast), attr, value)
Copy link

@sfc-gh-oplaton sfc-gh-oplaton May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't critical, but I wanted to make it explicit:

The point of using a callable to access object properties (lambda o: o.some_property) was to avoid using getattr/setattr, but they're equivalent, and there's no strict technical reason why callables are superior to getattr/setattr, just a personal preference (it looks more like code than stringly-typed code; very superficial preference).

That said, the callable pattern is more verbose at the call site, especially when using it repeatedly. (property = lambda ast: ast.sp_column_regexp, versus property = "sp_column_regexp",).

In summary, I think it would make the code less readable to have both the callable pattern and getattr/setattr. Pick one that you prefer (I don't feel strongly either way - callable looks more like code, but getattr/setattr will result in terser code) and be uniform. Uniformity matters a lot for ease of maintenance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I assume performance isn't a consideration here. There might be a difference between using a callable and getattr/setattr, but I didn't look into it, and I don't have an intuition here. If this is going to run billions of times, maybe writing a short benchmark and comparing would also help (if you can do it in <15 minutes, this doesn't need to become a side project of its own).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be concerned with performance here. I would pick whichever code is easier to read/less to write. Having this function here is super useful throughout the code base!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're sure performance won't matter, I recommend using getattr/setattr everywhere. Way fewer lexical tokens at each call site.

Copy link
Contributor Author

@sfc-gh-heshah sfc-gh-heshah May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wrote a short benchmark using timeit, wasn't too much work

import timeit

setup = """
import snowflake.snowpark._internal.proto.ast_pb2 as proto
test = proto.SpColumnExpr()
prop = lambda ast: ast.sp_column_apply__string
"""

getcode1 = """
t = prop(test)
"""

getcode2 = """
t = getattr(test, "sp_column_apply__string")
"""

print("Callable:\t", timeit.timeit(setup=setup, stmt=getcode1))
print("getattr:\t", timeit.timeit(setup=setup, stmt=getcode2))

setcode1 = """
test.sp_column_apply__string.field = "test"
"""

setcode2 = """
setattr(test.sp_column_apply__string, "field", "test")
"""

print(".field direct:\t", timeit.timeit(setup=setup, stmt=setcode1))
print("setattr:\t", timeit.timeit(setup=setup, stmt=setcode2))

The results I got for 1 million runs of each line were:

Callable:      0.04975545800000003
getattr:       0.040397291999999974
.field direct: 0.10505558299999995
setattr:       0.12402554099999996

Overall it seems getattr is actually faster, but setting the field directly is faster. I do think a refactor with using getattr will be cleaner so I will proceed with that.

@@ -226,6 +228,7 @@ class Column:
This class has methods for the most frequently used column transformations and operators. Module :mod:`snowflake.snowpark.functions` defines many functions to transform columns.
"""

# NOTE: For now assume Expression instances can be safely ignored when building AST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example that's problematic here? Not sure what to do to resolve this note.

Copy link
Contributor Author

@sfc-gh-heshah sfc-gh-heshah May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be an issue once all API functions that return Column instances also populate the ast parameter in the constructor.

For example currently because the isin method is not implemented to build the AST before returning, doing df.filter(col("A").isin(1, 2, 3) & col("B")) would fail since the boolean operator & would try to construct an AST using that of the new col("A").isin(1, 2, 3) column (which we currently don't fill if the only argument provided in the Column constructor is an expr1 of type Expression)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! let's just add what you explained as comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add the example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a TODO post phase 0?

if isinstance(expr1, str) and isinstance(expr2, str):
if expr2 == "*":
self._expression = Star([], df_alias=expr1)
else:
self._expression = UnresolvedAttribute(
quote_name(expr2), df_alias=expr1
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the alias field here be? is this for expr.as_(alias)? Can we maybe add an example what would need to get captured here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alias field would be something that the user provides to the DataFrame.alias(self, name: str) method.

The Snowpark API then allows users to create columns that reference this DataFrame instance by its alias using a line like col(<df_alias>, <column_name>)

In the IR we'll need the alias to resolve which DataFrame instance the user is referring to, but there might be a better way to go about enabling this functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob. would be good to include your explanation as comment here

if expr1 == "*":
self._ast.sp_column_sql_expr.sql = "*"
self._ast = Column._create_ast(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as #1701 is in , can we add tests for this?

copy_messages = {"col": self._ast},
assign_fields = {"field": field},
)
return Column(SubfieldString(self._expression, field), ast = ast)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the precommit to format everything :)

else:
raise TypeError(f"Unexpected item type: {type(field)}")

# overload operators
def _bin_op_impl(self, property: Callable, operator: BinaryExpression, other: ColumnOrLiteral) -> "Column":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls. add docstrings to this

def __eq__(self, other: Union[ColumnOrLiteral, Expression]) -> "Column":
"""Equal to."""
right = Column._to_expr(other)
return Column(EqualTo(self._expression, right))
return self._bin_op_impl(lambda ast: ast.sp_column_equal_to, EqualTo, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big fan of the refactor here, so much cleaner!


def __rtruediv__(self, other: Union[ColumnOrLiteral, Expression]) -> "Column":
return Column(Divide(Column._to_expr(other), self._expression))
return self._bin_op_rimpl(lambda ast: ast.sp_column_divide, Divide, other)

def __mod__(self, other: Union[ColumnOrLiteral, Expression]) -> "Column":
"""Reminder."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Remainder


def __neg__(self) -> "Column":
"""Unary minus."""
# TODO: Need SpColumnNeg IR entity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should start filing JIRAs with all of these TODOs. @sfc-gh-azwiegincew shall we maybe create an epic ThinClient AST/IR translation and track those under it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll spend some quality time with Jira today.

return Column(Not(self._expression))

def _cast(self, to: Union[str, DataType], try_: bool = False) -> "Column":
# TODO: Update SpColumnCast IR entity with new field "try_", then uncomment
Copy link
Contributor

@sfc-gh-lspiegelberg sfc-gh-lspiegelberg May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how difficult would this be? I would be also fine to defer this work to another PR here. Ok, to have only simple column expr supported for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very difficult, I have already made the changes and compiled the new protobuf in the PR in the monorepo with the modifications necessary: https://github.com/snowflakedb/snowflake/pull/183083

However, I will get rid of these TODOs in favor of a flexible approach in the _create_ast method which can silently skip IR entities and fields or raise an error depending on a parameter.

@@ -591,11 +699,18 @@ def regexp(self, pattern: ColumnOrLiteralStr) -> "Column":
:meth:`rlike` is an alias of :meth:`regexp`.

"""
# TODO: Should SpColumnRegexp.pattern be an Expr in the IR, could sbe SpColumnExpr as in SpColumnLike?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be a few TODOs which require IR changes, maybe collect in doc, and then go through in meeting? We prob. can resolve those quickly.

for col in parse_positional_args_to_list(*cols)
],
)
prop = lambda ast: ast.sp_column_within_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks long - can we add a comment to explain the logic?

def _named(self) -> NamedExpression:
if isinstance(self._expression, NamedExpression):
return self._expression
else:
return UnresolvedAlias(self._expression)

@staticmethod
def _create_ast(property: Optional[Callable] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a super helpful function. Can we add a docstring to explain it? This seems to be used throughout - so a good idea to have a detailed docstring and also whichever asserts/checks are needed within it.

@classmethod
def _fill_ast(cls, ast: proto.SpColumnExpr, value: ColumnOrLiteral) -> None:
"""
Copy from a Column object's AST, or copy a literal value into an AST expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain params, use Google Style (you can configure in PyCharm) to explain params.



def fill_ast(self, ast: proto.SpDataType) -> None:
raise NotImplementedError(f"{self.__class__.__name__} does not have an equivalent SpDataType yet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a base method I assume? Maybe let's not use NotImplementedError, but instead ValueError and what to implement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some intermediary types like _AtomicType(DataType) that should not be required to implement fill_ast I think so I will raise a ValueError instead.

pass


# Ignoring pandas types in AST for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you prob. can remove the comment, it will give a failure



class GeometryType(DataType):
"""Geometry data type. This maps to the GEOMETRY data type in Snowflake."""

# TODO: Create SpGeometryType IR enttiy, then uncomment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: enttiy -> entity

@@ -115,6 +121,11 @@ def __hash__(self):
if self.length == StringType._MAX_LENGTH:
return StringType().__hash__()
return super().__hash__()

def fill_ast(self, ast: proto.SpDataType) -> None:
# TODO: Create SpStringType IR entity with field "length: Int", and update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of all the TODOs as comment, could also use NotImplementedError with a comment on what to do. You could also define an error message

ErrorMessage.missing_ir('SpStringType IR needs field "length: Int"')

that would help a lot!

Copy link
Contributor Author

@sfc-gh-heshah sfc-gh-heshah May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, raising an error on the client side would probably be the best approach to raise the alarm on any Snowpark API elements with no representation in the IR.

I was afraid of blocking the client from working while we evaluate the IR changes, but I can turn the TODOs into NotImplementedError if that is okay.

Please see the draft pull request (still setting up the protobuf compiler to make sure the new AST is valid) I created addressing all of the TODOs here: https://github.com/snowflakedb/snowflake/pull/183083

Copy link
Contributor

@sfc-gh-lspiegelberg sfc-gh-lspiegelberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Left a few comments, I think it would be nice to have some testing added here.



# TODO: Instead of regexp, grab assign statments using Python ast library
RE_SYMBOL_NAME = re.compile(r'^\s*([a-zA-Z_]\w*)\s*=.*$', re.DOTALL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you prob. can move this constant within the get_symbol function

ast.file = curr_frame.filename
ast.start_line = curr_frame.lineno

if sys.version_info[1] >= 11:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fill for the older Python versions with None or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protobuf message created will default to None for these fields if not provided. Trying to assign them with None will raise an error due to the type mismatch.

@@ -363,6 +431,7 @@ def __round__(self, n=None):
def __hash__(self):
return hash(self._expression)

# TODO: Implement AST generation for in_ / isin (relies on MultipleExpression and ScalarSubquery)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises:
ValueError: If corresponding SpDataType IR entity is not available, raise an error
"""
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not NotImplementedError?


# Data types
class NullType(DataType):
"""Represents a null type."""

pass
def fill_ast(self, ast: proto.SpDataType) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be private methods as well? _fill_ast?

@@ -1,2 +1,2 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sfc-gh-azwiegincew recently updated this, pls reflect!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this file is now deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls delete this file

Copy link
Contributor

@sfc-gh-lspiegelberg sfc-gh-lspiegelberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits, great work! Let's get this in.

Copy link
Collaborator

@sfc-gh-azwiegincew sfc-gh-azwiegincew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gorgeous tests! Just a few nits.

import snowflake.snowpark._internal.proto.ast_pb2 as proto

NoneType = type(None)
PYTHON_TO_AST_CONST_MAPPINGS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line to separate 13 and 14

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of PYTHON_TO_AST_CONST_MAPPINGS. It's not reused anywhere, and you have complications such as the bytearray case. I would suggest removing this and just adding the cases directly in your one function that uses this global.

src/snowflake/snowpark/_internal/ast_utils.py Show resolved Hide resolved
src/snowflake/snowpark/_internal/ast_utils.py Show resolved Hide resolved
@@ -85,6 +85,11 @@
Unpivot,
ViewType,
)
from snowflake.snowpark._internal.ast_utils import (
setattr_if_not_none,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort

@@ -0,0 +1,20 @@
## TEST CASE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line after this for symmetry with EXPECTED OUTPUT – in all the new tests

@@ -0,0 +1,20 @@
## TEST CASE
from snowflake.snowpark.functions import col
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding import snowflake.snowpark.functions in the test driver and using functions.col in your tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or because col is so special (i.e. common), adding that import as well (keeping .functions for other functions in the future).

@@ -1,2 +1,2 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this file is now deleted.

@@ -23,29 +25,33 @@ class TestCase:

def parse_file(file):
"""Parses a test case file."""
with open(file, "r", encoding="utf-8") as f:
with open(file, encoding="utf-8") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove r?

return curr_frame


# TODO: Instead of regexp, grab assign statments using Python ast library
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a JIRA?

@@ -11,582 +11,589 @@
_sym_db = _symbol_database.Default()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been updated with the most recent one from the server / mono repo branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is the most recent one

@@ -724,7 +724,7 @@ def ast_query(self, request_id__ast) -> Any:
return self._conn._rest.request(
f"/queries/v1/query-request?requestId={request_id}", req
)
Copy link
Contributor

@sfc-gh-lspiegelberg sfc-gh-lspiegelberg Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove change here (use git restore -s origin/server-side-snowpark src/snowflake/snowpark/_internal/server_connection.py or so)

# I.e., add the following lines to ~/.ssh/config
# Host devvm
# HostName sdp-devvm-<ldap user>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse same logic for devvm / workspace incl. example from other script here?

Copy link
Contributor

@sfc-gh-lspiegelberg sfc-gh-lspiegelberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more pass and then this should be good to go

@@ -662,7 +662,6 @@ def get_result_query_id(self, plan: SnowflakePlan, **kwargs) -> str:
result_set, _ = self.get_result_set(plan, to_iter=True, **kwargs)
return result_set["sfqid"]


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a side effect of running the precommit fix_lint


Args:
property (str): IR entity representing the unary operation in the Column AST
operator (UnaryExpression): Snowpark unary operator to buidl the Expression instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
operator (UnaryExpression): Snowpark unary operator to buidl the Expression instance
operator (UnaryExpression): Snowpark unary operator to build the Expression instance

nit

Copy link
Contributor

@sfc-gh-vbudati sfc-gh-vbudati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I think it should be good to merge once you delete the file Arthur and Leonhard mentioned - tests/ast/run-unparser.sh

@sfc-gh-heshah sfc-gh-heshah merged commit bce97c0 into server-side-snowpark Jun 12, 2024
14 of 16 checks passed
@sfc-gh-heshah sfc-gh-heshah deleted the server-side-snowpark-colops branch June 12, 2024 17:52
@github-actions github-actions bot locked and limited conversation to collaborators Jun 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants