-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic select API working #1674
Basic select API working #1674
Conversation
src/snowflake/snowpark/column.py
Outdated
@@ -79,6 +79,7 @@ | |||
TimestampType, | |||
) | |||
from snowflake.snowpark.window import Window, WindowSpec | |||
import snowflake.snowpark._internal.proto.ast_pb2 as proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be sorted.
tests/thin-client/steel-thread.py
Outdated
df = df.filter("STR LIKE '%e%'") | ||
df.show() | ||
df = session.table('test_table').filter("STR LIKE '%e%'") | ||
# col = df.col("STR") # TODO: determine whether this should create and assign statement or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not. We have no visibility into the assignment of Python objects. We create assign statements in the AST that correspond to dataframe-typed values (df, col-expr, etc.).
tests/thin-client/steel-thread.py
Outdated
df = session.table('test_table') | ||
df = df.filter("STR LIKE '%e%'") | ||
df.show() | ||
df = session.table('test_table').filter("STR LIKE '%e%'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add this as a separate "test" or use cmdline args to choose the test.
src/snowflake/snowpark/dataframe.py
Outdated
@@ -517,6 +517,7 @@ def __init__( | |||
referenced in subsequent dataframe expressions. | |||
""" | |||
self._session = session | |||
self._ast_expr = ast_stmt.expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to retain the AST expression.
src/snowflake/snowpark/dataframe.py
Outdated
stmt = self._session._ast_batch.assign() | ||
ast = stmt.expr | ||
ast.sp_dataframe_select__columns.df.sp_dataframe_ref.id.bitfield1 = self._ast_id | ||
ast.sp_dataframe_select__columns.variadic = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variadic
is true for some forms – the logic seems to be in parse_positional_args_to_list
. We have to capture this detail to correctly represent the unparsed form in query history.
src/snowflake/snowpark/dataframe.py
Outdated
names.append(Column(e)._named()) | ||
col = Column(e) | ||
names.append(col._named()) | ||
ast.sp_dataframe_select__columns.cols.append(col.ast) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inefficient. Since the user passes in a string in this case, you're free to construct a new column expr from scratch without the inefficient copying.
This will naturally get refactored due to the previous comment though.
If this were to result in cleaner code, feel free to reimplement the logic, since in Phase 1 we will remove a lot of the existing complexity (we're preserving it for now so that this codebase can be used for Phase 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/snowflake/snowpark/column.py
Outdated
) | ||
# TODO: figure out why basic aliasing is breaking above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misinterpreted the goal of this case in the Column constructor. When two strings are provided to the constructor the first is meant to use an alias for the Dataframe created prior to using the Column constructor.
Just to clarify, is the SpColumnAlias
IR entity meant to capture statements of the form select colA as A
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In that case you would have two nested expressions.
src/snowflake/snowpark/column.py
Outdated
) -> None: | ||
self._ast = proto.SpColumnExpr() if ast is None else ast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems wrong. You should defer the construction of the AST to the specific cases below.
src/snowflake/snowpark/column.py
Outdated
elif isinstance(expr1, Expression): | ||
self._expression = expr1 | ||
else: # pragma: no cover | ||
raise TypeError("Column constructor only accepts str or expression.") | ||
|
||
@property | ||
def ast(self) -> proto.SpColumnExpr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary. This logic should happen at construction time for the appropriate case. We shouldn't be introducing imperative mutations with a hard-to-follow control flow in an API without mutable state like this one. This is why I'm suggesting having clear logic in the constructor that immediately constructs the required value without mutations spread across multiple branches and functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be okay to still store a constructed _ast
member and retrieve it later?
I'm facing an issue with deciding how to handle expressions of the form df.select(df.col("A"), col("B"))
, where the provided Column
instances are constructed before df.select
has an opportunity to construct the necessary SpColumnExpr
instance and populate the AST fields.
More generally any Dataframe API that accepts previously constructed Column instances will face this issue, hence the current solution. I agree with the point you made for line 1137 though as that is a clear opportunity to avoid copying from a potentially complex message into the Session AstBatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right and we have to store a pre-constructed value in _ast
. This property shouldn't be necessary though. Maybe a helper function to copy _ast
into whatever builder is passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, would the goal of the helper function be to replace protobuf.message.CopyFrom(col._ast)
and append(col._ast)
when constructing the AST one level up from the constructor?
I was thinking as long as _ast
is ensured to be an instance of SpColumnExpr
a helper function might not be necessary, unless there is another purpose I'm missing, please let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Definitely better without a helper if the code seems simple.
tests/thin-client/steel-thread.py
Outdated
df = session.table('test_table').filter("STR LIKE '%e%'") | ||
# col = df.col("STR") # TODO: determine whether this should create and assign statement or not | ||
df2 = df.select("A", df.col("STR"), col("B"), df.STR, col("A") + col("B")) | ||
print(session._ast_batch._request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you attach the output to the PR description please? I wanted to double-check the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, just added above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I don't think it's right to map df.col() to SpColumnSqlExpr. It should be the same as the others. It's possible that AST construct was checked in by mistake. I don't see any Column APIs that include SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, currently I was using the SpColumnSqlExpr
as a backup for APIs which didn't generate the AST yet.
There is however a Column API that allows SQL text through an UnresolvedAttribute
Expression subclass I believe:
def _expr(cls, e: str) -> "Column": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the AST, our goal is to capture the original API call. This appears to be an implementation detail which we will be removing in stage 1.
AST output from
|
class AstBatch: | ||
def __init__(self, session): | ||
self._session = session | ||
self._id_gen = itertools.count(start=1) | ||
self._init_batch() | ||
# TODO: extended version from the branch snowpark-ir. | ||
|
||
def assign(self, symbol = None): | ||
def assign(self, symbol=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with the rest of the code.
src/snowflake/snowpark/column.py
Outdated
) -> None: | ||
if ast is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done without a guard. It's better to rely on a known slot being set to None
rather than the absence of the key in the dictionary.
src/snowflake/snowpark/column.py
Outdated
if expr1 == "*": | ||
self._expression = Star([]) | ||
# TODO: Determine whether SpColumnSqlExpr is correct entity here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems correct.
src/snowflake/snowpark/dataframe.py
Outdated
stmt = self._session._ast_batch.assign() | ||
ast = stmt.expr | ||
ast.sp_dataframe_select__columns.df.sp_dataframe_ref.id.bitfield1 = self._ast_id | ||
ast.sp_dataframe_select__columns.variadic = (len(cols) > 1 or isinstance(cols[0], (list, tuple, set))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. If a single list is passed, we want variadic = False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG!
src/snowflake/snowpark/column.py
Outdated
self._ast.sp_column.name = col_name | ||
self._expression = UnresolvedAttribute(quote_name(expr1)) | ||
|
||
# some repitition here, but _expression logic will be eliminated eventually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repetition
Column
s are created with no reference to the parentDataframe
in the current Snowpark APIproto
IR entity constructors directly)ast
parameter added toColumn
constructor to enableColumn
methods to create and push-down ASTsast
property added toColumn
class to enable backwards compatibility withcol._expression: Expression
usingSpColumnSqlExpr
as a backupCurrent test case output AST