Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic select API working #1674

Merged
merged 5 commits into from
May 28, 2024

Conversation

sfc-gh-heshah
Copy link
Contributor

@sfc-gh-heshah sfc-gh-heshah commented May 24, 2024

  • Columns are created with no reference to the parent Dataframe in the current Snowpark API
  • Following this pattern when generating the AST (invoking proto IR entity constructors directly)
  • ast parameter added to Column constructor to enable Column methods to create and push-down ASTs
  • ast property added to Column class to enable backwards compatibility with col._expression: Expression using SpColumnSqlExpr as a backup

Current test case output AST

client_version {
  major: 42
}
client_language {
  python_language {
    version {
      major: 3
      minor: 8
    }
  }
}
body {
  assign {
    uid: 1
    var_id {
      bitfield1: 1
    }
    expr {
      sp_table {
        table: "test_table"
      }
    }
  }
}
body {
  assign {
    uid: 2
    var_id {
      bitfield1: 2
    }
    expr {
      sp_dataframe_filter {
        df {
          sp_dataframe_ref {
            id {
              bitfield1: 1
            }
          }
        }
        condition {
          sp_column_sql_expr {
            sql: "STR LIKE \'%e%\'"
          }
        }
      }
    }
  }
}
body {
  assign {
    uid: 3
    var_id {
      bitfield1: 3
    }
    expr {
      sp_dataframe_select__columns {
        df {
          sp_dataframe_ref {
            id {
              bitfield1: 2
            }
          }
        }
        cols {
          sp_column {
            name: "\"A\""
          }
        }
        cols {
          sp_column_sql_expr {
            sql: "\"STR\""
          }
        }
        cols {
          sp_column {
            name: "\"B\""
          }
        }
        cols {
          sp_column_sql_expr {
            sql: "\"STR\""
          }
        }
        cols {
          sp_column_sql_expr {
            sql: "ADD(\"A\", \"B\")"
          }
        }
      }
    }
  }
}

@sfc-gh-heshah sfc-gh-heshah requested a review from a team as a code owner May 24, 2024 02:51
@sfc-gh-heshah sfc-gh-heshah requested review from sfc-gh-yixie, sfc-gh-yuwang, sfc-gh-jrose and sfc-gh-azwiegincew and removed request for a team May 24, 2024 02:51
@@ -79,6 +79,7 @@
TimestampType,
)
from snowflake.snowpark.window import Window, WindowSpec
import snowflake.snowpark._internal.proto.ast_pb2 as proto
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be sorted.

df = df.filter("STR LIKE '%e%'")
df.show()
df = session.table('test_table').filter("STR LIKE '%e%'")
# col = df.col("STR") # TODO: determine whether this should create and assign statement or not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not. We have no visibility into the assignment of Python objects. We create assign statements in the AST that correspond to dataframe-typed values (df, col-expr, etc.).

df = session.table('test_table')
df = df.filter("STR LIKE '%e%'")
df.show()
df = session.table('test_table').filter("STR LIKE '%e%'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add this as a separate "test" or use cmdline args to choose the test.

@@ -517,6 +517,7 @@ def __init__(
referenced in subsequent dataframe expressions.
"""
self._session = session
self._ast_expr = ast_stmt.expr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to retain the AST expression.

stmt = self._session._ast_batch.assign()
ast = stmt.expr
ast.sp_dataframe_select__columns.df.sp_dataframe_ref.id.bitfield1 = self._ast_id
ast.sp_dataframe_select__columns.variadic = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variadic is true for some forms – the logic seems to be in parse_positional_args_to_list. We have to capture this detail to correctly represent the unparsed form in query history.

names.append(Column(e)._named())
col = Column(e)
names.append(col._named())
ast.sp_dataframe_select__columns.cols.append(col.ast)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inefficient. Since the user passes in a string in this case, you're free to construct a new column expr from scratch without the inefficient copying.

This will naturally get refactored due to the previous comment though.

If this were to result in cleaner code, feel free to reimplement the logic, since in Phase 1 we will remove a lot of the existing complexity (we're preserving it for now so that this codebase can be used for Phase 0).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
# TODO: figure out why basic aliasing is breaking above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misinterpreted the goal of this case in the Column constructor. When two strings are provided to the constructor the first is meant to use an alias for the Dataframe created prior to using the Column constructor.

Just to clarify, is the SpColumnAlias IR entity meant to capture statements of the form select colA as A?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In that case you would have two nested expressions.

) -> None:
self._ast = proto.SpColumnExpr() if ast is None else ast
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems wrong. You should defer the construction of the AST to the specific cases below.

elif isinstance(expr1, Expression):
self._expression = expr1
else: # pragma: no cover
raise TypeError("Column constructor only accepts str or expression.")

@property
def ast(self) -> proto.SpColumnExpr:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary. This logic should happen at construction time for the appropriate case. We shouldn't be introducing imperative mutations with a hard-to-follow control flow in an API without mutable state like this one. This is why I'm suggesting having clear logic in the constructor that immediately constructs the required value without mutations spread across multiple branches and functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay to still store a constructed _ast member and retrieve it later?

I'm facing an issue with deciding how to handle expressions of the form df.select(df.col("A"), col("B")), where the provided Column instances are constructed before df.select has an opportunity to construct the necessary SpColumnExpr instance and populate the AST fields.

More generally any Dataframe API that accepts previously constructed Column instances will face this issue, hence the current solution. I agree with the point you made for line 1137 though as that is a clear opportunity to avoid copying from a potentially complex message into the Session AstBatch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right and we have to store a pre-constructed value in _ast. This property shouldn't be necessary though. Maybe a helper function to copy _ast into whatever builder is passed in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, would the goal of the helper function be to replace protobuf.message.CopyFrom(col._ast) and append(col._ast) when constructing the AST one level up from the constructor?

I was thinking as long as _ast is ensured to be an instance of SpColumnExpr a helper function might not be necessary, unless there is another purpose I'm missing, please let me know!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Definitely better without a helper if the code seems simple.

df = session.table('test_table').filter("STR LIKE '%e%'")
# col = df.col("STR") # TODO: determine whether this should create and assign statement or not
df2 = df.select("A", df.col("STR"), col("B"), df.STR, col("A") + col("B"))
print(session._ast_batch._request)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you attach the output to the PR description please? I wanted to double-check the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, just added above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I don't think it's right to map df.col() to SpColumnSqlExpr. It should be the same as the others. It's possible that AST construct was checked in by mistake. I don't see any Column APIs that include SQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, currently I was using the SpColumnSqlExpr as a backup for APIs which didn't generate the AST yet.

There is however a Column API that allows SQL text through an UnresolvedAttribute Expression subclass I believe:

def _expr(cls, e: str) -> "Column":

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the AST, our goal is to capture the original API call. This appears to be an implementation detail which we will be removing in stage 1.

@sfc-gh-heshah
Copy link
Contributor Author

sfc-gh-heshah commented May 25, 2024

AST output from tests/thin-client/select_outputs.py after changes:

df = session.table("test_table").filter("STR LIKE '%e%'")
df2 = df.select("A", df.col("STR"), col("B"), df.STR)

client_version {
  major: 42
}
client_language {
  python_language {
    version {
      major: 3
      minor: 8
    }
  }
}
body {
  assign {
    uid: 1
    var_id {
      bitfield1: 1
    }
    expr {
      sp_table {
        table: "test_table"
      }
    }
  }
}
body {
  assign {
    uid: 2
    var_id {
      bitfield1: 2
    }
    expr {
      sp_dataframe_filter {
        df {
          sp_dataframe_ref {
            id {
              bitfield1: 1
            }
          }
        }
        condition {
          sp_column_sql_expr {
            sql: "STR LIKE \'%e%\'"
          }
        }
      }
    }
  }
}
body {
  assign {
    uid: 3
    var_id {
      bitfield1: 3
    }
    expr {
      sp_dataframe_select__columns {
        df {
          sp_dataframe_ref {
            id {
              bitfield1: 2
            }
          }
        }
        cols {
          sp_column {
            name: "\"A\""
          }
        }
        cols {
          sp_dataframe_col {
            df {
              sp_dataframe_ref {
                id {
                  bitfield1: 2
                }
              }
            }
            col_name: "\"STR\""
          }
        }
        cols {
          sp_column {
            name: "\"B\""
          }
        }
        cols {
          sp_dataframe_col {
            df {
              sp_dataframe_ref {
                id {
                  bitfield1: 2
                }
              }
            }
            col_name: "\"STR\""
          }
        }
        variadic: true
      }
    }
  }
}

class AstBatch:
def __init__(self, session):
self._session = session
self._id_gen = itertools.count(start=1)
self._init_batch()
# TODO: extended version from the branch snowpark-ir.

def assign(self, symbol = None):
def assign(self, symbol=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with the rest of the code.

) -> None:
if ast is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done without a guard. It's better to rely on a known slot being set to None rather than the absence of the key in the dictionary.

if expr1 == "*":
self._expression = Star([])
# TODO: Determine whether SpColumnSqlExpr is correct entity here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct.

stmt = self._session._ast_batch.assign()
ast = stmt.expr
ast.sp_dataframe_select__columns.df.sp_dataframe_ref.id.bitfield1 = self._ast_id
ast.sp_dataframe_select__columns.variadic = (len(cols) > 1 or isinstance(cols[0], (list, tuple, set)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. If a single list is passed, we want variadic = False.

Copy link
Collaborator

@sfc-gh-azwiegincew sfc-gh-azwiegincew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG!

self._ast.sp_column.name = col_name
self._expression = UnresolvedAttribute(quote_name(expr1))

# some repitition here, but _expression logic will be eliminated eventually
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repetition

@sfc-gh-heshah sfc-gh-heshah merged commit 8de5d28 into server-side-snowpark May 28, 2024
11 of 13 checks passed
@sfc-gh-heshah sfc-gh-heshah deleted the server-side-snowpark-colexpr branch May 28, 2024 18:21
@github-actions github-actions bot locked and limited conversation to collaborators May 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants