Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaTable Specifications #42

Merged
merged 54 commits into from
Sep 11, 2023
Merged

DeltaTable Specifications #42

merged 54 commits into from
Sep 11, 2023

Conversation

mrmasterplan
Copy link
Contributor

@mrmasterplan mrmasterplan commented May 25, 2023

Youtube link with introduction to the is PR: https://youtu.be/X57AWD0OsZA

Please approve this PR if you think that

  • It breaks nothing in existing workflows
  • it might be useful to someone
  • it does not fundamentally violate the principles of spetlr (maybe we should write those down)

DeltaTableSpec

Abstract

The DeltaTableSpec class contains all information about a delta table that can be
given in a CREATE TABLE statement.

The class can be initialized in pure python, or by parsing a CREATE TABLE
statement. In addition, the class is able to lift all necessary information from the
spark catalog that fully describe the table. Using these two channels, 1. from code
and 2. from disk, the class can make statements about the degree of agreement
between the two. Crucially, the class can formulate the ALTER TABLE statements
that are necessary to bring the table in spark into alignment with the specification
from code. This is its primary function.

Introduction

Taking a step back from the mechanisms of spark, one could argue that there are
these competing statements that all describe a delta table to some degree:

  • A CREATE TABLE statement
  • A spark data frame (to be written to disk)
  • A delta table on a storage media or in the spark catalog

In order to enable more dynamic analysis of their mutual (dis-)agreements, these
have been extended with the DeltaTableSpec class which can exist:

  • as python code: DeltaTableSpec(name="...", schema=...)
  • as a class instance in memory

The class has methods that enable going back and forth between each of these forms:

  • python code ↔ object instance: __init__ and repr(tbl) are guaranteed to
    be mutual inverses. The result of eval(repr(tbl)) compares equal to the
    original object.
  • sql code ↔ object instance:
    • DeltaTableSpec.from_sql(str) will create an instance from sql code
    • tbl.get_create_sql() will return a fully formed create statement,
      guaranteed to be the inverse of the above.
  • delta table ↔ object instance:
    • DeltaTableSpec.from_path(str) and DeltaTableSpec.from_name(str) will read
      all table details from spark.
    • tbl.make_storage_match() will execute the necessary create sql statement to
      make the result of the from_name call compare equal to the specification in tbl

Reference

For a detailed reference, please see the docstrings of each method on the class.


Documentation like the above is being produced, but I really want to get this out into peoples hands after working on it for more than 4 months.

Simon Heisterkamp added 2 commits May 30, 2023 11:30
# Conflicts:
#	src/spetlr/configurator/configurator.py
#	src/spetlr/delta/delta_handle.py
#	tests/cluster/delta/test_delta_class.py
@mrmasterplan
Copy link
Contributor Author

@LauJohansson I addressed all your comments. Thanks for the thorough review. I left the conversations unresolved, you can close them yourself if you agree.

@mrmasterplan mrmasterplan temporarily deployed to azure September 11, 2023 21:46 — with GitHub Actions Inactive
@mrmasterplan mrmasterplan merged commit 30fd7d4 into main Sep 11, 2023
2 checks passed
@mrmasterplan mrmasterplan deleted the feature/table-spec branch September 11, 2023 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants