Adding local deltacat storage module #175

raghumdani · 2023-08-04T21:44:17Z

deltacat/storage/model/table_version.py

… and new makefile targets. (#202) * Logging memory consumed to validate worker estimation correctness (#142) * Logging memory consumed to validate worker estimation correctness [Description] This change will allow us to validate the correctness of the workers we estimated to run a compaction job reliably. * update tests * fix tests for python 3.7 as addClassCleanup is absent * addressed comments * Update deltacat/compute/compactor/compaction_session.py Co-authored-by: Patrick Ames <pdames@amazon.com> Signed-off-by: Raghavendra M Dani <raghumdani@gmail.com> * Update deltacat/utils/ray_utils/concurrency.py Co-authored-by: Patrick Ames <pdames@amazon.com> Signed-off-by: Raghavendra M Dani <raghumdani@gmail.com> * remove dependency on rebase_source_partition, simplify the repartition delta discovery * remove repartition_during_rebase argument, ping the source to compacted table * remove repartition rcf write, generalize the rcf write by supporting pre-specified url * addressed comments * logging cpu usage metrics * fix percentages * capture latency of retrieving cluster resources --------- Signed-off-by: Raghavendra M Dani <raghumdani@gmail.com> Co-authored-by: Patrick Ames <pdames@amazon.com> Co-authored-by: Jialin Liu <rootliu@amazon.com> * Capturing all the performance metrics in an audit (#146) * Capturing all the performance metrics in an audit [Description] It is important to capture audit automatically and analyze the datapoints from significant number of runs to recognize a pattern and come up with a formula to predict resource consumption. This would improve reliability and efficiency. [Motivation] #141 * fix unit tests * refactoring and captured new metrics * fix constructor args * fix assert statement * fix NoneType error * fix bug in skipping manifest * add comment and address comments * fix stage_delta * fixing deltacat method signature * capturing the head process peak memory usage and result size of each step * [skip download cold manifest] Add support for skipping download cold manifest entries during materialize step * Address review comments * Adding a data model to represent compact_partition function parameters (#151) * defined CompactPartitionParams class * fixed setuptools.setup name * updated CompactPartitionParams docstring to refer to function location * added compact_partition_from_request method that wraps compact_partition * fixed unit tests * fixed test_serialize_handles_sets unit test by asserting that two list contain the same elements instead of using == operator * fixed linting in test_compact_partition_params.py * Object store implementation to allow elastically increasing the object store memory (#149) * [skip untouched files]Disable copy by reference during backfill and rebase only compaction; Fix referenced manifest PyarrowWriteResult (#153) * add dd parallelism params (#154) Co-authored-by: Jialin Liu <rootliu@amazon.com> * Allow s3 client kwargs as argument of compact_partition (#155) * Allow s3 client kwargs as argument of compact_partition * Changing default as empty dict * add s3_client_kwargs to rcf * Bunping up version to 0.1.18b8 * Honor profile name in s3 client kwargs (#157) * Use kwargs as to determine the profile to use when creating the s3 client * bumped up version to 0.1.18b9 * Allow s3_client_kwargs to be passed into repartition (#158) * Allow s3_client_kwargs to be passed into repartition * Update compaction session s3_client_kwargs to default to None * Move s3_client_kwargs default setter to parent scope (#159) * Move s3_client_kwargs default setter to parent scope * Bump version to 0.1.18b11 * keep null row and remove dw_int64 column (#161) * keep null row and remove dw_int64 column * add assertation for extra column remove * remove column by name explicitly * use drop column * use list of str in drop columns --------- Co-authored-by: Jialin Liu <rootliu@amazon.com> * version bump to 0.1.18b12 (#164) * Cleaning up rehashing logic as it is a dead code as of now. (#166) * Cleaning up rehashing logic as it is a dead code as of now. [Motivation] #165 * fix version number * Fix stream position and support latest pyarrow (#168) [Motivation] #167 * Bumped version from 0.1.18b12 to 0.1.18b13 (#169) * bumped version from 0.1.18b12 to 0.1.18b13 * fixed current_node_name logic to handle ipv6 addresses * Add pytest benchmarking for Parquet reads (#160) * Add benchmarking scripts and README * Cleanup code and report visualizations * Enable 1RG tpch files * Fix naming from 1RG to 2RG * Update dev-requirements.txt Co-authored-by: Patrick Ames <smtp.pdames@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com> * Update deltacat/benchmarking/conftest.py Co-authored-by: Patrick Ames <smtp.pdames@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com> * Update deltacat/benchmarking/README.md Co-authored-by: Patrick Ames <smtp.pdames@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com> * Lints and README * Use benchmark-requirements.txt instead of dev-requirements.txt * Lints --------- Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com> Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com> Co-authored-by: Patrick Ames <smtp.pdames@gmail.com> * Polling EC2 Instance Metadata endpoint until HTTP 200 OK (#172) * block_until_instance_metadata_service_returns_success implemented * now just checking INSTANCE_METADATA_SERVICE_IPV4_URI * added unit testing * removed unneccesary logging * removed commented out print statement * fixed session.mount * removed unused args * now using tenacity for retrying * changed default stop strategy * removed unused import in TestBlockUntilInstanceMetadataServiceReturnsSuccess * retry_url->retrying_get, log_attempt_number now private * Adding local deltacat storage module (#175) * Adding local deltacat storage module * Using camel case dict keys for consistency * Support storing data in parquet and utsv bytes * Fix README.md to allow db_file_path arg * version bump from 0.1.18b13 to 0.1.18b14 (#179) * Now triggering publish-to-pypi on editing and creating a release (#180) * `compact_partition` incremental unit test (#188) * fixed commit_partition and list_deltas local deltacat storage bug * providing destination table with deltas * ds_mock_kwargs fixtures just consist of dict of db_file_path * working test_compact_partition_success unit test * flake8 issues * removed print statements * fixed assert ( manifest_records == len(compacted_table) * 3 working unit test cases * implemented test_compact_partition_incremental * commented out broken unit test * removed unused test files * added explanatory comment to block_until_instance_metadata_service_returns_success * reverting ray_utils/runtime change * fixed test_retrying_on_statuses_in_status_force_list * fixed test_retrying_on_statuses_in_status_force_list * added additional block_until_instance_metadata_service_returns_success unit test * additional fixtures * validation_callback_func_ -> validation_callback_func_kwargs * added use_prev_compacted key * added additional use_prev_compacted * fixed test_compaction_session unit tests * copied over working unit tests from dev/test_compact_partition_first_cut * fixed repartition unit tests * moved test_utils to itw own module * removed unused kwargs arg * paramtrizing records_per_compacted_file, hash_bucket_count * removed unused TODO * augmented CompactPartitionParams to include additional compact_partition kwargs * augmented CompactPartitionParams to include additional compact_partition kwargs * refactored testcases and setup to there own modules * added additional type hints * defaulting to empty dict safetly wherever deltacat_storage_kwargs is a param * revert change that no longer passed **list_deltas_kwargs to io.discover_deltas * additional type hints * no longer passing kwargs to dedupe + decimal-pk case * no longer passing kwargs to materialize * unpacking deltacat_storage_kwargs in deltacat_storage.stage_delta timed_invocation * no longer unpacking **list_deltas_kwargs when calling _execute_compaction_round * readded kwargs * added incremental timestamp-pk unit test * added docstring to offer_iso8601_timestamp_list * added tc for duplicate w sorting keyand multiple primary keys * removed deadcode * added additional comments to first incremental test case * fixture refactoring - compaction_artifacts_s3_bucket -> setup_compaction_artifacts_s3_bucket * parameterizing table version * # ds_mock_kwargs -> # teardown_local_deltacat_storage_db * tearing down db within test execution is now parameterized * added INCREMENTAL_DEPENDENT_TEST_CASES dependent test case * reversing order of sk in 10-incremental-decimal-pk-multi-dup * Switch botocore retry mode to adaptive from standard (#191) Co-authored-by: Kevin Yan <kevnya@amazon.com> * Merge phash_main into main branch (#195) * Adding support for reading table into ParquetFile (#163) * Adding support for reading just the parquet meta by leveraging ParquetFile [Description] Right now we read the entire table using read_parquet method of pyarrow. However, we lose the parquet metadata in the process and this will eagerly load all the data. In this commit we will use ParquetFile (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html) [Motivation] #162 * Using pyarrow.fs.S3FileSystem instead of s3fs * Adding s3 adaptive retries and validate encoding * Interface for merge v2 (#182) * Adding interface and definitions for merge step * fix tests and merge logs * Add hashed memcached client support (#173) * Adding support for reading table into ParquetFile (#163) * Adding support for reading just the parquet meta by leveraging ParquetFile [Description] Right now we read the entire table using read_parquet method of pyarrow. However, we lose the parquet metadata in the process and this will eagerly load all the data. In this commit we will use ParquetFile (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html) [Motivation] #162 * Using pyarrow.fs.S3FileSystem instead of s3fs * Adding s3 adaptive retries and validate encoding * Add hashed memcached client support * Adding support for reading table into ParquetFile (#163) * Adding support for reading just the parquet meta by leveraging ParquetFile [Description] Right now we read the entire table using read_parquet method of pyarrow. However, we lose the parquet metadata in the process and this will eagerly load all the data. In this commit we will use ParquetFile (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html) [Motivation] #162 * Using pyarrow.fs.S3FileSystem instead of s3fs * Adding s3 adaptive retries and validate encoding * use uid instead of ip address as key to hash algorithm --------- Co-authored-by: Raghavendra M Dani <draghave@amazon.com> * Implementing hash bucketing v2 (#178) * Implementing hash bucket v2 * Fix the assertion regarding hash buckets * Python 3.7 does not have doClassCleanups in super * Fix the memory issue with the hb index creation * Compaction session implementation for algo v2 (#187) * Compaction session implementation for algo v2 * Address comments * Added capability to measure instance minutes in a autoscaling cluster setting * Avoid configuring logger if ray is uninitialized * Add readme to run tests * Refactoring and fixing the num_cpus key in options * Resolve merge conflict and rebase from main * Adding additional optimization from POC (#194) * Adding additional optimization from POC * fix typo * Moved the compact_partition tests to top level module * Adding unit tests for parquet downloaders * fix typo * fix repartition session * Adding stack trace and passing config kwargs separate due to s3fs bug * fix the parquet reader * pass deltacat_storage_kwargs in repartition_session * addressed comments and extend tests to handle v2 --------- Co-authored-by: Zyiqin-Miranda <79943187+Zyiqin-Miranda@users.noreply.github.com> * Daft Native Reader for Parquet Content Types (#183) * Add pyarrow_to_daft_schema helper * added initial daft read for parquet * integrate io code with daft cast * suggestions * lint-fix EOF for requirements * lint-fixs EOF with pre-commit * bump version to 0.1.18b15 * Fix tuple issue with coerce_int96_timestamp_unit * Fix kwargs and Schema * Lint * using include column names instead of column names * Update to use new Daft Schema.from_pyarrow_schema method * Remove daft requirement from benchmark-requirements.txt * Factor out daft parquet reading code into seperate util file * add timing to daft.read_parquet, address feedback, add unit tests * run pre-commit * switch on columns and add local file * add schema override unit test * add schema override unit test * fix fstrings * thread column names through * downgrade to daft==0.1.12 * add support for partial row group downloads * add `reader_type` to pyarrow utils --------- Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com> * [WIP] Read Iceberg to DeltaCAT Dataset (#131) * add read_iceberg and corresponding test structure * rollback requirements change * prepare for pyiceberg0.4.0 release. Add more tests * add missing requirements * add quick install command * add read_iceberg and corresponding test structure * rollback requirements change * prepare for pyiceberg0.4.0 release. Add more tests * add missing requirements * add quick install command * Fixes for Ray 2.X and add recommendations from review of PR #131. * fix format issue Signed-off-by: Rushan Jiang <rushanj@andrew.cmu.edu> * fix requirements list and setup Signed-off-by: Rushan Jiang <rushanj@andrew.cmu.edu> * fix the partitioning issue reported in the ray-project PR Signed-off-by: Rushan Jiang <rushanj@andrew.cmu.edu> * porting apache/iceberg#7768 to reduce the waiting time and increase stability when running integration test Signed-off-by: Rushan Jiang <rushanj@andrew.cmu.edu> --------- Signed-off-by: Rushan Jiang <rushanj@andrew.cmu.edu> Co-authored-by: Patrick Ames <pdames@amazon.com> * Add workaround for pydantic 2.0 incompatibility, add build & deploy to s3, and add AWS glue job runner. * Fix worker logging on AWS Glue, stop duplicate pip installs of DeltaCAT, linting. * Add regionalization, remove assumptions about high-level errors/responses from Glue. --------- Signed-off-by: Raghavendra M Dani <raghumdani@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com> Signed-off-by: Rushan Jiang <rushanj@andrew.cmu.edu> Signed-off-by: Patrick Ames <pdames@amazon.com> Co-authored-by: jialin <valiantljk@gmail.com> Co-authored-by: Raghavendra M Dani <draghave@amazon.com> Co-authored-by: Jialin Liu <rootliu@amazon.com> Co-authored-by: Zyiqin-Miranda <79943187+Zyiqin-Miranda@users.noreply.github.com> Co-authored-by: zyiqin <zyiqin@amazon.com> Co-authored-by: pf <19919899+pfaraone@users.noreply.github.com> Co-authored-by: rkenmi <rkenmi@gmail.com> Co-authored-by: Jay Chia <17691182+jaychia@users.noreply.github.com> Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com> Co-authored-by: Kevin Yan <43934572+yankevn@users.noreply.github.com> Co-authored-by: Kevin Yan <kevnya@amazon.com> Co-authored-by: Sammy Sidhu <samster25@users.noreply.github.com> Co-authored-by: Jonas(Rushan) Jiang <jonasjiang.dev@gmail.com>

Zyiqin-Miranda reviewed Aug 4, 2023

View reviewed changes

deltacat/storage/model/table_version.py Show resolved Hide resolved

raghumdani mentioned this pull request Aug 5, 2023

Implementing hash bucketing v2 #178

Merged

rkenmi reviewed Aug 7, 2023

View reviewed changes

deltacat/storage/model/table_version.py Outdated Show resolved Hide resolved

rkenmi approved these changes Aug 7, 2023

View reviewed changes

Zyiqin-Miranda approved these changes Aug 7, 2023

View reviewed changes

raghumdani added 4 commits August 7, 2023 10:45

Adding local deltacat storage module

0a54b3a

Using camel case dict keys for consistency

c13cd10

Support storing data in parquet and utsv bytes

997c3ff

Fix README.md to allow db_file_path arg

06dfc47

raghumdani force-pushed the sqlite_ds branch from 544fd9e to 06dfc47 Compare August 7, 2023 17:45

raghumdani merged commit c7073a1 into main Aug 7, 2023
5 checks passed

raghumdani deleted the sqlite_ds branch August 7, 2023 17:51

pfaraone mentioned this pull request Aug 15, 2023

compact_partition incremental unit test #188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding local deltacat storage module #175

Adding local deltacat storage module #175

raghumdani commented Aug 4, 2023

Adding local deltacat storage module #175

Adding local deltacat storage module #175

Conversation

raghumdani commented Aug 4, 2023