# Working with Rivulet Datasets
This demo showcases
1. How to create a dataset from a Parquet file using Deltacat.
2. How to dynamically modify a dataset schema by adding new columns.
3. How to append new rows and update existing rows without altering the original data files.
4. How to query and read data from the updated dataset efficiently.

In [None]:
import deltacat as dc
import pathlib
import pyarrow as pa
import pyarrow.parquet as pq

### Step 1: Create a simple 3x3 Parquet file using pyarrow

In [None]:
parquet_file_path = pathlib.Path.cwd() / "contacts.parquet"
data = {
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
}
table = pa.Table.from_pydict(data)
pq.write_table(table, parquet_file_path)
print(f"Created Parquet file at: {parquet_file_path}")

### Step 2: Load the Parquet file into a Dataset

In [None]:
dataset = dc.Dataset.from_parquet(
    name="contacts",
    file_uri=parquet_file_path,
    metadata_uri=".",
    merge_keys="id"
)
print("Loaded dataset from Parquet file.")
dataset.print()

### Step 3: Add two new fields to the Dataset

In [None]:
dataset.add_fields([
    ("id", dc.Datatype.int64()),
    ("email", dc.Datatype.string()),
    ("is_active", dc.Datatype.bool())
], schema_name="updated_schema", merge_keys=["id"])
print("Added 'email' and 'is_active' fields to the dataset schema.")

### Step 4: Append two new records
The cool thing with deltacat datasets is that deltacat will not attempt to
rewrite the existing Parquet file; instead, they will store additional data
files alongside the original Parquet file(s) that can be easily joined with the originals.

In [None]:
dataset_writer = dataset.writer(file_format="feather", schema_name="updated_schema")

# Define some new rows w/ the expanded schema and write them
new_rows = [
    {"id": 4, "name": "David",   "age": 40, "email": "david@example.com", "is_active": True},
    {"id": 5, "name": "Eve",     "age": 45, "email": "eve@example.com",   "is_active": False}
]
dataset_writer.write(new_rows)
print("Wrote 2 new rows (records) with expanded schema.")

# Write into the new columns on existing rows and write them, again without modifying/messing with the original parquet file.
updates_for_existing_rows = [
    {"id": 3, "email": "charlie@example.com", "is_active": True},
    {"id": 2, "email": "bob@example.com",     "is_active": False},
    {"id": 1, "email": "alice@example.com",   "is_active": False}
]
dataset_writer.write(updates_for_existing_rows)
print("Updated existing rows (id=1,2,3) with new columns (email, is_active).")

# Write dataset data/metadata into feather files.
dataset_writer.flush()
print("Flushed all changes to the dataset.")

### Step 5: Read data from feather file.

In [None]:
print("\nFinal dataset (merged from Parquet + Feather):")
for record in dataset.scan().to_pydict():
    print(record)