![iceberg-logo](https://www.apache.org/logos/res/iceberg/iceberg.png)

### [Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!](https://tabular.io/blog/docker-spark-and-iceberg/)

In [30]:
from pyiceberg import __version__

__version__

'0.6.0'

# Write support

This notebook demonstrates writing to Iceberg tables using PyIceberg. First, connect to the [catalog](https://iceberg.apache.org/concepts/catalog/#iceberg-catalogs), the place where tables are being tracked.

In [31]:
from pyiceberg.catalog import load_catalog

catalog = load_catalog('default')

# Loading data using Arrow

PyArrow is used to load a Parquet file into memory, and using PyIceberg this data can be written to an Iceberg table.

In [32]:
import pyarrow.parquet as pq

df = pq.read_table("/home/iceberg/data/yellow_tripdata_2022-01.parquet")

df

pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
----
VendorID: [[1,1,2,2,2,...,1,2,2,2,2],[2,2,1,1,1,...,1,2,2,2,2],...,[1,1,1,1,1,...,2,1,2,2,2],[2,2,2,2,2,...,2,2,2,2,2]]
tpep_pickup_datetime: [[2022-01-01 00:35:40.000000,2022-01-01 00:33:43.000000,2022-01-01 00:53:21.000000,2022-01-01 00:25:21.000000,2022-01-01 00:36:48.000000,...,2022-01-03 09:30:15.000000,2022-01-03 09:14:58.000000,2022-01-03 09:27:22.000000,2022-01-03 09:41:29.000000,2022-01-03 09:07:37.000000],[2022-01-03 09:45:59.000000,2022-01-03 09:57:16.000000,2022-01-03 09:00:25.000000,2022-01-03 09:34:16.000000,2022-01-03 09:57:47.00

# Create an Iceberg table

Next create the Iceberg table directly from the `pyarrow.Table`.

In [33]:
table_name = "default.taxi_dataset"

try:
    # In case the table already exists
    catalog.drop_table(table_name)
except:
    pass

table = catalog.create_table(table_name, schema=df.schema)

table

taxi_dataset(
  1: VendorID: optional long,
  2: tpep_pickup_datetime: optional timestamp,
  3: tpep_dropoff_datetime: optional timestamp,
  4: passenger_count: optional double,
  5: trip_distance: optional double,
  6: RatecodeID: optional double,
  7: store_and_fwd_flag: optional string,
  8: PULocationID: optional long,
  9: DOLocationID: optional long,
  10: payment_type: optional long,
  11: fare_amount: optional double,
  12: extra: optional double,
  13: mta_tax: optional double,
  14: tip_amount: optional double,
  15: tolls_amount: optional double,
  16: improvement_surcharge: optional double,
  17: total_amount: optional double,
  18: congestion_surcharge: optional double,
  19: airport_fee: optional double
),
partition by: [],
sort order: [],
snapshot: null

# Write the data

Let's append the data to the table. Appending or overwriting is equivalent since the table is empty. Next we can query the table and see that the data is there.

In [34]:
table.append(df)  # or table.overwrite(df)

assert len(table.scan().to_arrow()) == len(df)

table.scan().to_arrow()

pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
----
VendorID: [[1,1,2,2,2,...,2,2,2,2,2]]
tpep_pickup_datetime: [[2022-01-01 00:35:40.000000,2022-01-01 00:33:43.000000,2022-01-01 00:53:21.000000,2022-01-01 00:25:21.000000,2022-01-01 00:36:48.000000,...,2022-01-31 23:36:53.000000,2022-01-31 23:44:22.000000,2022-01-31 23:39:00.000000,2022-01-31 23:36:42.000000,2022-01-31 23:46:00.000000]]
tpep_dropoff_datetime: [[2022-01-01 00:53:29.000000,2022-01-01 00:42:07.000000,2022-01-01 01:02:19.000000,2022-01-01 00:35:23.000000,2022-01-01 01:14:20.000000,...,2022-01-31 23:42:51.000000,2022-01-31 23:55:01.0

In [35]:
str(table.current_snapshot())

'Operation.APPEND: id=5382422337398553932, schema_id=0'

# Append data

Let's append another month of data to the table

In [36]:
df = pq.read_table("/home/iceberg/data/yellow_tripdata_2022-02.parquet")
table.append(df)

In [37]:
str(table.current_snapshot())

'Operation.APPEND: id=4731738526151278430, parent_id=5382422337398553932, schema_id=0'

# Feature generation

Consider that we want to train a model to determine which features contribute to the tip amount. `tip_per_mile` is a good target to train the model on. When we try to append the data, we need to evolve the schema first.

In [38]:
import pyarrow.compute as pc

df = table.scan().to_arrow()
df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"]))

try:
    table.overwrite(df)
except ValueError as e:
    print(f"Error: {e}")

Error: Table schema does not match schema used to create file: 
table:
VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
tip_per_mile: double vs. 
file:
VendorID: int64
  -- field metadata --
  PARQUET:field_id: '1'
tpep_pickup_datetime: timestamp[us]
  -- field metadata --
  PARQUET:field_id: '2'
tpep_dropoff_datetime: timestamp[us]
  -- field metadata --
  PARQUET:field_id: '3'
passenger_count: double
  -- field metadata --
  PARQUET:field_id: '4'
trip_distance: double
  -- field metadata --
  PARQUET:field_id: '5'
RatecodeID: double
  -- field metadata --
  PARQUET:field_id: '6'
store_and_fwd_flag: string
  -

In [39]:
with table.update_schema() as upd:
    upd.union_by_name(df.schema)

print(str(table.schema()))

table {
  1: VendorID: optional long
  2: tpep_pickup_datetime: optional timestamp
  3: tpep_dropoff_datetime: optional timestamp
  4: passenger_count: optional double
  5: trip_distance: optional double
  6: RatecodeID: optional double
  7: store_and_fwd_flag: optional string
  8: PULocationID: optional long
  9: DOLocationID: optional long
  10: payment_type: optional long
  11: fare_amount: optional double
  12: extra: optional double
  13: mta_tax: optional double
  14: tip_amount: optional double
  15: tolls_amount: optional double
  16: improvement_surcharge: optional double
  17: total_amount: optional double
  18: congestion_surcharge: optional double
  19: airport_fee: optional double
  20: tip_per_mile: optional double
}


In [40]:
table.overwrite(df)

table

taxi_dataset(
  1: VendorID: optional long,
  2: tpep_pickup_datetime: optional timestamp,
  3: tpep_dropoff_datetime: optional timestamp,
  4: passenger_count: optional double,
  5: trip_distance: optional double,
  6: RatecodeID: optional double,
  7: store_and_fwd_flag: optional string,
  8: PULocationID: optional long,
  9: DOLocationID: optional long,
  10: payment_type: optional long,
  11: fare_amount: optional double,
  12: extra: optional double,
  13: mta_tax: optional double,
  14: tip_amount: optional double,
  15: tolls_amount: optional double,
  16: improvement_surcharge: optional double,
  17: total_amount: optional double,
  18: congestion_surcharge: optional double,
  19: airport_fee: optional double,
  20: tip_per_mile: optional double
),
partition by: [],
sort order: [],
snapshot: Operation.OVERWRITE: id=7739515532572230599, parent_id=4731738526151278430, schema_id=1