# :snowflake: Snowflake-Managed :snowflake: Iceberg Tables
## Creating & Managing Iceberg Tables on GCP
Author: [Prasanna Rajagopal](https://www.linkedin.com/in/prasannarajagopal/)

Created: **April, 2025.** 

## Steps to Create Snowflake-Managed Iceberg Tables in GCP
### 7 Easy Steps to Using [Apache Iceberg Tables in Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg) 
#### - Create an IAM Role in GCP
#### - Create a Bucket in [Google Cloud Storage (GCS)](https://cloud.google.com/storage?hl=en)
#### - Create a [External Volume](https://docs.snowflake.com/en/user-guide/tables-iceberg#external-volume) in Snowflake pointing to the GCS Storage Bucket. 
#### - [Describe the External Volume](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-gcs#step-2-retrieve-the-gcs-service-account-for-your-snowflake-account) and Retrieve the GCP Service Account information for your Snowflake Account.
#### - [Assign the Snowflake Service Account](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-gcs#assign-the-custom-role-to-the-gcs-service-account) as a Principal to the storage bucket and the role in GCP.
#### - Test the External Volume to verify that the privileges on the storage bucket.  
#### - Create Iceberg Tables.

## Create an IAM Role in GCP
### The role should have the following minimum privileges.  
```
storage.buckets.get
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list 
```

## Create the GCS Bucket
### Create the GCS bucket in the same region as your Snowflake account.  
### Enable Encryption
#### - Google-Managed Encryption Keys

## [Create the Snowflake External Volume](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume)
- An external volume is a named, account-level Snowflake object that you use to connect Snowflake to your external cloud storage for Iceberg tables. 
- An external volume stores an identity and access management (IAM) entity for your storage location. 
- Snowflake uses the IAM entity to securely connect to your storage for accessing table data, Iceberg metadata, and manifest files that store the table schema, partitions, and other metadata.

In [None]:
CREATE EXTERNAL VOLUME iceberg_test_ext_vol
  STORAGE_LOCATIONS =
    (
      (
        NAME = 'ext_us_east4'
        STORAGE_PROVIDER = 'GCS'
        STORAGE_BASE_URL = 'gcs://newgcpsnowbucket1/'
      )
    );

## [Retrieve The GCS Service Account For Your Snowflake Account](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-gcs#step-2-retrieve-the-gcs-service-account-for-your-snowflake-account)
## Describe the Snowflake External Volume
### Output from the describe command
```JSON
{
"NAME":"ext_us_east4",
"STORAGE_PROVIDER":"GCS",
"STORAGE_BASE_URL":"gcs://newgcpsnowbucket1/",
"STORAGE_ALLOWED_LOCATIONS":["gcs://newgcpsnowbucket1/*"],
"STORAGE_REGION":"US-EAST4",
"PRIVILEGES_VERIFIED":true,
"STORAGE_GCP_SERVICE_ACCOUNT":"service-account-id@project1-123456.iam.gserviceaccount.com",
"ENCRYPTION_TYPE":"NONE","ENCRYPTION_KMS_KEY_ID":""
}
```
#### Record the value of the STORAGE_GCP_SERVICE_ACCOUNT property in the output (for example, service-account-id@project1-123456.iam.gserviceaccount.com).
#### Snowflake provisions a single GCS service account for your entire Snowflake account. All GCS external volumes use that service account.
#### Filter the list of buckets, and select the bucket that you specified when you created an external volume.
#### Select Permissions » View by principals, then select Grant access.
#### Under Add principals, paste the name of the service account name from the output in the DESC EXTERNAL VOLUME command.  
#### Under Assign roles, select the custom IAM role that you created previously, then select Save.




In [None]:
DESC EXTERNAL VOLUME iceberg_test_ext_vol;

## Test the External Volume
- Verifies the configuration for a specified external volume.

For external volumes with write access, Snowflake attempts the following additional operations to verify the configuration:

- Write a test file.

- Read the test file.

- List the files in the storage location.

- Delete the test file.
```SQL
SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('iceberg_test_ext_vol');
```

## Create the Iceberg Table in Snowflake

In [None]:
CREATE ICEBERG TABLE equity_info_itbl (SYMBOL VARCHAR, COMPANY_NAME VARCHAR)
    CATALOG = 'SNOWFLAKE'
   EXTERNAL_VOLUME='iceberg_test_ext_vol'
   METADATA_FILE_PATH='metadata/v1.metadata.json';

In [None]:
CREATE ICEBERG TABLE equity_info_itbl2 (SYMBOL VARCHAR, COMPANY_NAME VARCHAR)
    CATALOG = 'SNOWFLAKE'
   EXTERNAL_VOLUME='iceberg_test_ext_vol'
   BASE_LOCATION = 'equity_info_itbl2'
   METADATA_FILE_PATH='metadata/v1.metadata.json';

In [None]:
-- Just the Metadata Directory, no initial file name.  
CREATE ICEBERG TABLE equity_prices_itbl (SYMBOL VARCHAR, COMPANY_NAME VARCHAR)
    CATALOG = 'SNOWFLAKE'
   EXTERNAL_VOLUME='iceberg_test_ext_vol'
   METADATA_FILE_PATH='metadata/'

In [None]:
DESC TABLE equity_info_itbl;

In [None]:
SHOW TABLES LIKE 'equity_info_itbl';

In [None]:
INSERT INTO equity_info_itbl VALUES ('AAPL', 'APPLE Inc.');

In [None]:
SELECT * FROM equity_info_itbl;

In [None]:
INSERT INTO equity_info_itbl VALUES ('WMT', 'Walmart Inc.');

In [None]:
INSERT INTO equity_info_itbl VALUES ('TGT', 'TARGET CORPORATION');

## Load Data Received in PARQUET File into Iceberg Table Managed by Snowflake
### Step 1: Create a File Format
### Step 2: Use COPY INTO <TABLE> Statement
### Step 3: Create a Task with the COPY Statement (Step 2) to Ingest PARQUET files on a Schedule.  

## Create a File Format

In [None]:
CREATE OR REPLACE FILE FORMAT my_parquet_file_format
  TYPE = PARQUET
  USE_VECTORIZED_SCANNER = TRUE;

## [COPY INTO <TABLE>](https://docs.snowflake.com/en/sql-reference/sql/copy-into-table)
```SQL 
COPY INTO customer_iceberg_ingest
  FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
  FILE_FORMAT = 'my_parquet_file_format'
  LOAD_MODE = ADD_FILES_COPY
  PURGE = TRUE
  MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;
```

```LOAD_MODE = ADD_FILES_COPY```
- This is a critical part for Iceberg tables. 
- It determines how the data is loaded into the Iceberg table. 
- Two ```LOAD_MODE``` Options
    - ```ADD_FILES_COPY```
    - ```FULL_INGEST```
- ```ADD_FILES_COPY``` 
    - Snowflake performs a server-side copy of the original Parquet files into the base location of the Iceberg table. 
    - It then registers the files to the table. 
    - This allows for cross-region or cross-cloud ingestion of raw Parquet files into Iceberg tables.
    - Mode adds the files to the Iceberg metadata without rewriting the data files. 
    - This is generally faster than other modes.
    - This mode potentially reduces data ingestion costs and time. 
- ```FULL_INGEST``` 
    - Snowflake scans the files and rewrites the Parquet data under the base location of the Iceberg table. 
    - Use this option if you need to transform or convert the data before registering the files to your Iceberg table.

```PURGE = TRUE```
- This option tells Snowflake to delete the files from the stage after they have been successfully loaded into the table.

```MATCH_BY_COLUMN_NAME = CASE_SENSITIVE```
- This option specifies how the columns in the data files are matched to the columns in the target table. 

```CASE_SENSITIVE``` 
- Means that the column names must match exactly, including the case.

In [None]:
COPY INTO demodb.test.customer_iceberg_ingest_gcp
  FROM @PARQUET_FILES_INT_STG
  FILE_FORMAT = 'my_parquet_file_format'
  LOAD_MODE = ADD_FILES_COPY
  PURGE = TRUE
  MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

## Create a Task With the COPY Command.
```SQL
CREATE OR REPLACE TASK customer_iceberg_ingest_task
  SCHEDULE = '1 MINUTES'
  WAREHOUSE = 'COMPUTE_WH'
  AS
    COPY INTO customer_iceberg_ingest
      FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
      FILE_FORMAT = 'my_parquet_file_format'
      LOAD_MODE = ADD_FILES_COPY
      PURGE = TRUE
      MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;
```


In [None]:
CREATE OR REPLACE TASK customer_iceberg_ingest_task
  SCHEDULE = '1 MINUTES'
  WAREHOUSE = 'COMPUTE_WH'
  AS
    COPY INTO customer_iceberg_ingest
      FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
      FILE_FORMAT = 'my_parquet_file_format'
      LOAD_MODE = ADD_FILES_COPY
      PURGE = TRUE
      MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

## SYSTEM$GET_ICEBERG_TABLE_INFORMATION
- Returns the location of the root metadata file and status of the latest snapshot for an Apache Iceberg table.

The SYSTEM$GET_ICEBERG_TABLE_INFORMATION function works differently according to table type:

- For an Iceberg table that uses Snowflake as the catalog, calling the function generates metadata for data manipulation language (DML) operations or other table updates that have occurred since Snowflake last generated metadata for the table.

- If there are no updates, the function returns the location of the latest metadata file, but does not generate new metadata.

- For an Iceberg table that is not managed by Snowflake, the function returns information about the latest refreshed snapshot.



In [None]:
SELECT SYSTEM$GET_ICEBERG_TABLE_INFORMATION('DEMODB.EQUITY_RESEARCH.equity_info_itbl');

### Refresh of Iceberg table Only Works for Externally-managed Iceberg Tables

In [None]:
ALTER ICEBERG TABLE DEMODB.EQUITY_RESEARCH.equity_info_itbl REFRESH;

## [Use Row-level Deletes](https://docs.snowflake.com/en/user-guide/tables-iceberg-manage#use-row-level-deletes)
- Iceberg provides two modes for configuring how compute engines handle row-level operations for externally managed tables. 
- Snowflake supports both of these modes.
### [Copy-on-write vs. merge-on-read](https://docs.snowflake.com/en/user-guide/tables-iceberg-manage#copy-on-write-vs-merge-on-read)
#### Copy-on-write
- **Copy-on-write is the default** in Snowflake
- This mode prioritizes read time and impacts write speed.
- When you perform an update, delete, or merge operation, your **compute engine rewrites the entire affected Parquet data file**. 
- This can result in **slow writes**, especially if you have **large data files**, but **doesn’t impact read time**.
#### Merge-on-read
- This mode **prioritizes write speed** and **slightly impacts read time**.
- When you perform an update, delete, or merge operation, your **compute engine creates a delete file that contains information about only the changed rows**.
- When you read from a table, your query engine merges delete files with data files. Merging can increase read time. 
- However, you can optimize read performance by scheduling regular compaction and table maintenance.

In [None]:
DELETE FROM equity_info_itbl WHERE SYMBOL = 'WMT'