# :snowflake: Snowflake-Managed :snowflake: Iceberg Tables
## Creating & Managing Iceberg Tables on AWS Using Snowflake as the Iceberg Catalog
Author: [Prasanna Rajagopal](https://www.linkedin.com/in/prasannarajagopal/)

Created: **April, 2025.** 

## Optional Step:
### If you wish to see the embedded slide images in the various cells, we need to create a stage and upload the images to that stage. 
### If not, you can skip this step and the associated image cells.  
## Create a NOTEBOOK_IMAGES_STG to store images to display in your notebook.
```
CREATE STAGE NOTEBOOK_IMAGES_STG
	DIRECTORY = ( ENABLE = true ) 
	ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );
```
### Copy the following image files into the stage:
- The image zip file (Images_Snowflake_Managed_Iceberg_Tables_On_AWS.zip) can be [found here](https://github.com/rrprasan/Finance/tree/main/Snowflake/Notebooks/Miscellaneous_Topics/Iceberg_Tables/Snowflake_Managed_Iceberg_Tables_On_AWS)
- Load the images into the following subdirectory in the stage:
```
@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/
```
- Apache_Iceberg_Open_Table_Format.png
- Comparing_Table_Formats_In_Snowflake.png
- Customer_Using_Apache_Iceberg_On_Snowflake.png
- External_Vol_Int_Iceberg_Tables.png
- Goldman_Sachs_Curation_Distribution.png
- Goldman_Sachs_Dynamic_Tables.png
- Snowflake_Lakehouse_Architecture_Polaris_Iceberg.png

In [None]:
CREATE STAGE NOTEBOOK_IMAGES_STG
	DIRECTORY = ( ENABLE = true ) 
	ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );

## :truck: Import Python Packages and Get Active Session to Snowflake :snowflake:

In [None]:
# Snowpark Pandas API
import modin.pandas as spd
# Import the Snowpark pandas plugin for modin
import streamlit as st
import matplotlib.pyplot as plt
import snowflake.snowpark.modin.plugin

from snowflake.snowpark.context import get_active_session
# Create a snowpark session
session = get_active_session()

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/Apache_Iceberg_Open_Table_Format.png" , decompress=False).read() 
# Display the image
st.image(image)

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/Comparing_Table_Formats_In_Snowflake.png" , decompress=False).read() 
# Display the image
st.image(image)

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/Snowflake_Lakehouse_Architecture_Polaris_Iceberg.png" , decompress=False).read() 
# Display the image
st.image(image)

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/Customer_Using_Apache_Iceberg_On_Snowflake.png" , decompress=False).read() 
# Display the image
st.image(image)

## Customers Using Apache Iceberg on Snowflake
### Customer Stories
- [Goldman Sachs](https://www.youtube.com/watch?v=b2cnGMJl2iU)
- [Allianz](https://www.youtube.com/watch?v=2y6y_gIkPpc)

## Snowflake With Iceberg Used By Goldman Sachs for Vendor Data

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/Goldman_Sachs_Curation_Distribution.png" , decompress=False).read() 
# Display the image
st.image(image)

## Goldman Sachs (GS) is a Fan & User of [Snowflake Dynamic Tables](https://docs.snowflake.com/en/user-guide/dynamic-tables-intro). 
### Source: GS at [Snowflake Data Cloud Summit](https://www.youtube.com/watch?v=b2cnGMJl2iU)
### Example Notebook: Volume Weighted Average Price Using [Dynamic Table Notebook](https://github.com/rrprasan/Finance/tree/main/Snowflake/Notebooks/Technical_Indicators/VWAP_Using_Dynamic_Tables)

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/Goldman_Sachs_Dynamic_Tables.png" , decompress=False).read() 
# Display the image
st.image(image)

In [None]:
image = session.file.get_stream("@NOTEBOOK_IMAGES_STG/Snowflake_Managed_Iceberg_Tables_On_AWS/External_Vol_Int_Iceberg_Tables.png" , decompress=False).read() 
# Display the image
st.image(image)

## [Prerequisites](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#prerequisites)
Before you configure an external volume, you need the following:

### An S3 storage bucket.

- To use Snowflake as the catalog, **the bucket must be in the same region that hosts your Snowflake account**.
- To use the external volume for **externally managed Iceberg tables**, all of **your table data and metadata files must be located in a bucket that hosts your Snowflake account**.
- Snowflake **can’t support external volumes with S3 bucket names that contain dots** (for example, my.s3.bucket). S3 doesn’t support SSL for virtual-hosted-style buckets with dots in the name, and Snowflake uses virtual-host-style paths and HTTPS to access data in S3.
- To support data recovery, enable versioning for your external cloud storage location.
- Permissions in AWS to create and manage IAM policies and roles. If you aren’t an AWS administrator, ask your AWS administrator to perform these tasks.

## Steps to Create Snowflake-Managed Iceberg Tables in AWS
### 8 Easy Steps to Using [Apache Iceberg Tables in Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg) 
#### - Create an AWS [S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in AWS
#### - Create an AWS [IAM Policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html) in AWS
#### - Create an AWS [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html)
#### - Create a [External Volume](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-4-create-an-external-volume-in-snowflake) in Snowflake pointing to the AWS Storage Bucket. 
#### - [Describe the External Volume](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-5-retrieve-the-aws-iam-user-for-your-snowflake-account) and Retrieve the AWS IAM User for your Snowflake Account.
#### - [Grant the Snowflake's IAM user permissions to access bucket objects](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-6-grant-the-iam-user-permissions-to-access-bucket-objects)
#### - Test the External Volume to verify that the privileges on the storage bucket.  
#### - Create Iceberg Tables.


## Step 1: Create Storage Bucket in AWS S3

### Step 2: [Create an AWS IAM Policy that Grants Access to Your S3 Location](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-1-create-an-iam-policy-that-grants-access-to-your-s3-location)

#### Policies to provide Snowflake with the required permissions to read and write data to your S3 location.
#### Change the ```my_bucket``` to your AWS S3 bucket name.  

```JSON
{
   "Version": "2012-10-17",
   "Statement": [
         {
            "Effect": "Allow",
            "Action": [
               "s3:PutObject",
               "s3:GetObject",
               "s3:GetObjectVersion",
               "s3:DeleteObject",
               "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<my_bucket>/*"
         },
         {
            "Effect": "Allow",
            "Action": [
               "s3:ListBucket",
               "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<my_bucket>",
            "Condition": {
               "StringLike": {
                     "s3:prefix": [
                        "*"
                     ]
               }
            }
         }
   ]
}
```

### Step 3: [Create an AWS IAM Role](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-2-create-an-iam-role)

### Step 4: (Optional) [Grant privileges required for SSE-KMS encryption to the IAM role](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-3-grant-privileges-required-for-sse-kms-encryption-to-the-iam-role-optional)
#### Not Required if Server-side encryption with Amazon S3 managed keys (SSE-S3)

### Step 5: [Create an External Volume in Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-4-create-an-external-volume-in-snowflake)
- An external volume is a named, account-level Snowflake object that you use to connect Snowflake to your external cloud storage for Iceberg tables. 
- An external volume stores an identity and access management (IAM) entity for your storage location. 
- Snowflake uses the IAM entity to securely connect to your storage for accessing table data, Iceberg metadata, and manifest files that store the table schema, partitions, and other metadata.
- A single external volume can support one or more Iceberg tables.


#### Change the ```my_bucket``` to your AWS S3 bucket name.
```SQL
CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'my-s3-us-west-2'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://<my_bucket>/'
            STORAGE_AWS_ROLE_ARN = '<arn:aws:iam::123456789012:role/myrole>'
            STORAGE_AWS_EXTERNAL_ID = 'iceberg_table_external_id'
         )
      )
      ALLOW_WRITES = TRUE;
```

## [Create enough external volumes for your use case](https://docs.snowflake.com/en/user-guide/tables-iceberg-best-practices#create-enough-external-volumes-for-your-use-case)

- Each external volume is associated with a particular Active storage location, and a single external volume can support multiple Iceberg tables. 
- However, the number of external volumes you need depends on how you want to store, organize, and secure your table data.
- You can use a single external volume if you want the data and metadata for all of your Snowflake-Iceberg tables in subdirectories under the same storage location (for example, in the same S3 bucket). 
- To configure these directories for Snowflake-managed tables, see Data and metadata directories.
- Alternatively, you can create multiple external volumes to secure various storage locations differently. 
- For example, you might create the following external volumes:
    - A read-only external volume for externally managed Iceberg tables.
    - An external volume configured with read and write access for Snowflake-managed tables.
- You can’t drop or replace an external volume if one or more Iceberg tables are associated with the external volume.

### Step 6: [Retrieve the AWS IAM user for your Snowflake account](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-5-retrieve-the-aws-iam-user-for-your-snowflake-account)
1. Retrieve the ARN for the AWS IAM user that was created automatically for your Snowflake account using the DESCRIBE EXTERNAL VOLUME command. 
    - Specify the name of your external volume.
    - The following example describes an external volume named iceberg_external_volume.
```SQL
DESC EXTERNAL VOLUME iceberg_external_volume;
``` 
2. Record the value for the ```STORAGE_AWS_IAM_USER_ARN``` property, which is the AWS IAM user created for your Snowflake account; 
    - For example:
``` arn:aws:iam::123456789001:user/abc1-b-self1234```
3. **Snowflake provisions a single IAM user for your entire Snowflake account.** 
    - All S3 external volumes in your account use that IAM user.



### Step 7: [Grant the IAM user permissions to access bucket objects](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-6-grant-the-iam-user-permissions-to-access-bucket-objects)
#### Policy Document for AWS IAM Role

Where:

- ```snowflake_user_arn``` is the STORAGE_AWS_IAM_USER_ARN value you recorded in the previous step using the ```DESC EXTERNAL VOL``` command.

- ```iceberg_table_external_id``` is your external ID. 
    - If you already specified an external ID when you created the role, and used the same ID to create your external volume, leave the value as-is. 
    - Otherwise, update sts:ExternalId with the value that you recorded.

```JSON
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<snowflake_user_arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<iceberg_table_external_id>"
        }
      }
    }
  ]
}
```

### Step 8: [Verify Snowflake's Access to your AWS S3 Storage Bucket](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3#step-7-verify-storage-access) 
```SQL
SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('my_external_volume');
```
#### Sample Output:
```
{
"success":true,
"storageLocationSelectionResult":"PASSED",
"storageLocationName":"my-s3-us-west-2",
"servicePrincipalProperties":
"STORAGE_AWS_IAM_USER_ARN: <Snowflake IAM User>; 
 STORAGE_AWS_EXTERNAL_ID: snowflake_managed_iceberg_table_external_id",
"location":"s3://snowflake-managed-iceberg-tables-aws-s3 bucket/",
"storageAccount":null,
"region":"us-west-2",
"writeResult":"PASSED",
"readResult":"PASSED",
"listResult":"PASSED",
"deleteResult":"PASSED",
"awsRoleArnValidationResult":"PASSED",
"azureGetUserDelegationKeyResult":"SKIPPED"
}
```

In [None]:
CREATE OR REPLACE EXTERNAL VOLUME aws_us_west_2_s3_iceberg_ext_vol
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'my-s3-us-west-2'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://snowflake-managed-iceberg-tables-aws-s3-bucket/'
            STORAGE_AWS_ROLE_ARN = '<AWS ROLE ARN>'
            STORAGE_AWS_EXTERNAL_ID = 'snowflake_managed_iceberg_table_external_id'
         )
      )
      ALLOW_WRITES = TRUE;

In [None]:
DESC EXTERNAL VOLUME aws_us_west_2_s3_iceberg_ext_vol;

Result of the Describe External Volume Command:
```JSON
{
"NAME":"my-s3-us-west-2",
"STORAGE_PROVIDER":"S3",
"STORAGE_BASE_URL":"s3://snowflake-managed-iceberg-tables-aws-s3-bucket/",
"STORAGE_ALLOWED_LOCATIONS":["s3://snowflake-managed-iceberg-tables-aws-s3-bucket/*"],
"STORAGE_REGION":"us-west-2",
"PRIVILEGES_VERIFIED":true,
"STORAGE_AWS_ROLE_ARN":"<AWS ROLE ARN>",
"STORAGE_AWS_IAM_USER_ARN":"<Snowflake AWS User>",
"STORAGE_AWS_EXTERNAL_ID":"snowflake_managed_iceberg_table_external_id",
"ENCRYPTION_TYPE":"NONE",
"ENCRYPTION_KMS_KEY_ID":""
}
```

## Create Iceberg Table

In [None]:
CREATE OR REPLACE ICEBERG TABLE customer12_iceberg_ingestv12 (
  COl1 DOUBLE,
  COL2 VARCHAR,
  COL3 DOUBLE
)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'aws_us_west_2_s3_iceberg_ext_vol'
  BASE_LOCATION = 'demodb/equity_research/customer12_iceberg_ingestv12';

In [None]:
insert into customer_iceberg_ingestv10 values (1.1, 'Text','2.2');

## Set a Default Catalog For the Account or Session.  

In [None]:
ALTER ACCOUNT SET CATALOG = 'SNOWFLAKE';

In [None]:
CREATE OR REPLACE ICEBERG TABLE customer15_iceberg15_ingestv15 (
  COl1 DOUBLE,
  COL2 VARCHAR,
  COL3 DOUBLE
)
  EXTERNAL_VOLUME = 'aws_us_west_2_s3_iceberg_ext_vol'
  BASE_LOCATION = 'customer15_iceberg15_ingestv15';

## Unset Default Catalog For the Account

In [None]:
ALTER ACCOUNT UNSET CATALOG;

### The following ```CREATE ICEBERG TABLE``` will fail if the ```CATALOG``` is ```UNSET```
### Fix the ```CREATE ICEBERG TABLE``` Statment by adding the ```CATALOG = 'SNOWFLAKE'```

In [None]:
CREATE OR REPLACE ICEBERG TABLE customer16_iceberg_ingestv15 (
  COl1 DOUBLE,
  COL2 VARCHAR,
  COL3 DOUBLE
)
  EXTERNAL_VOLUME = 'aws_us_west_2_s3_iceberg_ext_vol'
  BASE_LOCATION = 'snowflake-managed-iceberg-tables-aws-s3-bucket/';

In [None]:
DESC TABLE customer_iceberg_ingest;

In [None]:
INSERT INTO customer_iceberg_ingest VALUES (1.2, 'Text1', 1.5);

In [None]:
SELECT * FROM customer_iceberg_ingest;

## Set a Default External Volume for a Database. 

In [None]:
ALTER DATABASE my_database_1
  SET EXTERNAL_VOLUME = 'my_s3_vol';

In [None]:
SELECT COUNT(*) FROM customer_iceberg_ingest;

## Load Data Received in PARQUET File into Iceberg Table Managed by Snowflake
### Step 1: Create a File Format
### Step 2: Use COPY INTO <TABLE> Statement
### Step 3: Create a Task with the COPY Statement (Step 2) to Ingest PARQUET files on a Schedule.  

## Create a File Format

In [None]:
CREATE OR REPLACE FILE FORMAT my_parquet_file_format
  TYPE = PARQUET
  USE_VECTORIZED_SCANNER = TRUE;

## [COPY INTO <TABLE>](https://docs.snowflake.com/en/sql-reference/sql/copy-into-table)
```SQL 
COPY INTO customer_iceberg_ingest
  FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
  FILE_FORMAT = 'my_parquet_file_format'
  LOAD_MODE = ADD_FILES_COPY
  PURGE = TRUE
  MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;
```

```LOAD_MODE = ADD_FILES_COPY```
- This is a critical part for Iceberg tables. 
- It determines how the data is loaded into the Iceberg table. 
- Two ```LOAD_MODE``` Options
    - ```ADD_FILES_COPY```
    - ```FULL_INGEST```
- ```ADD_FILES_COPY``` 
    - Snowflake performs a server-side copy of the original Parquet files into the base location of the Iceberg table. 
    - It then registers the files to the table. 
    - This allows for cross-region or cross-cloud ingestion of raw Parquet files into Iceberg tables.
    - Mode adds the files to the Iceberg metadata without rewriting the data files. 
    - This is generally faster than other modes.
    - This mode potentially reduces data ingestion costs and time. 
- ```FULL_INGEST``` 
    - Snowflake scans the files and rewrites the Parquet data under the base location of the Iceberg table. 
    - Use this option if you need to transform or convert the data before registering the files to your Iceberg table.

```PURGE = TRUE```
- This option tells Snowflake to delete the files from the stage after they have been successfully loaded into the table.

```MATCH_BY_COLUMN_NAME = CASE_SENSITIVE```
- This option specifies how the columns in the data files are matched to the columns in the target table. 

```CASE_SENSITIVE``` 
- Means that the column names must match exactly, including the case.

In [None]:
COPY INTO customer_iceberg_ingest
  FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
  FILE_FORMAT = 'my_parquet_file_format'
  LOAD_MODE = ADD_FILES_COPY
  PURGE = TRUE
  MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

## Create a Task With the COPY Command.
```SQL
CREATE OR REPLACE TASK customer_iceberg_ingest_task
  SCHEDULE = '1 MINUTES'
  WAREHOUSE = 'COMPUTE_WH'
  AS
    COPY INTO customer_iceberg_ingest
      FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
      FILE_FORMAT = 'my_parquet_file_format'
      LOAD_MODE = ADD_FILES_COPY
      PURGE = TRUE
      MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;
```


In [None]:
CREATE OR REPLACE TASK customer_iceberg_ingest_task
  SCHEDULE = '1 MINUTES'
  WAREHOUSE = 'COMPUTE_WH'
  AS
    COPY INTO customer_iceberg_ingest
      FROM @DEMODB.EQUITY_RESEARCH.PARQUET_FILES_INT_STG
      FILE_FORMAT = 'my_parquet_file_format'
      LOAD_MODE = ADD_FILES_COPY
      PURGE = TRUE
      MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

In [None]:
SELECT Count(*) FROM customer_iceberg_ingest;

## List the Iceberg Tables Created with an External Volume

In [None]:
SHOW ICEBERG TABLES;

In [None]:
set my_last_query_id = (SELECT QUERY_ID
    FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
    WHERE QUERY_TEXT = 'SHOW ICEBERG TABLES;'
    ORDER BY START_TIME DESC
    LIMIT 1);

In [None]:
SELECT * FROM TABLE(
  RESULT_SCAN(
      $my_last_query_id
    )
  )
  WHERE "external_volume_name" = 'AWS_US_WEST_2_S3_ICEBERG_EXT_VOL';

## [Iceberg Catalog](https://docs.snowflake.com/en/user-guide/tables-iceberg#catalog)
- An Iceberg catalog enables a compute engine to manage and load Iceberg tables. 
- The catalog forms the first architectural layer in the Iceberg table specification and must support:
    - Storing the current metadata pointer for one or more Iceberg tables. 
    - A metadata pointer maps a table name to the location of that table’s current metadata file.
    - Performing atomic operations so that you can update the current metadata pointer for a table.
    - Snowflake supports different catalog options. 
    - For example, you can use Snowflake as the Iceberg catalog, or use a catalog integration to connect Snowflake to an external Iceberg catalog.

[Apache Iceberg Catalog Documentation](https://iceberg.apache.org/terms/#catalog-implementations)

## [Parquet Concepts](https://parquet.apache.org/docs/concepts/)

### Row group
- A logical horizontal partitioning of the data into rows. 
- There is no physical structure that is guaranteed for a row group. 
- A row group consists of a column chunk for each column in the dataset.
### Column chunk
- A chunk of the data for a particular column. 
- They live in a particular row group and are guaranteed to be contiguous in the file.
### Page 
- Column chunks are divided up into pages. 
- A page is conceptually an indivisible unit (in terms of compression and encoding). 
- There can be multiple page types which are interleaved in a column chunk.

### Hierarchy
- A **file** consists of one or more row groups. 
    - A **row group** contains **exactly one column chunk per column**. 
        - **Column chunks** contain **one or more pages.**

## Recommended Parquet File Sizes For Apache Iceberg Tables on Snowflake
- Snowflake recommends **16MB files with a single row group.**
- For workloads requiring larger file or row group sizes, Snowflake supports 1GB Parquet files with up to 128MB row groups. 
- Your ```mileage``` may vary. Your workload needs to be configured & tested.  