# Securing access to External Tables / Files with Unity Catalog

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/external/uc-external-location-global.png?raw=true" style="float:right; margin-left:10px" width="600"/>

By default, Unity Catalog will create managed tables in your primary storage, providing a secured table access for all your users.

In addition to these managed tables, you can manage access to External tables and files, located in another cloud storage (S3/ADLS/GCS). 

This give you capabilities to ensure a full data governance, storing your main tables in the managed catalog/storage while ensuring secure access for for specific cloud storage.

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=governance&org_id=1459955580342996&notebook=%2FAWS-Securing-data-on-external-locations&demo_name=uc-02-external-location&event=VIEW&path=%2F_dbdemos%2Fgovernance%2Fuc-02-external-location%2FAWS-Securing-data-on-external-locations&version=1&user_hash=e83debf9c2fbbbade446ffcd4d65ccfdeb93bd7927314dd02f530d8731ff8498">

In [0]:
#TODO= replace with the URL of the bucket you want to use for your external location:
# s3://demonhid15-rootbucket-sm/external_vol/
external_bucket_url = "s3://demonhid15-rootbucket-sm/external_source_ext/"
dbutils.widgets.text("external_bucket_url", external_bucket_url)


## Working with External Locations

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/external/uc-external-location.png?raw=true" style="float:right; margin-left:10px" width="800"/>


Accessing external cloud storage is easily done using `External locations`.

This can be done using 3 simple SQL command:


1. First, create a Storage credential. It'll contain the IAM role/SP required to access your cloud storage
1. Create an External location using your Storage credential. It can be any cloud location (a sub folder)
1. Finally, Grant permissions to your users to access this Storage Credential

## 1/ Create the STORAGE CREDENTIAL

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/external/uc-external-location-1.png?raw=true" style="float:right; margin-left:10px" width="700px"/>

The first step is to create the `STORAGE CREDENTIAL`.

To do that, we'll use Databricks Unity Catalog UI:

1. Open the Data Explorer in DBSQL
1. Select the "Storage Credential" menu
1. Click on "Create Credential"
1. Fill your credential information: the name and IAM role you will be using

Because you need to be ADMIN, this step has been created for you.


<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/external/uc-external-location-cred.png?raw=true" width="400"/>

In [0]:
%sql
-- For our demo, let's make sure all users can alter this storage credential:
-- ALTER STORAGE CREDENTIAL `field_demos_credential`  OWNER TO `account users`;

In [0]:
%sql
SHOW STORAGE CREDENTIALS 

In [0]:
%sql
-- DESCRIBE STORAGE CREDENTIAL `field_demos_credential`
DESCRIBE STORAGE CREDENTIAL `uc-upgrade-cred`

## 2/ Create the EXTERNAL LOCATION

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/external/uc-external-location-2.png?raw=true" style="float:right; margin-left:10px" width="700px"/>

We'll then create our `EXTERNAL LOCATION` using the following path:<br/>
`s3a://databricks-e2demofieldengwest/external_location/`

Note that you need to be Account Admin to do that, it'll fail with a permission error if you are not. But don't worry, the external location has been created for you.

You can also update your location using SQL operations:
<br/>
```ALTER EXTERNAL LOCATION `xxxx`  RENAME TO `yyyy`; ```<br/>
```DROP EXTERNAL LOCATION IF EXISTS `xxxx`; ```

In [0]:
print(external_bucket_url)

In [0]:
%sql
-- Note: you need to be account ADMIN to run this and create the external location.
CREATE EXTERNAL LOCATION IF NOT EXISTS `field_demos_external_location`
  URL "s3://demonhid15-rootbucket-sm/customer_a_ext/"
  WITH (CREDENTIAL `uc-upgrade-cred`)
  COMMENT 'External Location for demos' ;


-- let's make everyone owner for the demo to be able to change the permissions easily. DO NOT do that for real usage.
ALTER EXTERNAL LOCATION `field_demos_external_location`  OWNER TO `account users`;

In [0]:
%sql
SHOW EXTERNAL LOCATIONS

In [0]:
%sql
DESCRIBE EXTERNAL LOCATION `field_demos_external_location`;

## 3/ GRANT permissions on the external location

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/external/uc-external-location-3.png?raw=true" style="float:right; margin-left:10px" width="700px"/>

All we have to do is now GRANT permission to our users or group of users. In our demo we'll grant access to all our users using `account users`

We can set multiple permissions:

1. READ FILES to be able to access the data
1. WRITE FILES to be able to write data
1. CREATE TABLE to create external table using this location

To revoke your permissions, you can use ```REVOKE WRITE FILES ON EXTERNAL LOCATION `field_demos_external_location` FROM `account users`;```

In [0]:
%sql
GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION `field_demos_external_location` TO `account users`;

## Accessing the data

That's all we have to do! Our users can now access the folder in SQL or python:

In [0]:
%sql
-- Make sure you set this to your own external location  
LIST 's3://demonhid15-rootbucket-sm/external_source_ext/'
    
--

we can also write data using SQL or Python API:

In [0]:
df = spark.createDataFrame([("UC", "is awesome"), ("Delta Sharing", "is magic")])
df.write.mode('overwrite').format('csv').save(f'{external_bucket_url}/test_write_table')

In [0]:
spark.read.csv(f'{external_bucket_url}/test_write_table').display()

- Create External Table with Huge Data
- Create External Table with Huge Data with Partition
- Create External Table with Huge Data with Liquid Clustering

In [0]:
dbutils.fs.ls("/databricks-datasets")

In [0]:
df = spark.read.csv("dbfs:/databricks-datasets/flights/departuredelays.csv", header=True, inferSchema=True)
df.count()

In [0]:
external_bucket_url = "s3://demonhid15-rootbucket-sm/customer_a_ext"

# 2. Load the source table into a DataFrame
print("Loading samples.tpch.lineitem dataset...")
try:
    lineitem_df = spark.table("samples.tpch.lineitem")
    print(f"Dataset loaded successfully. Column count: {len(lineitem_df.columns)}")
except Exception as e:
    print(f"Error loading table: {e}")
    raise



In [0]:
# 3. Define the full output path
output_path = f"{external_bucket_url}/lineitem_delta_export"

# 4. Write the DataFrame to the external location in CSV format
print(f"Writing data to Delta format at: {output_path}")

lineitem_df.write.mode('overwrite').format('delta').save(output_path)

print("---")
print("Success! The Lineitem data has been exported.")
print(f"Check the following directory for the CSV output part files: {output_path}")

In [0]:
partition_column = "l_shipdate"
output_path = f"{external_bucket_url}/lineitem_delta_partitioned_export"
print(f"\nWriting data to DELTA format, partitioned by '{partition_column}', at: {output_path}")

# .partitionBy(partition_column): Crucial for creating physical folders based on the column values
lineitem_df.write \
    .mode('overwrite') \
    .format('delta') \
    .partitionBy(partition_column) \
    .save(output_path)

In [0]:
%sql
use catalog demonhid15;
use schema customer_a;
CREATE TABLE IF NOT EXISTS demonhid15.customer_a.lineitem_delta_ext
USING DELTA
LOCATION 's3://demonhid15-rootbucket-sm/customer_a_ext/lineitem_delta_export';

In [0]:
%sql
use catalog demonhid15;
use schema customer_a;
CREATE TABLE IF NOT EXISTS demonhid15.customer_a.lineitem_delta_partitioned_ext
USING DELTA
LOCATION 's3://demonhid15-rootbucket-sm/customer_a_ext/lineitem_delta_partitioned_export';

In [0]:
%sql
SELECT COUNT(*) FROM demonhid15.customer_a.lineitem_delta_ext -- 29999795
union all 
SELECT COUNT(*) FROM demonhid15.customer_a.lineitem_delta_partitioned_ext -- 29999795

In [0]:
%sql
-- SELECT COUNT(*) FROM demonhid15.customer_a.lineitem_delta_ext -- 29999795

SELECT * FROM demonhid15.customer_a.lineitem_delta_partitioned_ext -- 29999795
limit 10;

[https://delta.io/blog/liquid-clustering/](https://delta.io/blog/liquid-clustering/)

Liquid clustering gives you flexibility. With liquid clustering enabled, you can redefine your clustering columns without having to rewrite any existing data. This allows your data layout to evolve in parallel with changing query patterns.

Note that liquid clustering is not compatible with partitioning and Z-ordering.


[https://www.databricks.com/blog/introducing-predictive-optimization-statistics](https://www.databricks.com/blog/introducing-predictive-optimization-statistics)

**This require UC Catalog**