# Connectors - Databricks Delta Lake

[YData SDK provides a seamless integration with Databricks Delta Lake](https://ydata.ai/), allowing you to connect,
query, and manage your data in Delta Lake with ease. This section will guide you through the benefits,
setup, and usage of the Databricks' connector within [ydata-sdk](https://pypi.org/project/ydata-sdk/).

### Benefits of Integration
Integrating ydata-sdk with Databricks offers several key benefits:

- **Enhanced Data Accessibility:** Seamlessly access and integrate previously siloed data.
- **Improved Data Quality:** Use YData's tools to enhance the quality of your data through data preparation and augmentation.
- **Scalability:** Leverage Databricks' robust infrastructure to scale data processing and AI workloads.
- **Streamlined Workflows:** Simplify data workflows with connectors and SDKs, reducing manual effort and potential errors.
- **Comprehensive Support:** Benefit from extensive documentation and support for both platforms, ensuring smooth integration and operation.


### Authenticate with your account YData

In [None]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Create your Delta Lake connector

In [None]:
# Add the connection details - your databricks host, access token, access to an AWS S3 or Azure Delta lake, as well as the dataset details (catalog name, schema name, table and warehouse)
HOST = 'insert-databricks-host'
TOKEN = 'insert-token'

AWS_ACCESS_KEY = 'insert-aws-key'
SECRET_ACCESS_KEY = 'insert-secret'

CATALOG = 'catalog-name'
SCHEMA = 'schema-name'
TABLE = 'table-name'
WAREHOUSE = 'your-warehouse-name'

In [None]:
# The Databricks Lakehouse requires several inputs:
# Credentials to one of the following object storage (AWS S3 or Azure Blob Storage)
# Databricks host and access token
connector = DatabricksLakehouse(
    host=HOST,
    access_token=TOKEN,
    staging_credentials={'access_key_id': AWS_ACCESS_KEY,
                         'secret_access_key': SECRET_ACCESS_KEY},
    catalog='catalog',
    schema='schema',
    cloud='aws'
)

## Read from your Delta Lake

Using the Delta Lake connector it is possible to:
- Get the data from a Delta Lake table
- Get a sample from a Delta Lake table
- Get the data from a query to a Delta Lake instance

In [4]:
#Define what warehouse you want to use for the activity
#This method reads all the data records in the table
table = connector.get_table(table='insert-table-name',
                            warehouse='insert-warehouse-name')
print(table)

INFO: 2024-06-11 22:01:11,445 Successfully opened session 01ef283e-1d15-1af3-98a1-f98419972ba9
INFO: 2024-06-11 22:01:12,055 Successfully opened session 01ef283e-1d7e-1f4a-b6fb-a89fe5ff57a8
[1mDataset 
 
[0m[1mShape: [0m(5000, 13)
[1mSchema: [0m
         Column Variable type
0            id        string
1           age           int
2        gender           int
3        height           int
4        weight         float
5         ap_hi           int
6         ap_lo           int
7   cholesterol           int
8          gluc           int
9         smoke           int
10         alco           int
11       active           int
12       cardio           int




In [5]:
#Define what warehouse you want to use for the activity
#This method reads a sample with number of records = provided sample size
table_sample = connector.get_table_sample(table='insert-table-name',
                                          warehouse='insert-warehouse-name',
                                          sample_size=50)
print(table_sample)

[1mDataset 
 
[0m[1mShape: [0m(50, 13)
[1mSchema: [0m
         Column Variable type
0            id        string
1           age           int
2        gender           int
3        height           int
4        weight         float
5         ap_hi           int
6         ap_lo           int
7   cholesterol           int
8          gluc           int
9         smoke           int
10         alco           int
11       active           int
12       cardio           int




In [11]:
query_output = connector.query("SELECT * FROM catalogname.schemaname.tablename;",
                               warehouse='insert-warehouse-name')
print(query_output)

INFO: 2024-06-11 22:08:56,902 Successfully opened session 01ef283f-328f-115a-a977-2da27926d77a
[1mDataset 
 
[0m[1mShape: [0m(5000, 13)
[1mSchema: [0m
         Column Variable type
0            id        string
1           age           int
2        gender           int
3        height           int
4        weight         float
5         ap_hi           int
6         ap_lo           int
7   cholesterol           int
8          gluc           int
9         smoke           int
10         alco           int
11       active           int
12       cardio           int




## Write to your Databricks Delta Lake
If you need to write your data into a Delta Lake instance you can also leverage your Databricks Delta Lake connector for the following actions:

- Write the data into a table

In [17]:
# Write the data to a new table called cardio_test in the set schema
# If exists allow you to decide whether you want to append, replace or fail in case a table with the same name already exists in the schema.
connector.write_table(data=query_output,
                      staging_path='s3://ydata-dev/regular/internet_sales/test.csv',
                      table='cardio_new',
                      warehouse='Starter Warehouse',
                      if_exists='fail')


