# ![Ray Databricks Connector](../images/databrick_connector_logo.png)
This user guide walks through the basics of reading and writing data with Ray and Databricks.

The Ray Databricks connector enables parallel read and write to and from a Databricks SQL endpoint. The connector utilizes the Python DB API 2.0 specification implemented by most the Python Databricks Connect library.

## Initialize ray
Ray will automatically be initialized with defaults when calling any ray or ray dataset methods. To specify configuration, add the below. 

In [1]:
import ray, logging
logging.basicConfig(level=logging.ERROR) # only show errors

if not ray.is_initialized():
    ray.init()

2023-02-27 21:05:16,756	INFO worker.py:1242 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
find: ‘.git’: No such file or directory
2023-02-27 21:05:17,074	INFO worker.py:1360 -- Connecting to existing Ray cluster at address: 10.0.36.75:9031...
2023-02-27 21:05:17,081	INFO worker.py:1548 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale.com/api/v2/sessions/ses_vnmb5jgl4z6q98h61dx25rccju/services?redirect_to=dashboard [39m[22m
2023-02-27 21:05:17,084	INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_de75729c0244af0b2679b245ffd05193.zip' (0.26MiB) to Ray cluster...
2023-02-27 21:05:17,087	INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_de75729c0244af0b2679b245ffd05193.zip'.


## Connection properties
The databricks  connection properties need to be provided to the data source upon creation. These properties are documented by the databricks.

Below is an example of loading properties from the environment, and filtering them by the 'DATABRICKS_' prefix.

In [2]:
import os

connect_props = {
    key.replace('DATABRICKS_','').lower(): value 
    for key,value in os.environ.items() if 'DATABRICKS_' in key
}

# add db and schema in connect props
connect_props = dict(
    catalog = 'samples',
    schema = 'tpch',
    **connect_props
)

print('Connection properties:')
print('\n'.join(connect_props.keys()))

Connection properties:
catalog
schema
server_hostname
access_token
http_path


# Reading
Ray will use Databricks Python API to read in parallel into a Ray cluster. The created Ray datasets is composed of PyArrow dataframes that are spread across the Ray cluster to allow for the distributed operations required in machine learning.

![Databricks Read](../images/databricks_read.png)

### Read from tables
In order to read an entire table into a a Ray cluster, utilize the Ray data `read_databricks` method. The code below will read in a sample table from a Databricks sample database.

In [3]:
from ray.data import read_databricks

# read the table, limiting to first 1K customers
ds = read_databricks(connect_props, table='customer').limit(10)

# display the first 3 results
ds.limit(3).to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00,  2.29it/s]


Unnamed: 0,c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
0,412445,Customer#000412445,"0QAB3OjYnbP6mA0B,kgf",21,31-421-403-4333,5358.33,BUILDING,arefully blithely regular epi
1,412446,Customer#000412446,"5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg",20,30-487-949-7942,9441.59,MACHINERY,sleep according to the fluffily even forges. f...
2,412447,Customer#000412447,HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO,7,17-797-466-6308,7868.75,AUTOMOBILE,aggle blithely among the carefully express excus


### Read with a query
For more control over columns and rows read, as well as joining data from multiple tables, a query can be specified instead of a table name. 

In [4]:
QUERY = 'SELECT C_ACCTBAL, C_MKTSEGMENT FROM CUSTOMER WHERE C_ACCTBAL < 0 LIMIT 1000'

# read the result of the query
ds2 = read_databricks(connect_props, query=QUERY)

# display the first 3 results
ds2.limit(3).to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00,  2.80it/s]


Unnamed: 0,C_ACCTBAL,C_MKTSEGMENT
0,-219.53,BUILDING
1,-778.23,AUTOMOBILE
2,-848.16,BUILDING


### Additional read parameters
For reading from Databricks, underlying Python API arguments are also available and can be passed to the underlying execute method.

The code below uses the parameters argument to specify parameterss to be used by Databricks when executing the query.

In [5]:
QUERY = 'SELECT C_ACCTBAL, C_MKTSEGMENT FROM CUSTOMER WHERE C_ACCTBAL > %(balance)i LIMIT 100'

ds3 = read_databricks(connect_props, query=QUERY, parameters={'balance':2.0})
ds3.limit(3).to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00,  4.67it/s]


Unnamed: 0,C_ACCTBAL,C_MKTSEGMENT
0,5358.33,BUILDING
1,9441.59,MACHINERY
2,7868.75,AUTOMOBILE


## Writing
The Ray Databricks connector will use the Databricks driver to write each partition of data in parallel. Each partition of data in the Ray dataset will have a write task that writes in parallel to a cload strorage location. After the 
partitions are written, a table is created using the parquet files. This 
![Databricks write](../images/databricks_write.png)

### Configure a metastore
Prior to running this code, a storage bucket that is accessible to the Ray cluster and Databricks needs to be confgured. To add a storage location into the Unity catalog, follow these [instructions](https://docs.databricks.com/data-governance/unity-catalog/get-started.html#cloud-tenant-setup-aws). You will need administrative privledges to your cloud account. 

Since data will be written to Databricks, a metastore with create table and write permissions for your user must also exist. To create a metastore, follow these [instructions](https://docs.databricks.com/data-governance/unity-catalog/create-metastore.html). You will need administrative privledges to Databricks.   

In [6]:
from databricks.sql import connect

write_connect_props = {
    **connect_props, 
    'catalog':'hive_metastore',
    'schema':'default'
}

# create destination table
with connect(**write_connect_props) as con:
    with con.cursor() as cursor:
        cursor.execute('DROP TABLE IF EXISTS customer2')
        #cursor.execute('CREATE TABLE customer2 USING DELTA AS (SELECT * FROM samples.tpch.customer LIMIT 0)')
        

### Write to tables
In order to write a dataset into database table, use the `write_databricks` method of the dataset object. Repartition the dataset prior to calling this method in order to set the number of write tasks.

The example below writes the previously read data into a new database table that are created using the Snowflake Python API.

In [7]:
# write the dataset to the table 
ds.write_databricks(
    write_connect_props, 
    table='hive_metastore.default.customer2',
    stage_uri='s3://egr-sydney-databricks/stage'
)

# read the new table
ds4 = read_databricks(write_connect_props, table='customer2')
ds4.limit(3).to_pandas()

2023-02-27 21:05:31,909	INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[write]
write: 100%|██████████| 1/1 [00:14<00:00, 14.76s/it]
2023-02-27 21:05:46,692	INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[write]
write: 100%|██████████| 1/1 [00:03<00:00,  3.73s/it]
Read progress: 100%|██████████| 3/3 [00:04<00:00,  1.49s/it]


Unnamed: 0,c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
0,412445,Customer#000412445,"0QAB3OjYnbP6mA0B,kgf",21,31-421-403-4333,5358.33,BUILDING,arefully blithely regular epi
1,412446,Customer#000412446,"5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg",20,30-487-949-7942,9441.59,MACHINERY,sleep according to the fluffily even forges. f...
2,412447,Customer#000412447,HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO,7,17-797-466-6308,7868.75,AUTOMOBILE,aggle blithely among the carefully express excus


## Advanced Usage
If more low level access to the Ray Databricks connector is needed, the underlying `DatabricksConnector` and `DatabricksDatasource` can be used.

### Databricks Connector
The `DatabricksConnector` class holds the connection properties and logic required to establish a connection with Databricks. Internally it calls the native Python dirver API in order to read and write from and to tables in parallel across the cluster. The datasource uses the DB API 2 `execute` and `executemany` methods to enable parallel read and writes of data.

The connector is also a Python context manager, and utilize `with` semantics to define when a connection should be established, db operations commited to the database, and the connection closed. 

The code below will read from a sample table using the connector to manage the connection.

In [8]:
from ray.data.datasource import DatabricksConnector

# query the number of rows, using the connection context to
# manage transactions
with DatabricksConnector(**connect_props) as connector:
    count = connector.query_int(f'SELECT COUNT(*) FROM customer')

print(count)

750000


Alternatively, you can use `try` blocks with the connector's `open`, `commit` and `close` methods. 

In [9]:
try:
    connector = DatabricksConnector(**write_connect_props)
    connector.open()
    count = connector.query_int(f'SELECT COUNT(*) FROM customer2')
finally:
    connector.close()
    
print(count)

10


### Databricks Datasource
The Databricks datasource can be used with the Ray data `read_datasource` and `write_datasource` methods to read and write to databases using the distibuted processing capabilities of Ray data. The datasource uses the DatabricksConnector class internally.

Below is an example of creating the datasource using the previously defined connect properties, and then using it to read and write.

In [10]:
from ray.data.datasource import DatabricksDatasource
from ray.data import read_datasource

# create a datasource from a connector
datasource = DatabricksDatasource(connector)

# use read_datasource to read
ds = read_datasource(datasource, table='customer2')
ds.limit(3).to_pandas()

Read progress: 100%|██████████| 3/3 [00:00<00:00,  8.80it/s]


Unnamed: 0,c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
0,412445,Customer#000412445,"0QAB3OjYnbP6mA0B,kgf",21,31-421-403-4333,5358.33,BUILDING,arefully blithely regular epi
1,412446,Customer#000412446,"5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg",20,30-487-949-7942,9441.59,MACHINERY,sleep according to the fluffily even forges. f...
2,412447,Customer#000412447,HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO,7,17-797-466-6308,7868.75,AUTOMOBILE,aggle blithely among the carefully express excus


### DML and DDL
The connector can also be used for any DDL or DML operations you would normally execute through the DB Native Python API. These operations just pass through to the underlying API. 

The code below will create the objects needed for writing to tables. Note that a commit is issued between the queries so the DDL operation executes prior to the next one that is dependent. An alternative is to use two `with` blocks to define transaction boundaries.

In [11]:
with connector as con:
    con.query('DROP TABLE IF EXISTS customer2')