![Snowflake connector](images/ray_snowflake.png)
# Working with Snowflake
This user guide walks through the basics of reading and writing data with the Ray Snowflake connector, and using the data for training and tuning an ML model.

## Connection properties
The Snowflake connection properties need to be provided to the data source upon creation. The minimal required properties are `user`, `password`, `account` and `warehouse`. To use API keys instead of password, functionality to load Snowflake API keys is also provided. API keys can be loaded from a file specified by the `private_key_file` property, or can be passed directly via the `private_key` property. If the key is password protected, the password can be given via the `pk_password` property.  Optional properties like database and schema can also be provided at construction or be included in the fully specified table name of format `db.schema.table` when calling read or write operations with a table or subquery.

Below is an example of loading properties from the environment, and filtering them by the 'SNOWFLAKE_' prefix.

In [1]:
import os
env_connect_props = {
    key.replace('SNOWFLAKE_','').lower(): value 
    for key,value in os.environ.items() if 'SNOWFLAKE_' in key
}
print('Environment connection properties:')
print('\n'.join(env_connect_props.keys()))

Environment connection properties:
account
private_key_file
pk_password
user


# Reading from Snowflake
Ray will use Snowflake optimizations that allow query results to be read in parallel into a Ray cluster. The created Ray datasets is composed of Pandas dataframes that are spread across the Ray cluster to allow for the distributed operations required in machine learning.

![Snowflake read table](images/snowflake_read_table.png)


### Read from tables
In order to read an entire table into a a Ray cluster, utilize the Ray data `read_snowflake` method. The code below will read in a sample table from the Snowflake sample database.

In [2]:
from cryptography.hazmat.backends import default_backend
from ray.data.datasource import DBAPI2Connector
from ray.data import read_snowflake
# add db and schema in connect props
connect_props = dict(
    database = 'SNOWFLAKE_SAMPLE_DATA',
    schema = 'TPCH_SF1',
    warehouse='COMPUTE_WH',
    password='C0lumbia!',
    **env_connect_props
)

# read the entire table
ds = read_snowflake(connect_props, table='CUSTOMER') 

# display the first 3 results
ds.limit(3).to_pandas()

2023-02-06 01:14:54,632	INFO worker.py:1242 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
find: ‘.git’: No such file or directory
2023-02-06 01:14:55,011	INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.0.63.233:9031...
2023-02-06 01:14:55,228	INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale.com/api/v2/sessions/ses_vnmb5jgl4z6q98h61dx25rccju/services?redirect_to=dashboard [39m[22m
2023-02-06 01:14:55,233	INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_0768b49d820a3e94056f2f221c3eba12.zip' (0.38MiB) to Ray cluster...
2023-02-06 01:14:55,239	INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_0768b49d820a3e94056f2f221c3eba12.zip'.
Read progress: 100%|██████████| 1/1 [00:00<00:00,  7.38it/s]
Read progress: 100%|██████████| 1/1 [00:00<00:00, 686.58it/s]


Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
0,30001,Customer#000030001,"Ui1b,3Q71CiLTJn4MbVp,,YCZARIaNTelfst",4,14-526-204-4500,8848.47,MACHINERY,frays wake blithely enticingly ironic asymptote
1,30002,Customer#000030002,UVBoMtILkQu1J3v,11,21-340-653-9800,5221.81,MACHINERY,he slyly ironic pinto beans wake slyly above t...
2,30003,Customer#000030003,CuGi9fwKn8JdR,21,31-757-493-7525,3014.89,BUILDING,e furiously alongside of the requests. evenly ...


### Read with a query
For more control over columns and rows read, as well as joining data from multiple tables, a query can be specified instead of a table name. 

In [3]:
QUERY = 'SELECT C_ACCTBAL, C_MKTSEGMENT FROM CUSTOMER WHERE C_ACCTBAL < 0'

# read the result of the query
ds2 = read_snowflake(connect_props, query=QUERY)

# display the first 3 results
ds2.limit(3).to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00,  7.18it/s]
Read progress: 100%|██████████| 1/1 [00:00<00:00, 922.84it/s]


Unnamed: 0,C_ACCTBAL,C_MKTSEGMENT
0,-272.6,BUILDING
1,-78.56,AUTOMOBILE
2,-917.75,FURNITURE


### Additional read parameters
For reading from Snowflake, underlying Python API arguments are also available. The `timeout` and `params` arguments may be used in the [cursor execute method](https://docs.snowflake.com/en/user-guide/python-connector-api.html#object-cursor).

The code below uses the params to specify params to be used by Snowflake when executing the query.

In [4]:
QUERY = 'SELECT C_ACCTBAL, C_MKTSEGMENT FROM CUSTOMER WHERE C_ACCTBAL > ?'

ds3 = read_snowflake(connect_props, query=QUERY, params=[1000], timeout=1000)
ds3.limit(3).to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00,  9.38it/s]
Read progress: 100%|██████████| 1/1 [00:00<00:00, 660.21it/s]


Unnamed: 0,C_ACCTBAL,C_MKTSEGMENT
0,8848.47,MACHINERY
1,5221.81,MACHINERY
2,3014.89,BUILDING


## Writing
The Ray Snowflake connector will use Snowflake API to write each partition of data in parallel. Each partition of data in the Ray dataset will have a write task that writes in parallel to Snowflake. 
![Snowflake write table](images/snowflake_write_table.png)

### Write to tables
In order to write a dataset into Snowflake table, use the `write_snowflake` method of the dataset object. Repartition the dataset in order to set the number of write tasks.

First, a new database and table needs to be created using the Snowflake connector API.

In [5]:
from snowflake import connector

write_connect_props = {
    **connect_props, 
    'database':'RAY_SAMPLE', 
    'schema':'PUBLIC'
}
with connector.connect(**write_connect_props) as con:
    # create destination database
    con.cursor().execute(f'CREATE DATABASE IF NOT EXISTS RAY_SAMPLE')
    con.commit()
    
    # create destination table
    con.cursor().execute('''
        CREATE OR REPLACE TABLE CUSTOMER_COPY 
        LIKE SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
    ''')

The example below writes the previously read data into a new database table that are created using the Snowflake Python API.

In [6]:
# write the dataset to the table 
ds.write_snowflake(
    write_connect_props, 
    table='CUSTOMER_COPY'
)

Read progress: 100%|██████████| 19/19 [00:04<00:00,  4.24it/s]
Write Progress: 100%|██████████| 19/19 [00:09<00:00,  1.97it/s]


### Additional write parameters
For writing to Snowflake, the native Snowflake API arguments are also available from the [write_pandas](https://docs.snowflake.com/en/user-guide/python-connector-api.html#module-snowflake-connector-pandas-tools) method. The following is a list of the parameters that may be useful:

- `auto_create_table`: When true, will automatically create a table with corresponding columns for each column in the passed in DataFrame. The table will not be created if it already exists
- `overwrite`: When true, and if auto_create_table is true, then it drops the table. Otherwise, it truncates the table. In both cases it will replace the existing contents of the table with that of the passed in Pandas DataFrame.
- `table_type`: The table type of to-be-created table. The supported table types include ``temp``/``temporary`` and ``transient``. Empty means permanent table as per SQL convention.

In the example below, we use the `auto_create_table` parameter to create the output table before writing.

In [7]:
# write the dataset to the table, using an autocreated table
ds.write_snowflake(
    write_connect_props, 
    table='CUSTOMER_COPY_2',
    auto_create_table=True
)

Read progress: 100%|██████████| 19/19 [00:00<00:00, 3205.75it/s]
Write Progress: 100%|██████████| 19/19 [00:11<00:00,  1.66it/s]


## Advanced Usage
If more low level access to the Ray Snowflake connector is needed, the underlying `SnowflakConnector` and `SnowflakeDatasource` can be used.

### Snowflake Connector
The `SnowflakeConnector` class holds the connection properties and logic required to establish a connection with Snowflake. Internally it calls the native Python Snowflake API in order to read and write from and to Snowflake tables in parallel across the cluster. The datasource uses the Snowflake Python API's optimized `read_batch` and `write_pandas` methods to enable parallel read and writes of data.

The connector is also a Python context manager, and utilize `with` semantics to define when a connection should be established, db operations commited to the database, and the connection closed. 

The code below will read from a sample table using the connector to manage the connection.

In [8]:
from ray.data.datasource import SnowflakeConnector

# query the number of rows, using the connection context to
# manage transactions
with SnowflakeConnector(**connect_props) as con:
    count = con.query_int(f'SELECT COUNT(*) FROM CUSTOMER')

print(count)

150000


Alternatively, you can use `try` blocks with the connector's `open`, `commit` and `close` methods. 

In [9]:
connector = SnowflakeConnector(**connect_props)
try:
    connector.open()
    count = connector.query_int(f'SELECT COUNT(*) FROM CUSTOMER')
finally:
    connector.close()
    
print(count)

150000


### Snowflake Datasource
The Snowflake datasource can be used with the Ray data `read_datasource` and `write_datasource` methods to read and write to Snowflake databases using the distibuted processing capabilities of Ray data. The datasource uses a SnowflakeConnector class that is derived from the DBAPI2Connector class. 

Below is an exmaple of creating the datasource using the previously defined connect properties, and then using it to read and write.

In [10]:
from ray.data.datasource import SnowflakeDatasource
from ray.data import read_datasource

# create a datasource from a connector
datasource = SnowflakeDatasource(connector)

# use read_datasource to read
ds = read_datasource(
    datasource, 
    table='CUSTOMER'
)
 
ds.limit(3).to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00, 10.53it/s]
Read progress: 100%|██████████| 1/1 [00:00<00:00, 841.89it/s]


Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
0,30001,Customer#000030001,"Ui1b,3Q71CiLTJn4MbVp,,YCZARIaNTelfst",4,14-526-204-4500,8848.47,MACHINERY,frays wake blithely enticingly ironic asymptote
1,30002,Customer#000030002,UVBoMtILkQu1J3v,11,21-340-653-9800,5221.81,MACHINERY,he slyly ironic pinto beans wake slyly above t...
2,30003,Customer#000030003,CuGi9fwKn8JdR,21,31-757-493-7525,3014.89,BUILDING,e furiously alongside of the requests. evenly ...


In [12]:
# use write_datasource to write
connector = SnowflakeConnector(**write_connect_props)
datasource = SnowflakeDatasource(connector)
ds.write_datasource(
    datasource, 
    table='CUSTOMER_3',
    auto_create_table=True
)

Read progress: 100%|██████████| 19/19 [00:00<00:00, 3238.58it/s]
Write Progress: 100%|██████████| 19/19 [00:10<00:00,  1.74it/s]


### DML and DDL
The connector can also be used for any DDL or DML operations you would normally execute through the Snowflake Python API. These operations just pass through to the underlying Snowflake API. 

The code below will create the objects needed for writing to tables. Note that a commit is issued between the queries so the DDL operation executes prior to the next one that is dependent. An alternative is to use two `with` blocks to define transaction boundaries.

In [14]:
with connector as con:
    con.query(f'CREATE DATABASE IF NOT EXISTS RAY')
    con.commit()
    con.query(f'''
        CREATE OR REPLACE TABLE RAY.PUBLIC.CUSTOMER_COPY
            LIKE SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
    ''')

### Pandas data mapping
The Snowflake Datasource converts Pandas data types using the Snowflake Python Connector API. Data mappings are available from the Snowflake [documentation](https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#snowflake-to-pandas-data-mapping). 

The below code is an example of reading and writing all the available data formats.

In [15]:
with connector as con:
    con.query("""
        CREATE OR REPLACE TABLE SAMPLE_TABLE (
            ID INT,
            SAMPLE_NUMBER NUMBER(6,2),
            SAMPLE_DECIMAL DECIMAL(8,3),
            SAMPLE_FLOAT FLOAT,
            SAMPLE_VARCHAR VARCHAR,
            SAMPLE_BINARY BINARY,
            SAMPLE_INT INT,
            SAMPLE_DATE DATE,
            SAMPLE_TIME TIME,
            SAMPLE_TIMESTAMP_TZ TIMESTAMP_TZ,
            SAMPLE_TIMESTAMP_NTZ TIMESTAMP_NTZ,
            SAMPLE_TIMESTAMP_LTZ TIMESTAMP_LTZ,
            SAMPLE_GEOGRAPHY GEOGRAPHY,
            SAMPLE_VARIANT VARIANT,
            SAMPLE_ARRAY ARRAY,
            SAMPLE_OBJECT OBJECT
        )
    """)
    con.commit()
    con.query("""
        INSERT INTO SAMPLE_TABLE 
        VALUES (
            0,
            1111.11,
            22222.222,
            3.333333333,
            '4444444444',
            '01ffeeddaa',
            6666,
            TO_DATE('2007-07-07'),
            TO_TIME('08:00:00.000'),
            TO_TIMESTAMP_TZ('2009-07-08 08:00:00'),
            TO_TIMESTAMP_NTZ('2010-07-08 08:00:00.000'),
            TO_TIMESTAMP_LTZ('2011-07-08 08:00:00.000'),
            'POINT(-122.35 37.55)',
            NULL,
            NULL,
            NULL
        )
    """)
    con.query("""UPDATE SAMPLE_TABLE SET SAMPLE_VARIANT = to_variant(parse_json('{"key3": "value3", "key4": "value4"}'))""")
    con.query("UPDATE SAMPLE_TABLE SET SAMPLE_ARRAY = [1,'two',3,4]")
    con.query("UPDATE SAMPLE_TABLE SET SAMPLE_OBJECT = {'thirteen':13, 'zero':0}")

sample = read_snowflake(write_connect_props, table='SAMPLE_TABLE')
sample.to_pandas()

Read progress: 100%|██████████| 1/1 [00:00<00:00, 241.04it/s]


Unnamed: 0,ID,SAMPLE_NUMBER,SAMPLE_DECIMAL,SAMPLE_FLOAT,SAMPLE_VARCHAR,SAMPLE_BINARY,SAMPLE_INT,SAMPLE_DATE,SAMPLE_TIME,SAMPLE_TIMESTAMP_TZ,SAMPLE_TIMESTAMP_NTZ,SAMPLE_TIMESTAMP_LTZ,SAMPLE_GEOGRAPHY,SAMPLE_VARIANT,SAMPLE_ARRAY,SAMPLE_OBJECT
0,0,1111.11,22222.222,3.333333,4444444444,b'\x01\xff\xee\xdd\xaa',6666,2007-07-07,08:00:00,2009-07-08 08:00:00-07:00,2010-07-08 08:00:00,2011-07-08 08:00:00-07:00,"{\n ""coordinates"": [\n -122.35,\n 37.55...","{\n ""key3"": ""value3"",\n ""key4"": ""value4""\n}","[\n 1,\n ""two"",\n 3,\n 4\n]","{\n ""thirteen"": 13,\n ""zero"": 0\n}"


The below code writes the sample data back to Snowflake:

In [17]:
new_sample = sample.drop_columns(['SAMPLE_BINARY']) # binary column write does not work in Snowflake API
new_sample.write_snowflake(
    write_connect_props, 
    table='SAMPLE_TABLE_DEST', 
    auto_create_table=True
)
read_snowflake(
    write_connect_props, 
    table='SAMPLE_TABLE_DEST'
).to_pandas()

ValueError: The size in bytes of the block must be known: (ObjectRef(00ffffffffffffffffffffffffffffffffffffff0300000002000000), BlockMetadata(num_rows=1, size_bytes=None, schema=PandasBlockSchema(names=['ID', 'SAMPLE_NUMBER', 'SAMPLE_DECIMAL', 'SAMPLE_FLOAT', 'SAMPLE_VARCHAR', 'SAMPLE_BINARY', 'SAMPLE_INT', 'SAMPLE_DATE', 'SAMPLE_TIME', 'SAMPLE_TIMESTAMP_TZ', 'SAMPLE_TIMESTAMP_NTZ', 'SAMPLE_TIMESTAMP_LTZ', 'SAMPLE_GEOGRAPHY', 'SAMPLE_VARIANT', 'SAMPLE_ARRAY', 'SAMPLE_OBJECT'], types=[dtype('int64'), dtype('O'), dtype('O'), dtype('float64'), dtype('O'), dtype('O'), dtype('int64'), dtype('O'), dtype('O'), datetime64[ns, pytz.FixedOffset(-420)], dtype('<M8[ns]'), datetime64[ns, America/Los_Angeles], dtype('O'), dtype('O'), dtype('O'), dtype('O')]), input_files=[], exec_stats=None))