# Snowflake + Dask

How to load data from a Snowflake table or query into a Dask dataframe

## Connect to Snowflake

See [README](README.md) for more details on how to set up the credentials file.

Fill in the variables below based on the Snowflake warehouse and schema you used when running `load-data.sql`:

In [None]:
WAREHOUSE = '<YOUR WAREHOUSE>'
SCHEMA = '<YOUR SCHEMA>'

In [None]:
import yaml
import snowflake.connector

creds = yaml.full_load(open('/home/jovyan/snowflake_creds.yml'))

# get connection info
conn_info = {
    'warehouse': WAREHOUSE,
    'database': 'NYC_TAXI',
    'schema': SCHEMA,
    **creds,
}
conn = snowflake.connector.connect(**conn_info)

## Set up a query template

We need to set up a query template containing a bind variable that will result in Dask issuing multiple queries that each extract a slice of the taxi data based on the pickup_datetime column. These slices will become our partitions in a Dask dataframe. We use a [binding for the Snowflake query](https://docs.snowflake.com/en/user-guide/python-connector-example.html#binding-data) so that we can pass different date values at execution time.

In [None]:
query = """
SELECT *
FROM taxi_yellow
WHERE
    date(pickup_datetime) = %s
"""

Validate the query is good with pandas

In [None]:
cur = conn.cursor().execute(query, '2019-01-01')
df = cur.fetch_pandas_all()
len(df), df.memory_usage().sum() / 1e6  # memory size in MB

## Initialize Dask cluster

In [None]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster
import time

n_workers = 3
cluster = SaturnCluster(n_workers=n_workers, scheduler_size='medium', worker_size='large', nthreads=2)
client = Client(cluster)
cluster

If you initialized your cluster from right here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready

> **Pro tip:** Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [None]:
while len(client.scheduler_info()['workers']) < n_workers:
    print('Waiting for workers, got', len(client.scheduler_info()['workers']))
    time.sleep(30)
print('Done!')

## Load larger data with Dask!

We set up a function with `dask.delayed`. `@delayed` is a decorator that turns a Python function into a function suitable for running on the Dask cluster. When you execute a delayed function, instead of executing the operation, it returns a delayed result that represents what the return value of the function will be. `dask.dataframe.from_delayed` takes a list of these delayed objects, and concatenates them into a Dask dataframe.

In [None]:
from dask import delayed
import dask.dataframe as dd

In [None]:
print(query)

In [None]:
@delayed
def load(conn_info, query, day):
    conn = snowflake.connector.connect(**conn_info)
    cur = conn.cursor().execute(query, str(day))
    return cur.fetch_pandas_all()

In [None]:
out = load(conn_info, query, '2019-01-01')
out

We can call `compute()` to execute the function and see the output (in this case a Pandas dataframe)

In [None]:
type(out.compute())

Now, let's load more days using Dask! First we want to pull a range of dates where know data exists. We can run a quick Snowflake query for that

In [None]:
date_query = """
SELECT
    DISTINCT(DATE(pickup_datetime)) as date 
FROM taxi_yellow
WHERE
    pickup_datetime BETWEEN '2019-01-01' and '2019-01-31'
"""
dates_df = conn.cursor().execute(date_query).fetch_pandas_all()
dates = dates_df['DATE'].tolist()
dates[:5]

Then, we build up a list of delayed objects that call the `load()` function we created

In [None]:
delayed_obs = [load(conn_info, query, day) for day in dates]
delayed_obs[:5]

Finally, create a Dask Dataframe!

In [None]:
ddf = dd.from_delayed(delayed_obs)
ddf

Notice that the above command ran pretty quickly. This is because Dask only executes the task graph when you perform certain actions, such as writing a file or getting the `len` of the DataFrame

In [None]:
len(ddf)

<br>

We can use `repartition()` to introduce more parallelism. This helps downstream processes execute faster by splitting the work across more cores.

In [None]:
ddf = ddf.repartition(npartitions=100)
ddf

In [None]:
len(ddf)

<br>
The cell below will execute the Snowflake queries across the cluster, compute the row count and size of each partition in parallel, and then aggregate the results to present the row count and size of the entire Dask dataframe.

In [None]:
print(f'Num rows: {len(ddf)}, Size: {ddf.memory_usage(deep=True).sum().compute() / 1e6} MB')

The partitions in the Dask dataframe are pandas dataframes

In [None]:
ddf_part = ddf.partitions[0].compute()
type(ddf_part)

If we plan on performing a lot of operations using this Dask dataframe (such as training a machine learning model), and the data will fit in the memory of the _cluster_, we should `persist()` the dataframe to perform all the loading up-front.

In [None]:
from dask.distributed import wait

ddf = ddf.persist()
_ = wait(ddf)

The following cell should execute much faster than previously, because all the data is loaded into memory

In [None]:
print(f'Num rows: {len(ddf)}, Size: {ddf.memory_usage(deep=True).sum().compute() / 1e6} MB')