# Snowflake + Pandas

How to load data from a Snowflake table or query into a pandas dataframe

## Connect to Snowflake

See [README](README.md) for more details on how to set up the credentials file.

In [None]:
import yaml
import snowflake.connector

creds = yaml.full_load(open('/home/jovyan/snowflake_creds.yml'))

conn = snowflake.connector.connect(
    warehouse='COMPUTE_WH',
    database='NYC_TAXI',
    schema='PUBLIC',
    **creds,
)

## Run query

The [Snowflake connector for Python](https://docs.snowflake.com/en/user-guide/python-connector-pandas.html) has `fetch_pandas_all()` and `fetch_pandas_batches()` methods that utilize [Arrow](https://arrow.apache.org/) for fast data exchange.

In [None]:
query = """
SELECT *
FROM taxi_yellow
WHERE
    date_trunc('DAY', tpep_pickup_datetime) = '2020-01-01'
"""
cur = conn.cursor().execute(query)
df = cur.fetch_pandas_all()

In [None]:
df.head()

In [None]:
len(df), df.memory_usage().sum() / 1e6  # memory size in MB

`fetch_pandas_batches()` is useful if you can perform operations if the full result doesn't fit in memory, but there are operations you can perform to individual batches. It returns a `generator` that you can loop over.

In [None]:
cur = conn.cursor().execute(query)
batches = cur.fetch_pandas_batches()
batches

In [None]:
for batch in batches:
    print(len(batch), batch.memory_usage().sum() / 1e6)

If the data and/or computation is just too big for pandas on a single node, that's when you move to Dask! Check out the [`snowflake-dask.ipynb`](snowflake-dask.ipynb) notebook for this.