# Snowflake + Pandas

<table>
    <tr>
        <td>
            <img src="../_img/snowflake.png" width="300">
        </td>
    </tr>
</table>

This tutorial describes how to connect to Snowflake, load data into a Snowflake table, and work with that data in `pandas`.

<hr>

## Connect to Snowflake

This example uses data stored in a Snowflake data warehouse that is managed by the team at Saturn Cloud. We've set up a read-only user for use in these examples. If you would like to access data stored in your own Snowflake account, see the [README](./README.md).

In [None]:
import os
import snowflake.connector

conn_info = {
    "account": os.environ["EXAMPLE_SNOWFLAKE_ACCOUNT"],
    "user": os.environ["EXAMPLE_SNOWFLAKE_USER"],
    "password": os.environ["EXAMPLE_SNOWFLAKE_PASSWORD"],
    "database": os.environ["TAXI_DATABASE"],
}
conn = snowflake.connector.connect(**conn_info)

<hr>

## Run query

The [Snowflake Connector for Python](https://docs.snowflake.com/en/user-guide/python-connector-pandas.html) has `fetch_pandas_all()` and `fetch_pandas_batches()` methods that utilize [Arrow](https://arrow.apache.org/) for fast data exchange.

In [None]:
query = """
SELECT *
FROM taxi_yellow
WHERE
    date_trunc('DAY', pickup_datetime) = '2020-01-01'
"""
cur = conn.cursor().execute(query)
df = cur.fetch_pandas_all()

In [None]:
df.head()

In [None]:
len(df), df.memory_usage().sum() / 1e6  # memory size in MB

`fetch_pandas_batches()` is useful if you can perform operations if the full result doesn't fit in memory, but there are operations you can perform to individual batches. It returns a `generator` that you can loop over.

In [None]:
cur = conn.cursor().execute(query)
batches = cur.fetch_pandas_batches()
batches

In [None]:
for batch in batches:
    print(len(batch), batch.memory_usage().sum() / 1e6)

If the data and/or computation are just too big for pandas on a single node, that's when you move to Dask! Check out the [`snowflake-dask.ipynb`](snowflake-dask.ipynb) notebook for the Dask implementation of this example.

<hr>

## Next Steps

In this tutorial, you learned how to use Snowflake and `snowflake-connector-python` to execute SQL queries over large datasets. You also learned how to read those query results into a `pandas` data frame.

If you want to work with large query results or want to do post-processing that would benefit from a lot more parallelism than you can achieve on a single machine, you might want to read this result into a Dask DataFrame instead. Try [this dask + snowflake notebook](./snowflake-dask.ipynb) to learn how to efficiently read Snowflake query results into Dask collections.

If you want to see how to combine the lessons from this notebook with common machine learning tasks like feature engineering and hyperparameter tuning, try [this scikit-learn notebook](../nyc-taxi-snowflake/hyperparameter-scikit.ipynb).

<hr>