# Hybrid Execution Build Verification Test

In this demo, we will show how you can develop a robust pandas pipelines at all data scales. You will see how pandas on Snowflake intelligently determines whether to execute queries locally with regular pandas or run directly in Snowflake. This allows you to rapidly iterate with your pandas workflows for testing and development on small datasets, while futureproofing your pipelines when you scale up to production data.

In [1]:
import snowflake.snowpark.modin.plugin
import modin.pandas as pd
import numpy as np
import datetime
import pandas as native_pd
from time import perf_counter
from snowflake.snowpark.session import Session; session = Session.builder.create()

Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
Going to open: https://snowbiz.okta.com/app/snowflake/exk8wfsfryJIn4IWZ2p7/sso/saml?SAMLRequest=jVJdc9owEPwrHvUZSzbOBDRAhoQQ3BKgfCQzeRO2TFTLktHJceivr8xHJ31Ipm%2Ba0%2B7t3u31bt4L6b1xA0KrPgp8gjyuEp0KteujzXrc6iAPLFMpk1rxPjpwQDeDHrBClnRY2Ve15PuKg%2FVcIwW0%2BeijyiiqGQigihUcqE3oavg4paFPKAPgxjo5dKakIJzWq7Ulxbiua79u%2B9rscEgIwaSLHaqBfEMfJMqvNUqjrU60vFDe3UyfSASYRI2EQziFxZl4K9RpBV%2BpbE8goJP1etFazFdr5A0v091pBVXBzYqbN5HwzXJ6MgDOwWo2f57MN6t7H5SuM8lynuiirKzr5rsXzniKpd4Jt6N41EdlLtLr%2FdJUP0iyZdFoej9rF%2Fn8YSz3w1%2B5nHRTwZ7yn8lDxm6tihLkPV0SDZtEY4CKx6rJ0boSCa9aJGoFV%2BuQUNKlUccnQfsFeSOXo1DMHpkXs43Frfjt69yyozlWlvivb8zf806dQWYO32MVxc8vYXmNATRuYkKnS6FHA2bwv%2FP38EfW%2Bdhmbv%2FxaKGlSA7eWJuC2c%2FjCfzgWBFpKztCKS%2BYkMM0NRzAxSSlru8MZ9bdtDUVR3hwUv33qgd%2FAA%3D%3D&RelayState=ver%3A

## Example 1: Working with small/inline-created dataframe is faster

In [2]:
us_holidays = [
    ("New Year's Day", "2025-01-01"),
    ("Martin Luther King Jr. Day", "2025-01-20"),
    ("Presidents' Day", "2025-02-17"),
    ("Memorial Day", "2025-05-26"),
    ("Juneteenth National Independence Day", "2025-06-19"),
    ("Independence Day", "2025-07-04"),
    ("Labor Day", "2025-09-01"),
    ("Columbus Day", "2025-10-13"),
    ("Veterans Day", "2025-11-11"),
    ("Thanksgiving Day", "2025-11-27"),
    ("Christmas Day", "2025-12-25")
]

# Create DataFrame
df_us_holidays = pd.DataFrame(us_holidays, columns=["Holiday", "Date"])

# Convert Date column to datetime
df_us_holidays["Date"] = pd.to_datetime(df_us_holidays["Date"])

In [3]:
assert df_us_holidays.get_backend() == 'Pandas'  # with auto, we should expect this to be local

In [4]:
# Add new columns for transformations
df_us_holidays["Day_of_Week"] = df_us_holidays["Date"].dt.day_name()
df_us_holidays["Month"] = df_us_holidays["Date"].dt.month_name()

In [5]:
df_us_holidays

Unnamed: 0,Holiday,Date,Day_of_Week,Month
0,New Year's Day,2025-01-01,Wednesday,January
1,Martin Luther King Jr. Day,2025-01-20,Monday,January
2,Presidents' Day,2025-02-17,Monday,February
3,Memorial Day,2025-05-26,Monday,May
4,Juneteenth National Independence Day,2025-06-19,Thursday,June
5,Independence Day,2025-07-04,Friday,July
6,Labor Day,2025-09-01,Monday,September
7,Columbus Day,2025-10-13,Monday,October
8,Veterans Day,2025-11-11,Tuesday,November
9,Thanksgiving Day,2025-11-27,Thursday,November


In [6]:
%%time
#Note that without auto-switching, this took 2.5 min
for index, row in df_us_holidays.iterrows():
    print(f"{row['Holiday']} falls on {row['Day_of_Week']}, {row['Month']} {row['Date'].day}, {row['Date'].year}.")

New Year's Day falls on Wednesday, January 1, 2025.
Martin Luther King Jr. Day falls on Monday, January 20, 2025.
Presidents' Day falls on Monday, February 17, 2025.
Memorial Day falls on Monday, May 26, 2025.
Juneteenth National Independence Day falls on Thursday, June 19, 2025.
Independence Day falls on Friday, July 4, 2025.
Labor Day falls on Monday, September 1, 2025.
Columbus Day falls on Monday, October 13, 2025.
Veterans Day falls on Tuesday, November 11, 2025.
Thanksgiving Day falls on Thursday, November 27, 2025.
Christmas Day falls on Thursday, December 25, 2025.
CPU times: user 89.8 ms, sys: 4.07 ms, total: 93.8 ms
Wall time: 92.4 ms


### 💡 Automatic switching speeds up loops/iterations on small data + inline creation of dataframes

## Example 2: When data is filtered the choice of engine changes

Run the following SQL to generate a synthetic dataset with 10M rows of transactions (from 2024-2025 current date)
```sql
CREATE OR REPLACE TABLE revenue_transactions (
    Transaction_ID STRING,
    Date DATE,
    Revenue FLOAT
);

SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));
INSERT INTO revenue_transactions (Transaction_ID, Date, Revenue)
SELECT
    UUID_STRING() AS Transaction_ID,
    DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
    UNIFORM(10, 1000, RANDOM()) AS Revenue
FROM TABLE(GENERATOR(ROWCOUNT => 10000000));
```

In [7]:
# Run the following to generate a synthetic dataset with 10M rows of transactions (from 2024-2025 current date)
session.sql('''
CREATE OR REPLACE TABLE revenue_transactions (
    Transaction_ID STRING,
    Date DATE,
    Revenue FLOAT
);''').collect()
session.sql('''SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));''').collect()
session.sql('''INSERT INTO revenue_transactions (Transaction_ID, Date, Revenue)
SELECT
    UUID_STRING() AS Transaction_ID,
    DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
    UNIFORM(10, 1000, RANDOM()) AS Revenue
FROM TABLE(GENERATOR(ROWCOUNT => 10000000));
''').collect()

[Row(number of rows inserted=10000000)]

In [8]:
df_transactions = pd.read_snowflake("REVENUE_TRANSACTIONS")

Switcheroo operation: read_snowflake cost to move to pandas: 1000 cost to stay: 250


In [9]:
print(f"The dataset size is {len(df_transactions)} and the data is located in {df_transactions.get_backend()}.")

The dataset size is 10000000 and the data is located in Snowflake.


Perform some operations on 10M rows with Snowflake

In [10]:
df_transactions["DATE"] = pd.to_datetime(df_transactions["DATE"])

In [11]:
%%time
df_transactions.groupby("DATE").sum()["REVENUE"]

Switcheroo operation: sum cost to move to pandas: 1000 cost to stay: 250
CPU times: user 15.4 ms, sys: 4.17 ms, total: 19.5 ms
Wall time: 132 ms


DATE
2024-01-01    10706581.0
2024-01-02    10840213.0
2024-01-03    10771250.0
2024-01-04    10569984.0
2024-01-05    10620351.0
                 ...    
2025-04-11    10741805.0
2025-04-12    10690266.0
2025-04-13    10750746.0
2025-04-14    10766078.0
2025-04-15    10740229.0
Freq: None, Name: REVENUE, Length: 471, dtype: float64

In [12]:
assert df_transactions.get_backend() == "Snowflake"

So far everything has been happening in Snowflake, since we are working with the full dataset (10M rows). 
Next, we demonstrate what happens when we filter the data down to a smaller dataset below our data size threshold for automatic engine switching. 
First, let's perform the filtering directly with pandas. 

In [13]:
df_transactions_filter1 = df_transactions[(df_transactions["DATE"] >= pd.Timestamp.today().date() - pd.Timedelta('7 days')) & (df_transactions["DATE"] < pd.Timestamp.today().date())]

In [14]:
assert df_transactions_filter1.get_backend() == "Snowflake"

In [15]:
print(f"Date range: {df_transactions_filter1['DATE'].min().date()} to {df_transactions_filter1['DATE'].max().date()}. Resulting dataset size: {len(df_transactions_filter1)}")

The current operation leads to materialization and can be slow if the data is large!


Date range: 2025-04-08 to 2025-04-14. Resulting dataset size: 148485


Now that we have a smaller dataframe, this happens in pandas.

In [18]:
%time
df_transactions_filter1 = df_transactions_filter1.groupby("DATE").sum()["REVENUE"]

CPU times: user 4 µs, sys: 2 µs, total: 6 µs
Wall time: 13.1 µs
Switcheroo operation: sum cost to move to pandas: 37 cost to stay: 750


Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

In [19]:
assert df_transactions_filter1.get_backend()=="Pandas"

We saw what happens when we filter with pandas. Now let's look at what happens if we perform filtering via SQL directly in the `read_snowflake` command, so the dataframe upon creation is small.

In [None]:
df_transactions_filter2 = pd.read_snowflake("SELECT * FROM revenue_transactions WHERE Date >= DATEADD( 'days', -7, current_date ) and Date < current_date")

In [None]:
assert df_transactions_filter2.get_backend()=="Pandas"

In [None]:
# Verify the result is same as above
print(f"Date range: {df_transactions_filter2['DATE'].min()} to {df_transactions_filter2['DATE'].max()}. Resulting dataset size: {len(df_transactions_filter2)}")

Once you are in pandas, you can still continue to perform the same operations: 

In [None]:
%time
df_transactions_filter2.groupby("DATE").sum()["REVENUE"]

In [None]:
assert df_transactions_filter2.get_backend() == 'Pandas'

### 💡 Automatic switching means that pandas work well for both small and large data

## Example 3: Combining small and large datasets in the same workflow

Soemtimes you are working with multiple dataframes of different sizes and you need to join them together, what happens in this scenario?
When two dataframes are joined and the two dataframe are coming from different engine, we automatically determine what is the most optimal way to move the data to minimize the cost of data movement.

Continuing with our `df_transactions` and `df_us_holidays` dataset.

In [None]:
print("Quick recap:")
print(f"- df_transactions is {len(df_transactions)} rows and the data is located in {df_transactions.get_backend()}.")
print(f"- df_us_holidays is {len(df_us_holidays)} rows and the data is located in {df_us_holidays.get_backend()}.")

In [None]:
df_transactions["DATE"] = pd.to_datetime(df_transactions["DATE"])

Since `df_us_holidays` is much smaller than `df_transactions`, we moved `df_us_holidays` to Snowflake where `df_transactions` is, to perform the operation.

In [None]:
combined = pd.merge(df_us_holidays, df_transactions, left_on="Date", right_on="DATE")

In [None]:
assert combined.get_backend() == "Snowflake"

### 💡 When we combine multiple dataframes running in different locations, pandas on Snowflake automatically determines where to move the data.

## Example 4: Performing custom `apply` on small dataset

apply is known to be slow in Snowpark pandas since it is implemented as UDF/UDTF, which often comes with a fixed startup time.
Here, we show an example of how performing `apply` on a small dataset is faster with local pandas. 

In this example, we want to forecast using last year's transaction data via a custom apply function. 

In [None]:
def forecast_revenue(df, start_date, end_date):
    # Filter data from last year
    df_filtered = df[(df["DATE"] >= start_date - pd.Timedelta(days=365)) & (df["DATE"] < start_date)]
    
    # Append future dates to daily_avg for prediction
    future_dates = pd.date_range(start=start_date, end=end_date, freq="D")
    df_future = pd.DataFrame({"DATE": future_dates})

    # Group by DATE and calculate the mean revenue
    daily_avg = df_filtered.groupby("DATE")["REVENUE"].mean().reset_index()
    daily_avg["DATE"] = daily_avg["DATE"].astype('datetime64[ns]')
    # Merge future dates with predicted revenue, filling missing values
    df_forecast = df_future.merge(daily_avg, on="DATE", how="left")
    import numpy as np
    # Fill missing predicted revenue with overall mean from last year
    df_forecast["PREDICTED_REVENUE"] = np.nan
    df_forecast["PREDICTED_REVENUE"].fillna(daily_avg["REVENUE"].mean(), inplace=True)
    df_forecast["PREDICTED_REVENUE"] = df_forecast["PREDICTED_REVENUE"].astype("float")
    return df_forecast

First, let's use the `forecast_revenue` function to get the forecast in the date range, based on last year's revenue numbers.

In [None]:
start_date = pd.Timestamp("2025-10-01")
end_date = pd.Timestamp("2025-10-31")
df_forecast = forecast_revenue(df_transactions, start_date, end_date)

The resulting dataframe is very small, since it is only the 1-month window we're performing forecast on, so the backend is running on pandas locally.

In [None]:
assert df_forecast.get_backend() == 'Pandas'

In [None]:
def adjust_for_holiday_weekend(row):
    # For national holidays, revenue down 5% since stores are closed. For weekends, revenue is up 5% due to increased activity.
    if row["DATE"].strftime('%Y-%m-%d') in list(df_us_holidays["Date"].dt.strftime('%Y-%m-%d')): 
        return row["PREDICTED_REVENUE"] * 0.95
    elif row["DATE"].weekday() == 5 or row["DATE"].weekday() == 6: #Saturday/Sundays
        return row["PREDICTED_REVENUE"] * 1.05
    return row["PREDICTED_REVENUE"]

Now if we run `apply` on this dataframe. It will be running with local pandas.

In [None]:
# Adjust for holidays using the apply function
df_forecast["PREDICTED_REVENUE"] = df_forecast.apply(adjust_for_holiday_weekend, axis=1)

In [None]:
assert df_forecast.get_backend() == 'Pandas'

In [2]:
import os
os.environ["HYBRID_DATA_SIZE_TRANSITION_POINT"] = "10"

In [3]:
df_small = pd.DataFrame({'a': [1]*20, 'b': [2]*20}).move_to('Snowflake')

Transferring data from Pandas to Snowflake ...:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
assert df_small.get_backend() == 'Snowflake'

In [5]:
df_small = df_small.head(15)
df_small = df_small.apply(lambda x : x + 1)

Switcheroo operation: apply cost to move to pandas: 875 cost to stay: 250


In [6]:
assert df_small.get_backend() == 'Snowflake'

In [7]:
df_small = df_small.head(5)
df_small = df_small.apply(lambda x : x + 1)

Switcheroo operation: apply cost to move to pandas: 125 cost to stay: 750


Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
assert df_small.get_backend() == 'Pandas'

In [9]:
os.environ["HYBRID_DATA_SIZE_TRANSITION_POINT"] = "1000000"