# Hybrid Execution Build Verification Test

In this demo, we will show how you can develop a robust pandas pipelines at all data scales. You will see how pandas on Snowflake intelligently determines whether to execute queries locally with regular pandas or run directly in Snowflake. This allows you to rapidly iterate with your pandas workflows for testing and development on small datasets, while futureproofing your pipelines when you scale up to production data.

In [1]:
import snowflake.snowpark.modin.plugin
import modin.pandas as pd
import numpy as np
import datetime
import pandas as native_pd
from time import perf_counter
from snowflake.snowpark.session import Session; session = Session.builder.create()

Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
Going to open: https://snowbiz.okta.com/app/snowflake/exk8wfsfryJIn4IWZ2p7/sso/saml?SAMLRequest=jZLdctowEIVfxaNeY9kOf9EAGQqlmCbgBChN7oQtExVZcrRyDHn6yiZkkotkeqdZnSN9u2d7V4dMOM9MA1eyj3zXQw6TsUq43PXRejVpdJEDhsqECiVZHx0ZoKtBD2gmcjIszKO8Y08FA%2BPYhySQ%2BqKPCi2JosCBSJoxICYmy%2BHNNQlcj%2BRaGRUrgd5ZvnZQAKaNJTxbEuAW79GYnGBclqVbXrhK73DgeR72LrFVVZJvZ%2F3B9vSJ3sdes9JbhZVHr2zfuTyN4Cus7UkEZLpaRY1osVwhZ3hGHSkJRcb0kulnHrP13fUJACzBcr7YTBfr5Q8XpCpTQfcsVlleGPuaa084ZQkWasdtw%2BG4j%2FI9T27%2FxtFehvRl1L7fxhvZ2cyKP9ub2abrq3nKf03vy8mF4LfJz2aMnN%2FnRIMq0RCgYKGscjS25AWthtdqBJerwCfNNmkFbtvrPCBnbHPkkpraeYatELf8xVV7Q2s4muf4jRuzw75bppDq4yyUzXDzEOQdDKBwFSs6bQqpAfTgf%2Fvv4feu12Wb2%2FmH40gJHh%2BdidIZNZ%2FH47t%2BXeFJI62lhGWUi2GSaAZgYxJClSPNqLE7bXTBEB6cfv241YN%2F&RelayState=ver%3A1-hint%3A2

## read_csv loads into pandas

In [2]:
fruits = pd.read_csv('data.csv')
fruits.get_backend()
assert fruits.get_backend() == "Pandas"

In [3]:
fruits

Unnamed: 0,fruit,score
0,apple,1
1,orange,2
2,melon,3
3,grape,4
4,raisin,-1


In [4]:
df = pd.read_csv("s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv")
assert df.get_backend() == "Pandas"

## inline data is in pandas

In [5]:
us_holidays = [
    ("New Year's Day", "2025-01-01"),
    ("Martin Luther King Jr. Day", "2025-01-20"),
    ("Presidents' Day", "2025-02-17"),
    ("Memorial Day", "2025-05-26"),
    ("Juneteenth National Independence Day", "2025-06-19"),
    ("Independence Day", "2025-07-04"),
    ("Labor Day", "2025-09-01"),
    ("Columbus Day", "2025-10-13"),
    ("Veterans Day", "2025-11-11"),
    ("Thanksgiving Day", "2025-11-27"),
    ("Christmas Day", "2025-12-25")
]

# Create DataFrame
df_us_holidays = pd.DataFrame(us_holidays, columns=["Holiday", "Date"])

# Convert Date column to datetime
df_us_holidays["Date"] = pd.to_datetime(df_us_holidays["Date"])

In [6]:
assert df_us_holidays.get_backend() == 'Pandas'  # with auto, we should expect this to be local

In [7]:
# Add new columns for transformations
df_us_holidays["Day_of_Week"] = df_us_holidays["Date"].dt.day_name()
df_us_holidays["Month"] = df_us_holidays["Date"].dt.month_name()

In [8]:
df_us_holidays

Unnamed: 0,Holiday,Date,Day_of_Week,Month
0,New Year's Day,2025-01-01,Wednesday,January
1,Martin Luther King Jr. Day,2025-01-20,Monday,January
2,Presidents' Day,2025-02-17,Monday,February
3,Memorial Day,2025-05-26,Monday,May
4,Juneteenth National Independence Day,2025-06-19,Thursday,June
5,Independence Day,2025-07-04,Friday,July
6,Labor Day,2025-09-01,Monday,September
7,Columbus Day,2025-10-13,Monday,October
8,Veterans Day,2025-11-11,Tuesday,November
9,Thanksgiving Day,2025-11-27,Thursday,November


In [9]:
pd.explain()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,candidate,mode,metric,value
source,decision,api,index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fruits = pd.read_csv('data.csv'),Pandas,None.read_csv,4,Pandas,auto,move_to_cost,0.0
fruits = pd.read_csv('data.csv'),Pandas,None.read_csv,5,Pandas,auto,other_execute_cost,0.0
fruits = pd.read_csv('data.csv'),Pandas,None.read_csv,6,Pandas,auto,delta,-750.0
"df = pd.read_csv(""s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv"")",Pandas,None.read_csv,12,Pandas,auto,move_to_cost,0.0
"df = pd.read_csv(""s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv"")",Pandas,None.read_csv,13,Pandas,auto,other_execute_cost,0.0
"df = pd.read_csv(""s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv"")",Pandas,None.read_csv,14,Pandas,auto,delta,-750.0
"df_us_holidays = pd.DataFrame(us_holidays, columns=[""Holiday"", ""Date""])",Pandas,DataFrame.__init__,20,Pandas,auto,move_to_cost,0.0
"df_us_holidays = pd.DataFrame(us_holidays, columns=[""Holiday"", ""Date""])",Pandas,DataFrame.__init__,21,Pandas,auto,other_execute_cost,0.0
"df_us_holidays = pd.DataFrame(us_holidays, columns=[""Holiday"", ""Date""])",Pandas,DataFrame.__init__,22,Pandas,auto,delta,-100.0


In [10]:
%%time
#Note that without auto-switching, this took 2.5 min
for index, row in df_us_holidays.iterrows():
    print(f"{row['Holiday']} falls on {row['Day_of_Week']}, {row['Month']} {row['Date'].day}, {row['Date'].year}.")

New Year's Day falls on Wednesday, January 1, 2025.
Martin Luther King Jr. Day falls on Monday, January 20, 2025.
Presidents' Day falls on Monday, February 17, 2025.
Memorial Day falls on Monday, May 26, 2025.
Juneteenth National Independence Day falls on Thursday, June 19, 2025.
Independence Day falls on Friday, July 4, 2025.
Labor Day falls on Monday, September 1, 2025.
Columbus Day falls on Monday, October 13, 2025.
Veterans Day falls on Tuesday, November 11, 2025.
Thanksgiving Day falls on Thursday, November 27, 2025.
Christmas Day falls on Thursday, December 25, 2025.
CPU times: user 142 ms, sys: 4.91 ms, total: 147 ms
Wall time: 145 ms


### 💡 Automatic switching speeds up loops/iterations on small data + inline creation of dataframes

## Example 2: When data is filtered the choice of engine changes

Run the following SQL to generate a synthetic dataset with 10M rows of transactions (from 2024-2025 current date)
```sql
CREATE OR REPLACE TABLE revenue_transactions (
    Transaction_ID STRING,
    Date DATE,
    Revenue FLOAT
);

SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));
INSERT INTO revenue_transactions (Transaction_ID, Date, Revenue)
SELECT
    UUID_STRING() AS Transaction_ID,
    DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
    UNIFORM(10, 1000, RANDOM()) AS Revenue
FROM TABLE(GENERATOR(ROWCOUNT => 10000000));
```

In [11]:
# Run the following to generate a synthetic dataset with 10M rows of transactions (from 2024-2025 current date)
session.sql('''
CREATE OR REPLACE TABLE revenue_transactions (
    Transaction_ID STRING,
    Date DATE,
    Revenue FLOAT
);''').collect()
session.sql('''SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));''').collect()
session.sql('''INSERT INTO revenue_transactions (Transaction_ID, Date, Revenue)
SELECT
    UUID_STRING() AS Transaction_ID,
    DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
    UNIFORM(10, 1000, RANDOM()) AS Revenue
FROM TABLE(GENERATOR(ROWCOUNT => 10000000));
''').collect()

[Row(number of rows inserted=10000000)]

In [12]:
df_transactions = pd.read_snowflake("REVENUE_TRANSACTIONS")

In [13]:
print(f"The dataset size is {len(df_transactions)} and the data is located in {df_transactions.get_backend()}.")

The dataset size is 10000000 and the data is located in Snowflake.


In [14]:
pd.explain()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,candidate,mode,metric,value
source,decision,api,index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fruits = pd.read_csv('data.csv'),Pandas,None.read_csv,4,Pandas,auto,move_to_cost,0.0
fruits = pd.read_csv('data.csv'),Pandas,None.read_csv,5,Pandas,auto,other_execute_cost,0.0
fruits = pd.read_csv('data.csv'),Pandas,None.read_csv,6,Pandas,auto,delta,-750.0
"df = pd.read_csv(""s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv"")",Pandas,None.read_csv,12,Pandas,auto,move_to_cost,0.0
"df = pd.read_csv(""s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv"")",Pandas,None.read_csv,13,Pandas,auto,other_execute_cost,0.0
"df = pd.read_csv(""s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv"")",Pandas,None.read_csv,14,Pandas,auto,delta,-750.0
"df_us_holidays = pd.DataFrame(us_holidays, columns=[""Holiday"", ""Date""])",Pandas,DataFrame.__init__,20,Pandas,auto,move_to_cost,0.0
"df_us_holidays = pd.DataFrame(us_holidays, columns=[""Holiday"", ""Date""])",Pandas,DataFrame.__init__,21,Pandas,auto,other_execute_cost,0.0
"df_us_holidays = pd.DataFrame(us_holidays, columns=[""Holiday"", ""Date""])",Pandas,DataFrame.__init__,22,Pandas,auto,delta,-100.0
"df_transactions = pd.read_snowflake(""REVENUE_TRANSACTIONS"")",Pandas,DataFrame.__init__,28,Pandas,auto,move_to_cost,0.0


Perform some operations on 10M rows with Snowflake

In [15]:
df_transactions["DATE"] = pd.to_datetime(df_transactions["DATE"])

In [16]:
%%time
df_transactions.groupby("DATE").sum()["REVENUE"]

CPU times: user 64.1 ms, sys: 8.95 ms, total: 73.1 ms
Wall time: 215 ms


DATE
2024-01-01    9778454.0
2024-01-02    9797011.0
2024-01-03    9814284.0
2024-01-04    9774611.0
2024-01-05    9827905.0
                ...    
2025-05-25    9748948.0
2025-05-26    9827152.0
2025-05-27    9866154.0
2025-05-28    9837595.0
2025-05-29    9796763.0
Freq: None, Name: REVENUE, Length: 515, dtype: float64

In [17]:
assert df_transactions.get_backend() == "Snowflake"

In [18]:
pd.explain()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,candidate,mode,metric,value
source,decision,api,index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,DataFrame.__init__,45,Pandas,auto,other_execute_cost,0.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,DataFrame.__init__,46,Pandas,auto,delta,-1000.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,Series.__init__,52,Pandas,auto,move_to_cost,0.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,Series.__init__,53,Pandas,auto,other_execute_cost,0.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,Series.__init__,54,Pandas,auto,delta,-1000.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,Series.__init__,60,Pandas,auto,move_to_cost,0.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,Series.__init__,61,Pandas,auto,other_execute_cost,0.0
"df_transactions[""DATE""] = pd.to_datetime(df_transactions[""DATE""])",Pandas,Series.__init__,62,Pandas,auto,delta,-1000.0
<unknown>,Snowflake,DataFrame.__init__,68,Pandas,auto,move_to_cost,0.0
<unknown>,Snowflake,DataFrame.__init__,69,Pandas,auto,other_execute_cost,1000.0


So far everything has been happening in Snowflake, since we are working with the full dataset (10M rows). 
Next, we demonstrate what happens when we filter the data down to a smaller dataset below our data size threshold for automatic engine switching. 
First, let's perform the filtering directly with pandas. 

In [19]:
df_transactions_filter1 = df_transactions[(df_transactions["DATE"] >= pd.Timestamp.today().date() - pd.Timedelta('7 days')) & (df_transactions["DATE"] < pd.Timestamp.today().date())]

In [20]:
assert df_transactions_filter1.get_backend() == "Snowflake"

In [21]:
print(f"Date range: {df_transactions_filter1['DATE'].min().date()} to {df_transactions_filter1['DATE'].max().date()}. Resulting dataset size: {len(df_transactions_filter1)}")



Date range: 2025-05-22 to 2025-05-28. Resulting dataset size: 135811


Now that we have a smaller dataframe, this happens in pandas.

In [22]:
%time
df_transactions_filter1 = df_transactions_filter1.groupby("DATE").sum()["REVENUE"]

CPU times: user 3 μs, sys: 1e+03 ns, total: 4 μs
Wall time: 9.3 μs


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.sum' with max estimated shape 7x2:   0%|     …

In [23]:
assert df_transactions_filter1.get_backend()=="Pandas"

We saw what happens when we filter with pandas. Now let's look at what happens if we perform filtering via SQL directly in the `read_snowflake` command, so the dataframe upon creation is small.

In [24]:
df_transactions_filter2 = pd.read_snowflake("SELECT * FROM revenue_transactions WHERE Date >= DATEADD( 'days', -7, current_date ) and Date < current_date")

Transferring data from Snowflake to Pandas for 'None.read_snowflake' with max estimated shape 135811x3:   0%| …

In [25]:
assert df_transactions_filter2.get_backend()=="Pandas"

In [26]:
# Verify the result is same as above
print(f"Date range: {df_transactions_filter2['DATE'].min()} to {df_transactions_filter2['DATE'].max()}. Resulting dataset size: {len(df_transactions_filter2)}")

Date range: 2025-05-22 to 2025-05-28. Resulting dataset size: 135811


Once you are in pandas, you can still continue to perform the same operations: 

In [27]:
%time
df_transactions_filter2.groupby("DATE").sum()["REVENUE"]

CPU times: user 1 μs, sys: 0 ns, total: 1 μs
Wall time: 4.05 μs


DATE
2025-05-22    9897879.0
2025-05-23    9763143.0
2025-05-24    9740137.0
2025-05-25    9748948.0
2025-05-26    9827152.0
2025-05-27    9866154.0
2025-05-28    9837595.0
Name: REVENUE, dtype: float64

In [28]:
assert df_transactions_filter2.get_backend() == 'Pandas'

### 💡 Automatic switching means that pandas work well for both small and large data

## Example 3: Combining small and large datasets in the same workflow

Soemtimes you are working with multiple dataframes of different sizes and you need to join them together, what happens in this scenario?
When two dataframes are joined and the two dataframe are coming from different engine, we automatically determine what is the most optimal way to move the data to minimize the cost of data movement.

Continuing with our `df_transactions` and `df_us_holidays` dataset.

In [29]:
print("Quick recap:")
print(f"- df_transactions is {len(df_transactions)} rows and the data is located in {df_transactions.get_backend()}.")
print(f"- df_us_holidays is {len(df_us_holidays)} rows and the data is located in {df_us_holidays.get_backend()}.")

Quick recap:
- df_transactions is 10000000 rows and the data is located in Snowflake.
- df_us_holidays is 11 rows and the data is located in Pandas.


In [30]:
df_transactions["DATE"] = pd.to_datetime(df_transactions["DATE"])

Since `df_us_holidays` is much smaller than `df_transactions`, we moved `df_us_holidays` to Snowflake where `df_transactions` is, to perform the operation.

In [31]:
combined = pd.merge(df_us_holidays, df_transactions, left_on="Date", right_on="DATE")

Transferring data from Pandas to Snowflake for 'None.merge' with max estimated shape 11x4:   0%|          | 0/…

In [32]:
assert combined.get_backend() == "Snowflake"

### 💡 When we combine multiple dataframes running in different locations, pandas on Snowflake automatically determines where to move the data.

## Example 4: Performing custom `apply` on small dataset

apply is known to be slow in Snowpark pandas since it is implemented as UDF/UDTF, which often comes with a fixed startup time.
Here, we show an example of how performing `apply` on a small dataset is faster with local pandas. 

In this example, we want to forecast using last year's transaction data via a custom apply function. 

In [33]:
def forecast_revenue(df, start_date, end_date):
    # Filter data from last year
    df_filtered = df[(df["DATE"] >= start_date - pd.Timedelta(days=365)) & (df["DATE"] < start_date)]
    # Append future dates to daily_avg for prediction
    future_dates = pd.date_range(start=start_date, end=end_date, freq="D")
    df_future = pd.DataFrame({"DATE": future_dates})

    # Group by DATE and calculate the mean revenue
    daily_avg = df_filtered.groupby("DATE")["REVENUE"].mean().reset_index()
    daily_avg["DATE"] = daily_avg["DATE"].astype('datetime64[ns]')
    # Merge future dates with predicted revenue, filling missing values
    df_forecast = df_future.merge(daily_avg, on="DATE", how="left")
    import numpy as np
    # Fill missing predicted revenue with overall mean from last year
    df_forecast["PREDICTED_REVENUE"] = np.nan
    df_forecast["PREDICTED_REVENUE"].fillna(daily_avg["REVENUE"].mean(), inplace=True)
    df_forecast["PREDICTED_REVENUE"] = df_forecast["PREDICTED_REVENUE"].astype("float")
    return df_forecast

First, let's use the `forecast_revenue` function to get the forecast in the date range, based on last year's revenue numbers.

In [34]:
start_date = pd.Timestamp("2025-10-01")
end_date = pd.Timestamp("2025-10-31")
df_forecast = forecast_revenue(df_transactions, start_date, end_date)

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.mean' with max estimated shape 241x1:   0%|  …

The resulting dataframe is very small, since it is only the 1-month window we're performing forecast on, so the backend is running on pandas locally.

In [35]:
assert df_forecast.get_backend() == 'Pandas'

In [36]:
def adjust_for_holiday_weekend(row):
    # For national holidays, revenue down 5% since stores are closed. For weekends, revenue is up 5% due to increased activity.
    if row["DATE"].strftime('%Y-%m-%d') in list(df_us_holidays["Date"].dt.strftime('%Y-%m-%d')): 
        return row["PREDICTED_REVENUE"] * 0.95
    elif row["DATE"].weekday() == 5 or row["DATE"].weekday() == 6: #Saturday/Sundays
        return row["PREDICTED_REVENUE"] * 1.05
    return row["PREDICTED_REVENUE"]

Now if we run `apply` on this dataframe. It will be running with local pandas.

In [37]:
# Adjust for holidays using the apply function
df_forecast["PREDICTED_REVENUE"] = df_forecast.apply(adjust_for_holiday_weekend, axis=1)

In [38]:
assert df_forecast.get_backend() == 'Pandas'

In [39]:
#TODO: Waiting on environment settings
from modin.config.envvars import NativePandasMaxRows
NativePandasMaxRows.put(10)

In [40]:
df_small = pd.DataFrame({'a': [1]*20, 'b': [2]*20}).move_to('Snowflake')

Transferring data from Pandas to Snowflake with max estimated shape 20x2:   0%|          | 0/2 [00:00<?, ?it/s…

In [41]:
assert df_small.get_backend() == 'Snowflake'

In [42]:
df_small = df_small.apply(lambda x : x + 1)

In [43]:
assert df_small.get_backend() == 'Snowflake'

In [44]:
df_small = df_small.head(5)
df_small = df_small.apply(lambda x : x + 1)

Transferring data from Snowflake to Pandas for 'DataFrame.apply' with max estimated shape 5x2:   0%|          …

In [45]:
assert df_small.get_backend() == 'Pandas'

In [46]:
NativePandasMaxRows.put(10_000_000)

In [47]:
pd.explain(last=10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,candidate,mode,metric,value
source,decision,api,index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
df_small = df_small.apply(lambda x : x + 1),Snowflake,DataFrame.apply,421,Pandas,auto,delta,250.0
df_small = df_small.apply(lambda x : x + 1),Snowflake,DataFrame.__init__,427,Pandas,auto,move_to_cost,0.0
df_small = df_small.apply(lambda x : x + 1),Snowflake,DataFrame.__init__,428,Pandas,auto,other_execute_cost,1000.0
df_small = df_small.apply(lambda x : x + 1),Snowflake,DataFrame.__init__,429,Pandas,auto,delta,0.0
df_small = df_small.head(5),Pandas,DataFrame.__init__,435,Pandas,auto,move_to_cost,0.0
df_small = df_small.head(5),Pandas,DataFrame.__init__,436,Pandas,auto,other_execute_cost,0.0
df_small = df_small.head(5),Pandas,DataFrame.__init__,437,Pandas,auto,delta,-1000.0
df_small = df_small.apply(lambda x : x + 1),Pandas,DataFrame.apply,443,Pandas,auto,move_to_cost,0.0
df_small = df_small.apply(lambda x : x + 1),Pandas,DataFrame.apply,444,Pandas,auto,other_execute_cost,0.0
df_small = df_small.apply(lambda x : x + 1),Pandas,DataFrame.apply,445,Pandas,auto,delta,-750.0


# bug fix from bug bash, May 2, 2025: constructing dataframe or series out of snowflake df or series should not cuase data to move ("Jonathan Shi passing a Series/DF to the DF constructor triggers surprising moves (in)")

In [48]:
input_df = pd.DataFrame(list(range(10))).move_to('snowflake')
assert input_df.get_backend() == 'Snowflake'
assert pd.DataFrame(input_df).get_backend() == 'Snowflake'
assert pd.DataFrame({'col0': input_df[0]}).get_backend() == 'Snowflake'
assert pd.DataFrame(input_df[0]).get_backend() == 'Snowflake'

pandas_df  = pd.DataFrame(list(range(10)))
assert pandas_df.get_backend() == 'Pandas'
assert pd.DataFrame(pandas_df).get_backend() == 'Pandas'
assert pd.DataFrame({'col0': pandas_df[0]}).get_backend() == 'Pandas'
assert pd.DataFrame(pandas_df[0]).get_backend() == 'Pandas'



Transferring data from Pandas to Snowflake with max estimated shape 10x1:   0%|          | 0/2 [00:00<?, ?it/s…

# Test that post-op switch for groupby agg works

In [49]:
groupby_reductions  =  (
    # skip tail because of bug in upstream snowpark-python:
    # pd.DataFrame([[0, 1], [2, 3]]).groupby(0)[1].tail() fails with
    # some indexing error.
    # "tail",
    "var",
    "std",
    "sum",
    # "sem",
    "max",
    "mean",
    "min",
    "count",
    "nunique",
)

small_snow_df = pd.DataFrame([[0, 1], [2, 3]]).move_to('snowflake')

for operation in groupby_reductions:
    print(f"Testing {operation}")
    assert small_snow_df.get_backend() == 'Snowflake'    
    dataframe_groupby_result = getattr(small_snow_df.groupby(0), operation)()
    assert dataframe_groupby_result.get_backend() == 'Pandas'
    assert small_snow_df.get_backend() == 'Snowflake'
    series_groupby_result = getattr(small_snow_df.groupby(0)[1], operation)()
    assert series_groupby_result.get_backend() == 'Pandas'
    assert small_snow_df.get_backend() == 'Snowflake'


Transferring data from Pandas to Snowflake with max estimated shape 2x2:   0%|          | 0/2 [00:00<?, ?it/s]

Testing var


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.var' with max estimated shape 2x1:   0%|     …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.var' with max estimated shape 2x1:   0%|     …

Testing std


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.std' with max estimated shape 2x1:   0%|     …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.std' with max estimated shape 2x1:   0%|     …

Testing sum


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.sum' with max estimated shape 2x1:   0%|     …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.sum' with max estimated shape 2x1:   0%|     …

Testing max


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.max' with max estimated shape 2x1:   0%|     …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.max' with max estimated shape 2x1:   0%|     …

Testing mean


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.mean' with max estimated shape 2x1:   0%|    …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.mean' with max estimated shape 2x1:   0%|    …

Testing min


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.min' with max estimated shape 2x1:   0%|     …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.min' with max estimated shape 2x1:   0%|     …

Testing count


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.count' with max estimated shape 2x1:   0%|   …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.count' with max estimated shape 2x1:   0%|   …

Testing nunique


Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.nunique' with max estimated shape 2x1:   0%| …

Transferring data from Snowflake to Pandas for 'DataFrameGroupBy.nunique' with max estimated shape 2x1:   0%| …

# df and series to_pandas() should be available on pandas backend: https://snowflakecomputing.atlassian.net/browse/SNOW-2106995

In [3]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
assert df.get_backend() == 'Pandas'
assert df.to_pandas().equals(df._to_pandas())
assert df['a'].to_pandas().equals(df['a']._to_pandas())