# pandas on Snowflake - Hybrid Execution Demo

In this demo, we will show how you can develop a robust pandas pipelines at all data scales. You will see how pandas on Snowflake intelligently determines whether to execute queries locally with regular pandas or run directly in Snowflake. This allows you to rapidly iterate with your pandas workflows for testing and development on small datasets, while futureproofing your pipelines when you scale up to production data.

In [1]:
import snowflake.snowpark.modin.plugin
import modin.pandas as pd
import numpy as np
import datetime
import pandas as native_pd
from time import perf_counter
from snowflake.snowpark.session import Session; session = Session.builder.create()

Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
Going to open: https://snowbiz.okta.com/app/snowflake/exk8wfsfryJIn4IWZ2p7/sso/saml?SAMLRequest=jVJdb%2BIwEPwrke%2BZ2AlUcBZQUTjUcOVDDVCJN5M41MKxU6%2FTwP36cwiceg%2Bt%2BmatZ3Zmd7Z%2Ff8ql984NCK0GKPAJ8rhKdCrUYYA262mrhzywTKVMasUH6MwB3Q%2F7wHJZ0FFpX9Uzfys5WM81UkDrjwEqjaKagQCqWM6B2oTGo%2FkTDX1CGQA31smhKyUF4bRerS0oxlVV%2BVXb1%2BaAQ0IIJj%2BxQ9WQH%2BiDRPG1RmG01YmWN8rJzfSJRIBJp5ZwCKewuhIfhGpW8JXKvgEBfVyvV63VMl4jb3SbbqwVlDk3MTfvIuGb56fGADgH8WL58rjcxL98ULrKJDvyROdFaV03371wxlMs9UG4HUWTASqOIt1O9S7fx3YG80ALwbbzeZZGvflsPMukKNrd9ls4%2Bb3Yn6JegrztLdGwTjQCKHmk6hytK5HwrkU6rSBYE0Lbd5QQv9vt7pA3cTkKxeyFeTNbW9yLP74%2BWnYxx4oC%2F%2FON%2BenYqzLIzHkWqU70sguLLgbQuI4JNZdCLwbM8Lvz9%2FFH1vXYFm7%2F0WSlpUjO3lSbnNnP4wn84FIRaSu7QCnPmZCjNDUcwMUkpa7GhjPrbtqakiM8bFT%2Fv%2BrhXw%3D%3D&RelayState=

## Generate Data Tables
First let's generate synthetic data. Note that this will take a while and only needs to be run once. 
```python
def generate_synthetic_table(N,name): # Run the following to generate a synthetic dataset with X rows of transactions (from 2024-2025 current date)
    session.sql(f'''
    CREATE OR REPLACE TABLE revenue_transactions_{name} (
        Transaction_ID STRING,
        Date DATE,
        Revenue FLOAT
    );''').collect()
    session.sql('''SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));''').collect()
    session.sql(f'''INSERT INTO revenue_transactions_{name} (Transaction_ID, Date, Revenue)
    SELECT
        UUID_STRING() AS Transaction_ID,
        DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
        UNIFORM(10, 1000, RANDOM()) AS Revenue
    FROM TABLE(GENERATOR(ROWCOUNT => {N}));
    ''').collect()

generate_synthetic_table(10000000, "10M")
generate_synthetic_table(100000000, "100M")
generate_synthetic_table(1000000000, "1B")
```

## Reading Data From Snowflake

There are two common approaches to reading the data to vanilla pandas. 

1) Create a [Snowpark DataFrame](https://docs.snowflake.com/en/developer-guide/snowpark/python/working-with-dataframes#return-the-contents-of-a-dataframe-as-a-pandas-dataframe) and calling [`to_pandas`](https://docs.snowflake.com/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.DataFrame.to_pandas) to export results into a pandas DataFrame
```python
snowpark_df = session.table("REVENUE_TRANSACTIONS_10M")
native_pd_df = snowpark_df.to_pandas()
```

2) Use the [Snowflake Connector for Python](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-pandas) to query and export results from Snowflake into a pandas DataFrame using [`fetch_pandas_all`](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api#fetch_pandas_all)

```python
# Create a cursor object
cur = session.connection.cursor()
# Execute a statement that will generate a result set
cur.execute("select * from REVENUE_TRANSACTIONS_10M")
# Fetch all the rows in a cursor and load them into a pandas DataFrame
native_pd_df = cur.fetch_pandas_all()
```

We will use the first approach below and measure the time these operations take. (Note: This may take several minutes!)

Now let's try to load in the 10M row table

In [2]:
start_time = perf_counter()
table = session.table("REVENUE_TRANSACTIONS_10M")
pandas_df_10M = table.to_pandas()
end_time = perf_counter()
time_10M = end_time-start_time
print(f"Loading 10M rows to pandas dataframes takes {time_10M} seconds")

Loading 10M rows to pandas dataframes takes 25.496354874921963 seconds


Now let's try to load in the 100M row table

In [3]:
start_time = perf_counter()
table = session.table("REVENUE_TRANSACTIONS_100M")
pandas_df_100M = table.to_pandas()
end_time = perf_counter()
time_100M = end_time-start_time
print(f"Loading 100M rows to pandas dataframes takes {time_100M} seconds")

Loading 100M rows to pandas dataframes takes 292.1991420829436 seconds


Now if we try doing `to_pandas` on the 1 billion row table, we will get an out of memory error that looks something like this
```python
start_time = time.time()
table = session.table("REVENUE_TRANSACTIONS_1B")
pandas_df = table.to_pandas()
end_time = time.time()
time_1B = end_time-start_time
```
<img src="OOM.png" alt="out of memory" style="width: 500px;"/>

Now let's try this with Snowpark pandas. We can read the table directly using Snowpark pandas's `read_snowflake` command, without pulling the full dataset into memory. Snowpark pandas pushes down the computation to run on Snowflake, so that we can operate on the data directly without pulling the dataset into memory.

In [4]:
# Read data into a Snowpark pandas df 
start = perf_counter()
snowpandas_df_10M = pd.read_snowflake("REVENUE_TRANSACTIONS_10M")
end = perf_counter()
snowtime_10M = end - start
print(f"Loading 10M rows using Snowpark pandas takes {snowtime_10M} seconds")

Loading 10M rows using Snowpark pandas takes 2.2936937080230564 seconds


In [5]:
start = perf_counter()
snowpandas_df_100M = pd.read_snowflake("REVENUE_TRANSACTIONS_100M")
end = perf_counter()
snowtime_100M = end - start
print(f"Loading 100M rows using Snowpark pandas takes {snowtime_100M} seconds")

Loading 100M rows using Snowpark pandas takes 1.9097921670181677 seconds


In [6]:
start = perf_counter()
snowpandas_df_1B = pd.read_snowflake("REVENUE_TRANSACTIONS_1B")
end = perf_counter()
snowtime_1B = end - start
print(f"Loading 1B rows using Snowpark pandas takes {snowtime_1B} seconds")

Loading 1B rows using Snowpark pandas takes 3.130773166078143 seconds


As you can see, calling `read_snowflake` on any sized data takes no more than a few seconds, even for the 1 billion row dataset where pandas runs out of memory. We can also performs some operations using these dataframes, which finishes in a few seconds. 

In contrast, running this `pandas_df_10M.groupby("DATE").sum()["REVENUE"]` in pandas just hangs ... and forcing us to kill the Python kernel after some time.

In [7]:
start = perf_counter()
snowpandas_groupby = snowpandas_df_10M.groupby("DATE").sum()["REVENUE"]
print(snowpandas_groupby)
end = perf_counter()
snowpandas_time_10M = end - start
print(f"Performing groupby operation on 100M rows using Snowpark pandas takes {snowpandas_time_10M} seconds")

DATE
2024-01-01    10636116.0
2024-01-02    10871886.0
2024-01-03    10821764.0
2024-01-04    10912870.0
2024-01-05    10809808.0
                 ...    
2025-04-06    10776615.0
2025-04-07    10938497.0
2025-04-08    10737509.0
2025-04-09    10777891.0
2025-04-10    10715432.0
Name: REVENUE, Length: 466, dtype: float64
Performing groupby operation on 100M rows using Snowpark pandas takes 2.017078666947782 seconds


In [8]:
start = perf_counter()
snowpandas_groupby = snowpandas_df_100M.groupby("DATE").sum()["REVENUE"]
print(snowpandas_groupby)
end = perf_counter()
snowpandas_time_100M = end - start
print(f"Performing groupby operation on 100M rows using Snowpark pandas takes {snowpandas_time_100M} seconds")

DATE
2024-01-01    108024176.0
2024-01-02    108655406.0
2024-01-03    108026482.0
2024-01-04    108137523.0
2024-01-05    108316254.0
                 ...     
2025-04-06    108543057.0
2025-04-07    108243461.0
2025-04-08    108517153.0
2025-04-09    108425654.0
2025-04-10    108194452.0
Name: REVENUE, Length: 466, dtype: float64
Performing groupby operation on 100M rows using Snowpark pandas takes 1.8401533330325037 seconds


In [9]:
start = perf_counter()
snowpandas_groupby = snowpandas_df_1B.groupby("DATE").sum()["REVENUE"]
print(snowpandas_groupby)
end = perf_counter()
snowpandas_time_1B = end - start
print(f"Performing groupby operation on 1B rows using Snowpark pandas takes {snowpandas_time_1B} seconds")

DATE
2024-01-01    1.085032e+09
2024-01-02    1.083547e+09
2024-01-03    1.085693e+09
2024-01-04    1.083782e+09
2024-01-05    1.084499e+09
                  ...     
2025-04-06    1.083556e+09
2025-04-07    1.083573e+09
2025-04-08    1.084559e+09
2025-04-09    1.083182e+09
2025-04-10    1.083395e+09
Name: REVENUE, Length: 466, dtype: float64
Performing groupby operation on 1B rows using Snowpark pandas takes 5.48939924989827 seconds


In summary, with Snowpark pandas we are able to easily work with datasets with 1 billion rows and run the operations in a matter a few seconds, while we run into out of memory errors with regular pandas.

## But life isn't always about big data ... 

Oftentimes you need to start with small data for development and testing. Other times you may be working on projects with data across different scales or problems where your data scale evolves throughout your workflow.

pandas is highly optimized for small data scales. This means that sometimes, pushing the computation down to Snowflake may not always be the best choice in terms of performance. In general, the performance improvement from Snowpark starts to show up when we reach the millions-row (10s MB) range.

We recently introduced Hybrid Execution for Snowpark pandas to support working with data at all scale, by automatically selecting what engine is best suited for your workload. This means that you know longer have to toggle between what runs well in vanilla pandas vs what runs well in Snowpark, and can focus on developing your Python code, we figure out where is the best place to run your code. 

Let's take a look at some examples.

## Example 1: Working with small/inline-created dataframe is faster

In [10]:
us_holidays = [
    ("New Year's Day", "2025-01-01"),
    ("Martin Luther King Jr. Day", "2025-01-20"),
    ("Presidents' Day", "2025-02-17"),
    ("Memorial Day", "2025-05-26"),
    ("Juneteenth National Independence Day", "2025-06-19"),
    ("Independence Day", "2025-07-04"),
    ("Labor Day", "2025-09-01"),
    ("Columbus Day", "2025-10-13"),
    ("Veterans Day", "2025-11-11"),
    ("Thanksgiving Day", "2025-11-27"),
    ("Christmas Day", "2025-12-25")
]

# Create DataFrame
df_us_holidays = pd.DataFrame(us_holidays, columns=["Holiday", "Date"])

# Convert Date column to datetime
df_us_holidays["Date"] = pd.to_datetime(df_us_holidays["Date"])

In [11]:
assert df_us_holidays.get_backend() == 'Pandas'  # with auto, we should expect this to be local

In [12]:
# Add new columns for transformations
df_us_holidays["Day_of_Week"] = df_us_holidays["Date"].dt.day_name()
df_us_holidays["Month"] = df_us_holidays["Date"].dt.month_name()

In [13]:
df_us_holidays

Unnamed: 0,Holiday,Date,Day_of_Week,Month
0,New Year's Day,2025-01-01,Wednesday,January
1,Martin Luther King Jr. Day,2025-01-20,Monday,January
2,Presidents' Day,2025-02-17,Monday,February
3,Memorial Day,2025-05-26,Monday,May
4,Juneteenth National Independence Day,2025-06-19,Thursday,June
5,Independence Day,2025-07-04,Friday,July
6,Labor Day,2025-09-01,Monday,September
7,Columbus Day,2025-10-13,Monday,October
8,Veterans Day,2025-11-11,Tuesday,November
9,Thanksgiving Day,2025-11-27,Thursday,November


In [14]:
%%time
#Note that without auto-switching, this took 2.5 min
for index, row in df_us_holidays.iterrows():
    print(f"{row['Holiday']} falls on {row['Day_of_Week']}, {row['Month']} {row['Date'].day}, {row['Date'].year}.")

New Year's Day falls on Wednesday, January 1, 2025.
Martin Luther King Jr. Day falls on Monday, January 20, 2025.
Presidents' Day falls on Monday, February 17, 2025.
Memorial Day falls on Monday, May 26, 2025.
Juneteenth National Independence Day falls on Thursday, June 19, 2025.
Independence Day falls on Friday, July 4, 2025.
Labor Day falls on Monday, September 1, 2025.
Columbus Day falls on Monday, October 13, 2025.
Veterans Day falls on Tuesday, November 11, 2025.
Thanksgiving Day falls on Thursday, November 27, 2025.
Christmas Day falls on Thursday, December 25, 2025.
CPU times: user 78.7 ms, sys: 1.92 ms, total: 80.6 ms
Wall time: 82 ms


### 💡 Automatic switching speeds up loops/iterations on small data + inline creation of dataframes

## Example 2: When data is filtered the choice of engine changes

Run the following SQL to generate a synthetic dataset with 10M rows of transactions (from 2024-2025 current date)
```sql
CREATE OR REPLACE TABLE revenue_transactions (
    Transaction_ID STRING,
    Date DATE,
    Revenue FLOAT
);

SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));
INSERT INTO revenue_transactions (Transaction_ID, Date, Revenue)
SELECT
    UUID_STRING() AS Transaction_ID,
    DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
    UNIFORM(10, 1000, RANDOM()) AS Revenue
FROM TABLE(GENERATOR(ROWCOUNT => 10000000));
```

In [15]:
# Run the following to generate a synthetic dataset with 10M rows of transactions (from 2024-2025 current date)
session.sql('''
CREATE OR REPLACE TABLE revenue_transactions (
    Transaction_ID STRING,
    Date DATE,
    Revenue FLOAT
);''').collect()
session.sql('''SET num_days = (SELECT DATEDIFF(DAY, '2024-01-01', CURRENT_DATE));''').collect()
session.sql('''INSERT INTO revenue_transactions (Transaction_ID, Date, Revenue)
SELECT
    UUID_STRING() AS Transaction_ID,
    DATEADD(DAY, UNIFORM(0, $num_days, RANDOM()), '2024-01-01') AS Date,
    UNIFORM(10, 1000, RANDOM()) AS Revenue
FROM TABLE(GENERATOR(ROWCOUNT => 10000000));
''').collect()

[Row(number of rows inserted=10000000)]

In [16]:
df_transactions = pd.read_snowflake("REVENUE_TRANSACTIONS")

In [17]:
print(f"The dataset size is {len(df_transactions)} and the data is located in {df_transactions.get_backend()}.")

The dataset size is 10000000 and the data is located in Snowflake.


Perform some operations on 10M rows with Snowflake

In [18]:
df_transactions["DATE"] = pd.to_datetime(df_transactions["DATE"])

In [19]:
%%time
df_transactions.groupby("DATE").sum()["REVENUE"]

CPU times: user 24.2 ms, sys: 10.2 ms, total: 34.5 ms
Wall time: 202 ms


DATE
2024-01-01    10982720.0
2024-01-02    10646998.0
2024-01-03    10962135.0
2024-01-04    10812139.0
2024-01-05    10811396.0
                 ...    
2025-04-07    10750499.0
2025-04-08    10781859.0
2025-04-09    10743897.0
2025-04-10    10859036.0
2025-04-11    10856796.0
Freq: None, Name: REVENUE, Length: 467, dtype: float64

In [20]:
assert df_transactions.get_backend() == "Snowflake"

So far everything has been happening in Snowflake, since we are working with the full dataset (10M rows). 
Next, we demonstrate what happens when we filter the data down to a smaller dataset below our data size threshold for automatic engine switching. 
First, let's perform the filtering directly with pandas. 

In [21]:
df_transactions_filter1 = df_transactions[(df_transactions["DATE"] >= pd.Timestamp.today().date() - pd.Timedelta('7 days')) & (df_transactions["DATE"] < pd.Timestamp.today().date())]

In this case, since the data is already in Snowflake, it stays in Snowflake even after the filtering.

In [22]:
assert df_transactions_filter1.get_backend() == "Snowflake"

In [23]:
print(f"Date range: {df_transactions_filter1['DATE'].min().date()} to {df_transactions_filter1['DATE'].max().date()}. Resulting dataset size: {len(df_transactions_filter1)}")

The current operation leads to materialization and can be slow if the data is large!


Date range: 2025-04-03 to 2025-04-09. Resulting dataset size: 149559


Now that we have a smaller dataframe, this happens in pandas.

In [24]:
%time
df_transactions_filter1.groupby("DATE").sum()["REVENUE"]

CPU times: user 4 μs, sys: 0 ns, total: 4 μs
Wall time: 13.1 μs


Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

DATE
2025-04-03    10795147.0
2025-04-04    10883893.0
2025-04-05    10862456.0
2025-04-06    10848465.0
2025-04-07    10750499.0
2025-04-08    10781859.0
2025-04-09    10743897.0
Freq: None, Name: REVENUE, dtype: float64

We saw what happens when we filter with pandas. Now let's look at what happens if we perform filtering via SQL directly in the `read_snowflake` command, so the dataframe upon creation is small.

In [25]:
df_transactions_filter2 = pd.read_snowflake("SELECT * FROM revenue_transactions WHERE Date >= DATEADD( 'days', -7, current_date ) and Date < current_date")

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
assert df_transactions_filter2.get_backend()=="Pandas"

In [27]:
# Verify the result is same as above
print(f"Date range: {df_transactions_filter2['DATE'].min()} to {df_transactions_filter2['DATE'].max()}. Resulting dataset size: {len(df_transactions_filter2)}")

Date range: 2025-04-04 to 2025-04-10. Resulting dataset size: 149739


Once you are in pandas, you can still continue to perform the same operations: 

In [28]:
%time
df_transactions_filter2.groupby("DATE").sum()["REVENUE"]

CPU times: user 1e+03 ns, sys: 1 μs, total: 2 μs
Wall time: 4.77 μs


DATE
2025-04-04    10883893.0
2025-04-05    10862456.0
2025-04-06    10848465.0
2025-04-07    10750499.0
2025-04-08    10781859.0
2025-04-09    10743897.0
2025-04-10    10859036.0
Name: REVENUE, dtype: float64

### 💡 Automatic switching means that pandas work well for both small and large data

## Example 3: Combining small and large datasets in the same workflow

Soemtimes you are working with multiple dataframes of different sizes and you need to join them together, what happens in this scenario?
When two dataframes are joined and the two dataframe are coming from different engine, we automatically determine what is the most optimal way to move the data to minimize the cost of data movement.

Continuing with our `df_transactions` and `df_us_holidays` dataset.

In [29]:
print("Quick recap:")
print(f"- df_transactions is {len(df_transactions)} rows and the data is located in {df_transactions.get_backend()}.")
print(f"- df_us_holidays is {len(df_us_holidays)} rows and the data is located in {df_us_holidays.get_backend()}.")

Quick recap:
- df_transactions is 10000000 rows and the data is located in Snowflake.
- df_us_holidays is 11 rows and the data is located in Pandas.


In [30]:
df_transactions["DATE"] = pd.to_datetime(df_transactions["DATE"])

Since `df_us_holidays` is much smaller than `df_transactions`, we moved `df_us_holidays` to Snowflake where `df_transactions` is, to perform the operation.

In [31]:
combined = pd.merge(df_us_holidays, df_transactions, left_on="Date", right_on="DATE")

Transferring data from Pandas to Snowflake ...:   0%|          | 0/2 [00:00<?, ?it/s]

In [32]:
assert combined.get_backend() == "Snowflake"

### 💡 When we combine multiple dataframes running in different locations, pandas on Snowflake automatically determines where to move the data.

## Example 4: Performing custom `apply` on small dataset

apply is known to be slow in Snowpark pandas since it is implemented as UDF/UDTF, which often comes with a fixed startup time.
Here, we show an example of how performing `apply` on a small dataset is faster with local pandas. 

In this example, we want to forecast using last year's transaction data via a custom apply function. 

In [33]:
def forecast_revenue(df, start_date, end_date):
    # Filter data from last year
    df_filtered = df[(df["DATE"] >= start_date - pd.Timedelta(days=365)) & (df["DATE"] < start_date)]
    
    # Append future dates to daily_avg for prediction
    future_dates = pd.date_range(start=start_date, end=end_date, freq="D")
    df_future = pd.DataFrame({"DATE": future_dates})

    # Group by DATE and calculate the mean revenue
    daily_avg = df_filtered.groupby("DATE")["REVENUE"].mean().reset_index()
    daily_avg["DATE"] = daily_avg["DATE"].astype('datetime64[ns]')
    # Merge future dates with predicted revenue, filling missing values
    df_forecast = df_future.merge(daily_avg, on="DATE", how="left")
    import numpy as np
    # Fill missing predicted revenue with overall mean from last year
    df_forecast["PREDICTED_REVENUE"] = np.nan
    df_forecast["PREDICTED_REVENUE"].fillna(daily_avg["REVENUE"].mean(), inplace=True)
    df_forecast["PREDICTED_REVENUE"] = df_forecast["PREDICTED_REVENUE"].astype("float")
    return df_forecast

First, let's use the `forecast_revenue` function to get the forecast in the date range, based on last year's revenue numbers.

In [34]:
start_date = pd.Timestamp("2025-10-01")
end_date = pd.Timestamp("2025-10-31")
df_forecast = forecast_revenue(df_transactions, start_date, end_date)

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

The resulting dataframe is very small, since it is only the 1-month window we're performing forecast on, so the backend is running on pandas locally.

In [35]:
assert df_forecast.get_backend() == 'Pandas'

In [36]:
def adjust_for_holiday_weekend(row):
    # For national holidays, revenue down 5% since stores are closed. For weekends, revenue is up 5% due to increased activity.
    if row["DATE"].strftime('%Y-%m-%d') in list(df_us_holidays["Date"].dt.strftime('%Y-%m-%d')): 
        return row["PREDICTED_REVENUE"] * 0.95
    elif row["DATE"].weekday() == 5 or row["DATE"].weekday() == 6: #Saturday/Sundays
        return row["PREDICTED_REVENUE"] * 1.05
    return row["PREDICTED_REVENUE"]

Now if we run `apply` on this dataframe. It will be running with local pandas.

In [37]:
# Adjust for holidays using the apply function
df_forecast["PREDICTED_REVENUE"] = df_forecast.apply(adjust_for_holiday_weekend, axis=1)

In [38]:
assert df_forecast.get_backend() == 'Pandas'

### 💡 Apply on small dataset is much faster with automatic switching running with pandas locally.

## Example 5: Seamless continuation with downstream code upon backend change

Finally, let's plot the resulting forecast as a visualization. Note that the data type of the dataframe stays the same (`modin.pandas.dataframe.DataFrame`)independent of the choice of backend. This ensures compatibility and interoperability with any downstream code when the backend changes. 

In [39]:
print(f"Altair takes in {type(df_forecast)} with {df_forecast.get_backend()} as backend, since we implement the dataframe interchange protocol")

Altair takes in <class 'modin.pandas.dataframe.DataFrame'> with Pandas as backend, since we implement the dataframe interchange protocol


In [40]:
import altair as alt
alt.data_transformers.disable_max_rows()

chart_predicted = alt.Chart(df_forecast).mark_line(color='blue').encode(
    x='monthdate(DATE):T',
    y=alt.Y('PREDICTED_REVENUE:Q',scale=alt.Scale(domain=[400, 600])),
    tooltip=['DATE', 'PREDICTED_REVENUE']
)
chart_predicted



In [41]:
df_transactions_filtered = df_transactions[
    (df_transactions["DATE"] >= start_date - pd.Timedelta(days=365)) &
    (df_transactions["DATE"] < end_date - pd.Timedelta(days=365))
]
df_transactions_filtered_groupby = df_transactions_filtered.groupby("DATE")["REVENUE"].mean().reset_index()

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

In [42]:
print(f"Altair takes in {type(df_transactions_filtered_groupby)} with {df_transactions_filtered_groupby.get_backend()} as backend, since we implement the dataframe interchange protocol")

Altair takes in <class 'modin.pandas.dataframe.DataFrame'> with Pandas as backend, since we implement the dataframe interchange protocol


In [43]:
df_forecast_labeled = df_forecast.copy()
df_forecast_labeled['Label'] = 'Predicted Revenue'
df_forecast_labeled = df_forecast_labeled.rename(columns={'PREDICTED_REVENUE': 'Value'})
df_last_year_labeled = df_transactions_filtered_groupby.copy()
df_last_year_labeled['Label'] = 'Revenue'
df_last_year_labeled = df_last_year_labeled.rename(columns={'REVENUE': 'Value'})

# Combine
combined_df = pd.concat([
    df_forecast_labeled[['DATE', 'Value', 'Label']],
    df_last_year_labeled[['DATE', 'Value', 'Label']]
])

# Plot with Value on X and color based on Label
final_chart = alt.Chart(combined_df).mark_line().encode(
    y=alt.Y('Value:Q',scale=alt.Scale(domain=[400, 600])),
    x='monthdate(DATE):T',
    color=alt.Color('Label:N', legend=alt.Legend(title='Type')),
    tooltip=['DATE', 'Value', 'Label']
).properties(
    title='Revenue vs Predicted Revenue (by Value)'
)

final_chart



### 💡 Dataframe type stays the same independent of choice of backend