# Introduction to Snowpark pandas
The Snowpark pandas API allows you to run your pandas code directly on your data in Snowflake. Built to replicate the functionality of pandas - including its data isolation and consistency guarantees - the Snowpark pandas API enables you to scale up your traditional pandas pipelines with just a few lines of change.

In today's demo, we'll show how you can get started with the Snowpark pandas API. We'll also see that the Snowpark pandas API is very similar to the native pandas API. The results in this notebook come from comparing 1) Snowpark pandas version 1.15.0a1 on a newly created Snowflake warehouse of size 2-XL to 2) pandas 2.2.1 running on macOS Sonoma 14.4.1 on a MacBook Pro with 64 GB memory and an Apple M2 Max CPU.

## Importing Snowpark pandas
Much like Snowpark, Snowpark pandas requires an active `Session` object to connect to your data in Snowflake. In the next cell, we'll be initializing a Session object, and importing both Snowpark pandas and native pandas, as `spd` and `pd` respectively.

In [1]:
import modin.pandas as spd
import snowflake.snowpark.modin.plugin
import pandas as pd
import json
from snowflake.snowpark import Session
# Create Snowflake Session object
from pathlib import Path
import sys
connection_parameters_path = str(Path("__file__").absolute().parent.parent.parent.parent)
sys.path.append(connection_parameters_path)
from tests.parameters import CONNECTION_PARAMETERS

session = Session.builder.configs(CONNECTION_PARAMETERS).create()



## Getting Started - Reading Data from Snowflake
Today, we'll be analyzing some Stock Timeseries Data from Snowflake's Marketplace. The data is available courtesy of Cybersyn Inc., and can be found [here](https://app.snowflake.com/marketplace/listing/GZTSZAS2KF7/cybersyn-inc-financial-economic-essentials). Let's start by reading the `stock_price_timeseries` table into a DataFrame!

In [2]:
# Read data into a Snowpark pandas df 
from time import perf_counter
start = perf_counter()
spd_df = spd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.STOCK_PRICE_TIMESERIES")
end = perf_counter()
data_size = len(spd_df)
print(f"Took {end - start} seconds to read a table with {data_size} rows into Snowpark pandas!")
snow_time = end - start

Took 9.748766167000001 seconds to read a table with 66061030 rows into Snowpark pandas!


In [None]:
# Read data into a local native pandas df - recommended to kill this cell after waiting a few minutes!

from IPython import display
start = perf_counter()

# Create a cursor object.
cur = session.connection.cursor()

# Execute a statement that will generate a result set.
sql = "select * from FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.STOCK_PRICE_TIMESERIES"
cur.execute(sql)

# Fetch the result set from the cursor and deliver it as the pandas DataFrame.
native_pd_df = cur.fetch_pandas_all()
end = perf_counter()
print(f"Native pandas took {end - start} seconds to read the data!")

It takes much longer for native pandas to read the table into memory than for Snowpark pandas to read the table.

## Examine The Raw Data
Let's take a look at the data we're going to be working with

In [14]:
spd_df.head(5)

Unnamed: 0,TICKER,ASSET_CLASS,PRIMARY_EXCHANGE_CODE,PRIMARY_EXCHANGE_NAME,VARIABLE,VARIABLE_NAME,DATE,VALUE
0,INDF,ETF-Index Fund Shares,PSE,NYSE ARCA,pre-market_open,Pre-Market Open,2024-05-08,35.965
1,ATLC,Equity,NAS,NASDAQ CAPITAL MARKET,pre-market_open,Pre-Market Open,2024-05-08,27.67
2,AAPR,ETF-Index Fund Shares,BAT,BATS Z-EXCHANGE,pre-market_open,Pre-Market Open,2024-05-08,24.65
3,TEL,Equity,NYS,NEW YORK STOCK EXCHANGE,pre-market_open,Pre-Market Open,2024-05-08,142.6
4,CYTH,Equity,NAS,NASDAQ CAPITAL MARKET,pre-market_open,Pre-Market Open,2024-05-08,1.53


## Filtering The Data
Let's take a look at some common data transformations - starting with filtering! Let's filter for stocks that are listed on the New York Stock Exchange!

In [5]:
start = perf_counter()
nyse_spd_df = spd_df[(spd_df['PRIMARY_EXCHANGE_CODE'] == 'NYS')]
repr(nyse_spd_df)
end = perf_counter()
print(f"Filtering for stocks belonging to the NYSE took {end - start} seconds in Snowpark pandas")

Filtering for stocks belonging to the NYSE took 1.480584958999998 seconds in Snowpark pandas


Let's try an even more granular filter - let's filter for the Pre-Market Open of stocks that have the following tickers:
* GOOG (Alphabet, Inc.)
* MSFT (Microsoft)
* SNOW (Snowflake)

In [6]:
start = perf_counter()
filtered_spd_df = spd_df[((spd_df['TICKER'] == 'GOOG') | (spd_df['TICKER'] == 'MSFT') | (spd_df['TICKER'] == 'SNOW')) & (spd_df['VARIABLE_NAME'] == 'Pre-Market Open')]
repr(filtered_spd_df)
end = perf_counter()
print(f"Filtering for the Pre-Market Open price for the above stocks took {end - start} seconds in Snowpark pandas")

Filtering for the Pre-Market Open price for the above stocks took 1.0095136250000678 seconds in Snowpark pandas


# Reshaping the Data
Let's say we wanted to analyse the performance of various stock prices across time - in that case, it may be more helpful to have the values as columns, and the ticker name and date as the index - rather than the current encoding. We can accomplish this using the `pivot_table` API!

In [7]:
start = perf_counter()
reshape_df = spd_df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
repr(reshape_df)
end = perf_counter()
print(f"Pivoting the DataFrame took {end - start} seconds in Snowpark pandas")

Pivoting the DataFrame took 4.195186291000027 seconds in Snowpark pandas


In [8]:
reshape_df.head(5)

Unnamed: 0_level_0,VARIABLE_NAME,All-Day High,All-Day Low,Nasdaq Volume,Post-Market Close,Pre-Market Open
TICKER,DATE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,2018-05-01,66.35,65.5,439231.0,66.23,65.64
A,2018-05-02,66.86,65.81,316586.0,65.91,66.01
A,2018-05-03,66.46,64.85,407491.0,66.33,65.91
A,2018-05-04,67.25,65.63,269025.0,66.99,66.06
A,2018-05-07,67.98,67.08,263454.0,67.4,67.16


## Transforming the Data
Now that we have reformatted the data, we can beginn to apply some transformations. Let's start by taking a look at the All-Day Low column for the tickers above - we can resample the data to look at the Quarterly Low for the `GOOG` ticker!

In [9]:
start = perf_counter()
resampled_spd_df_all_quarter_low = reshape_df["All-Day Low"]["GOOG"].resample("91D").min()
repr(resampled_spd_df_all_quarter_low)
end = perf_counter()
print(f"Resampling the DataFrame took {end - start} seconds in Snowpark pandas")

Resampling the DataFrame took 2.099826750000034 seconds in Snowpark pandas


In [10]:
resampled_spd_df_all_quarter_low

DATE
2018-05-01    1006.48
2018-07-31     995.58
2018-10-30     968.09
2019-01-29    1055.85
2019-04-30    1025.06
2019-07-30    1125.00
2019-10-29    1250.79
2020-01-28    1013.54
2020-04-28    1218.04
2020-07-28    1399.96
2020-10-27    1514.61
2021-01-26    1801.22
2021-04-27    2223.89
2021-07-27    2623.00
2021-10-26    2493.01
2022-01-25    2363.60
2022-04-26     107.01
2022-07-26      94.93
2022-10-25      83.00
2023-01-24      88.87
2023-04-25     101.66
2023-07-25     121.54
2023-10-24     121.47
2024-01-23     131.54
2024-04-23     152.77
Freq: None, Name: All-Day Low, dtype: float64

We can even take a look at the quarter-over-quarter fluctuation in prices using the `diff` API!

In [11]:
start = perf_counter()
q_o_q_resampled_spd_df_all_quarter_low = resampled_spd_df_all_quarter_low.diff()
repr(q_o_q_resampled_spd_df_all_quarter_low)
end = perf_counter()
print(f"Diffing the resampled data took {end - start} seconds in Snowpark pandas")

Diffing the resampled data took 0.6961409589999903 seconds in Snowpark pandas


In [12]:
q_o_q_resampled_spd_df_all_quarter_low

DATE
2018-05-01        NaN
2018-07-31     -10.90
2018-10-30     -27.49
2019-01-29      87.76
2019-04-30     -30.79
2019-07-30      99.94
2019-10-29     125.79
2020-01-28    -237.25
2020-04-28     204.50
2020-07-28     181.92
2020-10-27     114.65
2021-01-26     286.61
2021-04-27     422.67
2021-07-27     399.11
2021-10-26    -129.99
2022-01-25    -129.41
2022-04-26   -2256.59
2022-07-26     -12.08
2022-10-25     -11.93
2023-01-24       5.87
2023-04-25      12.79
2023-07-25      19.88
2023-10-24      -0.07
2024-01-23      10.07
2024-04-23      21.23
Freq: None, Name: All-Day Low, dtype: float64

In [13]:
display.Markdown(data=f"""## Conclusion\nAs we can see, Snowpark pandas is able to replicate the pandas API while performing computations on large data sets that don't typically work with native pandas and all while keeping your data in Snowflake!""")

## Conclusion
As we can see, Snowpark pandas is able to replicate the pandas API while performing computations on large data sets that don't typically work with native pandas and all while keeping your data in Snowflake!