# Introduction to Snowpark pandas
The Snowpark pandas API allows you to run your pandas code directly on your data in Snowflake. Built to replicate the functionality of pandas - including its data isolation and consistency guarantees - the Snowpark pandas API enables you to scale up your traditional pandas pipelines with just a few lines of change.

In today's demo, we'll be taking a look at how you can get started with the API, as well as comparing its simpilarity with vanilla pandas.

## Importing Snowpark pandas
Much like Snowpark, Snowpark pandas requires an active `Session` object to connect to your data in Snowflake. In the next cell, we'll be initializing a Session object, and importing both Snowpark pandas and vanilla pandas, as `spd` and `pd` respectively.

In [None]:
import snowflake.snowpark.modin.pandas as spd
import pandas as pd
import json
from snowflake.snowpark.session import Session
# Create Snowflake Session object
# connection_parameters = json.load(open('creds.json'))
session = Session.builder.create()

In [None]:
# give the new warehouse time to start.
spd.DataFrame([1])

## Getting Started - Reading Data from Snowflake
Today, we'll be analyzing some Stock Timeseries Data from Snowflake's Marketplace. The data is available courtesy of Cybersyn Inc., and can be found [here](https://app.snowflake.com/marketplace/listing/GZTSZAS2KF7/cybersyn-inc-financial-economic-essentials). Let's start by reading the `stock_price_timeseries` table into a DataFrame!

In [None]:
# Read data into a Snowpark pandas df 
from time import perf_counter
start = perf_counter()
spd_df = spd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.STOCK_PRICE_TIMESERIES")
end = perf_counter()
data_size = len(spd_df)
print(f"Took {end - start} seconds to read a table with {data_size} rows into Snowpark pandas!")
snow_time = end - start

In [None]:
# Read data into a local Vanilla pandas df - recommended to kill this cell after waiting a few minutes!
# Create a cursor object.
from IPython import display
start = perf_counter()

# Create a cursor object.
cur = session.connection.cursor()

# Execute a statement that will generate a result set.
sql = "select * from FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.STOCK_PRICE_TIMESERIES"
cur.execute(sql)

# Fetch the result set from the cursor and deliver it as the pandas DataFrame.
native_pd_df = cur.fetch_pandas_all()
end = perf_counter()
print(f"Vanilla pandas took {end - start} seconds to read the data!")

As we can see, pandas is not able to pull the data into memory even after being given a few minutes.

## Examine The Raw Data
Let's take a look at the data we're going to be working with

In [None]:
spd_df.head(5).to_pandas()

In [None]:
native_pd_df.head(5)

## Filtering The Data
Let's take a look at some common data transformations - starting with filtering! Let's filter for stocks that are listed on the New York Stock Exchange!

In [None]:
start = perf_counter()
nyse_spd_df = spd_df[(spd_df['PRIMARY_EXCHANGE_CODE'] == 'NYS')]
repr(nyse_spd_df)
end = perf_counter()
print(f"Filtering for stocks belonging to the NYSE took {end - start} seconds in Snowpark pandas")

In [None]:
start = perf_counter()
nyse_native_df = native_pd_df[(native_pd_df['PRIMARY_EXCHANGE_CODE'] == 'NYS')]
repr(nyse_native_df)
end = perf_counter()
print(f"Filtering for stocks belonging to the NYSE took {end - start} seconds in native pandas")

Let's try an even more granular filter - let's filter for the Pre-Market Open of stocks that have the following tickers:
* GOOG (Alphabet, Inc.)
* MSFT (Microsoft)
* SNOW (Snowflake)

In [None]:
start = perf_counter()
filtered_spd_df = spd_df[((spd_df['TICKER'] == 'GOOG') | (spd_df['TICKER'] == 'MSFT') | (spd_df['TICKER'] == 'SNOW')) & (spd_df['VARIABLE_NAME'] == 'Pre-Market Open')]
repr(filtered_spd_df)
end = perf_counter()
print(f"Filtering for the Pre-Market Open price for the above stocks belonging took {end - start} seconds in Snowpark pandas")

In [None]:
start = perf_counter()
filtered_native_df = native_pd_df[((native_pd_df['TICKER'] == 'GOOG') | (native_pd_df['TICKER'] == 'MSFT') | (native_pd_df['TICKER'] == 'SNOW')) & (native_pd_df['VARIABLE_NAME'] == 'Pre-Market Open')]
repr(filtered_native_df)
end = perf_counter()
print(f"Filtering for the Pre-Market Open price for the above stocks belonging took {end - start} seconds in Snowpark pandas")

# Reshaping the Data
Let's say we wanted to analyse the performance of various stock prices across time - in that case, it may be more helpful to have the values as columns, and the ticker name and date as the index - rather than the current encoding. We can accomplish this using the `pivot_table` API!

In [None]:
start = perf_counter()
reshape_df = spd_df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
repr(reshape_df)
end = perf_counter()
print(f"Pivoting the DataFrame took {end - start} seconds in Snowpark pandas")

In [None]:
reshape_df.head(5).to_pandas()

In [None]:
start = perf_counter()
reshape_native_df = native_pd_df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
repr(reshape_native_df)
end = perf_counter()
print(f"Pivoting the DataFrame took {end - start} seconds in native pandas")

## Transforming the Data
Now that we have reformatted the data, we can beginn to apply some transformations. Let's start by taking a look at the All-Day Low column for the tickers above - we can resample the data to look at the Quarterly Low for the `GOOG` ticker!

In [None]:
start = perf_counter()
resampled_spd_df_all_quarter_low = reshape_df["All-Day Low"]["GOOG"].resample("91D").min()
repr(resampled_spd_df_all_quarter_low)
end = perf_counter()
print(f"Resampling the DataFrame took {end - start} seconds in Snowpark pandas")

In [None]:
resampled_spd_df_all_quarter_low

In [None]:
pandas_goog_frame = reshape_native_df["All-Day Low"]["GOOG"]
# native pandas checks that index is of date type. snowpark pandas does not.
pandas_goog_frame.index = pandas_goog_frame.index.astype('datetime64[ns]')
start = perf_counter()
resampled_native_df_all_quarter_low = pandas_goog_frame.resample("91D").min()
repr(resampled_native_df_all_quarter_low)
end = perf_counter()
print(f"Resampling the DataFrame took {end - start} seconds in native pandas")

We can even take a look at the quarter-over-quarter fluctuation in prices using the `diff` API!

In [None]:
start = perf_counter()
q_o_q_resampled_spd_df_all_quarter_low = resampled_spd_df_all_quarter_low.diff()
repr(q_o_q_resampled_spd_df_all_quarter_low)
end = perf_counter()
print(f"diffing the DataFrame took {end - start} seconds in Snowpark pandas")

In [None]:
q_o_q_resampled_spd_df_all_quarter_low

In [None]:
start = perf_counter()
q_o_q_resampled_native_df_all_quarter_low = resampled_native_df_all_quarter_low.diff()
repr(q_o_q_resampled_native_df_all_quarter_low)
end = perf_counter()
print(f"diffing the DataFrame took {end - start} seconds in native pandas")

In [None]:
q_o_q_resampled_native_df_all_quarter_low

In [None]:
display.Markdown(data=f"""## Conclusion\nAs we can see, Snowpark pandas is able to replicate the pandas API while performing computations on large data sets that don't typically work with vanilla pandas and all while keeping your data in Snowflake!""")