# Preamble

In [1]:
%%time

import snowflake.snowpark.modin.plugin
import modin.pandas as pd
import numpy as np
import datetime
import pandas as native_pd
from snowflake.snowpark.session import Session; session = Session.builder.create()

ValueError: Cannot register an extension with the reserved name __init__.

# Use case 1: working with a single small table

In this example, we read the Snowhouse table `SAMPLE_DATA.TPCH_SF1.CUSTOMER`, which is 150k rows and 10.3 MB in Snowflake.

In [2]:
%%time

df = pd.read_snowflake("SAMPLE_DATA.TPCH_SF1.CUSTOMER")

Snapshot source table/view 'SAMPLE_DATA.TPCH_SF1.CUSTOMER' failed due to reason: `003029 (0A000): SQL compilation error:
Cannot clone from a table that was imported from a share.'. Data from source table/view 'SAMPLE_DATA.TPCH_SF1.CUSTOMER' is being copied into a new temporary table 'SNOWPARK_TEMP_TABLE_XBV4LQSS4K' for snapshotting. DataFrame creation might take some time.


AttributeError: 'DataFrame' object has no attribute '_query_compiler'

Just printing the data is visibly slow...

In [3]:
df

NameError: name 'df' is not defined

and doing certain complex transformations is very slow.

In [13]:
%%time

result = df.groupby('C_NATIONKEY').apply(lambda group: group.C_CUSTKEY.iloc[0] + group.C_CUSTKEY.mean())

CPU times: user 399 ms, sys: 72.3 ms, total: 471 ms
Wall time: 8.74 s


## Let's switch the backend to python!

We pay a one-time cost of a few seconds to load the data into memory.

In [14]:
%%time

df = df.move_to('Pandas')

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 816 ms, sys: 181 ms, total: 997 ms
Wall time: 4.86 s


But now printing is extremely fast...

In [16]:
type(df.dtypes)

pandas.core.series.Series

In [12]:
df

Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
0,30001,Customer#000030001,"Ui1b,3Q71CiLTJn4MbVp,,YCZARIaNTelfst",4,14-526-204-4500,8848.47,MACHINERY,frays wake blithely enticingly ironic asymptote
1,30002,Customer#000030002,UVBoMtILkQu1J3v,11,21-340-653-9800,5221.81,MACHINERY,he slyly ironic pinto beans wake slyly above t...
2,30003,Customer#000030003,CuGi9fwKn8JdR,21,31-757-493-7525,3014.89,BUILDING,e furiously alongside of the requests. evenly ...
3,30004,Customer#000030004,tkR93ReOnf9zYeO,23,33-870-136-4375,3308.55,AUTOMOBILE,ssly bold deposits. final req
4,30005,Customer#000030005,pvq4uDoD8pEwpAE01aesCtbD9WU8qmlsvoFav5,9,19-144-468-5416,-278.54,MACHINERY,ructions behind the pinto beans x-ra
...,...,...,...,...,...,...,...,...
149995,29996,Customer#000029996,BnZVGZiAgcEImNm9iD,7,17-536-308-8025,4035.17,FURNITURE,"ual instructions. bold, silent foxes nag blith..."
149996,29997,Customer#000029997,lTbDYXdQ74JctD UbRbXCqF2b8,9,19-631-777-4123,2015.90,HOUSEHOLD,eodolites detect slyly alongside of the quickl...
149997,29998,Customer#000029998,ZxxiuDruzi98CcymR,23,33-619-315-9722,-810.56,FURNITURE,xpress packages. accounts sleep carefully iron...
149998,29999,Customer#000029999,CuPA4UpgTCYiXrBrpiSO D,12,22-824-951-8333,3865.14,FURNITURE,eposits-- accounts haggle across the slyly per...


And so are most things we can imagine doing with the data.

In [13]:
%%time

result = df.groupby('C_NATIONKEY').apply(lambda group: group.C_CUSTKEY.iloc[0] + group.C_CUSTKEY.mean())
print(result)

C_NATIONKEY
0     104809.648945
1     104849.461925
2     105879.897816
3     104446.823588
4     104679.314262
5     104906.427251
6     105917.388361
7     104694.312288
8     105173.743462
9     105037.669210
10    105448.956232
11    104112.225725
12    104829.006557
13    106009.666832
14    104419.615654
15    105264.778247
16    104199.408269
17    105125.565188
18    104305.006972
19    105746.410656
20    104615.581809
21    105924.703395
22    105530.506581
23    105457.153718
24    104223.723884
dtype: float64
CPU times: user 47.4 ms, sys: 7.52 ms, total: 54.9 ms
Wall time: 51.3 ms


When we're done with our transformations, we can write the results to Snowflake.

In [14]:
%%time

snow_result = result.rename('result').set_backend('Snowflake')

Transferring data from Pandas to Snowflake ...:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 23.4 ms, sys: 7.8 ms, total: 31.2 ms
Wall time: 130 ms


In [15]:
%%time

snow_result.to_snowflake('TEMP.MVASHISHTHA.RESULT', if_exists='replace', index=False)

CPU times: user 19.8 ms, sys: 7.59 ms, total: 27.4 ms
Wall time: 924 ms


# Use case 2: Filtering out most of a large table, then using python

## Setup

In this example, we read the table `SAMPLE_DATA.TPCH_SF10.LINEITEM`, which is 60M rows and 1.3 GB.

In [16]:
%%time

df = pd.read_snowflake("SAMPLE_DATA.TPCH_SF10.LINEITEM")

Snapshot source table/view 'SAMPLE_DATA.TPCH_SF10.LINEITEM' failed due to reason: `003029 (0A000): SQL compilation error:
Cannot clone from a table that was imported from a share.'. Data from source table/view 'SAMPLE_DATA.TPCH_SF10.LINEITEM' is being copied into a new temporary table 'SNOWPARK_TEMP_TABLE_XB61Q61XI6' for snapshotting. DataFrame creation might take some time.


CPU times: user 92.7 ms, sys: 28 ms, total: 121 ms
Wall time: 10.7 s


The data is large, and pulling it all into pandas would take about 2.5 minutes...

In [18]:
#df.set_backend('Pandas')

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT
0,56224388,804093,54110,2,5.0,4985.25,0.04,0.08,N,O,1998-04-25,1998-05-30,1998-05-24,COLLECT COD,MAIL,inst the exp
1,56224388,387466,87467,3,13.0,20194.85,0.04,0.04,N,O,1998-03-29,1998-04-15,1998-04-12,DELIVER IN PERSON,REG AIR,"e regular, spe"
2,56224388,127861,27862,4,26.0,49110.36,0.09,0.02,N,O,1998-05-14,1998-05-25,1998-06-06,TAKE BACK RETURN,MAIL,ve the final ideas cajole slyly
3,56224388,955237,55238,5,38.0,49103.22,0.10,0.01,N,O,1998-03-08,1998-04-19,1998-03-25,NONE,AIR,sts sleep around the slow
4,56224389,16386,41387,1,26.0,33861.88,0.04,0.00,A,F,1995-05-17,1995-06-14,1995-05-18,COLLECT COD,MAIL,ideas. slyly
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59986047,59974884,889130,39147,3,24.0,26858.16,0.03,0.06,R,F,1993-01-10,1993-02-24,1993-01-18,DELIVER IN PERSON,RAIL,the carefully express fo
59986048,59974884,1596930,71976,4,3.0,6080.58,0.08,0.08,R,F,1993-02-14,1993-02-17,1993-03-15,COLLECT COD,AIR,ove the regularly ironic
59986049,59974884,1588342,13358,5,46.0,65792.42,0.08,0.06,R,F,1993-03-21,1993-03-04,1993-04-06,DELIVER IN PERSON,TRUCK,onic requests sleep acco
59986050,59974884,1249071,24108,6,44.0,44880.44,0.09,0.04,A,F,1993-02-07,1993-02-17,1993-02-20,DELIVER IN PERSON,TRUCK,about the ideas. bold


But we only need to work with a sample of the data, so we sample 2% of the data.

In [19]:
%%time

filtered = df.sample(frac=0.02)



CPU times: user 68.1 ms, sys: 22.8 ms, total: 90.9 ms
Wall time: 3.93 s


Now it's easier to fetch the data.

In [20]:
%%time

python_filtered = filtered.set_backend('PANDAS')

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 3.4 s, sys: 864 ms, total: 4.27 s
Wall time: 10.7 s


Now we can do a complex operation on the filtered subset of the data.

In [21]:
%%time

python_filtered.groupby('L_SHIPMODE').apply(lambda group: group.L_ORDERKEY.iloc[0] + group.L_ORDERKEY.mean())

CPU times: user 234 ms, sys: 44 ms, total: 278 ms
Wall time: 276 ms


L_SHIPMODE
AIR        8.620138e+07
FOB        8.619488e+07
MAIL       8.621037e+07
RAIL       8.620723e+07
REG AIR    8.620581e+07
SHIP       8.621327e+07
TRUCK      8.620472e+07
dtype: float64

Doing the same operation on the filtered data in Snowflake would take much longer.

In [22]:
%%time

filtered.groupby('L_SHIPMODE').apply(lambda group: group.L_ORDERKEY.iloc[0] + group.L_ORDERKEY.mean())

CPU times: user 350 ms, sys: 99.9 ms, total: 450 ms
Wall time: 1min 13s


L_SHIPMODE
AIR        8.620138e+07
FOB        8.619488e+07
MAIL       8.621037e+07
RAIL       8.620723e+07
REG AIR    8.620581e+07
SHIP       8.621327e+07
TRUCK      8.620472e+07
dtype: float64

In [2]:
# Use case 3: Intermingling engines

In [3]:
snow_df = pd.read_snowflake("SAMPLE_DATA.TPCH_SF10.LINEITEM")

Snapshot source table/view 'SAMPLE_DATA.TPCH_SF10.LINEITEM' failed due to reason: `003029 (0A000): SQL compilation error:
Cannot clone from a table that was imported from a share.'. Data from source table/view 'SAMPLE_DATA.TPCH_SF10.LINEITEM' is being copied into a new temporary table 'SNOWPARK_TEMP_TABLE_MFNFUL1KHJ' for snapshotting. DataFrame creation might take some time.


In [4]:
%%time

native_df = snow_df['L_SHIPMODE'].value_counts().move_to('Pandas')
native_df

Transferring data from Snowflake to Pandas ...:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 171 ms, sys: 78.9 ms, total: 250 ms
Wall time: 1.77 s


L_SHIPMODE
RAIL       8571844
SHIP       8571402
REG AIR    8570280
FOB        8569760
MAIL       8569053
TRUCK      8567549
AIR        8566164
Name: count, dtype: int64

In [5]:
%%time
native_df.sort_values().index[-1]

CPU times: user 5.67 ms, sys: 787 µs, total: 6.45 ms
Wall time: 6.35 ms


'RAIL'

In [6]:
snow_df[snow_df['L_SHIPMODE'] == native_df.sort_values().index[-1]]

Unnamed: 0,L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT
0,14350244,866930,16947,2,33.0,62597.37,0.00,0.02,N,O,1998-02-20,1997-12-28,1998-03-19,TAKE BACK RETURN,RAIL,le furiously across the fluffily
5,14350245,1530905,30906,3,42.0,81304.86,0.10,0.04,N,O,1997-07-20,1997-06-27,1997-08-17,NONE,RAIL,usly bold ideas cajole furiously. care
13,14350247,834511,59520,2,4.0,5781.88,0.03,0.05,A,F,1993-09-03,1993-07-16,1993-09-26,NONE,RAIL,"the special, even requests. frays integrat"
19,14350273,746189,96204,1,17.0,20997.55,0.07,0.00,N,O,1998-11-01,1998-08-30,1998-11-02,COLLECT COD,RAIL,usual instructions. furiously iron
21,14350274,1562055,87071,1,46.0,51381.08,0.07,0.03,R,F,1992-05-23,1992-05-21,1992-06-09,NONE,RAIL,slyly ironic pinto be
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59986007,49622465,1042163,17194,1,35.0,38678.85,0.03,0.08,N,O,1997-03-19,1997-02-04,1997-04-18,COLLECT COD,RAIL,n grouches. i
59986023,49622469,1348975,99002,1,29.0,58693.39,0.04,0.01,N,O,1996-06-27,1996-06-30,1996-07-10,COLLECT COD,RAIL,"is. final, ironic packages lose ca"
59986032,49622471,22394,72395,1,3.0,3949.17,0.08,0.02,N,O,1996-12-15,1996-12-22,1996-12-18,NONE,RAIL,ely stealthy a
59986039,49622498,355331,5338,2,6.0,8317.92,0.03,0.06,N,O,1997-11-02,1997-11-18,1997-11-17,NONE,RAIL,riously even somas hag


In [7]:
pd.concat([snow_df, native_df])

TypeError: _linear_row_cost_fn() takes 1 positional argument but 2 were given

In [8]:
snow_df.dtypes

Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
Going to open: https://snowbiz.okta.com/app/snowflake/exk8wfsfryJIn4IWZ2p7/sso/saml?SAMLRequest=jVJdc9owEPwrHvXZlgykIRogQ0MozpBAMJApb8KWQUGWXJ0cQ399ZT4y6UMyfdOcdm%2F3bq9zu8%2Bl98YNCK26KAwI8rhKdCrUposW86HfRh5YplImteJddOCAbnsdYLksaL%2B0WzXjv0sO1nONFND6o4tKo6hmIIAqlnOgNqFx%2F3FMGwGhDIAb6%2BTQmZKCcFpbawuKcVVVQdUMtNngBiEEkxvsUDXkG%2FogUXytURhtdaLlhbJ3M30iEWLSqiUcwilMz8QfQp1W8JXK%2BgQCOprPp%2F50Es%2BR179Md6cVlDk3MTdvIuGL2fhkAJyD%2BGnyMpos4vsAlK4yyXY80XlRWtctcC%2Bc8RRLvRFuR9Ggi4qdSAd2uFTL3LD16iEO5fS%2BPRPZ6%2Ffxq9iS9fO%2Btbnu%2F1yLKh4dfiXIW14SbdSJRgAlj1Sdo3Ul0rjySdNvhvPwhl6FNAyDNiEr5A1cjkIxe2RezNYW1%2BJPoHeWHc2xosDvvjHf79pVBpk5PESqFb2sGsU1BtC4jgmdLoUeDZje%2F87fwR9Z52N7cvuPBlMtRXLwhtrkzH4eTxiEx4pI%2FewIpTxnQvbT1HAAF5OUuroznFl309aUHOHeSfXfq%2B79BQ%3D%3D&RelayState=

SnowparkSQLException: (1304): 01bb61e7-080f-702e-0001-dd47a26ef3df: 002003 (42S02): SQL compilation error:
Object 'SNOWPARK_TEMP_TABLE_AVZFAMGLDDREADONLY' does not exist or not authorized.

In [9]:
native_df.dtypes

dtype('int64')