### Pandas vs Snowpark Dataframes

#### 1. Snowpark Dataframes

Snowpark DataFrames are more efficient than Pandas DataFrames, and can help avoid memory limits common in traditional notebook workflows. It doesn’t matter how large the underlying dataset is; the query is quickly executed on Snowflake’s infrastructure, data is not going to leave the environment. Only the bite-sized rolled-up final results are returned to client.

##### 3.1 Import Snowflake Connector package

In [2]:
import snowflake.connector
import pandas as pd
from datetime import date
import time
import os
import json

#### 3.3 Get user credentials

In [4]:
## creds.json format
# {
#     "username" : "xxxxxxxxx",
#     "password" : "xxxxxxxxxxx",
#     "account" : "test.xxxxxx",
#     "warehouse" : "TEST_WH_xxxx",
#     "database" : "TEST_xxxxxxx",
#     "schema" : "TEST_xxxxxxxx",
#     "role" : "ROLE_TEST_xxxxxxxxx"
# }

#### 3.4 Connect Snowflake using Python Connector

In [5]:
#creating an instance of the connector object and cursor
ctx = snowflake.connector.connect (
    user=creds["username"],
    password=creds["password"],
    account=creds["account"],
    database=creds["database"],
    schema=creds["schema"],
    warehouse=creds["warehouse"],
    role=creds["role"]    
    )
cs = ctx.cursor()

#### 4. Snowpark Performance Evalution

##### 4.2 Import Snowpark package

In [6]:
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import asc, desc, avg, sum, col, lit

##### 4.3 Connect Snowflake using Snowpark

In [7]:
connection_parameters = {
        "account": creds["account"],    
        "user" : creds["username"],
        "password" : creds["password"],
        "database" : creds["database"],
        "schema" : creds["schema"],
        "warehouse" : creds["warehouse"],
        "role" : creds["role"]        
    }
    
## Create a Snowpark session
snpark_conn = Session.builder.configs(connection_parameters).getOrCreate()

#### 5. Python vs. Snowpark Dataframes

##### Create DataFrame

In [8]:
## Pandas
data = {"Name": ["John", "Anna", "Peter"], "Age": [28, 24, 35]}
df = pd.DataFrame(data)
print(df)

    Name  Age
0   John   28
1   Anna   24
2  Peter   35


In [9]:
## Snowpark
data = [("John", 28), ("Anna", 24), ("Peter", 35)]
snow_df = snpark_conn.create_dataframe(data, ["Name", "AGE"])
snow_df.show()

------------------
|"NAME"  |"AGE"  |
------------------
|John    |28     |
|Anna    |24     |
|Peter   |35     |
------------------



##### Displaying a Dataframe

In [None]:
## Pandas
selected_df = df[["Name", "Age"]]
print(selected_df)

    Name  Age
0   John   28
1   Anna   24
2  Peter   35


In [None]:
## Snowpark
snow_selected_df = snow_df.select("Name", "Age")
snow_selected_df.show()

------------------
|"NAME"  |"AGE"  |
------------------
|John    |28     |
|Anna    |24     |
|Peter   |35     |
------------------



##### Renaming Columns

In [None]:
## Pandas
df_renamed = df.rename(columns={"Name": "First_Name"})
print(df_renamed)

  First_Name  Age
0       John   28
1       Anna   24
2      Peter   35


In [None]:
## Snowpark
snow_df_renamed = snow_df.with_column_renamed("Name", "First_name")
snow_df_renamed.show()

------------------------
|"FIRST_NAME"  |"AGE"  |
------------------------
|John          |28     |
|Anna          |24     |
|Peter         |35     |
------------------------



##### Sorting Data

In [None]:
## Pandas
df_sorted = df.sort_values("Age")
print(df_sorted)

    Name  Age
1   Anna   24
0   John   28
2  Peter   35


In [None]:
## Snowpark
snow_df_sorted = snow_df.sort("Age")
snow_df_sorted.show()

------------------
|"NAME"  |"AGE"  |
------------------
|Anna    |24     |
|John    |28     |
|Peter   |35     |
------------------



##### Displaying And Filtering a Dataframe

In [None]:
## Pandas
df_filtered = df[df["Age"] > 25]
print(df_filtered)


    Name  Age
0   John   28
2  Peter   35


In [None]:
## Snowpark
snow_df_filtered = snow_df.filter(snow_df["AGE"] > 25)
snow_df_filtered.show()

------------------
|"NAME"  |"AGE"  |
------------------
|John    |28     |
|Peter   |35     |
------------------



##### Grouping Data

In [None]:
## Pandas
df_grouped = df.groupby("Age").size().reset_index(name="Counts")
print(df_grouped)


   Age  Counts
0   24       1
1   28       1
2   35       1


In [None]:
## Snowpark
snow_df_grouped = snow_df.groupBy("AGE").count()
snow_df_grouped.show()

-------------------
|"AGE"  |"COUNT"  |
-------------------
|28     |1        |
|24     |1        |
|35     |1        |
-------------------



##### Joining Dataframe

In [None]:
## Pandas
df1 = pd.DataFrame({"Name": ["John", "Anna"], "Age": [28, 24]})
df2 = pd.DataFrame({"Name": ["John", "Peter"], "EyeColor": ["BLUE","BLACK"]})
df = pd.merge(df1, df2, on=["Name"])
print(df)

   Name  Age EyeColor
0  John   28     BLUE


In [None]:
## Snowpark
snow_df1 = snpark_conn.createDataFrame([("John", 28), ("Anna", 24)], ["Name", "Age"])
snow_df2 = snpark_conn.createDataFrame([("John", "BLUE"), ("Peter", "BLACK")], ["Name", "EyeColor"])
snow_df = snow_df1.join(snow_df2, snow_df1.NAME == snow_df2.NAME,
                                lsuffix="_LEFT").drop(
                                "NAME_LEFT"
                                )
snow_df.show()

-------------------------------
|"AGE"  |"NAME"  |"EYECOLOR"  |
-------------------------------
|28     |John    |BLUE        |
-------------------------------



##### New Column based on another Column (conditional)

In [None]:
## Pandas
df["Senior"] = df["Age"].apply(lambda x: True if x > 60 else False)
print(df)

   Name  Age EyeColor  Senior
0  John   28     BLUE   False


In [None]:
## Snowpark
import snowflake.snowpark.functions as F
snow_df = snow_df.with_column("Senior", F.when(snow_df.AGE > 60,
                                True).otherwise(False))
snow_df.show()


------------------------------------------
|"AGE"  |"NAME"  |"EYECOLOR"  |"SENIOR"  |
------------------------------------------
|28     |John    |BLUE        |False     |
------------------------------------------



##### Update Rows

In [None]:
## Pandas
# Create DataFrame
df = pd.DataFrame(
                    {"Name": ["John", "Jane", "Emily", "Daniel"], "Age": [15, 22, 17, 28]}
                 )
# Update Age for John
df.loc[df["Name"] == "John", "Age"] = 16
print(df)

     Name  Age
0    John   16
1    Jane   22
2   Emily   17
3  Daniel   28


In [None]:
## Snowpark
# Create DataFrame
snow_df = snpark_conn.createDataFrame (
    [("John", 15), ("Jane", 22), ("Emily", 17), ("Daniel", 28)], ["Name","Age"]
)
snow_df.filter(F.col("Name") == "John").show()
# Update Age for John
snow_df = snow_df.with_column(
"Age", F.when(snow_df["Name"] == "John",20).otherwise(snow_df["Age"])
).filter(F.col("Name") == "John")
snow_df.show()

------------------
|"NAME"  |"AGE"  |
------------------
|John    |15     |
------------------

------------------
|"NAME"  |"AGE"  |
------------------
|John    |20     |
------------------



##### Handling Missing Values

In [None]:
## Pandas
#  Create DataFrame
df = pd.DataFrame(
{
"EyeColor": ["Blue", None, "Brown", None],
"Name": ["John", "Jane", None, "Daniel"],
}
)
print(df)
print()
# Drop rows with any missing values
df_no_missing = df.dropna()
# Fill missing values
df_filled = df.fillna({"EyeColor": "Unknown", "Name": "No Name"})
print(df_no_missing)
print()
print(df_filled)


  EyeColor    Name
0     Blue    John
1     None    Jane
2    Brown    None
3     None  Daniel

  EyeColor  Name
0     Blue  John

  EyeColor     Name
0     Blue     John
1  Unknown     Jane
2    Brown  No Name
3  Unknown   Daniel


In [None]:
## Snowpark
#  Create DataFrame with missing values
snow_df = snpark_conn.createDataFrame(
[("Brown", "John"), (None, "Jane"), ("Blue", None), (None,"Daniel")],["EyeColor", "Name"],)
snow_df.show()
# Drop rows with any missing values
snow_df_no_missing = snow_df.na.drop()
# Fill missing values
snow_df_filled = snow_df.na.fill({"EyeColor": "Unknown", "Name": "No Name"})
snow_df_no_missing.show()
snow_df_filled.show()


-----------------------
|"EYECOLOR"  |"NAME"  |
-----------------------
|Brown       |John    |
|NULL        |Jane    |
|Blue        |NULL    |
|NULL        |Daniel  |
-----------------------

-----------------------
|"EYECOLOR"  |"NAME"  |
-----------------------
|Brown       |John    |
-----------------------

-----------------------
|"EYECOLOR"  |"NAME"  |
-----------------------
|Brown       |John    |
|Unknown     |Jane    |
|Blue        |NoName  |
|Unknown     |Daniel  |
-----------------------



##### Window Functions

In [None]:
df = pd.DataFrame(
{
"Date": pd.date_range(start="2023-01-01", periods=5),
"Product": ["A", "B", "A", "B", "A"],
"Sales": [100, 80, 230, 150, 175],
}
)
df["Moving_Avg"] = df.groupby("Product")["Sales"].transform(
lambda x: x.rolling(window=2).mean()
)
print(df)


        Date Product  Sales  Moving_Avg
0 2023-01-01       A    100         NaN
1 2023-01-02       B     80         NaN
2 2023-01-03       A    230       165.0
3 2023-01-04       B    150       115.0
4 2023-01-05       A    175       202.5


In [None]:
## Snowpark
from snowflake.snowpark.functions import col, avg
from snowflake.snowpark import *

snow_df_window = snpark_conn.createDataFrame(
[
("2023-01-01", "A", 100), ("2023-01-02", "B", 80), ("2023-01-03", "A", 230),
("2023-01-04", "B", 150), ("2023-01-05", "A", 175),
],
["Date", "Product", "Sales"],
)
windowSpec = (
             Window.partitionBy("Product")
            .orderBy("Date")
            .rowsBetween(Window.currentRow - 1, Window.currentRow)
)
snow_df_window = snow_df_window.with_column(
"Moving_Avg", F.avg(snow_df_window["Sales"]).over(windowSpec)
)
snow_df_window.show()

---------------------------------------------------
|"DATE"      |"PRODUCT"  |"SALES"  |"MOVING_AVG"  |
---------------------------------------------------
|2023-01-01  |A          |100      |100.000       |
|2023-01-03  |A          |230      |165.000       |
|2023-01-05  |A          |175      |202.500       |
|2023-01-02  |B          |80       |80.000        |
|2023-01-04  |B          |150      |115.000       |
---------------------------------------------------



##### Apply Functions to multiple Columns

In [None]:
## Pandas
def add_ten(x):
    return x + 10

df = pd.DataFrame({"Age": [20, 25, 30, 35, 40], "Salary": [3000, 3500, 4000, 4500,5000]})
# Apply add_ten to each element in 'Age' and 'Salary'
df[["Age", "Salary"]] = df[["Age", "Salary"]].applymap(add_ten)
print(df)


   Age  Salary
0   30    3010
1   35    3510
2   40    4010
3   45    4510
4   50    5010


In [10]:
## Snowpark
snow_df = snpark_conn.createDataFrame(
    [(20, 3000), (25, 3500), (30, 4000), (35, 4500), (40, 5000)], ["Age","Salary"]
    )
snow_df.show()
# columns_to_change = ["Age", "Salary"]
# # Use select and list comprehension to apply function to multiple columns
# snow_df = snow_df.select(
# *[(F.col(c) + 10).alias(c) if c in columns_to_change else c for c in s_df.columns]
# )
# s_df.show()


--------------------
|"AGE"  |"SALARY"  |
--------------------
|20     |3000      |
|25     |3500      |
|30     |4000      |
|35     |4500      |
|40     |5000      |
--------------------



##### User Defined Functions

In [None]:
## Pandas
N/A

In [None]:
# Snowpark
snpark_conn.sql("use schema public").collect()
@F.udf(name="senior", replace=True)
def senior(age: int) -> bool:
    return True if age > 60 else False
snow_df = snow_df.with_column("Senior", senior(snow_df.AGE))
snow_df.show()