# Snowpark Basics HoL Part 2 - Joins and Views

## 2.1 Setup

### Imports

In [None]:
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

import sys
import json
import pandas as pd
import numpy as np

# Make sure we do not get line breaks when doing show on wide dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

### Create Snowpark Session

In [None]:
with open('creds.json') as f:
    connection_parameters = json.load(f)

In [None]:
session = Session.builder.configs(connection_parameters).create()
print(f"Current Database and schema: {session.get_fully_qualified_current_schema()}")
print(f"Current Warehouse: {session.get_current_warehouse()}")

### Snowpark DataFrames from Tables

In [None]:
# Creating a Snowpark DataFrame
snowpark_truck_df = session.table('TRUCK')
snowpark_header_df = session.table('ORDER_HEADER')
snowpark_detail_df = session.table('ORDER_DETAIL')
snowpark_location_df = session.table('LOCATION')
snowpark_menu_df = session.table('MENU')

print(f"ORDER_DETAIL rows: {snowpark_detail_df.count()}")


## 2.2 Joins

### Intro to Joins
Snowpark Dataframes can be joined through various methods. An inner join with a Cartesian product is the default but one can specify the join columns and type of join as optional arguments..

```python
table1_df = session.table("<table 1 name>")
table2_df = session.table("<table 2 name>")

joined_df = table1_df.join(table2_df) # Cartesian product

```

If the desired join **Columns** have the same names in both Dataframes a simple string can be passed as a second argument.

```python
joined_df = table1_df.join(table2_df, "<some common column name>") 

```

A List of common join Columns names could be passed. 

```python
joined_df = table1_df.join(table2_df, ["join_col1_name", "join_col2_name"]) 

```


In [None]:
truck_df1 = snowpark_truck_df.select("TRUCK_ID", "MENU_TYPE_ID", "PRIMARY_CITY", "ISO_COUNTRY_CODE")
header_df1 = snowpark_header_df.select("ORDER_ID", "TRUCK_ID", "LOCATION_ID", "CUSTOMER_ID", "ORDER_TS")
truck_header_df = truck_df1.join(header_df1, "TRUCK_ID")
truck_header_df.queries

In [None]:
truck_header_df.limit(10).to_pandas()

Note that the duplicate join column is removed from the result set.

### Multi-Table Joins and Repeated Column Names
What if we want to join multiple tables? We can string joins together.

Let's reduce our MENU and ORDER_DETAIL columns, and then join them to the existing joined dataframe of TRUCK and ORDER_HEADER.

In [None]:
menu_df1 = snowpark_menu_df.drop("MENU_ITEM_HEALTH_METRICS_OBJ")
detail_df1 = snowpark_detail_df.drop("LINE_NUMBER","DISCOUNT_ID","ORDER_ITEM_DISCOUNT_AMOUNT")

combined_order_df = truck_header_df.join(detail_df1, "ORDER_ID").join(menu_df1,"MENU_ITEM_ID")
combined_order_df.show()

Because MENU_TYPE_ID appears in two tables but is NOT a join column, Snowpark includes both and generates a prefix for each.
One way to fix this is would be to ensure that column names in the original dataframes to be joined are unique, perhaps selecting explicitly and using aliases.

Another is to add a select to the join dataframe definition.  In this select you can refer to the original dataframe columns and alias them. It could get cumbersome if you want all the columns.

(Syntax note - Wrapping a longer command across multiple lines will work if the break is within brackets - otherwise use the backslash as shown below.)

In [None]:
combined_order_df = truck_header_df.join(detail_df1, "ORDER_ID")\
     .join(menu_df1,"MENU_ITEM_ID")\
     .select("ORDER_ID","ORDER_DETAIL_ID",F.col("MENU_ITEM_ID"),truck_header_df["MENU_TYPE_ID"].alias("TRUCK_MENU_TYPE_ID"),
             menu_df1["MENU_TYPE_ID"].alias("MENU_MENU_TYPE_ID"))
combined_order_df.show()

This is another approach using join parameters lsuffix and rsuffix.

In [None]:
combined_order_df = truck_header_df.join(detail_df1, "ORDER_ID").join(menu_df1,"MENU_ITEM_ID",lsuffix = "_TRUCK", rsuffix = "_MENU")
combined_order_df.show()

### Other Joins
What if we want a left join? The join method takes a parameter such as "left".

Let's get a list of locations in GB or FR and their aggregated sales on 2022-02-01 (including null sales).

In [None]:
location_df1 = snowpark_location_df.select("LOCATION_ID","LOCATION", "CITY", "ISO_COUNTRY_CODE")
location_df1 = location_df1.filter(F.col("ISO_COUNTRY_CODE").in_("GB","FR")).sort("LOCATION")
location_df1.show()

header_df1 = snowpark_header_df.select("LOCATION_ID","ORDER_AMOUNT","ORDER_TS")
header_df1 = header_df1.with_column("ORDER_DATE", F.to_date(F.col("ORDER_TS"))).drop("ORDER_TS")
header_df1 = header_df1.group_by(['LOCATION_ID','ORDER_DATE']).agg(F.sum('ORDER_AMOUNT').as_('TOTAL_ORDER_AMOUNT'))
header_df1 = header_df1.filter(F.col('ORDER_DATE') == '2022-02-01')
header_df1.show()

gbfrfeb1_df = location_df1.join(header_df1, "LOCATION_ID","left").sort("LOCATION")
gbfrfeb1_df.show()

## 2.3 Tables and Views

We have a number of options to save our transformed dataframes within Snowflake. Let's take the dataframe above and see what we can do with it.

### Saving DataFrames as Tables

As we saw in the previous section, we can write the data out to a table.

In [None]:
gbfrfeb1_df.write.save_as_table(table_name='GBFRFEB01_TABLE', mode='overwrite')
session.table('GBFRFEB01_TABLE').show()

What if the table aleady exists?  Then we can set mode = 'append'.  
<br>We can also set column_order = 'index' i.e. in the order provided (default), or 'name' to match names.

In [None]:
gbfrfeb1_df1 = gbfrfeb1_df.select("LOCATION_ID", "LOCATION", "ORDER_DATE", "TOTAL_ORDER_AMOUNT", "CITY", "ISO_COUNTRY_CODE")
# This will fail:
# gbfrfeb1_df1.write.save_as_table(table_name='GBFRFEB01_TABLE', mode='append')
# This should succeed:
gbfrfeb1_df1.write.save_as_table(table_name='GBFRFEB01_TABLE', mode='append', column_order = 'name')

session.table('GBFRFEB01_TABLE').show()

A further optional parameter allows you to set table_type as 'temporary' or 'transient'

### Saving DataFrames as Views

The dataframe query can also be turned into a view. It is also possible to create a temporary view. 
Note that this method is unusual in that it doesn't require an action to be executed - it is run 'eagerly'. 

In [None]:
gbfrfeb1_df.create_or_replace_view('GBFRFEB01_VIEW')
session.sql ("SHOW VIEWS").show()

session.table('GBFRFEB01_VIEW').show()

## 2.X YOUR TURN!

Here is the challenge: You have been asked to analyse the numbers of different 'Beverage' items sold by location country for February 2022.  
You are to present the answers in two ways:  
by country, listing the most to least popular beverages  
by beverage, listing the top to bottom countries  

Then, save the data as a new table BEVERAGE202202

### Check out the data

What columns in Location, Menu, Order Header and Order Item will you need?  
You can optionally create simpler dataframes just to hold those columns.

### Select the right month from Order Headers and right category from Menu

The functions.year and functions.month methods extract those date parts from a date or time. 
Conditions can be joined with & but must be enclosed in brackets.

### Join the tables
You need a 4-way join.

### Now aggregate
Sum quantities by location and item. Display in different sort orders.

### Save as table
Sum quantities by location and item. Display in different sort orders.

In [None]:
session.close()