<img src="https://raw.githubusercontent.com/fugue-project/fugue/master/images/logo.svg" align="left" width="500"/>

# About this notebook

This notebook is a demonstration of FugueSQL prepared for Thinkful Data Analyst Bootcamp students. **FugueSQL is a language that allows SQL Users to use in-memory data frameworks such Pandas, Spark, and Dask with a SQL interface**. It has some differences from standard SQL that will be shows here. 

FugueSQL aims to be more English-like, and provide a fun interface for Data Analysts to work with data in their tool of choice. The FugueSQL notebook extension allows users to use FugueSQL with syntax highlighting in Jupyter notebook cells

Fugue also has a programming interface that is not covered in this notebook. The programming interface is not covered here, but the link to the repo and Slack channels are listed below if anyone is interested.

## Links 

Fugue is a pure abstraction layer that makes code portable across differing computing frameworks such as Pandas, Spark and Dask. It allows users to write code compatible across all 3 frameworks. It guarantees consistency regardless of scale and a unified framework for compute. All questions are welcome in the Slack channel.

[Fugue Repo](https://github.com/fugue-project/fugue)

[Fugue Slack](https://join.slack.com/t/fugue-project/shared_invite/zt-jl0pcahu-KdlSOgi~fP50TZWmNxdWYQ)

## Credits

A lot of the plots and EDA here is based off this notebook: https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart

## Installation

Note that this installation is optimized for Kaggle notebooks. `fuggle` is a library for Fugue on Kaggle notebooks. Installating fugue for use outside Kaggle notebooks should just be `pip install fugue`. Join the Slack (listed above) if there are any questions.

In [None]:
%pip install fuggle

This provides syntax highlighting for Fugue SQL cells and allows us to use the %%fsql magic.

In [None]:
from fuggle import setup
setup()

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time 

import warnings
warnings.filterwarnings('ignore')

## Preprocessing

This particular dataset comes in zipfiles so we'll preprocess and unzip to read with pandas/FugueSQL.

In [None]:
import zipfile
file_list = [
    '/kaggle/input/instacart-market-basket-analysis/aisles.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/orders.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/sample_submission.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/order_products__train.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/products.csv.zip',  
    '/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv.zip',    
    '/kaggle/input/instacart-market-basket-analysis/departments.csv.zip']

for file_name in file_list:
    with zipfile.ZipFile(file=file_name) as target_zip:
        target_zip.extractall()

## Quick Experiments

In [None]:
import pandas as pd

df = pd.DataFrame({"date": ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'],
                   "val": [1,2,3,4,5],
                   "val2": ["a","b","c","a","b"]})
df['date'] = pd.to_datetime(df['date'])

In [None]:
%%fsql
SELECT *
 FROM df
WHERE date < "2020-01-03"
PRINT

In [None]:
%%fsql
SELECT date, val,
CASE
    WHEN val = 1 THEN 'The quantity is 1'
    WHEN val = 2 THEN 'The quantity is 2'
    ELSE 'The quantity is greater than 3'
END AS valText
FROM df
PRINT

In [None]:
%%fsql
SELECT date, val, val2
FROM df
WHERE val2 IN ("a","b")
PRINT

In [None]:
mytuple = ("a", "b")

In [None]:
%%fsql
SELECT date, val, val2
FROM df
WHERE val2 IN {{mytuple}}
PRINT

# FugueSQL Syntax

Before using FugueSQL for data analysis, we'll go over some quick examples on the syntax. These will be put together for more complciated operations later. These will show the enhancements over standard SQL

## Load and Save

FugueSQL allows users to load from csv/json/parquet files using Pandas, Spark and Dask under the hood. This means we can load in data, perform transformations on it, and then write out the results. This allows data analysts to work with data not in a database.

In [None]:
%%fsql
df = LOAD "/kaggle/working/aisles.csv" (header=TRUE, infer_schema=TRUE)

SELECT * FROM df
WHERE aisle_id = 3
PRINT
SAVE OVERWRITE "/kaggle/working/aisles-modified.csv"

Notice the variable assignment during the `LOAD` statement. Variable assignment is not limited to `LOAD` opearations. it can also be used during the `SELECT` statement to create intermediate tables. All ANSI SQL keywords are available in FugueSQL

## Groupby and Filtering

In [None]:
%%fsql
products = LOAD "/kaggle/working/products.csv" (header=TRUE, infer_schema=TRUE)
PRINT 5 ROWS

  SELECT department_id, COUNT(*) AS count
    FROM products
   WHERE department_id < 6
GROUP BY department_id
   PRINT

Before moving on to other FugueSQL commands, this is a good place to show what the equivalent Panadas syntax would be for the same opearation. Note that `loc` is used to filter. We need to take care of resetting the index, and the renaming of the column is more verbose. In general, SQL is easier to read for some operations.

In [None]:
# Pandas implementation of previous
products = pd.read_csv("/kaggle/working/products.csv")
products['department_id'] = products['department_id'].astype(int)

products = products[['department_id']]\
    .loc[products['department_id'] < 6]\
    .value_counts()\
    .reset_index()\
    .rename(columns={0:'count'})\

products.head()

More important though, the Python code is very coupled with the Pandas framework. If the size of data becomes to big, we'd have to move to another framework like Spark or Dask. This Pandas-written code will no longer be applicable. In order to show this, we'll implement the same code in Spark to show how different the syntax is.

In [None]:
# Spark implementation of previous
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf

spark = SparkSession.builder.getOrCreate()
products = spark.read.format("csv").load("/kaggle/working/products.csv", header = True)

products = products.where("department_id < 6")\
    .groupBy("department_id")\
    .agg(sf.count(sf.lit(1)).alias("count"))\

products.show()

**The Case for SQL as Grammar for Logic**

This is the motivation for Fugue. Can we decouple the expression of our logic from the framework we are using? Fugue achieves this by letting users specity the execution during runtime. We can run the same SQL code on Pandas, Spark, or Dask by simple changing one line of code to define our execution engine. This provides more robust code that is agnostic to the volume of the data we're operating on.

It is important to note that Fugue also has a Python abstraction layer similar to this SQL abstraction layer. They can also work together. We'll see hints of this later.

## Defining Schema for a DataFrame

Note that if we don't infer the schema, Pandas loads most columns as strings. We can use `ALTER COLUMNS` to change the syntax. For DataFrames with a large number of columns, we recommend using infer_schema and then `ALTER COLUMNS` to ensure the correct types.

In [None]:
%%fsql
df = LOAD "/kaggle/working/aisles.csv" (header=TRUE)
PRINT 1 ROW
df = ALTER COLUMNS aisle_id:int, aisle:str FROM df
PRINT 1 ROW

Similarly, schema can be explicitly defined while loading in the CSV.

In [None]:
%%fsql
df = LOAD "/kaggle/working/aisles.csv" (header=TRUE) COLUMNS aisle_id:int, aisle:str
PRINT 1 ROWS

## Passing DataFrames to fsql Cells

FugueSQL allows for Python interoperatibility. DataFrames defined outside `%%fsql` cells can be used. In this example, we create a test DataFrame and use it inside a following FugueSQL code block.

In [None]:
test = pd.read_csv("/kaggle/working/aisles.csv")
test['new_col'] = 1
test.head(3)

In [None]:
%%fsql
SELECT *
  FROM test
 PRINT 5 ROWS

## Passing DataFrames out of fsql Cells

DataFrames defined in fsql cells can be in following cells or in native Python by using `YIELD DATAFRAME`. This holds the DataFrame in memory.  

In [None]:
%%fsql
df = LOAD "/kaggle/working/aisles.csv" (header=TRUE)
SELECT * 
FROM df
WHERE aisle_id = '3'
YIELD DATAFRAME AS result
PRINT

In [None]:
# Printing dataframe from previous step
print(result.as_pandas().head())

In [None]:
%%fsql
-- This is available because of the previous YIELD
SELECT * FROM result
PRINT

## Jinja Templating

Sometime a Python variable will be needed inside a SQL block. Think of dynamic lists used to filter values in a DataFrame. In this case, Jinja templating can be used to pass a variable inside a fsql code block.

In [None]:
# This is a Python code block
cheese_aisle = 'specialty cheeses'

In [None]:
%%fsql
df = LOAD "/kaggle/working/aisles.csv" (header=TRUE)

SELECT *
FROM df WHERE aisle = '{{cheese_aisle}}'
PRINT

## Anonymity and Inline

Anonymity is when the dataframe to perform the operation on is not specified. As a default, the output of the last operation will be used. This is a FugueSQL feature designed to simplify code. `PRINT` is an example of this. 

In [None]:
%%fsql
df = SELECT * FROM (LOAD "/kaggle/working/products.csv" (header=TRUE))
ALTER COLUMNS product_id:int, product_name:str, aisle_id:int, department_id:int
PRINT 5 ROWS

# Data Analysis

In [None]:
# Some plotting utility functions. These will be used in conjunction with SQL later
color = sns.color_palette()

def dow_countplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(12,8))
    sns.countplot(df['order_dow'], color=color[0])
    plt.ylabel('Count', fontsize=12)
    plt.xlabel('Day of week', fontsize=12)
    plt.title("Frequency of order by week day", fontsize=15)
    plt.show()
    
def hour_countplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(12,8))
    sns.countplot(df['order_hour_of_day'], color=color[1])
    plt.ylabel('Count', fontsize=12)
    plt.xlabel('Hour of Day', fontsize=12)
    plt.title("Frequency of order by hour of day", fontsize=15)
    plt.show()
    
def max_order_barplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(12,8))
    sns.barplot(df['n_orders'], df['count'], alpha=0.8, color=color[2])
    plt.ylabel('Number of Occurrences', fontsize=12)
    plt.xlabel('Maximum order number', fontsize=12)
    plt.title("Frequency of maximum order numbers", fontsize=15)
    plt.xticks(rotation='vertical')
    plt.show()
    
def days_since_prior_countplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(12,8))
    sns.countplot(df['days_since_prior_order'], color=color[3])
    plt.ylabel('Count', fontsize=12)
    plt.xlabel('Days since prior order', fontsize=12)
    plt.xticks(rotation='vertical')
    plt.title("Frequency distribution by days since prior order", fontsize=15)
    plt.show()

def top_products_barplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(12,8))
    sns.barplot(df['product_name'], df['count'], alpha=0.8, color=color[4])
    plt.ylabel('Number of Occurrences', fontsize=12)
    plt.xlabel('Product Name', fontsize=12)
    plt.title("Frequency of product orders (top 20)", fontsize=15)
    plt.xticks(rotation=45, ha="right")
    plt.show()
    
def top_aisles_barplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(12,8))
    sns.barplot(df['aisle'], df['count'], alpha=0.8, color=color[5])
    plt.ylabel('Number of Occurrences', fontsize=12)
    plt.xlabel('Aisle Name', fontsize=12)
    plt.title("Number of Occurances of each aisle", fontsize=15)
    plt.xticks(rotation=45, ha="right")
    plt.show()
    
def department_pieplot(df:pd.DataFrame) -> None:
    plt.figure(figsize=(10,10))
    temp_series = df['department'].value_counts()
    labels = (np.array(temp_series.index))
    sizes = (np.array((temp_series / temp_series.sum())*100))
    plt.pie(sizes, labels=labels, 
            autopct='%1.1f%%', startangle=200)
    plt.title("Departments distribution", fontsize=15)
    plt.show()
    
    

## Loading the Orders and Order Products Table

In [None]:
%%fsql
SELECT * FROM (LOAD "/kaggle/working/orders.csv" (header=TRUE)
               COLUMNS order_id:int,user_id:int, eval_set:str, order_number:int, order_dow:int, order_hour_of_day:int, days_since_prior_order:double)
YIELD FILE AS orders
PRINT 10 ROWS

order_products = SELECT order_id, product_id, reordered
                 FROM (LOAD "/kaggle/working/order_products__prior.csv" (header=true, infer_schema=true))
order_products = ALTER COLUMNS reordered:int
YIELD FILE AS order_products
PRINT 10 ROWS

## Rowcount of order_products

This is the largest table and we get the row count here to understand the volume of data we are dealing with. For larger datasets, users should consider using Spark and Dask as the backend to FugueSQL. The Kaggle kernel is a 4-core machine also, but Pandas runs on 1 core by default.

Using Spark or Dask allows us to parallelize the operations performed on the data.

The order_products table is appromiximately 32 million rows. Operations on this stretches the limits of Pandas.

In [None]:
%%fsql
-- This can be used because of YIELD FILE
SELECT COUNT(*) AS count
  FROM orders 
 PRINT
 
SELECT COUNT(*) AS count
  FROM order_products
 PRINT

## Missing Value Count

This is an example of an opearation that is a lot more verbose to write in SQL. We can achieve the same thing by using a Python function and one line of Pandas code.

In [None]:
%%fsql
SELECT COUNT(*) - COUNT(order_id) AS order_id,
    COUNT(*) - COUNT(user_id) AS user_id,
    COUNT(*) - COUNT(eval_set) AS eval_set,
    COUNT(*) - COUNT(order_number) AS order_number,
    COUNT(*) - COUNT(order_dow) AS order_dow,
    COUNT(*) - COUNT(order_hour_of_day) AS order_hour_of_day,
    COUNT(*) - COUNT(days_since_prior_order) AS days_since_prior_order
FROM orders
PRINT

In [None]:
#schema: *
def null_count(df:pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame(df.isnull().sum(axis = 0)).T

In [None]:
%%fsql
TRANSFORM orders USING null_count
PRINT

TRANSFORM order_products USING null_count
PRINT

## Introduction to Distributed Computing



Before we go into further analysis, we'll first give a brief introduction to distributed computing.

There is an image in the Dask repo [issues](https://github.com/dask/dask/issues/4471) that clearly illustrates the distributed computing paradigm. In general, there is a client or master that takes care of the orchestration and final data collection. The client is responsible for scheduling tasks among workers.

Both Spark and Dask have local modes also where they use the cores available on the local machine. This means we can still take advantage of the additional processing without having a cluster available.

<img src="https://user-images.githubusercontent.com/11656932/62263986-bbba2f00-b3e3-11e9-9b5c-8446ba4efcf9.png" align="left" width="700"/>

## Analysis

In [None]:
%%fsql
tempdf = SELECT user_id, MAX(order_number) AS n_orders
FROM orders
GROUP BY user_id
PRINT 2 ROWS

SELECT n_orders, COUNT(n_orders) AS count
FROM tempdf
GROUP BY n_orders
OUTPUT USING max_order_barplot

In [None]:
%%fsql
-- Frequency of orders by day of week
df = SELECT * FROM orders
OUTPUT USING dow_countplot
OUTPUT USING hour_countplot
OUTPUT USING days_since_prior_countplot

## Percentage of product orders that are reorders

Here we check how many individual product orders are reorders. This tells us how many times users are buying new products that they have not ordered before.

In [None]:
%%fsql
tempdf = SELECT COUNT(*) AS total,
                SUM(CASE WHEN reordered = 1 THEN 1 ELSE 0 END) AS reordered
           FROM order_products
           
SELECT reordered / total * 100 AS percent_reordered
  FROM tempdf
 PRINT

## Orders with no re-ordered products

These are the situations where either the customer is buying products for the first time or they are buying an entirely new set of products. We only have data for the second order of a user onwards so we don't need to filter by the order_number.

We can do this by aggregating on the order_id and getting the MAX of the reordered columns. The average of the resulting binary column will be the percent with reorders. 1 minus this value will be the percentage without reorders.

In [None]:
%%fsql
tempdf = SELECT order_id, MAX(reordered) AS contains_reorder
           FROM order_products
       GROUP BY order_id

SELECT 100 - AVG(contains_reorder) * 100 AS pct_w_no_reorder
  FROM tempdf
 PRINT

## Other Tables

In [None]:
%%fsql
products = LOAD "/kaggle/working/products.csv" (header=TRUE, infer_schema=TRUE) YIELD FILE
PRINT 5 ROWS

aisles = LOAD "/kaggle/working/aisles.csv" (header=TRUE, infer_schema=TRUE) YIELD DATAFRAME
PRINT 5 ROWS

departments = LOAD "/kaggle/working/departments.csv" (header=TRUE, infer_schema=TRUE) YIELD DATAFRAME
PRINT 5 ROWS

## Dask for Handle Memory Spillover

The code snippet below had a lot of memory issues because we are joining all of the tables to the order_products table, which has 32 million rows. This will need some clever optimization to pull off in Pandas (converting dtypes or filtering columns before join). With Dask though, we can perform the join, have the operation spill over to disk, and then get the smaller result set.

Pandas needs 3x more RAM than the size of the data to run effectively. This means Dask will probably help your workflows way earlier than you expect. Dask handles writing to disk when it hits around 60-70% of utilization by default. This keeps Pandas operating effectively.

In [None]:
%%fsql dask
-- Memory issues but dask solves it
result = SELECT order_id, aisle, product_name, department, reordered
           FROM order_products
     INNER JOIN products
             ON order_products.product_id = products.product_id
     INNER JOIN aisles
             ON products.aisle_id = aisles.aisle_id
     INNER JOIN departments
             ON departments.department_id = products.department_id
     SAVE OVERWRITE "/kaggle/working/result.parquet"

This is also the first place we observe how to change the execution engine in FugueSQL. All we have to do is specify it after the `%%fsql` cell magic. The corresponding SQL code will then run on that engine. If there is a DataFrame that is available through `YIELD` , then it will have to be converted (under the hood).

## Parquet versus CSV

The previous operation was saved in a parquet file. Parquet is one of the most common file formats for distributed computing. There are a couple of advantages over CSVs. 

* Column based versus row based
* Compression (70% reduction in size)
* Optimization with Spark
* Schema
* Partition friendly

## Top 20 Products

In [None]:
%%fsql
result = LOAD "/kaggle/working/result.parquet"

-- Top products
  SELECT product_name, COUNT(*) AS count
    FROM result
GROUP BY product_name
ORDER BY count DESC
   LIMIT 20
  OUTPUT USING top_products_barplot

-- Top aisles 
  SELECT aisle, COUNT(*) AS count
    FROM result
GROUP BY aisle
ORDER BY count DESC
   LIMIT 20
  OUTPUT USING top_aisles_barplot
 
-- Department pieplot
OUTPUT result USING department_pieplot
 

## Introductions to Partitions

In order to understand partitions, we can look at this image showing the way Dask scales Pandas. Each partition is a Pandas DataFrame. A Dask DataFrame is the collection of all of the Pandas DataFrames. Operations are done on each partition, and then aggregated back.

<img src="https://docs.dask.org/en/latest/_images/dask-dataframe.svg" align="left" width="400"/>

## [Reference on Partitions](https://blog.scottlogic.com/2018/03/22/apache-spark-performance.html) by Scott Logic

This reference has a lot of good images and explanations

### Ideal Partitioning Strategy
![Partitioning](https://blog.scottlogic.com/mdebeneducci/assets/Ideal-Partitioning.png)
### Skewed Partitions
![Skewed Partitions](https://blog.scottlogic.com/mdebeneducci/assets/Skewed-Partitions.png)
### Inefficient Scheduling
![Inefficient Scheduling](https://blog.scottlogic.com/mdebeneducci/assets/Inefficient-Scheduling.png)
### Data Shuffling
![Shuffle](https://blog.scottlogic.com/mdebeneducci/assets/Shuffle-Diagram.png)

## Median Basket Size for Each Customer

Here we make a Python function to help us get the median `basket_size` for one specific user. The median `basket_size` of a `user_id` can be calculated without knowing the information of other `user_ids`. This is a good hint that we can do this on a per partition basis. The partition is a guide for parallelization. **Data that belong to the same partition will live inside the same executor**.

In [None]:
#schema: user_id:int, basket_size:int
def get_basket_size_median(df:pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({'user_id': [df.iloc[0]['user_id']], 'basket_size' : [int(round(df[['basket_size']].median()))]})

In [None]:
# There are 206k users in the dataset. Let us downsample.
n = 50000

In [None]:
%%fsql
user_id = SELECT DISTINCT user_id FROM orders
user_id = SAMPLE {{n}} ROWS SEED 1 FROM user_id
YIELD FILE AS user_ids

**Production Note**

A side note, some people are curious about how to move Fugue from notebooks to production. There is an `fsql` function with Fugue that can be used for the programming interface. A user would wrap their SQL query in a string and then pass it to `fsql`. Of course, we lose syntax highlighting because it's a Python string.

In [None]:
from fugue_sql import fsql

query = """
result = LOAD "/kaggle/working/result.parquet"

basket_size = SELECT order_id, COUNT(*) AS basket_size
                FROM result
            GROUP BY order_id 
  
order_id = SELECT orders.user_id, order_id 
             FROM orders
       INNER JOIN user_ids
               ON user_ids.user_id = orders.user_id
            
basket_size = SELECT user_id, order_id.order_id, basket_size
                FROM basket_size
          INNER JOIN order_id
                  ON basket_size.order_id = order_id.order_id
                  
PRINT 2 ROWS

TRANSFORM basket_size PREPARTITION BY user_id USING get_basket_size_median
PRINT 2 ROWS
"""

start = time.time()
fsql(query).run()
print(f"Operation took {time.time() - start} seconds")

Now we bring the same query into Spark by passing in the ending into the `run` method. We will see the benefits of parallelizing the opeartion. Some of the performance gains are due to the optimizations of SparkSQL. There is more than a 4x speed up just by changing the execution engine.

In [None]:
start = time.time()
fsql(query).run("spark")
print(f"Operation took {time.time() - start} seconds")

## Top 5 Products for Each User

In this section we are interested in getting the top 5 products for each user. We also want to know how frequently they buy the products. Are they buying it every time they go into the store? Maybe this will tell us which customers are very predictable.

In [None]:
from fugue import FugueWorkflow
from fugue_spark import SparkExecutionEngine
from typing import List, Any, Iterable

# schema: user_id:int, product_name:str, count:int
def product_count(df:pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({'user_id': [df.iloc[0]['user_id']], 
                         'product_name': [df.iloc[0]['product_name']],
                         'count': [df.shape[0]]})

# schema: user_id:int, n_orders:int
def nth(df:Iterable[List[Any]], n) -> Iterable[List[Any]]:
    for row in df:
        if n==0:
            yield [row[0], row[1]]
            return
        n-=1

        
with FugueWorkflow(SparkExecutionEngine) as dag:
    product_orders = dag.load("/kaggle/working/result.parquet")
    
    order_id = dag.df(orders)
    users = dag.df(user_ids)
    
    # Filtering to our sampled users. Note the persist
    order_id = order_id.join(users, how="inner", on=["user_id"]).persist()
    
    # Count for each product by user_id
    tempdf = dag.df(product_orders).join(order_id, how="inner", on=["order_id"])
    tempdf = tempdf.partition(by=["user_id", "product_name"]).transform(product_count)

    # Join to tempdf2 which gets us the number of orders per user
    tempdf2 = order_id.partition(by=['user_id'], presort="order_id desc").transform(nth, params={"n":0})
    result = tempdf.join(tempdf2, how="inner", on=["user_id"])\
                    .partition(by=["user_id"], presort="count desc")\
                    .take(5)
    result.save("/kaggle/working/top_5_products.parquet", mode="overwrite")
    result.show(15)

In [None]:
%%fsql spark
product_orders = SELECT order_id, product_name FROM (LOAD "/kaggle/working/result.parquet")

-- Filtering to our sampled users
order_id = SELECT orders.user_id, order_id, order_number
             FROM orders
       INNER JOIN user_ids
               ON user_ids.user_id = orders.user_id PERSIST

-- Count for each product by user_id
tempdf = SELECT user_id, product_name, COUNT(*) AS count
           FROM product_orders
     INNER JOIN order_id
             ON product_orders.order_id = order_id.order_id
       GROUP BY user_id, product_name
    
-- Join to an inner select which gets us the number of orders per user
    SELECT tempdf.user_id, product_name, count, n_orders
      FROM tempdf
INNER JOIN (SELECT user_id, MAX(order_number) AS n_orders
            FROM order_id
            GROUP BY user_id) tempdf2
        ON tempdf.user_id = tempdf2.user_id
      TAKE 5 ROWS PREPARTITION BY user_id PRESORT count DESC  
      SAVE OVERWRITE "/kaggle/working/top_5_products.parquet"
     PRINT 15 ROWS

## Persist and Lazy Evaluation

![DAG](https://www.edureka.co/community/?qa=blob&qa_blobid=12881994506202880144)

## Reorder Ratio Over Time

![Line of best fit](https://i.investopedia.com/content/video/line_of_best_fit_/lineofbestfit.png)

In [None]:
# schema: user_id:int, trend:double
def reorder_trend(df:pd.DataFrame) -> pd.DataFrame:
    m, b = np.polyfit(list(range(df.shape[0])), df['reorder_rate'], 1)
    if df.shape[0] > 5:
        return pd.DataFrame({'user_id': [df.iloc[0]['user_id']], 'trend' : [m]})
    else:
        return pd.DataFrame({'user_id': [df.iloc[0]['user_id']], 'trend' : [0]})

In [None]:
%%fsql spark
product_orders = SELECT order_id, reordered FROM (LOAD "/kaggle/working/result.parquet")
order_id = SELECT orders.user_id, order_id, order_number
             FROM orders
       INNER JOIN user_ids
               ON user_ids.user_id = orders.user_id
               
    SELECT user_id, order_number, AVG(reordered) AS reorder_rate
      FROM product_orders
INNER JOIN order_id
        ON product_orders.order_id = order_id.order_id
  GROUP BY user_id, order_number
  TAKE 10 ROWS PREPARTITION BY user_id PRESORT order_number DESC
  PRINT 10 ROWS
  
trend = TRANSFORM PREPARTITION BY user_id PRESORT order_number ASC USING reorder_trend 
  PRINT 5 ROWS
  
  SELECT * FROM trend
  YIELD FILE AS trend_stats

Although not shown in this notebook, it is worth talking about PERSIST and lazy evaluation for those new to distributed computing. 

In [None]:
%%fsql
SELECT * 
     FROM trend_stats
 ORDER BY trend DESC
 PRINT 10 ROWS
 
SELECT * 
     FROM trend_stats
 ORDER BY trend ASC
 PRINT 10 ROWS

In [None]:
from fugue import FugueWorkflow
from fugue_spark import SparkExecutionEngine
from typing import List, Any, Iterable

# schema: user_id:int, product_name:str, count:int
def product_count(df:pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({'user_id': [df.iloc[0]['user_id']], 
                         'product_name': [df.iloc[0]['product_name']],
                         'count': [df.shape[0]]})

# schema: user_id:int, n_orders:int
def nth(df:Iterable[List[Any]], n) -> Iterable[List[Any]]:
    for row in df:
        if n==0:
            yield [row[0], row[1]]
            return
        n-=1

        
with FugueWorkflow(SparkExecutionEngine) as dag:
    product_orders = dag.load("/kaggle/working/result.parquet")
    
    order_id = dag.df(orders)
    users = dag.df(user_ids)
    
    # Filtering to our sampled users. Note the persist
    order_id = order_id.join(users, how="inner", on=["user_id"]).persist()
    
    # Count for each product by user_id
    tempdf = dag.df(product_orders).join(order_id, how="inner", on=["order_id"])
    tempdf = tempdf.partition(by=["user_id", "product_name"]).transform(product_count)

    # Join to tempdf2 which gets us the number of orders per user
    tempdf2 = order_id.partition(by=['user_id'], presort="order_id desc").transform(nth, params={"n":0})
    result = tempdf.join(tempdf2, how="inner", on=["user_id"])\
                    .partition(by=["user_id"], presort="count desc")\
                    .take(5)
    result.save("/kaggle/working/top_5_products.parquet", mode="overwrite")
    result.show(15)