<img src="https://raw.githubusercontent.com/fugue-project/fugue/master/images/logo.svg" align="left" width="200"/>

<details>
<summary>About this notebook</summary>

This notebook is a demonstration of FugueSQL prepared for Thinkful Data Analyst Bootcamp students. **FugueSQL is a language that allows SQL Users to use in-memory data frameworks such Pandas, Spark, and Dask with a SQL interface**. It has some differences from standard SQL that will be shows here. 

FugueSQL aims to be more English-like, and provide a fun interface for Data Analysts to work with data in their tool of choice. The FugueSQL notebook extension allows users to use FugueSQL with syntax highlighting in Jupyter notebook cells

Fugue also has a programming interface that is not covered in this notebook. The programming interface is not covered here, but the link to the repo and Slack channels are listed below if anyone is interested.

## Links 

Fugue is a pure abstraction layer that makes code portable across differing computing frameworks such as Pandas, Spark and Dask. It allows users to write code compatible across all 3 frameworks. It guarantees consistency regardless of scale and a unified framework for compute. All questions are welcome in the Slack channel.

[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://fugue-project.github.io/tutorials/) ⬅️ Open the tutorials

[![Homepage](https://img.shields.io/badge/fugue-source--code-red?logo=github)](https://github.com/fugue-project/fugue) ⬅️ Check out our source code

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://join.slack.com/t/fugue-project/shared_invite/zt-jl0pcahu-KdlSOgi~fP50TZWmNxdWYQ) ⬅️ Chat with us on slack

**Note:**  A lot of the plots and EDA here is based off [this notebook](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart) by [sudalairajkumar](https://github.com/SudalaiRajkumar)

</details>

# Setup

Install `FugueSQL` & `s3fs` (to access data stored on Amazon s3)

In [None]:
!pip install fuggle[sql] s3fs

Import & run `setup()` to enable syntax highlighting for `FugueSQL` cells & the use of the `%%fsql` magic

In [1]:
from fugue_notebook import setup
setup()

<IPython.core.display.Javascript object>

# `FugueSQL` is `SQL` compliant 

In [11]:
%%fsql
data =  CREATE [["2020-01-01", 1, "a"],
               ["2020-01-02", 2, "b"],
               ["2020-01-03", 3, "c"],
               ["2020-01-04", 4, "a"],
               ["2020-01-05", 5, "b"]]
        SCHEMA date:datetime, val1:int, val2:str

YIELD DATAFRAME AS data 

## WHERE

In [13]:
%%fsql
SELECT *
 FROM data
WHERE date < "2020-01-03"
PRINT

Unnamed: 0,date,val1,val2
0,2020-01-01,1,a
1,2020-01-02,2,b


In [14]:
%%fsql
SELECT *
 FROM data
WHERE date BETWEEN "2020-01-03" AND "2020-01-05" 
PRINT 

Unnamed: 0,date,val1,val2
0,2020-01-03,3,c
1,2020-01-04,4,a
2,2020-01-05,5,b


In [23]:
%%fsql
SELECT date, val1, val2
FROM data
WHERE val2 IN ("a","b")
PRINT

Unnamed: 0,date,val1,val2
0,2020-01-01,1,a
1,2020-01-02,2,b
2,2020-01-04,4,a
3,2020-01-05,5,b


## GROUP BY

In [18]:
%%fsql
SELECT val2, SUM(val1) AS total
 FROM data
GROUP BY val2
PRINT 

Unnamed: 0,val2,total
0,a,5
1,b,7
2,c,3


## CASE

In [21]:
%%fsql
SELECT date, val1,
CASE
    WHEN val1 = 1 THEN 'The quantity is 1'
    WHEN val1 = 2 THEN 'The quantity is 2'
    ELSE 'The quantity is greater than 3'
END AS val1_text
FROM data
PRINT

Unnamed: 0,date,val1,val1_text
0,2020-01-01,1,The quantity is 1
1,2020-01-02,2,The quantity is 2
2,2020-01-03,3,The quantity is greater than 3
3,2020-01-04,4,The quantity is greater than 3
4,2020-01-05,5,The quantity is greater than 3


# Superpowers

## Load & save data

`FugueSQL` enables users to work with data not stored in a database.

You can load from csv/json/parquet files stored locally or on remote file systems (`Amazon s3`, `Google Cloud Platform`, `Azure` etc)

We can load in data, perform transformations on it, and then write out the results

In [24]:
%%fsql
LOAD "s3://kaggle-data-instacart/aisles.csv" (header=true, infer_schema=TRUE)
SAVE OVERWRITE "/tmp/aisles.csv" (header=true)

In [33]:
%%fsql
aisles = LOAD "/tmp/aisles.csv" (header=TRUE, infer_schema=TRUE)
YIELD DATAFRAME AS aisles

In [34]:
%%fsql
SELECT * FROM aisles
WHERE aisle_id = 3
PRINT
SAVE OVERWRITE "/tmp/working/aisles-modified.csv"

Unnamed: 0,aisle_id,aisle
0,3,energy granola bars


## Ran out of memory?  No problem!

`FugueSQL` runs on `pandas` by default which loads the data into memory.

`FugueSQL` can optionally run on `spark` or `dask` instead and use `memory spillover` to handle bigger-than-memory data. 

> `pandas` also only uses a single core on your local machine, whereas `spark` and `dask` can use all available cores.  As a result we can experience massive speed-ups for large data

In [30]:
%%fsql
LOAD "s3://kaggle-data-instacart/products.csv" (header=true)
SAVE OVERWRITE "/tmp/products.csv" (header=true)
YIELD DATAFRAME AS products

In [36]:
%%fsql
LOAD "s3://kaggle-data-instacart/order_products__prior.csv" (header=true)
SAVE OVERWRITE "/tmp/order_products__prior.csv" (header=true)
YIELD DATAFRAME AS order_products

In [None]:
%%fsql
LOAD "s3://kaggle-data-instacart/departments.csv" (header=true)
SAVE OVERWRITE "/tmp/departments.csv" (header=true)
YIELD DATAFRAME AS departments

In [40]:
%%fsql
SELECT order_id, aisle, product_name, department, reordered FROM order_products
INNER JOIN products ON order_products.product_id = products.product_id
INNER JOIN aisles ON products.aisle_id = aisles.aisle_id
INNER JOIN departments ON departments.department_id = products.department_id
SAVE OVERWRITE "/tmp/working/result.parquet"

_3 _State.RUNNING -> _State.FAILED  "departments is not found in ['_0', '_1', '_2']"


KeyError: "departments is not found in ['_0', '_1', '_2']"

In [None]:
%%fsql dask
SELECT order_id, aisle, product_name, department, reordered FROM order_products
INNER JOIN products ON order_products.product_id = products.product_id
INNER JOIN aisles ON products.aisle_id = aisles.aisle_id
INNER JOIN departments ON departments.department_id = products.department_id
SAVE OVERWRITE "/tmp/working/result.parquet"

Unnamed: 0,department_id,count
0,1,4007
1,2,548
2,3,1516
3,4,1684
4,5,1054
