In [1]:
%load_ext raw_magic

# Denormalize JSON

The JSON file `country_sales.json` has some data about cost of items. 

In [2]:
%buckets_register raw-tutorial

API error: S3 credentials already exists


In [3]:
%%query
read("s3://raw-tutorial/ipython-demos/country_sales.json")

country,sales,sales,sales
country,item,cost,date
CH,201,450,2017-05-01 12:23
CH,100,70,2017-05-01 13:01
CH,201,270,2017-05-01 13:54
CH,110,30,2017-05-01 14:01
CH,101,80,2017-05-01 16:22
CH,101,450,2017-05-02 01:03
US,200,210,2017-05-01 09:16
US,200,200,2017-05-01 09:58
US,210,320,2017-05-01 10:21
US,112,40,2017-05-01 11:01


Since it contains nested data, we "denormalize" it by turning it into a regular flat table, through a SQL extension (see the FROM statement). The data is also cleaned: item IDs are added a `"i#"` prefix and the `date` field is converted into a timestamp.


In [4]:
%%view country_sales
file := read("s3://raw-tutorial/ipython-demos/country_sales.json");

SELECT cs.country, "i#" + item AS item, cost,
     TO_TIMESTAMP(date, "yyyy-MM-dd HH:mm") AS date
   FROM cs IN file, cs.sales


View "country_sales" was replaced


In [5]:
%%query 
SELECT * FROM country_sales

country,item,cost,date
CH,i#201,450,2017-05-01 12:23:00
CH,i#100,70,2017-05-01 13:01:00
CH,i#201,270,2017-05-01 13:54:00
CH,i#110,30,2017-05-01 14:01:00
CH,i#101,80,2017-05-01 16:22:00
CH,i#101,450,2017-05-02 01:03:00
US,i#200,210,2017-05-01 09:16:00
US,i#200,200,2017-05-01 09:58:00
US,i#210,320,2017-05-01 10:21:00
US,i#112,40,2017-05-01 11:01:00


# Convert text file into table

The file `products.txt` is a plain text file containing the description of products. 

In [6]:
%%query
read("s3://raw-tutorial/ipython-demos/products.txt")

string
i#100: Monitor Samsung S24E450DL LED Monitor
i#101: Monitor Acer V276HLCbmdpx ZeroFrame Monitor
i#110: Monitor Dell UltraSharp U2414H
i#111: Monitor Dell UltraSharp U2715H
i#112: Monitor Dell Professional P2715Q 4K Monitor
i#200: Keyboard Microsoft Sculpt Ergonomic Keyboard
i#201: Keyboard Microsoft Sculpt Comfort Desktop
i#210: Keyboard Logitech Wireless Illuminated Keyboard


The text file is converted into a structured collection of records using the PARSE AS keyword, which splits the string into tokens, and converts each token into a record. As a result, items now looks like a regular SQL table and can be queried as such.

In [7]:
%%view items
read("s3://raw-tutorial/ipython-demos/products.txt")
        PARSE AS r"""(i#\w+): (\w+) (.*)"""
        INTO (id: _1, category: _2, model: _3)

View "items" was replaced


In [8]:
%query items

id,category,model
i#100,Monitor,Samsung S24E450DL LED Monitor
i#101,Monitor,Acer V276HLCbmdpx ZeroFrame Monitor
i#110,Monitor,Dell UltraSharp U2414H
i#111,Monitor,Dell UltraSharp U2715H
i#112,Monitor,Dell Professional P2715Q 4K Monitor
i#200,Keyboard,Microsoft Sculpt Ergonomic Keyboard
i#201,Keyboard,Microsoft Sculpt Comfort Desktop
i#210,Keyboard,Logitech Wireless Illuminated Keyboard


# Join JSON and Text file

Now that `country_sales` and `items` are regular tables, these can be joined. This query joins a JSON file with a text file, both of which have been preprocessed using RAW queries.

**Note that no schemas were created, no data was explicitly loaded and no separate ETL process or scripts were needed: these optimizations are all done internally by RAW and transparent to the user.**

In [9]:
%%query 

joinedView := SELECT *
    FROM country_sales, items
    WHERE item = id;

SELECT category, MIN(cost) AS min, AVG(cost) AS avg, MAX(cost) AS max
FROM joinedView
WHERE date < TIMESTAMP "2017-05-02 00:00:00"
GROUP BY category

category,min,avg,max
Monitor,10,45,80
Keyboard,200,290,450
