# Producing Complex Data

In the previous notebook we saw how to query complex data and started exploring RAW's data model.

In this notebook we continue this exploration by showing how RAW queries can also output complex data structures, which will be particularly useful when we see how to export RAW results into formats like JSON or XML.

In [8]:
%load_ext raw_magic

The raw_magic extension is already loaded. To reload it, use:
  %reload_ext raw_magic


## Collections in the output field of a SELECT

We start by the following query:

In [9]:
%%rql

SELECT *
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale

country,products,products
country,category,cost
CH,Keyboard,50
CH,Keyboard,70
CH,Monitor,450
US,Keyboard,20
US,Monitor,200


Recall that `products` is a "nested table".

Let's analyze the following query:

In [13]:
%%rql

SELECT sale.country, (SELECT p.cost FROM sale.products AS p) AS products_cost
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale

country,products_cost
CH,50
CH,70
CH,450
US,20
US,200


This query returns two rows: one for `CH`, another for `US`.
    
The first column is the country, and the second column is a list of the cost of the products in that country.

How does this work?

The inner `SELECT` contains `(SELECT p.cost FROM sale.products AS p)`.
We can think of this as a normal query, over a table called `sale.products`.
THe output of that query is a table with the cost of each product.

When we compose it in a single query:
```
SELECT sale.country, (SELECT p.cost FROM sale.products AS p) AS products_cost
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale
```
... then the result is a table, where the second column contains another table.

If we were to represent the *output* as JSON, it would look like:
```
[
  {"country": "CH",
   "products_cost": [50, 70, 450]},
  {"country": "US",
   "products_cost": [20, 200]}
]
```

To further demonstrate that `SELECT`s are just operations over collections of data, let's add a filter to the inner `SELECT`:

In [17]:
%%rql

SELECT sale.country, (SELECT p.cost FROM sale.products AS p WHERE p.cost > 60) AS products_cost_over_60
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale

country,products_cost_over_60
CH,70
CH,450
US,200


This query filtered the products in the inner `SELECT` for those that cost > 60.

We can even do aggregations:

In [16]:
%%rql

SELECT sale.country, (SELECT COUNT(*) FROM sale.products AS p WHERE p.cost > 60) AS number_products_cost_over_60
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale

country,number_products_cost_over_60
CH,2
US,1


This query counts the number of products in each country that cost over 60.

## Extensions to GROUP BY

We start by a traditional aggregation in SQL.

In [47]:
%%rql

SELECT Country, COUNT(*)
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2

Country,_2
Afghanistan,21
Albania,1


This query lists the number of airports per country.

In RAW, however, `GROUP BY` produces "groups" that be queried.

When the `GROUP BY` keyword is used, the `*` is bound to the group.

To query the entire "group" for a given country - i.e. the airports in each country - we can do:

In [48]:
%%rql

SELECT Country, *
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2

Country,_2,_2,_2,_2,_2,_2,_2,_2,_2,_2,_2,_2
Country,AirportID,Name,City,Country,IATA_FAA,ICAO,Latitude,Longitude,Altitude,Timezone,DST,TZ
Afghanistan,2048,Herat,Herat,Afghanistan,HEA,OAHR,34.210017,62.2283,3206,4.5,U,Asia/Kabul
Afghanistan,2049,Jalalabad,Jalalabad,Afghanistan,JAA,OAJL,34.399842,70.498625,1814,4.5,U,Asia/Kabul
Afghanistan,2050,Kabul Intl,Kabul,Afghanistan,KBL,OAKB,34.565853,69.212328,5877,4.5,U,Asia/Kabul
Afghanistan,2051,Kandahar,Kandahar,Afghanistan,KDH,OAKN,31.505756,65.847822,3337,4.5,U,Asia/Kabul
Afghanistan,2052,Maimana,Maimama,Afghanistan,MMZ,OAMN,35.930789,64.760917,2743,4.5,U,Asia/Kabul
Afghanistan,2053,Mazar I Sharif,Mazar-i-sharif,Afghanistan,MZR,OAMS,36.706914,67.209678,1284,4.5,U,Asia/Kabul
Afghanistan,2054,Shindand,Shindand,Afghanistan,,OASD,33.391331,62.260975,3773,4.5,U,Asia/Kabul
Afghanistan,2055,Sheberghan,Sheberghan,Afghanistan,,OASG,36.750783,65.913164,1053,4.5,U,Asia/Kabul
Afghanistan,2056,Konduz,Kunduz,Afghanistan,UND,OAUZ,36.665111,68.910833,1457,4.5,U,Asia/Kabul
Afghanistan,5922,Faizabad Airport,Faizabad,Afghanistan,FBD,OAFZ,37.1211,70.5181,3872,4.5,U,Asia/Kabul


THE `*` is a "nested table" containing all rows in the group defined by the `GROUP BY` clause.

In this example the `*` is all the airports in a given `Country`, since the query does `GROUP BY Country`.

Since `*` is a table, we can query it as normally:

In [50]:
%%rql

SELECT Country, (SELECT Name, City FROM *)
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2

Country,_2,_2
Country,Name,City
Afghanistan,Herat,Herat
Afghanistan,Jalalabad,Jalalabad
Afghanistan,Kabul Intl,Kabul
Afghanistan,Kandahar,Kandahar
Afghanistan,Maimana,Maimama
Afghanistan,Mazar I Sharif,Mazar-i-sharif
Afghanistan,Shindand,Shindand
Afghanistan,Sheberghan,Sheberghan
Afghanistan,Konduz,Kunduz
Afghanistan,Faizabad Airport,Faizabad


... or even ...

In [55]:
%%rql

SELECT Country, (SELECT City, COUNT(*) FROM * GROUP BY City)
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2

Country,_2,_2
Country,City,_2
Afghanistan,Shank,1
Afghanistan,Camp Bastion,1
Afghanistan,Jalalabad,1
Afghanistan,Chaghcharan,1
Afghanistan,Tarin Kowt,1
Afghanistan,Sheberghan,1
Afghanistan,Kunduz,1
Afghanistan,Kabul,2
Afghanistan,Shindand,1
Afghanistan,Sharan,1


This groups the airports by Country, and then by City.

The `COUNT(*)` in the inner `SELECT` refers to the groups created by `GROUP BY City`.

## Top-Level Collections

Let's look in more detail at the output of the following queries:

In [68]:
%%rql

SELECT country AS name FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")

name
CH
US


This query returns a list of countries. Each row in the output has the column `name`.

If we were to visualize the output as JSON, it would be:
```
[
  {"name": "CH"},
  {"name": "US"}
]
```

Now the following query:

In [70]:
%%rql

SELECT country FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")

string
CH
US


... appears similar but note that `AS` alias is not included.

The output is different: each row is in fact a string. There is not record.

If we were to visualize the output as JSON, it would be:

```
  ["CH", "US"]
```

We can confirm this by asking the output type of the query, with the RAW Jupyter magic `%%query_validate`. 

In [78]:
%%query_validate

SELECT country AS name FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")

collection(record(name: string))


In [79]:
%%query_validate

SELECT country FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")

collection(string)


Note that the first example returns `collection(record(name: string))`, which is RAW's type representation for a collection of records, each with a single field `name` of type string.

The second returns `collection(string)`, which is RAW's type representation for a collection of strings.

## Top-Level Records 

The syntax `(field1: "One", field2: 1)` is used to create a record with two fields: `field1`, a string with value `"One"`, and `field2`, a integer with value 1.

Collections and Records can be nested in RAW, so the following is a valid query:

In [63]:
%%rql

(
    Countries: (SELECT DISTINCT Country FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")),
    Number_Of_Airports: (SELECT COUNT(*) FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv"))
)

Countries,Number_Of_Airports
Afghanistan,8107
Albania,8107
Angola,8107
Antarctica,8107
Australia,8107
Barbados,8107
Belarus,8107
Benin,8107
Bermuda,8107
Bhutan,8107


If we again use `%%query_validate` to see the output type:

In [80]:
%%query_validate


(
    Countries: (SELECT DISTINCT Country FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")),
    Number_Of_Airports: (SELECT COUNT(*) FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv"))
)

record(Countries: collection(string), Number_Of_Airports: long)


Therefore we confirm the output of this query is a record with two fields: `Countries`, a collection of strings, and `Number_of_Airports`, a long.

**Next:** [Advanced Data Discovery](Advanced%20Data%20Discovery.ipynb)