# OLAP Cubes Operations

All the database tables in this notebook are based on public database samples and transformations.
- `Sakila` is a sample database created by *MySql*: [Link](https://dev.mysql.com/doc/sakila/en/).
- The *Postgres* version of it is called `Pagila`: [Link](https://github.com/devrimgunduz/pagila).

## Connect to the local database where Pagila is loaded

No need to run the cell below if `pagila_star` has been already created and populated.

In [1]:
# !PGPASSWORD=root createdb -h localhost -U root pagila_star
# !PGPASSWORD=root psql -q -h localhost -U root -d pagila_star -f ../pagila_data/pagila-star.sql

In [2]:
import sql
%load_ext sql

DB_ENDPOINT = "localhost"
DB_NAME = 'pagila_star'
DB_USER = 'root'
DB_PASSWORD = 'root'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_ENDPOINT}:{DB_PORT}/{DB_NAME}"

print(conn_string)
%sql $conn_string

postgresql://root:root@localhost:5432/pagila_star


## Star Schema

<img src="../images/pagila-star.png" width=50%>

## Roll-Up

- Stepping up the level of aggregation to a large grouping.
- e.g.`city` is summed as `country`.

**EXAMPLE**  
**Write a query that calculates revenue (sales_amount) by day, rating, and country. Sort the data by revenue in descending order, and limit the data to the top 10 results.**

In [3]:
%%time
%%sql
SELECT dd.day, dm.rating, dc.country, SUM(fs.sales_amount) AS revenue
FROM dimdate dd, dimmovie dm, dimcustomer dc, factsales fs
WHERE fs.date_key = dd.date_key
AND fs.movie_key = dm.movie_key
AND fs.customer_key = dc.customer_key
GROUP BY 1, 2, 3
ORDER BY 4 DESC
LIMIT 10;

 * postgresql://root:***@localhost:5432/pagila_star
10 rows affected.
CPU times: user 2.81 ms, sys: 0 ns, total: 2.81 ms
Wall time: 14.7 ms


day,rating,country,revenue
30,G,China,169.67
30,PG,India,156.67
30,NC-17,India,153.64
30,PG-13,China,146.67
30,R,China,145.66
30,R,India,143.68
30,G,India,137.67
18,NC-17,India,135.75
30,PG,China,131.72
21,PG-13,India,128.74


## Drill-down
- Breaking up one of the dimensions to a lower level.
- e.g.`city` is broken up into `districts`.

**EXAMPLE**  
**Write a query that calculates revenue (sales_amount) by day, rating, and district. Sort the data by revenue in descending order, and limit the data to the top 10 results.**

In [4]:
%%time
%%sql
SELECT dd.day, dm.rating, dc.district, SUM(fs.sales_amount) AS revenue
FROM dimdate dd, dimmovie dm, dimcustomer dc, factsales fs
WHERE fs.date_key = dd.date_key
AND fs.movie_key = dm.movie_key
AND fs.customer_key = dc.customer_key
GROUP BY 1, 2, 3
ORDER BY 4 DESC
LIMIT 10;

 * postgresql://root:***@localhost:5432/pagila_star
10 rows affected.
CPU times: user 2.42 ms, sys: 0 ns, total: 2.42 ms
Wall time: 17.7 ms


day,rating,district,revenue
30,PG-13,Southern Tagalog,53.88
30,G,Inner Mongolia,38.93
30,G,Shandong,36.93
30,NC-17,West Bengali,36.92
17,PG-13,Shandong,34.95
1,PG,California,32.94
18,NC-17,So Paulo,32.93
21,R,So Paulo,31.93
30,NC-17,Buenos Aires,31.93
30,PG,Southern Tagalog,30.94


**Write a query that calculates revenue (sales_amount) by day, rating, and city. Sort the data by revenue in descending order, and limit the data to the top 10 results.**

In [5]:
%%time
%%sql
SELECT dd.day, dm.rating, dc.city, SUM(fs.sales_amount) AS revenue
FROM dimdate dd, dimmovie dm, dimcustomer dc, factsales fs
WHERE fs.date_key = dd.date_key
AND fs.movie_key = dm.movie_key
AND fs.customer_key = dc.customer_key
GROUP BY 1, 2, 3
ORDER BY 4 DESC
LIMIT 10;

 * postgresql://root:***@localhost:5432/pagila_star
10 rows affected.
CPU times: user 2.41 ms, sys: 0 ns, total: 2.41 ms
Wall time: 17.9 ms


day,rating,city,revenue
30,G,San Bernardino,24.97
30,NC-17,Apeldoorn,23.95
21,NC-17,Belm,22.97
28,R,Mwanza,21.97
21,G,Citt del Vaticano,21.97
30,PG-13,Zanzibar,21.97
17,G,Rajkot,19.97
1,R,Qomsheh,19.97
22,R,Yangor,19.97
28,PG-13,Dhaka,19.97


## Slice
Slicing is the reduction of the dimensionality of a cube by 1 e.g. 3 dimensions to 2, fixing one of the dimensions to a single value. In the example above, we have a 3-dimensional cube on day, rating, and city.

**EXAMPLE**  
**Write a query that reduces the dimensionality of the above example by limiting the results to only include movies with a rating of "PG-13". Again, sort by revenue in descending order and limit to the first 10 rows.**

In [6]:
%%time
%%sql
SELECT dd.day, dm.rating, dc.city, SUM(fs.sales_amount) AS revenue
FROM dimdate dd, dimmovie dm, dimcustomer dc, factsales fs
WHERE fs.date_key = dd.date_key
AND fs.movie_key = dm.movie_key
AND fs.customer_key = dc.customer_key
AND dm.rating = 'PG-13'
GROUP BY 1, 2, 3
ORDER BY 4 DESC
LIMIT 10;

 * postgresql://root:***@localhost:5432/pagila_star
10 rows affected.
CPU times: user 2.99 ms, sys: 0 ns, total: 2.99 ms
Wall time: 8.45 ms


day,rating,city,revenue
30,PG-13,Zanzibar,21.97
28,PG-13,Dhaka,19.97
30,PG-13,Osmaniye,18.97
29,PG-13,Shimoga,18.97
21,PG-13,Asuncin,18.95
30,PG-13,Nagareyama,17.98
21,PG-13,Parbhani,17.98
20,PG-13,Baha Blanca,17.98
30,PG-13,Tanauan,17.96
17,PG-13,Ikerre,17.95


## Dice
Dicing is creating a subcube with the same dimensionality but fewer values for two or more dimensions. 

**EXAMPLE**
**Write a query to create a subcube of the initial cube that includes moves with:**
* **ratings of PG or PG-13.**
* **in the city of Bellevue or Lancaster.**
* **day equal to 1, 15, or 30.**

In [7]:
%%time
%%sql
SELECT dd.day, dm.rating, dc.city, SUM(fs.sales_amount) AS revenue
FROM dimdate dd, dimmovie dm, dimcustomer dc, factsales fs
WHERE fs.date_key = dd.date_key
AND fs.movie_key = dm.movie_key
AND fs.customer_key = dc.customer_key
AND dm.rating IN ('PG', 'PG-13')
AND dc.city IN ('Bellevue', 'Lancaster')
AND dd.day IN (1, 15, 30)
GROUP BY 1, 2, 3
ORDER BY 4 DESC
LIMIT 10;

 * postgresql://root:***@localhost:5432/pagila_star
6 rows affected.
CPU times: user 2.25 ms, sys: 0 ns, total: 2.25 ms
Wall time: 3.28 ms


day,rating,city,revenue
30,PG,Lancaster,12.98
1,PG-13,Lancaster,5.99
30,PG-13,Bellevue,3.99
30,PG-13,Lancaster,2.99
15,PG-13,Bellevue,1.98
1,PG,Bellevue,0.99


## Grouping sets ()
- It happens often that for 3 dimensions, you want to aggregate a fact:
    - by nothing (total)
    - then by the 1st dimension
    - then by the 2nd 
    - then by the 3rd 
    - then by the 1st and 2nd
    - then by the 2nd and 3rd
    - then by the 1st and 3rd
    - then by the 1st and 2nd and 3rd
    
- Since this is very common, and in all cases, we are iterating through all the fact table anyhow, there is a more clever way to do that using the SQL grouping statement "GROUPING SETS".

### Total revenue
**Write a query that calculates total revenue (sales_amount).**

In [8]:
%%sql
SELECT SUM(fs.sales_amount) AS revenue
FROM factsales AS fs;

 * postgresql://root:***@localhost:5432/pagila_star
1 rows affected.


revenue
67416.51


### Revenue by country
**Write a query that calculates total revenue (sales_amount) by country.**

In [9]:
%%sql
SELECT ds.country, SUM(fs.sales_amount) AS revenue
FROM factsales fs, dimstore ds
WHERE fs.store_key = ds.store_key
GROUP BY ds.country
LIMIT 5;

 * postgresql://root:***@localhost:5432/pagila_star
2 rows affected.


country,revenue
Canada,33689.74
Australia,33726.77


### Revenue by month
**Write a query that calculates total revenue (sales_amount) by month.**

In [10]:
%%sql
SELECT dd.month, SUM(fs.sales_amount) AS revenue
FROM factsales fs, dimdate dd
WHERE fs.date_key = dd.date_key
GROUP BY month
ORDER BY month ASC
LIMIT 5;

 * postgresql://root:***@localhost:5432/pagila_star
5 rows affected.


month,revenue
1,4824.43
2,9631.88
3,23886.56
4,28559.46
5,514.18


### Revenue by month and country
**Write a query that calculates total revenue (sales_amount) by month and country. Sort the data by month, country, and revenue in descending order.**

In [11]:
%%sql
SELECT dd.month, ds.country, SUM(fs.sales_amount) AS revenue
FROM factsales fs, dimdate dd, dimstore ds
WHERE fs.date_key = dd.date_key
AND fs.store_key = ds.store_key
GROUP BY dd.month, ds.country
ORDER BY dd.month ASC, ds.country ASC, revenue DESC
LIMIT 5;

 * postgresql://root:***@localhost:5432/pagila_star
5 rows affected.


month,country,revenue
1,Australia,2364.19
1,Canada,2460.24
2,Australia,4895.1
2,Canada,4736.78
3,Australia,12060.33


### Revenue total, by month, by country, by month and country all in one shot
**Write a query that calculates total revenue at the various grouping levels done above (total, by month, by country, by month & country) all at once using the `grouping sets` function.**

In [12]:
%%sql
SELECT dd.month, ds.country, SUM(fs.sales_amount) AS revenue
FROM factsales fs, dimdate dd, dimstore ds
WHERE fs.date_key = dd.date_key
AND fs.store_key = ds.store_key
GROUP BY GROUPING SETS (
    (),
    (dd.month),
    (ds.country),
    (dd.month, ds.country)
)
ORDER BY dd.month, ds.country;

 * postgresql://root:***@localhost:5432/pagila_star
18 rows affected.


month,country,revenue
1.0,Australia,2364.19
1.0,Canada,2460.24
1.0,,4824.43
2.0,Australia,4895.1
2.0,Canada,4736.78
2.0,,9631.88
3.0,Australia,12060.33
3.0,Canada,11826.23
3.0,,23886.56
4.0,Australia,14136.07


## Cube ()
**Write a query that calculates total revenue at the various grouping levels done above (total, by month, by country, by month & country) all at once using the `cube` function.**

In [13]:
%%time
%%sql
SELECT dd.month, ds.country, SUM(fs.sales_amount) AS revenue
FROM factsales fs, dimdate dd, dimstore ds
WHERE fs.date_key = dd.date_key
AND fs.store_key = ds.store_key
GROUP BY CUBE (dd.month, ds.country)
ORDER BY dd.month, ds.country;

 * postgresql://root:***@localhost:5432/pagila_star
18 rows affected.
CPU times: user 1.78 ms, sys: 186 µs, total: 1.97 ms
Wall time: 10.5 ms


month,country,revenue
1.0,Australia,2364.19
1.0,Canada,2460.24
1.0,,4824.43
2.0,Australia,4895.1
2.0,Canada,4736.78
2.0,,9631.88
3.0,Australia,12060.33
3.0,Canada,11826.23
3.0,,23886.56
4.0,Australia,14136.07


## Naive way
The naive way to create the same table as above is to write several queries and UNION them together. Grouping sets and cubes produce queries that are shorter to write, easier to read, and more performant.  

**Write a query that calculates total revenue at the various grouping levels done above (total, by month, by country, by month & country) all at once using the naive way.**

In [14]:
%%time
%%sql
SELECT  NULL as month, NULL as country, sum(sales_amount) as revenue
FROM factSales
    UNION all 
SELECT NULL, dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by  dimStore.country
    UNION all 
SELECT cast(dimDate.month as text) , NULL, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate on (dimDate.date_key = factSales.date_key)
GROUP by dimDate.month
    UNION all
SELECT cast(dimDate.month as text),dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country)

 * postgresql://root:***@localhost:5432/pagila_star
18 rows affected.
CPU times: user 2.98 ms, sys: 0 ns, total: 2.98 ms
Wall time: 17.1 ms


month,country,revenue
,,67416.51
,Canada,33689.74
,Australia,33726.77
4.0,,28559.46
3.0,,23886.56
1.0,,4824.43
5.0,,514.18
2.0,,9631.88
1.0,Australia,2364.19
5.0,Canada,243.1
