# Exercise 02 -  OLAP Cubes - Solution

All the databases table in this demo are based on public database samples and transformations
- `Sakila` is a sample database created my `MySql` [Link](https://dev.mysql.com/doc/sakila/en/sakila-structure.html)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](http://archive.oreilly.com/oreillyschool/courses/dba3/index.html)

In [1]:
!PGPASSWORD= createdb -h 127.0.0.1 -U huyenvtk1 pagila_star
!PGPASSWORD= psql -q -h 127.0.0.1 -U huyenvtk1 -d pagila_star -f Data/pagila-star.sql

 set_config 
------------
 
(1 row)

 setval 
--------
    200
(1 row)

 setval 
--------
    605
(1 row)

 setval 
--------
     16
(1 row)

 setval 
--------
    600
(1 row)

 setval 
--------
    109
(1 row)

 setval 
--------
    599
(1 row)

 setval 
--------
      1
(1 row)

 setval 
--------
      1
(1 row)

 setval 
--------
      1
(1 row)

 setval 
--------
      1
(1 row)

 setval 
--------
  64196
(1 row)

 setval 
--------
   1000
(1 row)

 setval 
--------
   4581
(1 row)

 setval 
--------
      6
(1 row)

 setval 
--------
  32098
(1 row)

 setval 
--------
  16049
(1 row)

 setval 
--------
      2
(1 row)

 setval 
--------
      2
(1 row)



In [2]:
%load_ext sql
import sql

# STEP1 : Connect to the local database where Pagila is loaded

In [3]:
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'huyenvtk1'
DB_PASSWORD = ''
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)


postgresql://huyenvtk1:@127.0.0.1:5432/pagila


In [4]:
%sql $conn_string

'Connected: huyenvtk1@pagila'

# STEP2 :  Star Schema

<img src="pagila-star.png" width="50%"/>

# Start by a simple cube

In [5]:
%%time
%%sql
SELECT dimDate.day,dimMovie.rating, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales 
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate      on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer  on (dimCustomer.customer_key = factSales.customer_key)
group by (dimDate.day, dimMovie.rating, dimCustomer.city)
order by revenue desc
limit  20;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
20 rows affected.
CPU times: user 2.54 ms, sys: 1.12 ms, total: 3.65 ms
Wall time: 74.1 ms


day,rating,city,revenue
30,G,San Bernardino,99.88
30,NC-17,Apeldoorn,95.8
21,NC-17,Belm,91.88
28,R,Mwanza,87.88
30,PG-13,Zanzibar,87.88
21,G,Citt del Vaticano,87.88
22,R,Yangor,79.88
28,PG-13,Dhaka,79.88
1,R,Qomsheh,79.88
17,G,Rajkot,79.88


## Slicing

- Slicing is the reduction of the dimensionality of a cube by 1 e.g. 3 dimensions to 2,  fixing one of the dimensions to a single value
- In the following example we have a 3-deminensional cube on day, rating, and country
- In the example below `rating` is fixed and to "PG-13" which reduces the dimensionality 

In [6]:
%%time
%%sql
SELECT dimDate.day,dimMovie.rating, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
WHERE dimMovie.rating = 'PG-13'
GROUP by (dimDate.day, dimCustomer.city, dimMovie.rating)
ORDER by revenue desc
LIMIT  20;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
20 rows affected.
CPU times: user 1.9 ms, sys: 895 µs, total: 2.79 ms
Wall time: 61.8 ms


day,rating,city,revenue
30,PG-13,Zanzibar,87.88
28,PG-13,Dhaka,79.88
29,PG-13,Shimoga,75.88
30,PG-13,Osmaniye,75.88
21,PG-13,Asuncin,75.8
20,PG-13,Baha Blanca,71.92
21,PG-13,Parbhani,71.92
30,PG-13,Nagareyama,71.92
30,PG-13,Tanauan,71.84
17,PG-13,Ikerre,71.8


## Dicing
 - Creating a subcube, same dimensionality, less values for 2 or more dimensions
 - e.g. PG-13

In [7]:
%%time
%%sql
SELECT dimDate.day,dimMovie.rating, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
WHERE dimMovie.rating in ('PG-13', 'PG')
AND dimCustomer.city in ('Bellevue', 'Lancaster')
AND dimDate.day in ('1', '15', '30')
GROUP by (dimDate.day, dimCustomer.city, dimMovie.rating)
ORDER by revenue desc
LIMIT  20;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
6 rows affected.
CPU times: user 2.84 ms, sys: 1.14 ms, total: 3.98 ms
Wall time: 12 ms


day,rating,city,revenue
30,PG,Lancaster,51.92
1,PG-13,Lancaster,23.96
30,PG-13,Bellevue,15.96
30,PG-13,Lancaster,11.96
15,PG-13,Bellevue,7.92
1,PG,Bellevue,3.96


## Roll-up
- Stepping up the level of aggregation to a large grouping
- e.g.`city` is summed as `country`

In [8]:
%%time
%%sql
SELECT dimDate.day,dimMovie.rating, dimCustomer.country, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
GROUP by (dimDate.day,  dimMovie.rating, dimCustomer.country)
ORDER by revenue desc
LIMIT  20;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
20 rows affected.
CPU times: user 2.66 ms, sys: 1.41 ms, total: 4.07 ms
Wall time: 76.5 ms


day,rating,country,revenue
30,G,China,678.68
30,PG,India,626.68
30,NC-17,India,614.56
30,PG-13,China,586.68
30,R,China,582.64
30,R,India,574.72
30,G,India,550.68
18,NC-17,India,543.0
30,PG,China,526.88
21,PG-13,India,514.96


## Drill-down
- Breaking up one of the dimensions to a lower level.
- e.g.`city` is broken up to  `districts`

In [9]:
%%time
%%sql
SELECT dimDate.day,dimMovie.rating, dimCustomer.district, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
GROUP by (dimDate.day, dimCustomer.district, dimMovie.rating)
ORDER by revenue desc
LIMIT  20;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
20 rows affected.
CPU times: user 2.32 ms, sys: 1.26 ms, total: 3.58 ms
Wall time: 66.8 ms


day,rating,district,revenue
30,PG-13,Southern Tagalog,215.52
30,G,Inner Mongolia,155.72
30,G,Shandong,147.72
30,NC-17,West Bengali,147.68
17,PG-13,Shandong,139.8
1,PG,California,131.76
18,NC-17,So Paulo,131.72
21,R,So Paulo,127.72
30,NC-17,Buenos Aires,127.72
30,PG,Southern Tagalog,123.76


# Grouping Sets
- It happens a lot that for a 3 dimensions, you want to aggregate a fact:
    - by nothing (total)
    - then by the 1st dimension
    - then by the 2nd 
    - then by the 3rd 
    - then by the 1st and 2nd
    - then by the 2nd and 3rd
    - then by the 1st and 3rd
    - then by the 1st and 2nd and 3rd
    
- Since this is very common, and in all cases, we are iterating through all the fact table anyhow, there is a move clever way to do that using the SQL grouping statement "GROUPING SETS" 

## total revenue

In [10]:
%%sql
SELECT sum(sales_amount) as revenue
FROM factSales

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
1 rows affected.


revenue
269666.04


## revenue by country

In [11]:
%%sql
SELECT dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by  dimStore.country
order by dimStore.country, revenue desc;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
2 rows affected.


country,revenue
Australia,134907.08
Canada,134758.96


## revenue by month

In [12]:
%%sql
SELECT dimDate.month,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
GROUP by dimDate.month
order by dimDate.month, revenue desc;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
5 rows affected.


month,revenue
1,19297.72
2,38527.52
3,95546.24
4,114237.84
5,2056.72


## revenue by month & country

In [13]:
%%sql
SELECT dimDate.month,dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country)
order by dimDate.month, dimStore.country, revenue desc;

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
10 rows affected.


month,country,revenue
1,Australia,9456.76
1,Canada,9840.96
2,Australia,19580.4
2,Canada,18947.12
3,Australia,48241.32
3,Canada,47304.92
4,Australia,56544.28
4,Canada,57693.56
5,Australia,1084.32
5,Canada,972.4


## revenue total, by month, by country, by month & country All in one shot
- watch the nones

In [14]:
%%time
%%sql
SELECT dimDate.month,dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate  on (dimDate.date_key  = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by grouping sets ((), dimDate.month,  dimStore.country, (dimDate.month,  dimStore.country));


 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
18 rows affected.
CPU times: user 2.15 ms, sys: 1.21 ms, total: 3.36 ms
Wall time: 64.1 ms


month,country,revenue
1.0,Australia,9456.76
1.0,Canada,9840.96
1.0,,19297.72
2.0,Australia,19580.4
2.0,Canada,18947.12
2.0,,38527.52
3.0,Australia,48241.32
3.0,Canada,47304.92
3.0,,95546.24
4.0,Australia,56544.28


# CUBE 
- Group by CUBE (dim1, dim2, ..) , produces all combinations of different lenghts in one go.
- This view could be materialized in a view and queried which would save lots repetitive aggregations

```SQL
SELECT dimDate.month,dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate  on (dimDate.date_key   = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by cube(dimDate.month,  dimStore.country);
```


In [15]:
%%time
%%sql
SELECT dimDate.month,dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by cube(dimDate.month,  dimStore.country);

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
18 rows affected.
CPU times: user 1.99 ms, sys: 1.24 ms, total: 3.22 ms
Wall time: 64.6 ms


month,country,revenue
1.0,Australia,9456.76
1.0,Canada,9840.96
1.0,,19297.72
2.0,Australia,19580.4
2.0,Canada,18947.12
2.0,,38527.52
3.0,Australia,48241.32
3.0,Canada,47304.92
3.0,,95546.24
4.0,Australia,56544.28


In [17]:
%%time
%%sql
SELECT dimDate.month,dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month,  dimStore.country);

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
10 rows affected.
CPU times: user 1.97 ms, sys: 3.53 ms, total: 5.5 ms
Wall time: 55.8 ms


month,country,revenue
1,Australia,9456.76
1,Canada,9840.96
2,Australia,19580.4
2,Canada,18947.12
3,Australia,48241.32
3,Canada,47304.92
4,Australia,56544.28
4,Canada,57693.56
5,Australia,1084.32
5,Canada,972.4


## revenue total, by month, by country, by month & country All in one shot, NAIVE way

In [16]:
%%time
%%sql
SELECT  NULL as month, NULL as country, sum(sales_amount) as revenue
FROM factSales
    UNION all 
SELECT NULL, dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by  dimStore.country
    UNION all 
SELECT cast(dimDate.month as text) , NULL, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate on (dimDate.date_key = factSales.date_key)
GROUP by dimDate.month
    UNION all
SELECT cast(dimDate.month as text),dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country)

 * postgresql://huyenvtk1:***@127.0.0.1:5432/pagila
18 rows affected.
CPU times: user 3.05 ms, sys: 1.08 ms, total: 4.13 ms
Wall time: 61.6 ms


month,country,revenue
,,269666.04
,Canada,134758.96
,Australia,134907.08
3.0,,95546.24
5.0,,2056.72
4.0,,114237.84
2.0,,38527.52
1.0,,19297.72
1.0,Australia,9456.76
1.0,Canada,9840.96
