## Setup the connection

In [1]:
import boto3
import json
import psycopg2

In [3]:
def get_secret(secret_name, region_name="us-east-1"):
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name)
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
    get_secret_value_response = json.loads(get_secret_value_response['SecretString'])
    return get_secret_value_response

creds = get_secret("wysde")
USERNAME = creds["RDS_POSTGRES_USERNAME"]
PASSWORD = creds["RDS_POSTGRES_PASSWORD"]
HOST = creds["RDS_POSTGRES_HOST"]
DATABASE = 'pagila'

conn_str = 'postgresql://{0}:{1}@{2}/{3}'.format(USERNAME, PASSWORD, HOST, DATABASE)

%config SqlMagic.autopandas=True
%config SqlMagic.displaycon=False
%config SqlMagic.feedback=False
%config SqlMagic.displaylimit=5
%reload_ext sql
%sql {conn_str}

## Slicing and Dicing

Start with a simple cube: Write a query that calculates the revenue (sales_amount) by day, rating, and city. Remember to join with the appropriate dimension tables to replace the keys with the dimension labels. Sort by revenue in descending order and limit to the first 20 rows. The first few rows of your output should match the table below.

| day | rating | city           | revenue |
|-----|--------|----------------|---------|
| 30  | G      | San Bernardino | 24.97   |
| 30  | NC-17  | Apeldoorn      | 23.95   |
| 21  | NC-17  | Belm           | 22.97   |
| 30  | PG-13  | Zanzibar       | 21.97   |
| 28  | R      | Mwanza         | 21.97   |

In [4]:
%%time
%%sql
SELECT dimdate.day AS day,
    dimmovie.rating AS rating,
    dimcustomer.city AS city,
    SUM(factsales.sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimmovie ON (factsales.movie_key = dimmovie.movie_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
    JOIN dimcustomer ON (factsales.customer_key = dimcustomer.customer_key)
GROUP BY dimdate.day,
    dimmovie.rating,
    dimcustomer.city
ORDER BY revenue DESC
LIMIT 5

CPU times: user 10.7 ms, sys: 12.8 ms, total: 23.4 ms
Wall time: 1.09 s


Unnamed: 0,day,rating,city,revenue
0,30,G,San Bernardino,24.97
1,21,NC-17,Belm,22.97
2,30,PG-13,Zanzibar,21.97
3,21,G,Citt del Vaticano,21.97
4,28,R,Mwanza,21.97


### Slicing

Slicing is the reduction of the dimensionality of a cube by 1 e.g. 3 dimensions to 2, fixing one of the dimensions to a single value. In the example above, we have a 3-dimensional cube on day, rating, and country.

TODO: Write a query that reduces the dimensionality of the above example by limiting the results to only include movies with a `rating` of "PG-13". Again, sort by revenue in descending order and limit to the first 20 rows. The first few rows of your output should match the table below. 

In [5]:
%%time
%%sql
SELECT dimdate.day AS day,
    dimmovie.rating AS rating,
    dimcustomer.city AS city,
    SUM(factsales.sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimmovie ON (factsales.movie_key = dimmovie.movie_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
    JOIN dimcustomer ON (factsales.customer_key = dimcustomer.customer_key)
WHERE dimmovie.rating = 'PG-13'
GROUP BY dimdate.day,
    dimmovie.rating,
    dimcustomer.city
ORDER BY revenue DESC
LIMIT 5

CPU times: user 9.02 ms, sys: 2.64 ms, total: 11.7 ms
Wall time: 653 ms


Unnamed: 0,day,rating,city,revenue
0,30,PG-13,Zanzibar,21.97
1,30,PG-13,Osmaniye,18.97
2,21,PG-13,Asuncin,18.95
3,21,PG-13,Parbhani,17.98
4,30,PG-13,Tanauan,17.96


### Dicing

Dicing is creating a subcube with the same dimensionality but fewer values for two or more dimensions. 

TODO: Write a query to create a subcube of the initial cube that includes moves with:
* ratings of PG or PG-13
* in the city of Bellevue or Lancaster
* day equal to 1, 15, or 30

The first few rows of your output should match the table below. 

| day | rating | city      | revenue |
|-----|--------|-----------|---------|
| 30  | PG     | Lancaster | 12.98   |
| 1   | PG-13  | Lancaster | 5.99    |
| 30  | PG-13  | Bellevue  | 3.99    |
| 30  | PG-13  | Lancaster | 2.99    |
| 15  | PG-13  | Bellevue  | 1.98    |

In [6]:
%%time
%%sql
SELECT dimdate.day AS day,
    dimmovie.rating AS rating,
    dimcustomer.city AS city,
    SUM(factsales.sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimmovie ON (factsales.movie_key = dimmovie.movie_key)
    JOIN dimcustomer ON (factsales.customer_key = dimcustomer.customer_key)
WHERE dimmovie.rating IN ('PG', 'PG-13')
    AND dimcustomer.city IN ('Bellevue', 'Lancaster')
    AND dimdate.day IN (1, 15, 30)
GROUP BY dimdate.day,
    dimmovie.rating,
    dimcustomer.city
ORDER BY revenue DESC
LIMIT 5

CPU times: user 8.45 ms, sys: 2.37 ms, total: 10.8 ms
Wall time: 719 ms


Unnamed: 0,day,rating,city,revenue
0,30,PG,Lancaster,12.98
1,1,PG-13,Lancaster,5.99
2,30,PG-13,Bellevue,3.99
3,30,PG-13,Lancaster,2.99
4,15,PG-13,Bellevue,1.98


## Roll Up and Drill Down

### Roll-up

- Stepping up the level of aggregation to a large grouping
- e.g.`city` is summed as `country`

TODO: Write a query that calculates revenue (sales_amount) by day, rating, and country. Sort the data by revenue in descending order, and limit the data to the top 20 results. The first few rows of your output should match the table below.

| day | rating | country | revenue |
|-----|--------|---------|---------|
| 30  | G      | China   | 169.67  |
| 30  | PG     | India   | 156.67  |
| 30  | NC-17  | India   | 153.64  |
| 30  | PG-13  | China   | 146.67  |
| 30  | R      | China   | 145.66  |

In [7]:
%%time
%%sql
SELECT dimdate.day AS day,
    dimmovie.rating AS rating,
    dimcustomer.country AS country,
    SUM(factsales.sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimmovie ON (factsales.movie_key = dimmovie.movie_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
    JOIN dimcustomer ON (factsales.customer_key = dimcustomer.customer_key)
GROUP BY dimdate.day,
    dimmovie.rating,
    dimcustomer.country
ORDER BY revenue DESC
LIMIT 5

CPU times: user 8.11 ms, sys: 2.44 ms, total: 10.6 ms
Wall time: 912 ms


Unnamed: 0,day,rating,country,revenue
0,30,G,China,153.71
1,30,NC-17,India,136.69
2,30,R,China,134.68
3,30,PG,India,132.75
4,30,R,India,132.71


### Drill-down

- Breaking up one of the dimensions to a lower level.
- e.g.`city` is broken up into  `districts`

TODO: Write a query that calculates revenue (sales_amount) by day, rating, and district. Sort the data by revenue in descending order, and limit the data to the top 20 results. The first few rows of your output should match the table below.

| day | rating | district         | revenue |
|-----|--------|------------------|---------|
| 30  | PG-13  | Southern Tagalog | 53.88   |
| 30  | G      | Inner Mongolia   | 38.93   |
| 30  | G      | Shandong         | 36.93   |
| 30  | NC-17  | West Bengali     | 36.92   |
| 17  | PG-13  | Shandong         | 34.95   |

In [8]:
%%time
%%sql
SELECT dimdate.day AS day,
    dimmovie.rating AS rating,
    dimcustomer.district AS district,
    SUM(factsales.sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimmovie ON (factsales.movie_key = dimmovie.movie_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
    JOIN dimcustomer ON (factsales.customer_key = dimcustomer.customer_key)
GROUP BY dimdate.day,
    dimmovie.rating,
    dimcustomer.district
ORDER BY revenue DESC
LIMIT 5

CPU times: user 8.26 ms, sys: 2.05 ms, total: 10.3 ms
Wall time: 587 ms


Unnamed: 0,day,rating,district,revenue
0,30,PG-13,Southern Tagalog,48.89
1,30,G,Inner Mongolia,38.93
2,30,G,Shandong,36.93
3,17,PG-13,Shandong,34.95
4,30,NC-17,West Bengali,33.93


## Grouping Sets

- It happens often that for 3 dimensions, you want to aggregate a fact:
    - by nothing (total)
    - then by the 1st dimension
    - then by the 2nd 
    - then by the 3rd 
    - then by the 1st and 2nd
    - then by the 2nd and 3rd
    - then by the 1st and 3rd
    - then by the 1st and 2nd and 3rd
    
- Since this is very common, and in all cases, we are iterating through all the fact table anyhow, there is a more clever way to do that using the SQL grouping statement "GROUPING SETS" 

### Total Revenue

TODO: Write a query that calculates total revenue (sales_amount)

In [9]:
%%sql
SELECT SUM(sales_amount) AS revenue
FROM factsales

Unnamed: 0,revenue
0,61312.04


### Revenue by Country

TODO: Write a query that calculates total revenue (sales_amount) by country

In [10]:
%%sql
SELECT dimstore.country AS country,
    SUM(sales_amount) AS revenue
FROM factsales
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
GROUP BY dimstore.country
ORDER BY revenue DESC
LIMIT 5

Unnamed: 0,country,revenue
0,Australia,30683.13
1,Canada,30628.91


### Revenue by Month

TODO: Write a query that calculates total revenue (sales_amount) by month

In [11]:
%%sql
SELECT dimdate.month AS month,
    SUM(sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
GROUP BY dimdate.month
LIMIT 5

Unnamed: 0,month,revenue
0,3,23886.56
1,5,514.18
2,4,28559.46
3,2,8351.84


### Revenue by Month & Country

TODO: Write a query that calculates total revenue (sales_amount) by month and country. Sort the data by month, country, and revenue in descending order. The first few rows of your output should match the table below.

| month | country   | revenue  |
|-------|-----------|----------|
| 1     | Australia | 2364.19  |
| 1     | Canada    | 2460.24  |
| 2     | Australia | 4895.10  |
| 2     | Canada    | 4736.78  |
| 3     | Australia | 12060.33 |

In [12]:
%%sql
SELECT dimdate.month AS month,
    dimstore.country AS country,
    SUM(sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
GROUP BY dimdate.month,
    dimstore.country
ORDER BY month,
    country,
    revenue DESC
LIMIT 5

Unnamed: 0,month,country,revenue
0,2,Australia,4215.65
1,2,Canada,4136.19
2,3,Australia,12060.33
3,3,Canada,11826.23
4,4,Australia,14136.07


### Revenue Total, by Month, by Country, by Month & Country All in one shot

TODO: Write a query that calculates total revenue at the various grouping levels done above (total, by month, by country, by month & country) all at once using the grouping sets function. Your output should match the table below.

| month | country   | revenue  |
|-------|-----------|----------|
| 1     | Australia | 2364.19  |
| 1     | Canada    | 2460.24  |
| 1     | None      | 4824.43  |
| 2     | Australia | 4895.10  |
| 2     | Canada    | 4736.78  |
| 2     | None      | 9631.88  |
| 3     | Australia | 12060.33 |
| 3     | Canada    | 11826.23 |
| 3     | None      | 23886.56 |
| 4     | Australia | 14136.07 |
| 4     | Canada    | 14423.39 |
| 4     | None      | 28559.46 |
| 5     | Australia | 271.08   |
| 5     | Canada    | 243.10   |
| 5     | None      | 514.18   |
| None  | None      | 67416.51 |
| None  | Australia | 33726.77 |
| None  | Canada    | 33689.74 |

In [13]:
%%sql
SELECT dimdate.month AS month,
    dimstore.country AS country,
    SUM(sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
GROUP BY grouping sets(
        (),
        dimdate.month,
        dimstore.country,
        (dimdate.month, dimstore.country)
    )
ORDER BY month,
    country,
    revenue DESC
LIMIT 20

Unnamed: 0,month,country,revenue
0,2.0,Australia,4215.65
1,2.0,Canada,4136.19
2,2.0,,8351.84
3,3.0,Australia,12060.33
4,3.0,Canada,11826.23
5,3.0,,23886.56
6,4.0,Australia,14136.07
7,4.0,Canada,14423.39
8,4.0,,28559.46
9,5.0,Australia,271.08


## CUBE

- Group by CUBE (dim1, dim2, ..) , produces all combinations of different lenghts in one go.
- This view could be materialized in a view and queried which would save lots repetitive aggregations

TODO: Write a query that calculates the various levels of aggregation done in the grouping sets exercise (total, by month, by country, by month & country) using the CUBE function. Your output should match the table below.

| month | country   | revenue  |
|-------|-----------|----------|
| 1     | Australia | 2364.19  |
| 1     | Canada    | 2460.24  |
| 1     | None      | 4824.43  |
| 2     | Australia | 4895.10  |
| 2     | Canada    | 4736.78  |
| 2     | None      | 9631.88  |
| 3     | Australia | 12060.33 |
| 3     | Canada    | 11826.23 |
| 3     | None      | 23886.56 |
| 4     | Australia | 14136.07 |
| 4     | Canada    | 14423.39 |
| 4     | None      | 28559.46 |
| 5     | Australia | 271.08   |
| 5     | Canada    | 243.10   |
| 5     | None      | 514.18   |
| None  | None      | 67416.51 |
| None  | Australia | 33726.77 |
| None  | Canada    | 33689.74 |

In [14]:
%%time
%%sql
SELECT dimdate.month AS month,
    dimstore.country AS country,
    SUM(sales_amount) AS revenue
FROM factsales
    JOIN dimdate ON (factsales.date_key = dimdate.date_key)
    JOIN dimstore ON (factsales.store_key = dimstore.store_key)
GROUP BY CUBE(dimdate.month, dimstore.country)
ORDER BY month,
    country,
    revenue DESC
LIMIT 20

CPU times: user 8.31 ms, sys: 2.85 ms, total: 11.2 ms
Wall time: 692 ms


Unnamed: 0,month,country,revenue
0,2.0,Australia,4215.65
1,2.0,Canada,4136.19
2,2.0,,8351.84
3,3.0,Australia,12060.33
4,3.0,Canada,11826.23
5,3.0,,23886.56
6,4.0,Australia,14136.07
7,4.0,Canada,14423.39
8,4.0,,28559.46
9,5.0,Australia,271.08


### Revenue Total, by Month, by Country, by Month & Country All in one shot, NAIVE way

The naive way to create the same table as above is to write several queries and UNION them together. Grouping sets and cubes produce queries that are shorter to write, easier to read, and more performant. Run the naive query below and compare the time it takes to run to the time it takes the cube query to run.

In [16]:
%%time
%%sql
SELECT NULL as month,
    NULL as country,
    sum(sales_amount) as revenue
FROM factSales
UNION all
SELECT NULL,
    dimStore.country,
    sum(sales_amount) as revenue
FROM factSales
    JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by dimStore.country
UNION all
SELECT cast(dimDate.month as text),
    NULL,
    sum(sales_amount) as revenue
FROM factSales
    JOIN dimDate on (dimDate.date_key = factSales.date_key)
GROUP by dimDate.month
UNION all
SELECT cast(dimDate.month as text),
    dimStore.country,
    sum(sales_amount) as revenue
FROM factSales
    JOIN dimDate on (dimDate.date_key = factSales.date_key)
    JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country)
ORDER BY month,
    country,
    revenue DESC
LIMIT 20

CPU times: user 9.23 ms, sys: 2.03 ms, total: 11.3 ms
Wall time: 966 ms


Unnamed: 0,month,country,revenue
0,2.0,Australia,4215.65
1,2.0,Canada,4136.19
2,2.0,,8351.84
3,3.0,Australia,12060.33
4,3.0,Canada,11826.23
5,3.0,,23886.56
6,4.0,Australia,14136.07
7,4.0,Canada,14423.39
8,4.0,,28559.46
9,5.0,Australia,271.08
