# Exercise 1 -  Sakila Star Schema & ETL  

All the database tables in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://dev.mysql.com/doc/sakila/en/sakila-structure.html)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](http://archive.oreilly.com/oreillyschool/courses/dba3/index.html)

# STEP0: Using ipython-sql

- Load ipython-sql: `%load_ext sql`

- To execute SQL queries you write one of the following atop of your cell: 
    - `%sql`
        - For a one-liner SQL query
        - You can access a python var using `$`    
    - `%%sql`
        - For a multi-line SQL query
        - You can **NOT** access a python var using `$`


- Running a connection string like:
`postgresql://postgres:postgres@db:5432/pagila` connects to the database


# STEP1 : Connect to the local database where Pagila is loaded

##  1.1 Create the pagila db and fill it with data
- Adding `"!"` at the beginning of a jupyter cell runs a command in a shell, i.e. we are not running python code but we are running the `createdb` and `psql` postgresql commmand-line utilities

In [1]:
!PGPASSWORD=student createdb -h 127.0.0.1 -p 5433 -U student pagila
!PGPASSWORD=student psql -q -h 127.0.0.1 -p 5433 -U student -d pagila -f Data/pagila-schema.sql
!PGPASSWORD=student psql -q -h 127.0.0.1 -p 5433 -U student -d pagila -f Data/pagila-data.sql

psql:Data/pagila-schema.sql:22: ERROR:  language "plpgsql" already exists
 setval 
--------
    200
(1 fila)

 setval 
--------
     16
(1 fila)

 setval 
--------
   1000
(1 fila)

 setval 
--------
    605
(1 fila)

 setval 
--------
    600
(1 fila)

 setval 
--------
    109
(1 fila)

 setval 
--------
    599
(1 fila)

 setval 
--------
   4581
(1 fila)

 setval 
--------
      6
(1 fila)

 setval 
--------
  32098
(1 fila)

 setval 
--------
  16049
(1 fila)

 setval 
--------
      2
(1 fila)

 setval 
--------
      2
(1 fila)



## 1.2 Connect to the newly created db

In [1]:
%load_ext sql

In [2]:
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'student'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)


postgresql://student:student@127.0.0.1:5432/pagila


In [3]:
%sql $conn_string

'Connected: student@pagila'

# STEP2 : Explore the  3NF Schema

![](pagila-3nf.png =100x20)

[<img src="pagila-3nf.png" width="750"/>](pagila-3nf.png)

## 2.1 How much? What data sizes are we looking at?

In [4]:
nStores = %sql select count(*) from store;
nFilms = %sql select count(*) from film;
nCustomers = %sql select count(*) from customer;
nRentals = %sql select count(*) from rental;
nPayment = %sql select count(*) from payment;
nStaff = %sql select count(*) from staff;
nCity = %sql select count(*) from city;
nCountry = %sql select count(*) from country;
nCountry = %sql select count(*) from country;


print("Films\t\t=", nFilms[0][0])
print("Customers\t=", nCustomers[0][0])
print("Rentals\t\t=", nRentals[0][0])
print("Payment\t\t=", nPayment[0][0])
print("Staff\t\t=", nStaff[0][0])
print("Stores\t\t=", nStores[0][0])
print("Cities\t\t=", nCity[0][0])
print("Country\t\t=", nCountry[0][0])

 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.
Films		= 1000
Customers	= 599
Rentals		= 16044
Payment		= 16049
Staff		= 2
Stores		= 2
Cities		= 600
Country		= 109


## 2.2 When? What time period are we talking about?

In [5]:
%%sql 
select min(payment_date) as start, max(payment_date) as end from payment;

 * postgresql://student:***@127.0.0.1:5432/pagila
1 rows affected.


start,end
2007-01-24 21:21:56.996577,2007-05-14 13:44:29.996577


In [12]:
%%sql
select city_id, city from city;

 * postgresql://student:***@127.0.0.1:5432/pagila
600 rows affected.


city_id,city
1,A Corua (La Corua)
2,Abha
3,Abu Dhabi
4,Acua
5,Adana
6,Addis Abeba
7,Aden
8,Adoni
9,Ahmadnagar
10,Akishima


## 2.3 Where? Where do events in this database occur?

In [13]:
%%sql
SELECT district, sum(city_id) as n
FROM address
GROUP BY district
ORDER BY n desc
LIMIT 10;


 * postgresql://student:***@127.0.0.1:5432/pagila
10 rows affected.


district,n
Shandong,3237
England,2974
So Paulo,2952
West Bengali,2623
Buenos Aires,2572
Uttar Pradesh,2462
California,2444
Southern Tagalog,1931
Tamil Nadu,1807
Hubei,1790


<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>district</th>
        <th>n</th>
    </tr>
    <tr>
        <td>Buenos Aires</td>
        <td>10</td>
    </tr>
    <tr>
        <td>California</td>
        <td>9</td>
    </tr>
    <tr>
        <td>Shandong</td>
        <td>9</td>
    </tr>
    <tr>
        <td>West Bengali</td>
        <td>9</td>
    </tr>
    <tr>
        <td>So Paulo</td>
        <td>8</td>
    </tr>
    <tr>
        <td>Uttar Pradesh</td>
        <td>8</td>
    </tr>
    <tr>
        <td>Maharashtra</td>
        <td>7</td>
    </tr>
    <tr>
        <td>England</td>
        <td>7</td>
    </tr>
    <tr>
        <td>Southern Tagalog</td>
        <td>6</td>
    </tr>
    <tr>
        <td>Punjab</td>
        <td>5</td>
    </tr>
</tbody></table></div>