# SQL in Python & Scheduling- Day 3: Afternoon Lecture - December 17, 2020

### Topics to Cover:
- RDBMS Review
- SQL query overview
- Connecting Databsese to Python 
- Scheduler

## Why databases?

* Organized way to store data and relate data to each other
* Enables data governance (e.g., availability, usability, integrity and security of data)
* Efficient retrieval of structured data
* Can store lots of data

_"A company cannot be managed with Excel spreadsheets"_

## Why SQL?

* A lot of data is still stored in SQL databases
* Structured Query Language (SQL) still the most proficient tool to investigate, filter, slice and dice your data

## NoSQL databases
* There exists NoSQL databases (e.g., MongoDB)
* Want to learn more? https://www.mongodb.com/nosql-explained

## Commonly used database schemes

* How do tables in a database relate to each other?
* There are some commonly used design principles
* We will walk through **Star** and **Snowflake** schemes

**Resources:**
* https://www.guru99.com/star-snowflake-data-warehousing.html
* https://en.wikipedia.org/wiki/Snowflake_schema
* https://en.wikipedia.org/wiki/Star_schema

### Star Schema

<img src="star-schema.png" width=300>

* Every dimension represented by only one dimension table
* Dimension table contains set of attributes
* Dimension tables are joined to the fact table using a foreign key
* Dimension tables **are not** joined to each other
* Fact table contains key and measure

### Snowflake Schema

<img src="snowflake-schema.png" width=700>

* Extension of the star schema by adding extra dimensions
* Dimension tables are "normalized", splitting data into additional tables
* What does it mean to [normalize data](https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description#:~:text=Normalization%20is%20the%20process%20of,eliminating%20redundancy%20and%20inconsistent%20dependency.)?

## Schema comparison

**Star schema:**
* Easy to understand
* Dimension tables are not normalized - country ID does not have country lookup table
* Widely supported by BI tools

**Snowflake schema:**
* Easy to implement and "grow"
* Lower query performance with multiple tables
* More difficult to maintain with more lookup tables


### RDBMS Vendors

* **Closed source** (i.e., you have to pay)
    * Vendors: Oracle, SQL Server (Microsoft), IBM DB2, Microsoft Access - local small databases
    * Could come with integrations and services that make things easier  
    
<br>

* **Open source** (i.e., free)
    * MySQL, PostgreSQL, SQLite, MariaDB
    * Good developer community makes these great options
    * [This website](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems#:~:text=SQLite%20is%20a%20self%2Dcontained,even%20in%20low%2Dmemory%20environments.) offers a good comparison of open source systems options.
    
## Why SQLite?

* Not directly comparable to client/server SQL database engines such as MySQL, Oracle, PostgreSQL, or SQL Server
* Used as on-disk file format for desktop applications
* Great to learn on to get a hang of SQL

For more information: https://www.sqlite.org/whentouse.html

## Why PostGres? 

* Created by scientists from the University of California at Berkeley
* Open source nature makes it easy to upgrade or extend
* High compliance to the SQL standard
* Offers its users a huge (and growing) number of functions allowing programmers to create new applications, better protect data integrity, and developers build resilient and secure environments
* Can easily run it on Windows, Mac OS X, and almost all Linux and Unix distributions
* MySQL would be a good choice too
* We use will use `psql`

## Make sure everyone is set up before break...

1. Install PostGreSQL: https://www.datacamp.com/community/tutorials/installing-postgresql-windows-macosx
* From your terminal: `pip install psycopg2`
* Open a jupyter notebook and run `import psycopg2`to make sure it works 

To get a database going: 
1. Open psql shell
* Log in (should be all "enter" if you left it at default)
* **Set a password (REMEMBER THIS!)**
* `\l` to see what databases already exist
* `CREATE DATABASE drinks;`
* `\l` to see that it was created

## Connecting Databases to Python 

### POSTGRES

In [28]:
# make necessary imports and set up environmental variables
import os
import psycopg2
import pandas as pd

postgres_pwd = os.environ.get('POSTGRES_PWD') #insert your own password

In [31]:
con = psycopg2.connect(database='drinks', user='postgres', password='vacadaba',
                       host='127.0.0.1', port='5432')  # This should work if you left everything as default

In [32]:
#Initiate Cursor
query = con.cursor()



In [None]:
# RUN THIS ONLY ONCE TO POPULATE YOUR DRINKS DATABASE
# COMMENT IT OUT ONCE DONE

#query.execute(open('drinks.sql','r').read())

In [48]:
# execute a simple query
query.execute('''
SELECT * 
FROM orders 
LIMIT 5;''')

response = query.fetchall()

for row in response:
    print(row)

('person 1', '2016-10-16', 'bar 9', 'drink 4', 1)
('person 1', '2016-10-16', 'bar 9', 'drink 44', 1)
('person 1', '2016-10-22', 'bar 19', 'drink 1', 4)
('person 1', '2016-10-22', 'bar 19', 'drink 9', 1)
('person 1', '2016-10-22', 'bar 19', 'drink 42', 2)


In [49]:
pd.DataFrame(response)

Unnamed: 0,0,1,2,3,4
0,person 1,2016-10-16,bar 9,drink 4,1
1,person 1,2016-10-16,bar 9,drink 44,1
2,person 1,2016-10-22,bar 19,drink 1,4
3,person 1,2016-10-22,bar 19,drink 9,1
4,person 1,2016-10-22,bar 19,drink 42,2


In [50]:
df = pd.read_sql('SELECT * FROM orders LIMIT 5;', con)

In [36]:
df

Unnamed: 0,person,date,bar,drink_id,quantity
0,person 1,2016-10-16,bar 9,drink 4,1
1,person 1,2016-10-16,bar 9,drink 44,1
2,person 1,2016-10-22,bar 19,drink 1,4
3,person 1,2016-10-22,bar 19,drink 9,1
4,person 1,2016-10-22,bar 19,drink 42,2


In [None]:
#When Done make sure we close off the connection
#query.close()

## Quick recap for SQL

The basic format is
<img src='select-statement.png' width=200>

What would the following tell us with our database?
```
SELECT *
FROM orders
WHERE bar='bar 9'
```

---
Reminder - our database:

<img src='our-db.png' width=400 align='left'>

## Other commands

Use a cheat sheet!
<img src='sql-cheat-sheet.png' width=700>

## More on joins

- `(INNER) JOIN`: Returns records that have matching values in both tables
- `LEFT (OUTER) JOIN`: Returns all records from the left table, and the matched records from the right table
- `RIGHT (OUTER) JOIN`: Returns all records from the right table, and the matched records from the left table
- `FULL (OUTER) JOIN`: Returns all records when there is a match in either left or right table

<img src='joins.png'>

## Let's work through some problems

### Question 1:
Get the bar name and average price of each bar

What do you think? What table do we need? What calculation do we need?

---
Reminder - our database:

<img src='our-db.png' width=400 align='left'>

In [38]:
query.execute(
"""
SELECT bar, AVG(price) AS avg_price
FROM has_on_menu
GROUP BY bar
""")

response = query.fetchall()

pd.DataFrame(response)

Unnamed: 0,0,1
0,bar 14,93.438857
1,bar 8,67.919044
2,bar 16,121.364365
3,bar 15,90.993793
4,bar 12,152.071032
5,bar 6,27.682568
6,bar 17,107.329703
7,bar 9,52.381706
8,bar 7,61.567487
9,bar 18,105.214779


In [39]:
print(response)

[('bar 14', 93.43885736465454), ('bar 8', 67.91904382705688), ('bar 16', 121.36436504125595), ('bar 15', 90.99379272460938), ('bar 12', 152.07103157043457), ('bar 6', 27.682567596435547), ('bar 17', 107.32970309257507), ('bar 9', 52.38170552253723), ('bar 7', 61.56748652458191), ('bar 18', 105.2147786617279), ('bar 11', 31.56255849202474), ('bar 19', 18.22443421681722), ('bar 3', 129.80951288768225), ('bar 5', 100.60385791460673), ('bar 20', 86.51315768559773), ('bar 10', 60.57263496943882), ('bar 1', 83.67851856776646), ('bar 13', 150.56043901443482), ('bar 4', 49.25097733736038), ('bar 2', 60.25870447158813)]


In [40]:
colnames = [desc[0] for desc in query.description]
print(colnames)

['bar', 'avg_price']


### Question 2:
Get the bars with the top 5 average prices.

What do you think? How can we adapt the code we did before?

---
Reminder - our database:

<img src='our-db.png' width=400 align='left'>

In [41]:
query.execute(
"""
SELECT bar, AVG(price) AS avg_price
FROM has_on_menu
GROUP BY bar
ORDER BY avg_price DESC
LIMIT 5
""")

response = query.fetchall()

pd.DataFrame(response)

Unnamed: 0,0,1
0,bar 12,152.071032
1,bar 13,150.560439
2,bar 3,129.809513
3,bar 16,121.364365
4,bar 17,107.329703


In [42]:
colnames = [desc[0] for desc in query.description]
print(colnames)

['bar', 'avg_price']


### Question 3:
Which bar sells the cheapest drink? Which drink and what's the price?

---
Reminder - our database:

<img src='our-db.png' width=400 align='left'>

In [43]:
query.execute(
"""
SELECT bar, drink_id, price
FROM has_on_menu
ORDER BY price ASC
LIMIT 1;
""")

response = query.fetchall()

pd.DataFrame(response)

Unnamed: 0,0,1,2
0,bar 18,drink 43,3.477886


### Question 4:
What is the number of beers sold by each bar?

(If you interpret it to be "sold" as in available for sale, then you need `has_on_menu`; if you interpret "sold" as in a sale was made, then you need `orders`)

Hint: you need two tables here

---
Reminder - our database:

<img src='our-db.png' width=400 align='left'>

In [44]:
query.execute(
"""
SELECT orders.bar, SUM(orders.quantity) as beers_sold from orders
JOIN drink_info ON drink_info.drink_id = orders.drink_id
WHERE drink_info.type = 'beer'
GROUP BY orders.bar
ORDER BY beers_sold DESC
"""
)

response = query.fetchall()

pd.DataFrame(response)

Unnamed: 0,0,1
0,bar 20,176
1,bar 5,111
2,bar 3,108
3,bar 11,80
4,bar 2,79
5,bar 10,66
6,bar 17,36


### Challenge question

For each person, find the bar they visit, and the type(s) and price(s) of the drink(s) they drink during those visits.

---
Reminder - our database:

<img src='our-db.png' width=400 align='left'>

In [45]:
query.execute(
"""
SELECT o.person, o.bar, d.type, h.price
    FROM orders o
    JOIN has_on_menu h ON (o.drink_id = h.drink_id AND o.bar = h.bar) 
    JOIN drink_info d ON o.drink_id = d.drink_id;
""")

response = query.fetchall()

pd.DataFrame(response)

Unnamed: 0,0,1,2,3
0,person 1,bar 9,cocktail,26.487211
1,person 1,bar 9,vodka,99.061460
2,person 1,bar 19,cocktail,11.775652
3,person 1,bar 19,rum,18.195812
4,person 1,bar 19,whisky,34.507183
...,...,...,...,...
1605,person 100,bar 2,cocktail,18.345350
1606,person 100,bar 2,beer,7.545805
1607,person 100,bar 2,rum,6.449759
1608,person 100,bar 15,wine,144.248120


In [26]:
#don't forget to close your query

## SQLITE

In [25]:
#Import sqlite3, built in to anaconda
import sqlite3

#conn = sqlite3.connect(':memory:') # memory only

#Connect to database
#Must be same directory
#conn = sqlite3.connect("flights.db")  #pass in Database name, builds a new one if not exists

conn = sqlite3.connect("chinook.db") 

**Chinook Database Schema**

<img src = 'chinook_schema.png'>

In [None]:
#Next we need to create a cursor
# a cursor is a little thing that tells the database what to do
cur = conn.cursor() #create cursor instance

In [None]:
#cur.execute("select * from airlines limit 5;")
cur.execute('''
SELECT * FROM employees
LIMIT 10
''')
results = cur.fetchall()

In [None]:
for row in results:
    print(row)

In [None]:
pd.DataFrame(results)

#### One Last Thing

Check out **Turbodbc**, it is another python module that access relational databases. Similar in the structure and method on how we both used the previous ways. 

Supports
1. MySQL
2. PostgreSQL

https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html

## Python Scheduling


- How to automatically schedule task in python 


```cmd
$ pip install schedule
```

Documentation: https://schedule.readthedocs.io/en/stable/


Great FAQs: https://schedule.readthedocs.io/en/stable/faq.html

**Why do we schedule jobs?**

- Check status of a site?
- Pull the latest update from a database with a query? 
- Automating repetitive tasks 

In [1]:
# ssential 
import schedule
import time

In [2]:
#define a job


def job():
    print("This is a job, and the job is done.")
    
    
def job_2():
    print('This is the second job.')
    
    
def job_that_executes_once():
    #stuff and code
    return schedule.CancelJob

In [None]:
schedule.every(5).seconds.do(job)
schedule.every(10).seconds.do(job_2)
schedule.every().day.at("10:00").do(job)
schedule.every().wednesday.at("01:00").do(job_2)
schedule.every(5).to(10).seconds.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

Every 1 day at 22:30:00 do job_that_executes_once() (last run: [never], next run: 2020-12-17 22:30:00)

In [None]:
def greet(name):
    print('Hello {}'.format(name))

schedule.every().day.do(greet, 'Andrea').tag('daily-tasks', 'friend')
schedule.every().hour.do(greet, 'John').tag('hourly-tasks', 'friend')
schedule.every().hour.do(greet, 'Monica').tag('hourly-tasks', 'customer')
schedule.every().day.do(greet, 'Derek').tag('daily-tasks', 'guest')

schedule.clear('daily-tasks')


In [46]:
def my_job():
    # This job will execute every 5 to 10 seconds.
    print('Foo')

schedule.every(5).to(10).seconds.do(my_job)

Every 5 to 10 seconds do my_job() (last run: [never], next run: 2020-12-17 12:11:21)

In [5]:
import requests

#url = "https://api.coindesk.com/v1/bpi/currentprice.json"
#page = requests.get(url)
#data = page.json()

In [None]:
data

In [None]:
data['bpi']['USD']['rate']

In [3]:
bitcoin_price = [] 

In [52]:
import requests
import schedule
import time


def fetch_bitcoin():
    url = "https://api.coindesk.com/v1/bpi/currentprice.json"
    page = requests.get(url)
    data = page.json()
    print('Getting latest Bitcoin Price')
    print(data['bpi']['USD']['rate'])
    bitcoin_price.append(data['bpi']['USD']['rate'])


    
    
    
schedule.every(15).seconds.do(fetch_bitcoin)


while True:
    schedule.run_pending()
    time.sleep(1)

getting some bitcoin price
22,864.7383
getting some bitcoin price
22,864.7383
Foo
getting some bitcoin price
22,864.7383
getting some bitcoin price
22,864.7383
Foo
getting some bitcoin price
22,864.1746
getting some bitcoin price
22,864.1746
Getting latest Bitcoin Price
22,864.1746


NameError: name 'bitcoin_price' is not defined