<a href="https://colab.research.google.com/github/vlx300/kb_colab/blob/master/Python_SQLite3_Example_Data_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Engineering** 
##**Relational vs Non-Relational Databases**

![alt text](https://ulriklyngs.com/sites/default/files/styles/final-blog-image-style/public/desat_chalkb.jpg?itok=7gumzPT8)

In [0]:
import sqlite3

In [0]:
db = sqlite3.connect(':memory:')  # using in memory database
cur = db.cursor()

##**Create three tables**

1.   **Customer**:  This table contains primary key, as well as customer first and last names
2.   **Items**: this table will contain the primary key, item name and item price
3.   **Items Bought**:this table contains the order#, date and price, it will also connect to the primary keys in items and customer tables. 



In [3]:
cur.execute('''CREATE TABLE IF NOT EXISTS customer (
  id integer PRIMARY KEY,
  firstname varchar(255),
  lastname varchar(255) )''')
cur.execute('''CREATE TABLE IF NOT EXISTS Item (
  id integer PRIMARy KEY,
  title varchar(255),
  price decimal )''')
cur.execute('''CREATE TABLE IF NOT EXISTS BoughtItem (
  ordernumber integer PRIMARY KEY,
  customerid integer,
  itemid integer, 
  price decimal,
  CONSTRAINT customerid
      FOREIGN KEY (customerid) REFERENCES Customer(id),
  CONSTRAINT itemid 
      FOREIGN KEY (itemid) REFERENCES Item(id) )''')

<sqlite3.Cursor at 0x7f266b045730>

You passed a query to cur.execute() to create your three tables.. Now lets populate them with data 

In [4]:
cur.execute('''INSERT INTO Customer(firstname, lastname)
                VALUES ('Bob', 'Adams'),
                ('Amy', 'Smith'),
                ('Rob', 'Bennet');''')
cur.execute('''INSERT INTO Item(title, price)
                VALUES ('USB', 10.2),
                ('Mouse', 12.23),
                ('Monitor', 199.99);''')
cur.execute('''INSERT INTO BoughtItem(customerid, itemid, price)
                VALUES (1, 1, 10.2),
                (1, 2, 12.23),
                (1, 3, 199.99),
                (2, 3, 180.00),
                (3, 2, 11.23);''')  # Discounted Price


<sqlite3.Cursor at 0x7f266b045730>

Now we have a few records in each table, you can use this data to answer a few more questions 

#**SQL Aggregation Functions**

Aggregation functions are those that perform mathematical operations  on a result set.  **AVG, COUNT, MIN, MAX and SUM**  Often you will need a **GROUP BY** or **HAVING** Clause to complement these aggregations. Let use **AVG** as a example: (See Below)

*AVG can compute the "mean" of a given resul*t

In [5]:
cur.execute('''SELECT itemid, AVG(price) FROM Boughtitem GROUP BY itemid''')
print(cur.fetchall())

[(1, 10.2), (2, 11.73), (3, 189.995)]


Here you have retrieved the averge price for each of the items bought in your database. you can see that the item with the item id# of 1 has an average price of $10.20

Lets make this easier to understand, by displaying the item name instead of item id#

In [6]:
cur.execute('''SELECT item.title, AVG(boughtitem.price) FROM Boughtitem as boughtitem
            INNER JOIN Item as Item on (item.id = boughtitem.customerid)
            GROUP BY boughtitem.itemid''')

print(cur.fetchall())

[('USB', 10.2), ('Monitor', 11.73), ('Mouse', 189.995)]


Another useful aggregation function is **SUM**. you can use this functon to display the total amount of money each customer spent (*See below*)

In [7]:
cur.execute('''SELECT customer.firstname, SUM(boughtitem.price) FROM BoughtItem as BoughtItem
            INNER JOIN Customer as Customer on (Customer.id = boughtitem.customerid)
            GROUP BY customer.firstname''')

print(cur.fetchall())

[('Amy', 180), ('Bob', 222.42000000000002), ('Rob', 11.23)]


#**Speeding up SQL Queries**

Speed depends on various factors but is mostly affected by how many of each of the following are present: 

*   **Joins**
*   **Aggregations**
*   **Traversals**
*   **Records**

the greater number of joins, the higher the complexity and the larger number of traversals in tables. Multiple joins are quite expensive to performs on several thousand records invloving several tables because *the database needs to cache the intermediate result*! At this point most people starting thinking about increasing DB memory size. Hmmm

![alt text](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRc8ZoZdP57kkw--eYUpsNhp-DnJOs7fblaUIgnFd4uMORRB9sATg&s)

Speed is also affected by whether or not there are indices present in the database or not. they are extrememly important and allow you to quick;ly searh thru a table to find a match for some column etc.  

Indices sort the records at th4 cost of higher insert time. as well as some storage Multiple columns can be combined to create a single index.  *Example. 'date' and 'price'columns can be combined because the query depends on both conditions.* 



#**Debugging SQL queries**

Most dstabases include an EXPLAIN QUERY PLAN that describes the steps the database takes to execute a query. for SQLite you can enable this functionality by adding EXPLAIN QUERY PLAN  in front of a SELECT statement 

In [9]:
cur.execute('''EXPLAIN QUERY PLAN SELECT customer.firstname, item.title,
                item.price, boughtitem.price FROM BoughtItem as boughtitem
                INNER JOIN Customer as customer on (customer.id = boughtitem.customerid)
                INNER JOIN  Item as item on (item.id = boughtitem.itemid)''')

print(cur.fetchall())

[(0, 0, 0, 'SCAN TABLE BoughtItem AS boughtitem'), (0, 1, 1, 'SEARCH TABLE Customer AS customer USING INTEGER PRIMARY KEY (rowid=?)'), (0, 2, 2, 'SEARCH TABLE Item AS item USING INTEGER PRIMARY KEY (rowid=?)')]


the query tries to list the first name, item title, original price for all bought items. 

**SQL Query Plan**

SCAN TABLE BoughtItem AS boughtitem
SEARCH TABLE Customer AS customer USING INTEGER PRIMARY KEY (rowid=?)
SEARCH TABLE Item AS item USING INTEGER PRIMARY KEY (rowid=?)

**NOTE:** the fetch statement in the python code only returns the explanation, **NOT THE RESULTS!**