# Lecture 7 - Intro to databases

### Contents:
* Databases - dockerized versions
* DataTypes
* Tables
* Joins
* Python - SQLAlchemy
* SQL + Pandas implementation!



## Why do I need it?

* Peristence of data
* Csvs might not be suitable anymore:
    * No data sanitation
    * Cannot share between clients (download continually data from multiple sources and create a single file)
    * Files can get corrupted, inconsistent, no security, easily deleted etc...
    * What if something happens during a write?
    * Parallel writing
    * Speed of writing/reading

* Lookup in the dataset!

## Relational databases

* optimize storage -> use normalized data - discover relations using joins
* designed on ACID principle - Atomicity, Consistency, Isolation, Durability
* store huge data 
* read it very fast - depending on the design
* Many different applications!
    * Business
    * Web-servers
    * Big data
    
* Protected access with username / password, vpns, etc.

## SQL
*Structured Query Language*
* Human readable
* Different implementations
    * engines: SQLite, MySQL, Oracle, PostgreSQL
* SQL is only a language
* Data are stored in *Tables* 
* Connected via *Relations*
* NoSQL - MongoDB, CouchDB, DynamoDB - they optimize access speed, instead of storage (now storage is cheap), async, scalable

## How to use it? 
* Command-line
* Python drivers
* Programming interface
* GUI Interface - [DBeaver](https://dbeaver.io/)
* Integration with existing software - MS Office, etc

### Database Layers
![alt text](sql_struktura.png "sql structures")


### Tables
![alt text](stock-db.png "Our DB")


### Data Types
depends on specific application
* numeric
    * INT, INTEGER, REAL, FLOAT, DOUBLE etc.
* strings
    * STRING, TEXT, VARCHAR
* more specialized
    * DATE, TIME etc.


## My problem - I want to keep data about stocks for analysis

* Would I always need to download data which does not chage?
* Run different queries - analysis
* More stocks can be added any day
* Keep format

In [None]:
#!pip install yfinance

import yfinance as yf

msft = yf.Ticker("MSFT")

In [None]:
#data like this so what do I want to keep? and how?
msft.info


## Lets create a database for this data

* Where to store it?
  *  memory: fast, will be lost once exited
  * personal computer - why not, but can be lost, ?performance?
  * cloud server - SaaS - https://aws.amazon.com/rds/postgresql/ - if you want to learn this, drop as an email, we might add it to the course


* Demo - postgresql server instance running in Docker on your computer - quick to start using, no installation etc.
  * ! docker will create an image where the data will be stored - if you lose the image, you lose the data!
  * It is possible to create have the data in a specific directory, thus persistent - if you really really need persistence of the data, get the cloud server, or read the manual https://hub.docker.com/_/postgres
  

### which docker image? 
https://hub.docker.com/_/postgres
    
docker allows me to easily specify versions, size of image and other things!

### if you have docker running on your machine, you can easily start the database from terminal / command line

* I am running latest postgres 12 baed on alpine linux system (dont really care when in docker, but it is slim)
* Specifying the image name `--name your name` - I can stop it `docker stop name` and start it again `docker start name`
   * if not supplied, it will be created by docker with some funny adjective of a scientist like 'crazy einstein' etc.
* specify env variables which will customize the DB (password and postgres user)
* specify on which port I can access the db -p 5423:5432 `-p 5423:5432` - in the docker it runs on default postgres 5432, I want to get there through my own 5432 port, since nothing is running there. 

* recommending add `-d` so it runs in backgroung

* access logs `docker logs stock-db`


`docker run -d --name stock-db -e POSTGRES_PASSWORD=iesFTW -e POSTGRES_USER=honza -p 5432:5432 postgres:12-alpine`

### how to connect?

* Now we have a server - we need a client (like with requests - browser)
   * Not a bad idea to get familiar with command line tools `psql` client  - on MacOS `brew install libpq`
   * GUI clients - multiplatform https://dbeaver.io/ and others  - on macOS `brew cask install dbeaver-community`   
 
* terminal connect:
   * `psql -h localhost -U honza postgres` and put in password `iesFTW`
   * default database name is `postgres`, thats the last parameter. You can customize it with docker
   * by default `psql` would connect you to database with name same as the user (jansila) in my case, so do not get confused here
 
* DBeaver as shown in video

In [None]:
#save some data - design a database

# tables - company, financials, prices
# each has own purpose

sql_create_company = """CREATE TABLE company (
    ticker VARCHAR(5) PRIMARY KEY, --max length of a ticker is 5
    name TEXT NOT NULL, -- cannot be empty
    sector TEXT,
    state TEXT,
    summary TEXT)
"""


sql_create_financials = """CREATE TABLE financials (
    ticker VARCHAR(5) PRIMARY KEY, -- in more advanced designs, we would create this as foreign key! only one observation per ticker
    shares BIGINT,
    div_yield REAL,
    beta REAL
)"""

sql_create_prices = """CREATE TABLE IF NOT EXISTS prices (
    ticker VARCHAR(5),
    ts date not null,
    price REAL,
    volume BIGINT --in milions
    )
    """

In [None]:
# lets connect
import psycopg2
connection = psycopg2.connect("dbname='postgres' user='honza' host='localhost' password='iesFTW'")
connection.autocommit = True
# in order to work with the DB, we need a cursor 

cursor = connection.cursor()

for sql_statement in [sql_create_company, sql_create_financials, sql_create_prices]:
    cursor.execute(sql_statement)

In [None]:
def write_company_data(cursor, ticker, td):
    cursor.execute("INSERT INTO company (ticker, name, sector, state, summary) VALUES (%s, %s, %s, %s, %s)", 
                       (ticker, td['shortName'], td['sector'], td['state'], td['longBusinessSummary'])
                  )
def write_financial_data(cursor, ticker, td):
    cursor.execute("INSERT INTO financials (ticker, shares, div_yield, beta) VALUES (%s, %s, %s, %s)", 
                       (ticker, td['floatShares'], td['dividendYield'],td['beta'])
                  )

def write_prices(cursor, ticker, data):
    for row in data.iterrows():
        ts = row[0]
        close = row[1]['Close']
        vol = row[1]['Volume']
        cursor.execute("INSERT INTO prices (ticker, ts, price, volume) VALUES (%s, %s, %s, %s)", 
                       (ticker, ts, close,vol)
                  )
        
## add some data in the db
tickers = ['MSFT', 'FB','GOOG','GS','INTC', 'AAL', 'AAPL']

#yf api https://aroussi.com/post/python-yahoo-finance

for ticker in tickers: 
    td = yf.Ticker(ticker)
    write_company_data(cursor, ticker, td.info)
    write_financial_data(cursor, ticker, td.info)
    write_prices(cursor, ticker, td.history('ytd'))
    
    


In [None]:
msft.history()

In [None]:
#not data are in persistent storage

In [None]:
#check all companies we downloaded

cursor.execute("SELECT ticker, name, sector from company;")

for row in cursor.fetchall(): #cursor.fetchone(), 

    print(f'downloaded {row[1]} that operates in {row[2]} and has ticker: {row[0]}')

In [None]:
cursor.execute("SELECT ticker, name, sector from company;")

cursor.fetchone()

In [None]:
cursor.fetchone()

In [None]:
#or iterate 
cursor.execute("SELECT ticker, name, sector from company;")

for row in cursor:
    print(row)


In [None]:
#check all technology companies we downloaded

#case 1
cursor.execute("SELECT ticker, name from company WHERE sector = 'Technology';")
for row in cursor.fetchall():
    print(f'downloaded {row[1]} and has ticker: {row[0]}')

print('-----')
#check all technology companies we downloaded
#case 2
cursor.execute("SELECT ticker, name from company where sector = %s;", ('Technology', )) #input needs to be a tuple!
for row in cursor.fetchall():
    print(f'downloaded a tech company {row[1]} and has ticker: {row[0]}')

In [None]:
# joins
# just like in pandas

cursor.execute("""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                        join financials as fin 
                            on fin.ticker=comp.ticker
                ;""")
for row in cursor.fetchall():
    print(f'ticker {row[0]} in sector {row[1]} with {row[2]} shares outstanding')

### JOINS 

* connecting tables - relations!

<img src='https://www.dofactory.com/Images/sql-joins.png'/>

### Inner
* most common - give me the match!
* when you see match, keep it, otherwise drop it.

### Left 
* INNER + rows from LEFT with no match in the RIGHT

In [None]:
cursor.execute("""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                       left join financials as fin 
                    on fin.ticker=comp.ticker;""")
for row in cursor.fetchall():
    print(f'ticker {row[0]} in sector {row[1]} with {row[2]} shares outstanding')

In [None]:
from sqlalchemy import *
import pandas as pd

#                 connect as driver://username:password@host:port/database
engine = create_engine('postgresql://honza:iesFTW@localhost:5432/postgres')

In [None]:
#pandas + psycopg2

pd.read_sql_query("""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                       left join financials as fin 
                    on fin.ticker=comp.ticker;""", connection, index_col='ticker')

In [None]:
#pandas + sqlalchemy
pd.read_sql_query(
"""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                       left join financials as fin 
                    on fin.ticker=comp.ticker;"""
    ,con=engine,index_col='ticker')


In [None]:
#multiple joins with WHERE clause!

pd.read_sql_query("""SELECT comp.ticker, fin.shares, px.price as lprice
                    from company as comp 
                        join financials as fin
                            on fin.ticker=comp.ticker
                        join prices as px
                            on px.ticker=comp.ticker
                        WHERE px.ts='2020-04-21'
                  """,connection, index_col='ticker')

In [None]:
#algebra within a query

pd.read_sql_query("""SELECT comp.ticker, fin.shares, px.price as lprice, fin.shares*px.price/1e9 as mktcap_in_billions
                    from company as comp 
                        join financials as fin
                            on fin.ticker=comp.ticker
                        join prices as px
                            on px.ticker=comp.ticker
                        where px.ts='2020-01-02'
                  """,connection, index_col='ticker')

In [None]:
#all prices and calculated market caps

pd.read_sql_query("""SELECT comp.ticker, px.price as lprice, px.ts, fin.shares*px.price/1e9 as mktcap_in_billions
                    from company as comp 
                        join financials as fin
                            on fin.ticker=comp.ticker
                        join prices as px
                            on px.ticker=comp.ticker;
                  """,connection, index_col='ticker')

In [179]:
#create turnover variable

cursor.execute("""
                ALTER TABLE prices 
             ADD COLUMN IF NOT EXISTS turnover REAL;
             """)

cursor.execute("UPDATE prices SET turnover = volume*price")

pd.read_sql_query("""SELECT * from prices WHERE ticker='AAPL';
                  """,connection, index_col='ticker')

Unnamed: 0_level_0,ts,price,volume,turnover
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2020-01-02,299.64,33870100,1.014884e+10
AAPL,2020-01-03,296.73,36580700,1.085459e+10
AAPL,2020-01-06,299.09,29596800,8.852107e+09
AAPL,2020-01-07,297.68,27218000,8.102254e+09
AAPL,2020-01-08,302.47,33019800,9.987499e+09
...,...,...,...,...
AAPL,2020-04-16,286.69,39281300,1.126156e+10
AAPL,2020-04-17,282.80,53812500,1.521817e+10
AAPL,2020-04-20,276.93,32503800,9.001277e+09
AAPL,2020-04-21,268.37,45189800,1.212759e+10
