## Midterm - any questions?

# Lecture 8 - Intro to databases

### Contents:
* Databases - dockerized versions
* DataTypes
* Tables
* Joins
* Python - SQLAlchemy
* SQL + Pandas implementation!



## Why do I need it?

* Peristence of data
* Csvs might not be suitable anymore:
    * No data sanitation
    * Cannot share between clients (download continually data from multiple sources and create a single file)
    * Files can get corrupted, inconsistent, no security, easily deleted etc...
    * What if something happens during a write?
    * Parallel writing
    * Speed of writing/reading

* Lookup in the dataset!

## Relational databases

* optimize storage -> use normalized data - discover relations using joins
* designed on ACID principle - Atomicity, Consistency, Isolation, Durability
* store huge data 
* read it very fast - depending on the design
* Many different applications!
    * Business
    * Web-servers
    * Big data
    
* Protected access with username / password, vpns, etc.

## SQL
*Structured Query Language*
* Human readable
* Different implementations
    * engines: SQLite, MySQL, Oracle, PostgreSQL
* SQL is only a language
* Data are stored in *Tables* 
* Connected via *Relations*
* NoSQL - MongoDB, CouchDB, DynamoDB - they optimize access speed, instead of storage (now storage is cheap), async, scalable

## How to use it? 
* Command-line
* Python drivers
* Programming interface
* GUI Interface - [DBeaver](https://dbeaver.io/)
* Integration with existing software - MS Office, etc

### Database Layers
![alt text](sql_struktura.png "sql structures")


### Tables
![alt text](stock-db.png "Our DB")


### Data Types
depends on specific application
* numeric
    * INT, INTEGER, REAL, FLOAT, DOUBLE etc.
* strings
    * STRING, TEXT, VARCHAR
* more specialized
    * DATE, TIME etc.


## My problem - I want to keep data about stocks for analysis

* Would I always need to download data which does not chage?
* Run different queries - analysis
* More stocks can be added any day
* Keep format

In [21]:
# !pip install yfinance

import yfinance as yf

msft = yf.Ticker("MSFT")

In [None]:
#data like this so what do I want to keep? and how?
msft.info


## Lets create a database for this data

* Where to store it?
  *  memory: fast, will be lost once exited
  * personal computer - why not, but can be lost, ?performance?
  * cloud server - SaaS - https://aws.amazon.com/rds/postgresql/ - if you want to learn this, drop as an email, we might add it to the course


* Demo - postgresql server instance running in Docker on your computer - quick to start using, no installation etc.
  * ! docker will create a container where the data will be stored - if you lose the image, you lose the data!
  * It is possible to create have the data in a specific directory, thus persistent - if you really really need persistence of the data, get the cloud server, or read the manual https://hub.docker.com/_/postgres
  

### which docker image? 
https://hub.docker.com/_/postgres
    
docker allows me to easily specify versions, size of image and other things!

### if you have docker running on your machine, you can easily start the database from terminal / command line

* I am running latest postgres 12 based on alpine linux system (dont really care when in docker, but it is slim)
* Specifying the image name `--name your name` - I can stop it `docker stop name` and start it again `docker start name`
   * if not supplied, it will be created by docker with some funny adjective of a scientist like 'crazy einstein' etc.
* specify env variables which will customize the DB (password and postgres user)
* specify on which port I can access the db -p 5423:5432 `-p 5423:5432` - in the docker it runs on default postgres 5432, I want to get there through my own 5432 port, since nothing is running there. 

* recommending add `-d` so it runs in backgroung

* access logs `docker logs stock-db`


`docker run -d --name stock-db -e POSTGRES_PASSWORD=iesFTW -e POSTGRES_USER=honza -p 5432:5432 postgres:12-alpine`

### how to connect?

* Now we have a server - we need a client (like with requests - browser)
   * Not a bad idea to get familiar with command line tools `psql` client  - on MacOS `brew install libpq`
   * GUI clients - multiplatform https://dbeaver.io/ and others  - on macOS `brew cask install dbeaver-community`   
 
* terminal connect:
   * `psql -h localhost -U honza postgres` and put in password `iesFTW`
   * default database name is `postgres`, thats the last parameter. You can customize it with docker
   * by default `psql` would connect you to database with name same as the user (jansila) in my case, so do not get confused here
 
* DBeaver as shown in video

In [26]:
#save some data - design a database

# tables - company, financials, prices
# each has own purpose

sql_create_company = """CREATE TABLE IF NOT EXISTS company (
                            ticker VARCHAR(5) PRIMARY KEY, --max length of a ticker is 5
                            name TEXT NOT NULL, -- cannot be empty
                            sector TEXT,
                            state TEXT,
                            summary TEXT)
"""


sql_create_financials = """CREATE TABLE financials (
    ticker VARCHAR(5) PRIMARY KEY, -- in more advanced designs, we would create this as foreign key! only one observation per ticker
    shares BIGINT,
    div_yield REAL,
    beta REAL
)"""

sql_create_prices = """CREATE TABLE IF NOT EXISTS prices (
    ticker VARCHAR(5),
    ts DATE NOT NULL,
    price REAL,
    volume BIGINT --in milions
    )
    """

In [None]:
# !pip install psycopg2

In [27]:
# lets connect
import psycopg2 #only for PostgreSQL

connection = psycopg2.connect("dbname='postgres' user='honza' host='db' password='iesFTW'") 
connection.autocommit = True #bit advanced
# in order to work with the DB, we need a cursor 

cursor = connection.cursor() #this talks to the DB

for sql_statement in [sql_create_company, sql_create_financials, sql_create_prices]:
    cursor.execute(sql_statement)

In [30]:
def write_company_data(cursor, ticker, td):
    cursor.execute("INSERT INTO company (ticker, name, sector, state, summary) VALUES (%s, %s, %s, %s, %s)", 
                       (ticker, td['shortName'], td['sector'], td['state'], td['longBusinessSummary'])
                  )
def write_financial_data(cursor, ticker, td):
    cursor.execute("INSERT INTO financials (ticker, shares, div_yield, beta) VALUES (%s, %s, %s, %s)", 
                       (ticker, td['floatShares'], td['dividendYield'],td['beta'])
                  )

def write_prices(cursor, ticker, data):
    for row in data.iterrows():
        ts = row[0]
        close = row[1]['Close']
        vol = row[1]['Volume']
        cursor.execute("INSERT INTO prices (ticker, ts, price, volume) VALUES (%s, %s, %s, %s)", 
                       (ticker, ts, close,vol)
                  )
        
## add some data in the db
tickers = ['MSFT', 'FB','GOOG','GS','INTC', 'AAL', 'AAPL']

#yf api https://aroussi.com/post/python-yahoo-finance

for ticker in tickers: 
    td = yf.Ticker(ticker)
    write_company_data(cursor, ticker, td.info)
    write_financial_data(cursor, ticker, td.info)
    write_prices(cursor, ticker, td.history('ytd'))
    
    


In [40]:
#check all companies we downloaded

cursor.execute("SELECT ticker, name, sector FROM company;")

# for row in cursor.fetchall(): #cursor.fetchone(), 

#     print(f'downloaded {row[1]} that operates in {row[2]} and has ticker: {row[0]}')

In [41]:
cursor.fetchall()

[('MSFT', 'Microsoft Corporation', 'Technology'),
 ('FB', 'Facebook, Inc.', 'Communication Services'),
 ('GOOG', 'Alphabet Inc.', 'Communication Services'),
 ('GS', 'Goldman Sachs Group, Inc. (The)', 'Financial Services'),
 ('INTC', 'Intel Corporation', 'Technology'),
 ('AAL', 'American Airlines Group, Inc.', 'Industrials'),
 ('AAPL', 'Apple Inc.', 'Technology')]

In [None]:
cursor.execute("SELECT ticker, name, sector from company;")

cursor.fetchone()

In [None]:
cursor.fetchone()

In [42]:
#or iterate 
cursor.execute("SELECT ticker, name, sector from company;")

for row in cursor:
    print(row)


('MSFT', 'Microsoft Corporation', 'Technology')
('FB', 'Facebook, Inc.', 'Communication Services')
('GOOG', 'Alphabet Inc.', 'Communication Services')
('GS', 'Goldman Sachs Group, Inc. (The)', 'Financial Services')
('INTC', 'Intel Corporation', 'Technology')
('AAL', 'American Airlines Group, Inc.', 'Industrials')
('AAPL', 'Apple Inc.', 'Technology')


In [46]:
#check all technology companies we downloaded

#case 1
cursor.execute("SELECT ticker, name from company WHERE sector = 'Technology';")
for row in cursor.fetchall():
    print(f'downloaded {row[1]} and has ticker: {row[0]}')

print('-----')
#check all technology companies we downloaded
#case 2
cursor.execute("SELECT ticker, name from company where sector = %s;", ('Technology', )) #input needs to be a tuple!
for row in cursor.fetchall():
    print(f'downloaded a tech company {row[1]} and has ticker: {row[0]}')
    
    
    
print('-----')
#check all Tech and Industrial companies we downloaded
#case 3
for industry in [('Technology',), ('Industrials',)]:

    cursor.execute("SELECT ticker, name from company where sector = %s;", industry) #input needs to be a tuple!
    for row in cursor.fetchall():
        print(f'downloaded a {industry[0]} company {row[1]} and has ticker: {row[0]}')

downloaded Microsoft Corporation and has ticker: MSFT
downloaded Intel Corporation and has ticker: INTC
downloaded Apple Inc. and has ticker: AAPL
-----
downloaded a tech company Microsoft Corporation and has ticker: MSFT
downloaded a tech company Intel Corporation and has ticker: INTC
downloaded a tech company Apple Inc. and has ticker: AAPL
-----
downloaded a Technology company Microsoft Corporation and has ticker: MSFT
downloaded a Technology company Intel Corporation and has ticker: INTC
downloaded a Technology company Apple Inc. and has ticker: AAPL
downloaded a Industrials company American Airlines Group, Inc. and has ticker: AAL


In [47]:
# joins
# just like in pandas

cursor.execute("""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                        join financials as fin 
                    on fin.ticker=comp.ticker
                ;""")
for row in cursor.fetchall():
    print(f'ticker {row[0]} in sector {row[1]} with {row[2]} shares outstanding')

ticker MSFT in sector Technology with 7449961518 shares outstanding
ticker FB in sector Communication Services with 2391882253 shares outstanding
ticker GOOG in sector Communication Services with 605509742 shares outstanding
ticker GS in sector Financial Services with 354142142 shares outstanding
ticker INTC in sector Technology with 4094393760 shares outstanding
ticker AAL in sector Industrials with 502828288 shares outstanding
ticker AAPL in sector Technology with 16984630180 shares outstanding


### JOINS 

* connecting tables - relations!

<img src='https://4.bp.blogspot.com/-_HsHikmChBI/VmQGJjLKgyI/AAAAAAAAEPw/JaLnV0bsbEo/s1600/sql%2Bjoins%2Bguide%2Band%2Bsyntax.jpg' width="800" />

### Inner
* most common - give me the match!
* when you see match, keep it, otherwise drop it.

### Left 
* INNER + rows from LEFT with no match in the RIGHT

In [48]:
cursor.execute("""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                       left join financials as fin 
                    on fin.ticker=comp.ticker;""")
for row in cursor.fetchall():
    print(f'ticker {row[0]} in sector {row[1]} with {row[2]} shares outstanding')

ticker MSFT in sector Technology with 7449961518 shares outstanding
ticker FB in sector Communication Services with 2391882253 shares outstanding
ticker GOOG in sector Communication Services with 605509742 shares outstanding
ticker INTC in sector Technology with 4094393760 shares outstanding
ticker AAL in sector Industrials with 502828288 shares outstanding
ticker AAPL in sector Technology with 16984630180 shares outstanding


In [50]:
# !pip install sqlalchemy

Collecting sqlalchemy
[?25l  Downloading https://files.pythonhosted.org/packages/a7/91/f4f202d214849d071d4d4481176b09a3ffb9299ca33a8775629cce47daec/SQLAlchemy-1.3.20-cp37-cp37m-manylinux2010_x86_64.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 4.2MB/s eta 0:00:01
[?25hInstalling collected packages: sqlalchemy
Successfully installed sqlalchemy-1.3.20
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [54]:
from sqlalchemy import *
import pandas as pd

#                 connect as driver://username:password@host:port/database
engine = create_engine('postgresql://honza:iesFTW@db:5432/postgres') #postgresql.connection - similar object

In [55]:
#pandas + psycopg2

pd.read_sql_query("""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                       left join financials as fin 
                    on fin.ticker=comp.ticker;""", connection, index_col='ticker')

Unnamed: 0_level_0,sector,shares
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
MSFT,Technology,7449961518
FB,Communication Services,2391882253
GOOG,Communication Services,605509742
INTC,Technology,4094393760
AAL,Industrials,502828288
AAPL,Technology,16984630180


In [56]:
#pandas + sqlalchemy
pd.read_sql_query(
"""SELECT comp.ticker, comp.sector, fin.shares 
                    from company as comp 
                       left join financials as fin 
                    on fin.ticker=comp.ticker;"""
    ,con=engine,index_col='ticker')


Unnamed: 0_level_0,sector,shares
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
MSFT,Technology,7449961518
FB,Communication Services,2391882253
GOOG,Communication Services,605509742
INTC,Technology,4094393760
AAL,Industrials,502828288
AAPL,Technology,16984630180


In [60]:
#multiple joins with WHERE clause!

pd.read_sql_query("""SELECT comp.ticker, fin.shares,fin.div_yield, px.price as lprice
                    from company as comp 
                        join financials as fin
                            on fin.ticker=comp.ticker
                        join prices as px
                            on px.ticker=comp.ticker
                        WHERE px.ts='2020-04-21'
                  """,connection, index_col='ticker')

Unnamed: 0_level_0,shares,div_yield,lprice
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MSFT,7449961518,0.0107,166.5144
FB,2391882253,,170.8
GOOG,605509742,,1216.34
INTC,4094393760,0.0287,55.263416
AAL,502828288,,11.0
AAPL,16984630180,0.0072,66.13444


In [None]:
#algebra within a query

pd.read_sql_query("""SELECT comp.ticker, fin.shares, px.price as lprice, fin.shares*px.price/1e9 as mktcap_in_billions
                    from company as comp 
                        join financials as fin
                            on fin.ticker=comp.ticker
                        join prices as px
                            on px.ticker=comp.ticker
                        where px.ts='2020-01-02'
                  """,connection, index_col='ticker')

In [61]:
#all prices and calculated market caps

pd.read_sql_query("""SELECT comp.ticker, px.price as lprice, px.ts, fin.shares*px.price/1e9 as mktcap_in_billions
                    from company as comp 
                        join financials as fin
                            on fin.ticker=comp.ticker
                        join prices as px
                            on px.ticker=comp.ticker;
                  """,connection, index_col='ticker')

Unnamed: 0_level_0,lprice,ts,mktcap_in_billions
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MSFT,158.93628,2020-01-02,1184.069165
MSFT,156.95726,2020-01-03,1169.325548
MSFT,157.36296,2020-01-06,1172.348002
MSFT,155.92818,2020-01-07,1161.658917
MSFT,158.41183,2020-01-08,1180.162073
...,...,...,...
AAPL,118.03000,2020-11-18,2004.695879
AAPL,118.64000,2020-11-19,2015.056514
AAPL,117.34000,2020-11-20,1992.976443
AAPL,113.85000,2020-11-23,1933.700120


In [64]:
#create turnover variable

cursor.execute("""
                ALTER TABLE prices 
             ADD COLUMN IF NOT EXISTS turnover REAL;
             """)

cursor.execute("UPDATE prices SET turnover = volume*price")

pd.read_sql_query("""SELECT * from prices WHERE ticker='AAPL';
                  """,connection, index_col='ticker')

Unnamed: 0_level_0,ts,price,volume,turnover
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2020-01-02,73.840040,135480400,1.000388e+10
AAPL,2020-01-03,73.122154,146322800,1.069944e+10
AAPL,2020-01-06,73.704820,118387200,8.725707e+09
AAPL,2020-01-07,73.358185,108872000,7.986652e+09
AAPL,2020-01-08,74.538240,132079200,9.844951e+09
...,...,...,...,...
AAPL,2020-11-18,118.030000,76322100,9.008297e+09
AAPL,2020-11-19,118.640000,74113000,8.792766e+09
AAPL,2020-11-20,117.340000,73391400,8.611747e+09
AAPL,2020-11-23,113.850000,127126400,1.447334e+10


# SANITIZE YOUR INPUTS

<Img src='https://imgs.xkcd.com/comics/exploits_of_a_mom.png' />