## Create a database

In [1]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd

In [2]:
# Define a database name (we're using a dataset on births, so we'll call it birth_db)
# Set your postgres username
dbname = 'netgen_db'
username = 'xingliu' # change this to your username

In [3]:
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print(engine.url)

postgres://xingliu@localhost/netgen_db


In [4]:
## create a database (if it doesn't exist)
if not database_exists(engine.url):
    create_database(engine.url)
print(database_exists(engine.url))

True


In [5]:
plant50187 = pd.read_csv('plantid54268.csv', parse_dates = ['time_stamp'])
plant50187['plant_id'] = 50187
plant50187 = plant50187[['plant_id', 'time_stamp', 'net_gen']]

In [6]:
plant3845 = pd.read_csv('plantid3845.csv', parse_dates = ['time_stamp'])
plant3845['plant_id'] = 3845
plant3845 = plant3845[['plant_id', 'time_stamp', 'net_gen']]

In [7]:
plant54268 = pd.read_csv('plantid54268.csv', parse_dates = ['time_stamp'])
plant54268['plant_id'] = 54268
plant54268 = plant54268[['plant_id', 'time_stamp', 'net_gen']]

In [8]:
plant_netgen = pd.concat([plant50187, plant3845, plant54268], axis = 0)

In [9]:
## insert data into database from Python (proof of concept - this won't be useful for big data, of course)
plant_netgen.to_sql('netgen_table', engine, index = False, if_exists='replace')

In [10]:
plant_netgen.head()

Unnamed: 0,plant_id,time_stamp,net_gen
0,50187,2007-01-31,90470.93
1,50187,2007-02-28,85858.58
2,50187,2007-03-31,74914.07
3,50187,2007-04-30,58088.45
4,50187,2007-05-31,70801.27


The above line (to_sql) is doing a lot of heavy lifting.  It's reading a dataframe, it's creating a table, and adding the data to the table.  So ** SQLAlchemy is quite useful! **

## Working with PostgresSQL without Python

**Open up the PostgreSQL app, click on the "Open psql" button in the bottom right corner, ** <br>

or alternatively type <br>

    psql -h localhost

into the command line  

**Connect to the "birth_db" database we created**

    \c birth_db

**You should see something like the following**

`You are now connected to database "birth_db" as user "rockson".`


**Then try the following query:**

    SELECT * FROM birth_data_table;
    
Note that the semi-colon indicates an end-of-statement.

### You can see the table we created!  But it's kinda ugly and hard to read.

Try a few other sample queries.  Before you type in each one, ask yourself what you think the output will look like:

`SELECT * FROM birth_data_table WHERE infant_sex='M';`

`SELECT COUNT(infant_sex) FROM birth_data_table WHERE infant_sex='M';`

`SELECT COUNT(gestation_weeks), infant_sex FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks, infant_sex;`

`SELECT gestation_weeks, COUNT(gestation_weeks) FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks;`

All the above queries run, but they are difficult to visually inspect in the Postgres terminal.

## Working with PostgreSQL in Python

In [11]:
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database = dbname, user = username)

# query:
sql_query = """
SELECT * FROM netgen_table WHERE plant_id=50187;
"""
netgen_from_sql = pd.read_sql_query(sql_query,con)
netgen_from_sql.head()

Unnamed: 0,plant_id,time_stamp,net_gen
0,50187,2007-01-31,90470.93
1,50187,2007-02-28,85858.58
2,50187,2007-03-31,74914.07
3,50187,2007-04-30,58088.45
4,50187,2007-05-31,70801.27


Once the data has been pulled into python, we can leverage pandas methods to work with the data.