# Sydney Livability Group Assignment

## To Do:
- Clean tables. Make sure data is valid for when we upload it to the Sydney_Livability schema. 
- Figure out how to upload the zipped folders of data (non-csv data) into the notebook. 
- Identify the Primary and Foreign Keys for each table.
- Upload the tables into the schema.
- Figure out how to share the schema. Something to do with adding other peoples credentials into the credentials.json folder.
- Identify other databases we want to use from the website provided.
- Identify other stakeholders
- Canvas "Quiz" on stakeholders and additional data: due Week 11 Friday. 

## Notes
Accessing a PostgreSQL database within Python requires psycopg2 and sqlalchemy modules. 
Also require pandas. 
You need a crednetials.json file in the same folder to store database credentials. This will also allow us to share notebooks between users without security concerns and allow multiple credentials to be stored without greatly modifying the notebook. 
We have to use the public schema as it is the only schema with PostGis installed. We do not create our own schema. Will try to figure out how to share tables on a public schema.
This code is from the Week 4 Tutorial

## Connect to the database:

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy import MetaData


import psycopg2
import psycopg2.extras
import json
import os

credentials = "Credentials.json"

def pgconnect(credential_filepath, db_schema="public"):
    with open(credential_filepath) as f:
        db_conn_dict = json.load(f)
        host       = db_conn_dict['host']
        db_user    = db_conn_dict['user']
        db_pw      = db_conn_dict['password']
        default_db = db_conn_dict['user']
        try:
            db = create_engine('postgresql+psycopg2://'+db_user+':'+db_pw+'@'+host+'/'+default_db, echo=False)
            conn = db.connect()
            print('Connected successfully.')
        except Exception as e:
            print("Unable to connect to the database.")
            print(e)
            db, conn = None, None
        return db,conn

In [25]:
db, conn = pgconnect(credentials)

Connected successfully.


## Exploring the Schema on PGAdmin:

In [10]:
from sqlalchemy import inspect
inspect(db).get_schema_names()

['information_schema', 'nswfuel', 'prices', 'public']

Inspect a specific schema:

In [11]:
inspect(db).get_table_names(schema='nswfuel')
inspect(db).get_columns('observations', schema='nswfuel')

[{'name': 'servicestation',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'observationno',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'pricedate',
  'type': DATE(),
  'nullable': True,
  'default': None,
  'autoincrement': False,
  'comment': None},
 {'name': 'pricetime',
  'type': TIME(),
  'nullable': True,
  'default': None,
  'autoincrement': False,
  'comment': None}]

## Create a new schema!

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fb750095820>

This will set all of the queries you execute to look at the sydney_livability schema. Probably won't return anything useful until you have a sydney_livability schema in your PGAdmin Server. First need to figure out how to share a server with others. 

In [12]:
conn.execute("set search_path to public")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fe3d0183790>

Defining helper function: Query

In [13]:
def query(conn, sqlcmd, args=None, df=True):
    result = pd.DataFrame() if df else None
    try:
        if df:
            result = pd.read_sql_query(sqlcmd, conn, params=args)
        else:
            result = conn.execute(sqlcmd, args).fetchall()
            result = result[0] if len(result) == 1 else result
    except Exception as e:
        print("Error encountered: ", e, sep='\n')
    return result

In [None]:
# To switch off converting everything into a pandas dataframe use df = False like so:
# query(conn, "select count(*) from Fuel", df=False)

## Exploring tables provided

In [14]:
nhdata = pd.read_csv('Neighbourhoods.csv')
nhdata.head()

Unnamed: 0.1,Unnamed: 0,area_id,area_name,land_area,population,number_of_dwellings,number_of_businesses,median_annual_household_income,avg_monthly_rent,0-4,5-9,10-14,15-19
0,0,102011028,Avoca Beach - Copacabana,643.8,7590,2325,738.0,46996.0,1906.0,467,583,604,560
1,1,102011029,Box Head - MacMasters Beach,3208.6,10986,3847,907.0,42621.0,1682.0,586,696,661,692
2,2,102011030,Calga - Kulnura,76795.1,4841,1575,1102.0,42105.0,1182.0,220,254,304,320
3,3,102011031,Erina - Green Point,3379.3,14237,4450,1666.0,43481.0,1595.0,695,778,916,838
4,4,102011032,Gosford - Springfield,1691.2,19385,6373,2126.0,45972.0,1382.0,1200,1079,963,977


In [15]:
business_stats_data = pd.read_csv('BusinessStats.csv')
business_stats_data.head()

Unnamed: 0,area_id,area_name,number_of_businesses,accommodation_and_food_services,retail_trade,agriculture_forestry_and_fishing,health_care_and_social_assistance,public_administration_and_safety,transport_postal_and_warehousing
0,101021007,Braidwood,629,26,27,280,11,0,35
1,101021008,Karabar,326,7,10,8,11,0,43
2,101021009,Queanbeyan,724,52,47,11,56,3,77
3,101021010,Queanbeyan - East,580,16,23,4,12,0,57
4,101021011,Queanbeyan Region,1642,39,63,292,34,7,81


## Clean data provided

In [31]:
# Check for 0 or negative values
# Check for null values
# Check pandas is interpreting the right values.

     Unnamed: 0    area_id                    area_name   land_area  \
0             0  102011028     Avoca Beach - Copacabana    643.8000   
1             1  102011029  Box Head - MacMasters Beach   3208.6000   
2             2  102011030              Calga - Kulnura  76795.1000   
3             3  102011031          Erina - Green Point   3379.3000   
4             4  102011032        Gosford - Springfield   1691.2000   
..          ...        ...                          ...         ...   
317         317  106011109              Cessnock Region   1570.4341   
318         318  106011113             Singleton Region   4067.2349   
319         319  111021218        Morisset - Cooranbong    330.5208   
320         320  114021285         Hill Top - Colo Vale    174.3752   
321         321  114021289           Southern Highlands   1409.7013   

    population number_of_dwellings  number_of_businesses  \
0         7590                2325                 738.0   
1        10986             

## Load data into new database:

In [29]:
# ALREADY EXECUTED
# Create the Neighbourhoods table:
conn.execute("""
CREATE TABLE Neighbourhoods(
    area_id INTEGER NOT NULL PRIMARY KEY,
    area_name VARCHAR(50),
    land_area FLOAT8,
    population INTEGER,
    number_of_dwellings INTEGER,
    number_of_businesses FLOAT8,
    median_annual_household_income FLOAT8,
    avg_monthly_rent FLOAT8,
    child0_4 INTEGER,
    child5_9 INTEGER,
    child10_14 INTEGER,
    child15_19 INTEGER
    )""")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fe3a8aafcd0>

In [5]:
# DO NOT RUN YET. FIRST MUST CLEAN THE DATA SO IT WILL UPLOAD CORRECTLY. 
# nhdata.to_sql("neighbourhoods", con=conn, if_exists='append', index=False)
# query(conn, "select * from neighbourhoods")

NameError: name 'query' is not defined

In [30]:
# DO NOT RUN YET. NEED TO CLEAN DATA FIRST
# Create the Neighbourhoods table:
conn.execute("""
CREATE TABLE BusinessStats(
    area_id INTEGER NOT NULL REFERENCES Neighbourhoods(area_id),
    area_name VARCHAR(50) PRIMARY KEY,
    number_of_businesses INTEGER,
    accommodation_and_food_services INTEGER,
    retail_trade INTEGER,
    agriculture_forestry_and_fishing INTEGER,
    health_care_and_social_assistance INTEGER,
    public_administration_and_safety INTEGER,
    transport_postal_and_warehousing INTEGER
    
    )""")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fe401f79580>

In [27]:
# DELETE TABLE:
# conn.execute("""
# DROP TABLE IF EXISTS neighbourhoods;
# """)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fe3f0362f70>