Create engine and link to local db.

Will need to create the db as follows in a a file called [`docker-compose.yaml`](docker-compose.yaml):
```yaml
services:
  pgdatabase:
    image: postgres:13
    environment:
      - name=value
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_green_taxi
    volumes:
      - "/home/sam/git/data-engineering-zoomcamp-notes/Green_Taxi/postgres:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8080:80"
```
This is then run with the command `docker-compose up` (`-d` for headless mode, with `docker-compose down` to kill).

In [39]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://root:root@localhost:5432/ny_green_taxi')

Creating the schema is best done with an engine connection, as it will recognize what flavor of SQL is being run

In [40]:
schema = pd.io.sql.get_schema(df, name='green_taxi_data', con=engine)
print(schema)


CREATE TABLE green_taxi_data (
	"VendorID" BIGINT, 
	lpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, 
	lpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, 
	store_and_fwd_flag TEXT, 
	"RatecodeID" BIGINT, 
	"PULocationID" BIGINT, 
	"DOLocationID" BIGINT, 
	passenger_count BIGINT, 
	trip_distance FLOAT(53), 
	fare_amount FLOAT(53), 
	extra FLOAT(53), 
	mta_tax FLOAT(53), 
	tip_amount FLOAT(53), 
	tolls_amount FLOAT(53), 
	ehail_fee FLOAT(53), 
	improvement_surcharge FLOAT(53), 
	total_amount FLOAT(53), 
	payment_type BIGINT, 
	trip_type BIGINT, 
	congestion_surcharge FLOAT(53)
)




Uploading in chunks:

In [41]:
df_iter = pd.read_csv('green_tripdata_2019-01.csv', iterator=True, chunksize=100_000)
df = next(df_iter)
len(df)

100000

In [42]:
df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)

Uploading row info

In [43]:
df.head(n=0).to_sql(name='green_taxi_data', con=engine, if_exists='replace')

0

First 100,000, then rest

In [44]:
df.to_sql(name='green_taxi_data', con=engine, if_exists='append')
n = 100_000

while True:
    try:
        df = next(df_iter)
    except(StopIteration):
        print("finished upload")
        break
    
    df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)

    df.to_sql(name='green_taxi_data', con=engine, if_exists='append')
    n += 100_000
    print(f'inserted chunk {n}')

inserted chunk 200000
inserted chunk 300000
inserted chunk 400000
inserted chunk 500000
inserted chunk 600000
inserted chunk 700000
finished upload


A question required a separate document, which we'll upload here. It's small, so it does not need to be iterated.

In [48]:
df = pd.read_csv('taxi_zone_lookup.csv')
df.head(n=0).to_sql(name='taxi_zone_lookup', con=engine, if_exists='replace')
df.to_sql(name='taxi_zone_lookup', con=engine, if_exists='append')

265