# Data ingestion

**Libraries & Imports**

In [1]:
from time import time
import datetime
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

from warnings import simplefilter
simplefilter('ignore')

In [2]:
# Safaty check: pandas version
# pd.__version__

## Prepare dataset for ingestion

In [3]:
# Load Dataset from parquet file
df = pd.read_parquet('../yellow_tripdata_2021-01.parquet')

In [4]:
df.shape

(1369769, 19)

In [5]:
# Check df dtypes
df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
dtype: object

**Because we are using _parquet_ format, and the variables were parsed to its correct data types, the `dtypes` attribute already show the correct parsing.**

Anyway, we will transform this data from **_.parquet_** to **_.csv_**. That will alllow us to use `pandas.read_csv` combined with the argument `iterator = True` later. It will be usefull since we will be loading the data to our database in batches/chuncks.

In [6]:
# From parquet to CSV
# df.to_csv('./yellow_tripdata_2021-01.csv', header=True, index=False)

The IO module in pandas allows us to get the correct SQL syntax for creating a table exactly as we need it.

Let's make use of this convenient tool:

In [7]:
# Get SQL command to create tabel from pandas DataFrame
print(pd.io.sql.get_schema(df, name = 'yello_taxi_data'))

CREATE TABLE "yello_taxi_data" (
"VendorID" INTEGER,
  "tpep_pickup_datetime" TIMESTAMP,
  "tpep_dropoff_datetime" TIMESTAMP,
  "passenger_count" REAL,
  "trip_distance" REAL,
  "RatecodeID" REAL,
  "store_and_fwd_flag" TEXT,
  "PULocationID" INTEGER,
  "DOLocationID" INTEGER,
  "payment_type" INTEGER,
  "fare_amount" REAL,
  "extra" REAL,
  "mta_tax" REAL,
  "tip_amount" REAL,
  "tolls_amount" REAL,
  "improvement_surcharge" REAL,
  "total_amount" REAL,
  "congestion_surcharge" REAL,
  "airport_fee" REAL
)


**We can also get more specific SQL code according to the database tool we are using.** In this case, we are using **PosgtreSQL**.

For that, we need first to create a connection to the database and use that connection on the `pandas.io` module.

In [8]:
# Create postgres connection
# Docker engine & postgres container running

# Create engine
engine = create_engine('postgresql://root:root@localhost:5432/nyc_taxi')

Finally, let's see the SQL syntax specific for **PostgreSQL**:

In [9]:
# Create SQL code for initiating schemma
print(pd.io.sql.get_schema(df, name = 'yello_taxi_data', con = engine))


CREATE TABLE yello_taxi_data (
	"VendorID" BIGINT, 
	tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, 
	tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, 
	passenger_count FLOAT(53), 
	trip_distance FLOAT(53), 
	"RatecodeID" FLOAT(53), 
	store_and_fwd_flag TEXT, 
	"PULocationID" BIGINT, 
	"DOLocationID" BIGINT, 
	payment_type BIGINT, 
	fare_amount FLOAT(53), 
	extra FLOAT(53), 
	mta_tax FLOAT(53), 
	tip_amount FLOAT(53), 
	tolls_amount FLOAT(53), 
	improvement_surcharge FLOAT(53), 
	total_amount FLOAT(53), 
	congestion_surcharge FLOAT(53), 
	airport_fee FLOAT(53)
)




We can now create a SQL statement script to create the tables within our databse.

But there is even a simpler way to do the same task. Using `pandas.to_sql`!

NOTE: We will be using the same connection engine created earlier in this code.

## Create database w/ DataFrame metadata

In [10]:
# List columns as a pandas.DataFrame
df.head(0)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee


In [11]:
# Create table in the DB from with pandas .to_sql method
df.head(n=0).to_sql(name = 'yellow_taxi_data', con=engine, if_exists='replace')

In [12]:
# %time df.to_sql(name = 'yellow_taxi_data', con=engine, if_exists='append')

## Ingest Data into the database

### `pandas.DataFrame` iterator

With the `pandas.DataFrame` iterator we can upload data in batches/chunksto our database.

That's usually the way to go for tables with many, many records.

In [13]:
# Create the DataFrame iterator
df_iter = pd.read_csv('../yellow_tripdata_2021-01.csv', iterator = True, chunksize=100_000)

In [14]:
chunksize = 100_000
n_chunks_total = np.ceil(df.shape[0] / chunksize).astype('int8')

# Uncomment to reset db
# df.head(n=0).to_sql(name = 'yellow_taxi_data', con=engine, if_exists='replace')

for i, df_chunk in enumerate(df_iter):
    
    t_start = time()
    
    # Change date columns to datetime type objects
    df_chunk['tpep_pickup_datetime'] = pd.to_datetime(df_chunk['tpep_pickup_datetime'])
    df_chunk['tpep_dropoff_datetime'] = pd.to_datetime(df_chunk['tpep_dropoff_datetime'])
    
    # Append chunk to the database
    df_chunk.to_sql(name = 'yellow_taxi_data', con=engine, if_exists='append')
    
    t_end = time()
    
    print(f'{datetime.datetime.now()} - [ Chunk {(i + 1):02d}/{n_chunks_total} ] - Chunk ingested into database in {round((t_end - t_start), 3)} seconds')

2022-08-07 16:21:27.865335 - [ Chunk 01/14 ] - Chunk ingested into database in 14.021 seconds
2022-08-07 16:21:41.943959 - [ Chunk 02/14 ] - Chunk ingested into database in 13.942 seconds
2022-08-07 16:21:56.107810 - [ Chunk 03/14 ] - Chunk ingested into database in 14.03 seconds
2022-08-07 16:22:10.792055 - [ Chunk 04/14 ] - Chunk ingested into database in 14.554 seconds
2022-08-07 16:22:25.367912 - [ Chunk 05/14 ] - Chunk ingested into database in 14.453 seconds
2022-08-07 16:22:40.166606 - [ Chunk 06/14 ] - Chunk ingested into database in 14.674 seconds
2022-08-07 16:22:57.108284 - [ Chunk 07/14 ] - Chunk ingested into database in 16.812 seconds
2022-08-07 16:23:13.044763 - [ Chunk 08/14 ] - Chunk ingested into database in 15.812 seconds
2022-08-07 16:23:28.679159 - [ Chunk 09/14 ] - Chunk ingested into database in 15.511 seconds
2022-08-07 16:23:45.372157 - [ Chunk 10/14 ] - Chunk ingested into database in 16.566 seconds
2022-08-07 16:24:07.106220 - [ Chunk 11/14 ] - Chunk ingested