# Taxi Data Ingestion and Query Optimisation

## Introduction

This demo has two goals:

1. Load New York Taxi, Yellow Trip data into Vast S3 and Vast DB as it is a useful dataset for big data demonstrations.
2. Compare the same query in Vast DB vs the Parquet files in Vast S3.

## Instructions

- Ensure you are familiar with the [Setup Instructions](https://github.com/snowch/vast-docker-compose-examples/blob/main/demos/SETUP_INSTRUCTIONS.md) and have cloned and started the containers.
- Follow the instructions in each notebook, following each one in order:
  - [Part 1 - Ingest to S3](./yellow_taxi_data_pt1_ingest_to_s3.ipynb): Download Yellow Taxi Parquet files, restructure the data where necessary, and load to Vast S3.
  - [Part 2 - Ingest to vastdb](./yellow_taxi_data_pt2_ingest_to_vastdb.ipynb): Ingest data from S3 to Vast DB using Bulk Import.
  - [Part 3 - Query with Hive](./yellow_taxi_data_pt3_hive_table.ipynb): Setup a Hive Table on the Parquet files in S3 and verify it can be queried.
  - [Part 4 - Query with VastDB](./yellow_taxi_data_pt4_vastdb_query.ipynb): Query the data in VastDB.
  - [Part 5 - Load Verification](./yellow_taxi_data_pt5_verification.ipynb): Verify the same data has been loaded in S3 and VastDB.
  - [Part 6 - Performance Comparision](./yellow_taxi_data_pt6_performance.ipynb): Using a simple query, compare the performance with Hive and VastDB.
  - [Appendix - Schema Drift Analysis](./yellow_taxi_data_ptA_schema_duckdb.ipynb): (optional) Fetch the Parquet schemas to perform an analysis of how the schema has drifted (evolved) over time.
