s3-data-lake-example

Creating a S3 Data lake with pyspark ETL.

First step involves using pandas to only extract the columns that are required and to create files in data lake using parquet format.

Queries to be supported The Lines where the expected vs actual arrival time is long.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
etl_processor		etl_processor
parquet_processor		parquet_processor
s3_processor		s3_processor
.gitignore		.gitignore
README.md		README.md
Untitled.ipynb		Untitled.ipynb
__init__.py		__init__.py
file.parquet		file.parquet

Provide feedback