HemaAssessment

This repository contains the code for the assessment for a data engineer position. It's an ETL pipeline written in pyspark, for ingesting a dataset from [kaggle] (https://www.kaggle.com/rohitsahoo/sales-forecasting).

In order to run it, after building the image, you need to run the container with the following file structure mounted to the /dataLake path:

dataLake:
- landing
- raw
- consumption
- curated

As no requirements where provided about operations, the most simple case is assumed; a daily process of one file. One file at a time should be put in the landing zone, then the pipeline is ready to run. The container is launched with the bash entrypoint, so you need to call the main script via "python3 main.py".

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
HemaAssessment.iml		HemaAssessment.iml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

Dockerfile

Dockerfile

HemaAssessment.iml

HemaAssessment.iml

README.md

README.md

Repository files navigation

HemaAssessment

About

Releases

Packages

Languages

vledouts/HemaAssessment

Folders and files

Latest commit

History

Repository files navigation

HemaAssessment

About

Resources

Stars

Watchers

Forks

Languages