Skip to content

vledouts/HemaAssessment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HemaAssessment

This repository contains the code for the assessment for a data engineer position. It's an ETL pipeline written in pyspark, for ingesting a dataset from [kaggle] (https://www.kaggle.com/rohitsahoo/sales-forecasting).

In order to run it, after building the image, you need to run the container with the following file structure mounted to the /dataLake path:

  • dataLake:
    • landing
    • raw
    • consumption
    • curated

As no requirements where provided about operations, the most simple case is assumed; a daily process of one file. One file at a time should be put in the landing zone, then the pipeline is ready to run. The container is launched with the bash entrypoint, so you need to call the main script via "python3 main.py".

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published