Data Engineering Code Challenge

Data Engineering Code Challenge (PySpark)

Objective:

The purpose of this challenge is to evaluate your skills in data engineering, particularly in working with PySpark for data processing and transformation tasks. This challenge mimics real-world scenarios you might encounter in the role.

Technical information:

Use Python and PySpark [https://spark.apache.org/docs/latest/api/python/getting_started/install.html]
Use the following package for PySpark tests - https://github.com/MrPowers/chispa - the application needs to have tests.
Do not use notebooks for final delivery, like Jupyter for instance. While these are good for interactive work and/or prototyping in this case they shouldn't be used.
The application should have an automated build pipeline using GitHub Actions.
Follow best practices by using descriptive commit messages.
Use logging to show information about what the application is doing, avoid using print statements, but feel free to use them in the tests for your own debugging purposes. However, they should not be there in the final version.
Use type hints and docstrings as much as possible to enhance code documentation and readability.
Consider using linters and code formatters and have it the Continuous Integration step of your automated build.

Background:

You work for a company that analyzes retail sales data from various stores. Your task is to process raw transactional data, generate insights, and save the transformed data for further analysis.

The raw data consists of:

A Sales Transactions dataset with information about products sold, quantities, and prices.
A Products dataset containing product details.
A Stores dataset with details about stores.

Data Description

Sales Transactions Dataset (sales.csv):
- transaction_id: Unique identifier for the transaction.
- store_id: ID of the store where the transaction occurred.
- product_id: ID of the product sold.
- quantity: Number of units sold.
- transaction_date: Date of the transaction (format: YYYY-MM-DD).
- price: Price per unit.
Products Dataset (products.csv):
- product_id: Unique identifier for the product.
- product_name: Name of the product.
- category: Category of the product.
Stores Dataset (stores.csv):
- store_id: Unique identifier for the store.
- store_name: Name of the store.
- location: Location of the store.

Tasks

You need to perform the following tasks using PySpark:

Part 1: Data Preparation

Load the three datasets (sales.csv, products.csv, and stores.csv) into PySpark DataFrames.
Perform basic data validation:
- Check for missing or null values as well as inconsistencies in the data format.
- Identify and handle duplicates, if any.
- Enforce the appropriate data types for all columns.

Part 2: Data Transformation

Sales Aggregation:
- Calculate the total revenue for each store (store_id) and each product category.
- Output: DataFrame with store_id, category, and total_revenue.
Monthly Sales Insights:
- Calculate the total quantity sold for each product category, grouped by month.
- Output: DataFrame with year, month, category, and total_quantity_sold.
Enrich Data:
- Combine the sales, products, and stores datasets into a single enriched dataset with the following columns: transaction_id, store_name, location, product_name, category, quantity, transaction_date, and price.
PySpark UDF:
- Implement a PySpark UDF to categorize products based on the following price ranges:
```
Low: price < 20
Medium: 20 <= price <= 100
High: price > 100
```
- Add a column price_category to the enriched dataset and save it as an additional output.
Part 3: Data Export
- Save the enriched dataset from Part 2, Task 3, in Parquet format, partitioned by category and transaction_date.
- Save the store_id-level revenue insights (from Part 2, Task 1) in CSV format.

Deliverables

Your solution should be uploaded and submitted using this repo as base (download, clone, for is not allowed)
Your solution should follow good project architecture.
Output files:
- Enriched dataset in Parquet format.
- Revenue insights in CSV format.
- Create a documentation file in Markdown for documenting your project including:
  - Your approach to the problem.
  - Any assumptions or decisions made.
  - Steps to run the code and reproduce results.
Submission Guidelines
- Once you finish the challenge ping us via the contact mail we provided.

Evaluation Criteria

Correctness and completeness of the solution.
Efficient use of PySpark for data processing.
Code quality (readability, comments, modularity, good practices).
Adherence to the deliverable format and instructions.
Handling of edge cases and data validation.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
data		data
spark_jobs		spark_jobs
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
SOLUTION.md		SOLUTION.md
docker-compose.yaml		docker-compose.yaml
mypy.log		mypy.log
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Code Challenge

Objective:

Technical information:

Background:

Data Description

Tasks

Deliverables

Evaluation Criteria

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

sylvinho81/Data-Engineering-Code-Challenge

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Code Challenge

Objective:

Technical information:

Background:

Data Description

Tasks

Deliverables

Evaluation Criteria

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages