Data Engineering Coding Challenge

Welcome to the Pri-Med Data Engineering coding challenge! This exercise is designed to assess your ability to work with real-world data quality issues, data modeling, SQL proficiency, and documentation skills.

Estimated Time: 1-2 hours

Getting Started

Step 1: Create Your Own Repository from This Template

Click the "Use this template" button at the top of this GitHub repository
Select "Create a new repository"
Set the repository name to: data-engineering-coding-challenge-<your-username>
- Example: data-engineering-coding-challenge-jdoe (where jdoe is your username)
Important: Set the repository to Public so we can review your submission
Click "Create repository"

You now have your own copy of the challenge to work in!

Step 2: Clone Your Repository

git clone https://github.com/<your-org>/data-engineering-coding-challenge-<your-username>.git
cd data-engineering-coding-challenge-<your-username>

Note: If you have issues creating a templated repo, you can download and send the finished work as a .zip to brose@pri-med.com

Challenge Overview

You will be working with a simulated e-commerce dataset that contains common data quality issues you'd encounter in production environments. Your task is to clean, transform, and model this data in Snowflake, then answer business questions using SQL queries.

Your Snowflake Environment

Snowflake Account: https://ewqtnsz-coding_challenge.snowflakecomputing.com

You have been provided with credentials to access a dedicated Snowflake account. Your environment includes:

READ-ONLY Access: SOURCES.TEST_DATA database and schema containing raw source data
FULL Access: A personal database (e.g., JDOE if your name is Jane Doe) where you can create schemas, tables, views, etc.

Connecting to Snowflake

Navigate to Snowflake and create a new worksheet or workspace: https://ewqtnsz-coding_challenge.snowflakecomputing.com

The Dataset

The source data has been pre-loaded into your Snowflake account in the SOURCES.TEST_DATA schema. The data represents an e-commerce platform with the following tables:

Source Tables in `SOURCES.TEST_DATA`:

You can access the source data with:

SELECT * FROM SOURCES.TEST_DATA.ORDERS;
SELECT * FROM SOURCES.TEST_DATA.CUSTOMERS;
SELECT * FROM SOURCES.TEST_DATA.PRODUCTS;
SELECT * FROM SOURCES.TEST_DATA.ORDER_ITEMS;

Known Data Quality Issues

Your solution should handle these real-world data problems:

Missing Values: NULL, empty strings, and blank fields
Duplicate Records: Exact duplicates that need deduplication
Inconsistent Date Formats: Multiple date formats across and within files
- 2024-01-15
- 01/16/2024
- Jan 20 2024
- 2024/01/18
Inconsistent Formatting:
- Email addresses in different cases
- Phone numbers in various formats
Data Type Issues: Numbers stored as strings, NULLs represented as text

Challenge Tasks

Part 1: Data Exploration & Profiling

Start by exploring the source data to understand the quality issues:

Query the SOURCES.TEST_DATA tables to understand the data structure
Identify and document data quality issues (duplicates, nulls, format inconsistencies)
Create a brief data quality assessment summary

Deliverable: Documentation of issues found and your planned approach

Part 2: Dimensional Modeling

Design and implement dimensional and/or fact model(s) directly from the source data in SOURCES.TEST_DATA.

Build your models in your personal database, applying transformations as you create the tables:

Handle duplicates - Use DISTINCT, GROUP BY, or your own logic to handle duplicate records
Standardize dates - Convert date fields to proper DATE type with consistent formatting
Clean formatting - Standardize emails (lowercase), phone numbers, etc. in your CREATE statements
Handle NULLs - Decide how to handle missing/empty values for your model
Proper data types - Ensure numeric fields are correct types

Suggested dimensional model:

Dimension tables (e.g., dim_customers, dim_products, dim_date)
Fact table(s) (e.g., fact_orders with line-item grain)
Use simple joins and aggregations - keep it straightforward!

Deliverable: SQL scripts creating dimensions and facts with transformations applied

Part 3: (Optional/Bonus) Business Intelligence Queries

Write SQL queries to answer these business questions:

Total Revenue by Month - Show total revenue for each month in 2024
Top 5 Customers by Revenue - Which customers have spent the most?
Product Performance - List products with their total quantity sold and revenue, ordered by revenue
Order Status Analysis - Show count and total value of orders by status (completed, pending, cancelled, shipped)

Deliverable: SQL queries with results and brief explanations

Submission Requirements

Please document your work in a clear, organized manner in your GitHub repository under the sql/.. directory.
Create a .sql file for each model that will live in your personal database.schema.
Have the tables created in snowflake under your personal database and appropriate schema(s)
Feel free to include additional scripts as well (i.e. validation queries, reports, etc)

Note: If you are having issues with github, send a .zip of the repo and all included files to brose@pri-med.com

Data Quality Report (optional/bonus):

Summary of data quality issues found with specific counts
Counts of duplicates, nulls, format inconsistencies per table
How you resolved each issue

Submitting Your Work

Commit and push all your work to your repository:

git add .
git commit -m "Complete data engineering coding challenge"
git push origin main

Ensure your repository is Public so we can review it
Send us the link to your repository:
- https://github.com/<your-org>/data-engineering-coding-challenge-<your-username>
Your work in Snowflake should remain in your personal database (e.g., JDOE) for our review

Tips for Success

Start by exploring the data - run some SELECT queries to see what you're working with
Don't overthink it - handle data quality issues directly in your dimensional model CREATE statements
Use simple SQL - DISTINCT, GROUP BY, COALESCE, etc. No need for complex window functions
Document your assumptions in comments or SOLUTION.md
Your dimensional model doesn't need to be perfect, but explain your choices
Focus on demonstrating your thought process

Questions?

If you have questions about the challenge requirements or run into technical issues with Snowflake access, please reach out to brose@pri-med.com

Good luck! We're excited to see your approach to solving real-world data engineering problems.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
sample_data		sample_data
sql		sql
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Coding Challenge

Getting Started

Step 1: Create Your Own Repository from This Template

Step 2: Clone Your Repository

Challenge Overview

Your Snowflake Environment

Connecting to Snowflake

The Dataset

Source Tables in `SOURCES.TEST_DATA`:

Known Data Quality Issues

Challenge Tasks

Part 1: Data Exploration & Profiling

Part 2: Dimensional Modeling

Part 3: (Optional/Bonus) Business Intelligence Queries

Submission Requirements

Submitting Your Work

Tips for Success

Questions?

About

Uh oh!

Releases

Packages

rysmithdev/data-engineering-coding-challenge-rysmithdev

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Coding Challenge

Getting Started

Step 1: Create Your Own Repository from This Template

Step 2: Clone Your Repository

Challenge Overview

Your Snowflake Environment

Connecting to Snowflake

The Dataset

Source Tables in SOURCES.TEST_DATA:

Known Data Quality Issues

Challenge Tasks

Part 1: Data Exploration & Profiling

Part 2: Dimensional Modeling

Part 3: (Optional/Bonus) Business Intelligence Queries

Submission Requirements

Submitting Your Work

Tips for Success

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Source Tables in `SOURCES.TEST_DATA`:

Packages