I'm a Cloud Data Engineer focused on building scalable, reliable, and cost-efficient cloud data platforms.
I specialize in turning raw, messy, multi-source data into trusted analytics layers and ML-ready pipelines
through a mix of modern ELT, streaming systems, and strong distributed systems fundamentals.
π MSCS @ Northeastern University (2022β2024)
βοΈ Focus: Cloud-Native Data Engineering β streaming(Kafka/Flink), orchestration(Airflow/Dagster), modeling(dbt)
π Connect: GitHub: wyang10 β’ LinkedIn: linkedin.com/in/awhy
- Cloud Data Systems: Airflow β’ dbt β’ Snowflake β’ BigQuery β’ Terraform
- Streaming Architecture: Kafka/Flink β’ stateful processing β’ exactly-once pipelines
- Distributed Systems: idempotency β’ back-pressure β’ partitioning strategies
- ELT/ETL Optimization: incremental models β’ data quality β’ orchestration best practices
- Feature Engineering: online/offline store design β’ feature pipelines
Data Engineer β LumiereX (Jan 2025 β Present)
Built core ELT frameworks, improved data quality layers, and optimized Spark jobs for cost/performance.
Designed cloud-native data pipelines supporting analytics and ML-driven decisions.
Software Engineer Intern β VisionX (Jan 2024 β Jul 2024)
Implemented scalable ingestion APIs, automated batch ETL workflows,
and contributed to the design of ML feature extraction pipelines.
π· airflow_dbt_demo
A production-ready ELT & Data Quality Framework using Airflow + dbt + Great Expectations + CICD.
Automates data ingestion, transformation, testing, and lineage into a reproducible orchestration system.
End-to-End, Reproducible ML Pipeline Engineered a modular, production-style ML system for predicting in-hospital mortality. Go from raw CSV β cleaned features β baseline models β reproducible CLI pipeline, with optional SMOTE to address severe class imbalance.
- I design modular, observable pipelines that are easy to test, debug, and scale.
- I prioritize trade-offs that maximize team velocity, reliability, and cloud spend efficiency.
- I enjoy collaborations involving data modeling, pipeline quality, and distributed system design.
Languages & Tools
Python (Pandas, PySpark) β’ SQL β’ Java β’ Scala β’ Bash
Cloud & Orchestration
GCP (BigQuery, Dataflow) β’ AWS (S3, EMR, Glue, Lambda) β’ Airflow β’ Dagster β’ dbt β’ Docker
Kubernetes β’ GitHub Actions β’ Terraform
Big Data & Storage
Spark β’ Flink β’ Kafka β’ Databricks β’ Hive β’ HDFS
Snowflake β’ Delta Lake β’ Parquet β’ dimensional modeling
Data Quality & CI/CD
Great Expectations β’ dbt tests β’ automated lineage β’ monitoring
