Audrey Yang wyang10

Hi there 👋 I’m Audrey~ 🚀

About Me 🌱

I'm a Cloud Data Engineer focused on building scalable, reliable, and cost-efficient cloud data platforms.
I specialize in turning raw, messy, multi-source data into trusted analytics layers and ML-ready pipelines
through a mix of modern ELT, streaming systems, and strong distributed systems fundamentals.

Quick Pitch 💬

🎓 MSCS @ Northeastern University (2022–2024)
☁️ Focus: Cloud-Native Data Engineering — streaming(Kafka/Flink), orchestration(Airflow/Dagster), modeling(dbt)
🔗 Connect: GitHub: wyang10 • LinkedIn: linkedin.com/in/awhy

Highlights 💡

Cloud Data Systems: Airflow • dbt • Snowflake • BigQuery • Terraform
Streaming Architecture: Kafka/Flink • stateful processing • exactly-once pipelines
Distributed Systems: idempotency • back-pressure • partitioning strategies
ELT/ETL Optimization: incremental models • data quality • orchestration best practices
Feature Engineering: online/offline store design • feature pipelines

Experience 🧩

Data Engineer — LumiereX (Jan 2025 – Present)
Built core ELT frameworks, improved data quality layers, and optimized Spark jobs for cost/performance.
Designed cloud-native data pipelines supporting analytics and ML-driven decisions.

Software Engineer Intern — VisionX (Jan 2024 – Jul 2024)
Implemented scalable ingestion APIs, automated batch ETL workflows,
and contributed to the design of ML feature extraction pipelines.

Featured Projects 👨‍💻

🔷 airflow_dbt_demo

A production-ready ELT & Data Quality Framework using Airflow + dbt + Great Expectations + CICD.
Automates data ingestion, transformation, testing, and lineage into a reproducible orchestration system.

🔷 Smote-Heart-Attack-ML

End-to-End, Reproducible ML Pipeline Engineered a modular, production-style ML system for predicting in-hospital mortality. Go from raw CSV → cleaned features → baseline models → reproducible CLI pipeline, with optional SMOTE to address severe class imbalance.

How I Work 👯

I design modular, observable pipelines that are easy to test, debug, and scale.
I prioritize trade-offs that maximize team velocity, reliability, and cloud spend efficiency.
I enjoy collaborations involving data modeling, pipeline quality, and distributed system design.

Core Skills ⚡

Languages & Tools
Python (Pandas, PySpark) • SQL • Java • Scala • Bash

Cloud & Orchestration
GCP (BigQuery, Dataflow) • AWS (S3, EMR, Glue, Lambda) • Airflow • Dagster • dbt • Docker
Kubernetes • GitHub Actions • Terraform

Big Data & Storage
Spark • Flink • Kafka • Databricks • Hive • HDFS
Snowflake • Delta Lake • Parquet • dimensional modeling

Data Quality & CI/CD
Great Expectations • dbt tests • automated lineage • monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audrey Yang wyang10

Achievements

Achievements

Highlights

Block or report wyang10