dataengineering

An open-source project dedicated to constructing robust data pipelines and scalable software infrastructure. We leverage industry-standard tools favored by developers to enhance efficiency and reliability. Uniquely, these pipelines are field-tested on farms across Sumatra, Indonesia, ensuring real-world applicability and resilience.

portfolio computer-vision django-application software-engineering traefik stem airflow-docker dataengineering django-docker ultralytics

Updated May 27, 2024
Python

airscholar / RealtimeStreamingEngineering

Star

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.

elasticsearch kafka apache-spark tcp-socket dataengineering openai-api chatgpt

Updated Jan 4, 2024
Python

WaylonWalker / kedro-static-viz

Sponsor

Star

kedro cli plugin for generating a static kedro viz site (html, css, js) that can be deployed on many serverless tools.

python data datapipeline dataengineering kedro kedro-plugin

Updated Jan 6, 2023
Python

Wittline / pyspark-on-aws-emr

Sponsor

Star

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

python aws big-data spark aws-emr pyspark dataengineering big-data-analytics ec2-spot emr-cluster wordcloud-generator ec2-spot-instances

Updated Jun 13, 2022
Python

judeleonard / Prescriber-ETL-data-pipeline

Star

An End-to-End ETL data pipeline that leverages pyspark parallel processing to process about 25 million rows of data coming from a SaaS application using Apache Airflow as an orchestration tool and various data warehouse technologies and finally using Apache Superset to connect to DWH for generating BI dashboards for weekly reports

airflow pyspark datawarehouse airflow-docker dataengineering amazon-s3 posgresql azure-blob-storage etl-pipeline apache-superset bi-dashboards

Updated Dec 7, 2022
Python

Improve this page

Add a description, image, and links to the dataengineering topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the dataengineering topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataengineering

Here are 259 public repositories matching this topic...

datafold / data-diff

TobikoData / sqlmesh

grai-io / grai-core

kevinheavey / modern-polars

mehd-io / pypi-duck-flow

awslabs / aws-orbit-workbench

josephmachado / beginner_de_project_stream

minhadona / data_engineer_interview_challenges

kislerdm / data-engineering-interviews

prodmodel / prodmodel

abhishek-ch / data-machinelearning-the-boring-way

franloza / coches-net-dashboard

danielsaban / data-scraping-sofascore

sarahmk125 / airflow-docker-metrics

josephmachado / socialetl

mikestack15 / orangutan-stem

airscholar / RealtimeStreamingEngineering

WaylonWalker / kedro-static-viz

Wittline / pyspark-on-aws-emr

judeleonard / Prescriber-ETL-data-pipeline

Improve this page

Add this topic to your repo