Skip to content

theotheo/awesome-data-engineering

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome Data Engineering

Awesome A curated list of data engineering tools for software developers

List of content

  1. [Databases] (#databases)
  2. Ingestion
  3. [File System] (#file-system)
  4. File Format
  5. Stream Processing
  6. [Batch Processing] (#batch-processing)
  7. [Charts and Dashboards] (#charts-and-dashboards)
  8. [Frameworks] (#frameworks)
  9. Datasets
  10. [Monitoring] (#monitoring)
  11. Docker

Databases

Data Ingestion

  • [Kafka] (http://kafka.apache.org/) Publish-subscribe messaging rethought as a distributed commit log.
  • [AWS Kinesis] (http://aws.amazon.com/kinesis/) A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
  • RabbitMQ Robust messaging for applications.
  • FluentD An open source data collector for unified logging layer.
  • Apache Scoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Luigi Python module that helps you build complex pipelines of batch jobs

File System

  • [HDFS] (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
  • [AWS S3] (http://aws.amazon.com/s3/)
  • [Tachyon] (http://tachyon-project.org/) Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
  • CEPH Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
  • OrangeFS Orange File System is a branch of the Parallel Virtual File System
  • SnackFS SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra
  • GlusterFS Gluster Filesystem
  • XtreemFS fault-tolerant distributed file system for all storage needs
  • SeaweedFS Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".

File Format

  • Apache Avro Apache Avro™ is a data serialization system
  • Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
    • Snappy A fast compressor/decompressor. Used with Parquet
    • PigZ A parallel implementation of gzip for modern multi-processor, multi-core machines
  • Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
  • ProtoBuf Protocol Buffers - Google's data interchange format
  • SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats
  • Kryo Kryo is a fast and efficient object graph serialization framework for Java

Stream Processing

  • Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  • Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
  • Apache Storm Apache Storm is a free and open source distributed realtime computation system
  • Apache Samza Apache Samza is a distributed stream processing framework
  • Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data
  • VoltDB

Batch Processing

Charts and Dashboards

  • [Highcharts] (http://www.highcharts.com/) A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
  • ZingChart Fast JavaScript charts for any data set.
  • C3.js D3-based reusable chart library.
  • [D3.js] (http://d3js.org/) A JavaScript library for manipulating documents based on data.
    • [D3Plus] (http://d3plus.org) D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
  • SmoothieCharts A JavaScript Charting Library for Streaming Data.
  • PyXley Python helpers for building dashboards using Flask and React

Frameworks

  • [Luigi] (https://github.com/spotify/luigi) Luigi is a Python module that helps you build complex pipelines of batch jobs.
    • CronQ An application cron-like system. Used w/Luige
  • [Cascading] (http://www.cascading.org/) Java based application development platform.
  • [Airflow] (https://github.com/airbnb/airflow) Airflow is a system to programmaticaly author, schedule and monitor data pipelines.
  • [Azkeban] (https://azkaban.github.io/) Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
  • Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs

ELK Elastic Logstash Kebana

  • docker-logstash A highly configurable logstash (1.4.4) docker image running Elasticsearch (1.7.0) and Kibana (3.1.2).
  • elasticsearch-jdbc JDBC importer for Elasticsearch
  • ZomboDB Postgres Extension that allows creating an index backed by Elasticsearch

Docker

  • Gockerize Package golang service into minimal docker containers
  • Flocker Easily manage Docker containers & their data
  • Rancher RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
  • Kontena Application Containers for Masses
  • Weave Weaving Docker containers into applications http://weave.works
  • Zodiac A lightweight tool for easy deployment and rollback of dockerized applications
  • cAdvisor Analyzes resource usage and performance characteristics of running containers
  • Micro S3 persistence Docker microservice for saving/restoring volume data to S3
  • Dockup Docker image to backup/restore your Docker container volumes to AWS S3 http://tutum.co

Datasets

Realtime

  • Instagram Realtime Real-time photo updates provide your application with instant notifications of new photos as they are posted on Instagram.
  • Twitter Realtime The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.
  • Firebase Realtime Airport delays, Parking, Cryptocurrencies, Earthquakes, Transit, Weather

Data Dumps

Cheers to The Data Engineering Ecosystem: An Interactive Map

Inspired by the awesome list. Created by Insight Data Engineering fellows.

Monitoring

Prometheus

  • Prometheus.io An open-source service monitoring system and time series database
  • HAProxy Exporter Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

License

CC0

To the extent possible under law, Igor Barinov has waived all copyright and related or neighboring rights to this work.

About

A curated list of data engineering tools for software developers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published