- SQL on Hadoop
- Data Management
- Workflow, Lifecycle and Governance
- Data Ingestion and Integration
- Libraries and Tools
- Realtime Data Processing
- Distributed Computing and Programming
- Packaging, Provisioning and Monitoring
- Machine learning and Big Data analytics
- Other Awesome Lists
- Apache Hadoop - Apache Hadoop
- Apache Tez
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
- dumbo - Python module that allows you to easily write and run Hadoop programs.
- hadoopy - Python MapReduce library written in Cython.
- mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
- White Elephant - Hadoop log aggregator and dashboard
- Kiji Project
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
- Kylin - Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
- Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
- Apache Ignite - Distributed in-memory platform
- Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
- Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
- mpich2-yarn - Running MPICH2 on Yarn
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
- Apache HBase - Apache HBase
- Apache Phoenix - A SQL skin over HBase
- happybase - A developer-friendly Python library to interact with Apache HBase.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
- Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
- hindex - Secondary Index for HBase
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- OpenTSDB - The Scalable Time Series Database
- Apache Cassandra
SQL on Hadoop
SQL on Hadoop
- Apache Hive
- Hive Plugins
- https://github.com/kevinweil/elephant-bird - Twitter
- https://github.com/markgrover/hive-translate (PostgreSQL translate())
- https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
- https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
- Storage Handler
- Libraries and tools
- shib - WebUI for query engines: Hive and Presto
- clive - Clojure library for interacting with Hive via Thrift
- https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
- PyHive - Python interface to Hive and Presto
- HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test- Unit test framework for hive and hive-service
- Cloudera Impala
- Apache Tajo - Data warehouse system for Apache Hadoop
- Apache Drill
- Apache Calcite - A Dynamic Data Management Framework
Workflow, Lifecycle and Governance
- Apache Oozie - Apache Oozie
- Apache Falcon - Data management and processing platform
- Apache NiFi - A dataflow system
Data Ingestion and Integration
- Apache Flume - Apache Flume
- Flume Plugins
- Suro - Netflix's distributed Data Pipeline
- Apache Sqoop - Apache Sqoop
- Apache Kafka - Apache Kafka
- Apache Pig - Apache Pig
- Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
- vahara - Machine learning and natural language processing with Apache Pig
- packetpig - Open Source Big Data Security Analytics
- akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
- seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
- Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Libraries and Tools
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN.
- Hue - A Web interface for analyzing data with Apache Hadoop.
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
- Apache Thrift
- Apache Avro - Apache Avro is a data serialization system.
- Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
- Spring for Apache Hadoop
- hdfs - A native go client for HDFS
Realtime Data Processing
Distributed Computing and Programming
- Apache Spark
- Apache Crunch
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Packaging, Provisioning and Monitoring
- Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Ambari - Apache Ambari
- Ganglia Monitoring System
- ankush - A big data cluster management tool that creates and manages clusters of different technologies.
- Apache Zookeeper - Apache Zookeeper
- Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
- Buildoop - Hadoop Ecosystem Builder
- Deploop - The Hadoop Deploy System
- Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
- inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
- Apache Solr
- SenseiDB - Open-source, distributed, realtime, semi-structured database
- Banana - Kibana port for Apache Solr
- Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Sentry - An authorization module for Hadoop
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
- Big Data Benchmark
- hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
Machine learning and Big Data analytics
- Apache Maout
- Cloudera Oryx - The Oryx open source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure.
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHive - RHive is an R extension facilitating distributed computing via Apache Hive.
- Apache Lens
Various resources, such as books, websites and articles.
Useful websites and articles
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop 1.x vs 2
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Introducing Apache Hadoop YARN
- Apache Hadoop YARN - Background and an Overview
- Apache Hadoop YARN - Concepts and Applications
- Apache Hadoop YARN - ResourceManager
- Apache Hadoop YARN - NodeManager
- Migrating to MapReduce 2 on YARN (For Users)
- Migrating to MapReduce 2 on YARN (For Operators)
- Hadoop and Big Data: Use Cases at Salesforce.com
- All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
- What is Bigtop, and Why Should You Care?
- Hadoop - Distributions and Commercial Support
- Ganglia configuration for a small Hadoop cluster and some troubleshooting
- Hadoop illuminated - Open Source Hadoop Book
- NoSQL Database
- 10 Best Practices for Apache Hive
- Hadoop Operations at Scale
- AWS BigData Blog
- Hadoop 24/7
- An example Apache Hadoop Yarn upgrade
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Hadoop Performance at LinkedIn
- Docker based Hadoop provisioning
- Hadoop: The Definitive Guide
- Hadoop Operations
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop in Practice, Second Edition
- Hadoop in Action, Second Edition
Other Awesome Lists
Other amazingly awesome lists can be found in the awesome-awesomeness list.