An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
-
Updated
Jul 16, 2024 - Scala
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
State of the Art Natural Language Processing
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Smart Automation Tool for building modern Data Lakes and Data Pipelines
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Simple and Distributed Machine Learning
An open protocol for secure data sharing
Spark Accelerator framework ; It enables secondary indices to remote data stores.
A Spark plugin for reading and writing Excel files
Resilient data pipeline framework running on Apache Spark
A distributed graph computing platform that enables simple visual analysis of large-scale relational data.
A library to transform Scala product types and Schemes from different systems into other Schemes. Any implemented type automatically gets methods to convert it into the rest of the types and vice versa. E.g: a Spark Schema can be transformed into a BigQuery table.
Created by Matei Zaharia
Released May 26, 2014