Open source platform for the machine learning lifecycle
-
Updated
May 31, 2024 - Python
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Open source platform for the machine learning lifecycle
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
PySpark + Scikit-learn = Sparkit-learn
(Deprecated) Scikit-learn integration package for Apache Spark
A command-line tool for launching Apache Spark clusters.
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
pyspark methods to enhance developer productivity 📣 👯 🎉
A boilerplate for writing PySpark Jobs
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Train and run Pytorch models on Apache Spark.
Easy to use library to bring Tensorflow on Apache Spark
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
Apache (Py)Spark type annotations (stub files).
Apache Spark 3 - Structured Streaming Course Material
Dataproc templates and pipelines for solving simple in-cloud data tasks
Code for "Efficient Data Processing in Spark" Course
Real-Time Financial Market Data Processing and Prediction application
Astronomy Broker based on Apache Spark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Created by Matei Zaharia
Released May 26, 2014