Trying best case apache spark working environment for robust data pipelines
-
Updated
Apr 1, 2023 - Python
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Trying best case apache spark working environment for robust data pipelines
Real-time analysis pipeline
A forecasting project based on Apache-Spark and implemented with Naive Bayes theorem.
Scalable Book Recommender System - Apache Spark ML Lib
Heart disease classification with data mining(Zeppelin Notebook)
Data Lake with Spark
ML model deployment app I contributed to via MLH Fellowship
Data Engineering Capstone Project: ETL Pipelines and Data Warehouse Development which icluded business challenge that requires building a data platform for retailer data analytics.
Realtime user streaming data pipeline
Pinterest's experiment analytics data pipeline which runs thousands of experiments per day and crunches billions of datapoints to provide valuable insights to improve the product.
My Apache Spark images for wide set of applications
Example project implementing best practices and testing for PySpark data pipelines.
Graph coloring example using GraphFrames of Apache Spark framework
This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.
Big Data Technologies can be defined as software tools for analyzing, processing, and extracting data from an extremely complex and large data set with which traditional management tools can never deal
Created by Matei Zaharia
Released May 26, 2014