Developing a Flow with EMR and Airflow
-
Updated
Aug 7, 2024 - Python
Developing a Flow with EMR and Airflow
This project provides a detailed overview of creating an automated data engineering pipeline using Airflow, AWS services, Spark, Snowflake and Tableau
This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both locally and on AWS EMR.
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
Automate Amazon EMR clusters using Lambda for streamlined and scalable data processing workflows. Unlock the full potential of your data pipeline with LambdaEMR Automator.
A robust data pipeline leveraging Amazon EMR and PySpark, orchestrated seamlessly with Apache Airflow for efficient batch processing
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
Elastic Data Factory
Performed business operations using Big data technologies: AWS EMR, AWS RDS (MySQL), Hadoop, Apache Scoop, Apache HBase, MapReduce
Guide: Executing a python script on AWS EMR for big data analysis.
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
Implementation of Random Forest algorithm using pyspark on AWS to classify the wines and deployment on Docker Container.
Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)
Database Schema & ETL pipeline for Song Play Analysis | Bosch AI Talent Accelerator Scholarship Program
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
Data pipeline for analyzing the effects of economic indicators on cryptocurrencies
Orchestrating Cloud ETL Workloads
Load data from S3, process the data into analytics tables using Spark and load them back into S3. Deployed this Spark process on a cluster using AWS EMR
To implement a data lake using S3 and Spark on an EMR cluster using AWS Cloud9 environment and develop an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as a set of dimensional tables.
Add a description, image, and links to the emr-cluster topic page so that developers can more easily learn about it.
To associate your repository with the emr-cluster topic, visit your repo's landing page and select "manage topics."