Streaming data-pipeline in aws
-
Updated
May 2, 2022 - Python
Streaming data-pipeline in aws
A data pipeline that conducts ETL processes to AWS Redshift, utilizing Spark and coordinated by Apache Airflow.
Banking Data Warehouse Pipeline
The goal of this repository is to provide good and clear examples of Amazon CLI commands together with Amazon CDK to easily create any AWS services and resources
Remove duplicates entries from a Redshift cluster
A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):
The goal of this project is to build data pipeline for gathering real-time carpark lots availability and weather datasets from Data.gov.sg. These data are extracted via API, and stored them in the S3 bucket before ingesting them into the Data Warehouse.
Load data from the Million Song Dataset into AWS RedShift.
Udacity Data Engineering Nanodegree Project #3.
ETL pipeline with AWS Redshift orchestrated with Airflow
Udacity Data Engeneering Nanodegree Program - My Submission of Project: Data Pipelines
Used AWS Glue to perform ETL operations and load resultant data to AWS Redshift. In the second phase used AWS CloudWatch rules and LAMBDA to automatically run GLUE Jobs
Data Pipeline Analytics Platform is an end-to-end generic Big Data pipeline. Involves following tech stack: AWS S3, AWS Redshift, AWS EMR Cluster, Apache Spark, Apache Airflow.
Data pipelines created and monitored using Airflow to feed data into Redshift
Add a description, image, and links to the aws-redshift topic page so that developers can more easily learn about it.
To associate your repository with the aws-redshift topic, visit your repo's landing page and select "manage topics."