Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub
-
Updated
Jan 5, 2024 - Python
Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub
Redshift Python Connector. It supports Python Database API Specification v2.0.
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3
🔄 🏃 EtLT of my own Strava data using the Strava API, MySQL, Python, S3, Redshift, and Airflow
Project was based on an interest in Data Engineering, ETL pipeline. It also provided a good opportunity to develop skills and experience in a range of tools. As such, project is more complex than required, utilising dbt, airflow, docker and cloud based storage.
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):
Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark
This project provides valuable customer sentiment insights for Zomato by tracking and analyzing tweets related to their brand and services.
The goal of this repository is to provide good and clear examples of Amazon CLI commands together with Amazon CDK to easily create any AWS services and resources
Project 3 - Data Engineering Nanodegree
Project 5 - Data Engineering Nanodegree
Redshift script to create a MANIFEST file recursively
Remove duplicates entries from a Redshift cluster
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and services including Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift.
A Data Warehousing project for retail sales using dimension modelling best practices with SCD type 2 on AWS Redshift. Utilizing AWS Lambda, Glue Workflows and Python Shell jobs to create and automate an ELT pipeline where batch data coming into S3 is loaded onto Redshift and necessary transformations are performed to meet requirements.
Udacity Data Engineering Nanodegree Project #3.
Data Pipeline Analytics Platform is an end-to-end generic Big Data pipeline. Involves following tech stack: AWS S3, AWS Redshift, AWS EMR Cluster, Apache Spark, Apache Airflow.
building etl pipelines to migrate music json data/ metadata files (semi-structured data) into a relational database stored in AWS Redshift cluster
Add a description, image, and links to the aws-redshift topic page so that developers can more easily learn about it.
To associate your repository with the aws-redshift topic, visit your repo's landing page and select "manage topics."