Skip to content

Create a data pipeline that load data from S3 to Redshift

Notifications You must be signed in to change notification settings

ysfesr/Data_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Data Pipeline

Building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables. This allows Data Scientists to continue finding insights from the data stored in the Data warehouse.

In this project we are going to use two Amazon Web Services resources:

ETL

Data Pipeline design: At a high-level the pipeline does the following tasks.

  1. Extract data from multiple S3 locations.
  2. Load the data into Redshift cluster.
  3. Transform the data into a star schema.

Data Warehouse Schema Definition

This is the schema of the database database_schema

This is the schema of the data warehouse (star schema) data_warehouse_schema

Project structure

The structure is:

  • create_tables.py - This script will drop old tables (if exist) ad re-create new tables
  • etl.py - This script orchestrate ETL.
  • sql_queries.py - This is the ETL. All the transformatios in SQL are done here.
  • /img - Directory with images that are used in this markdown document

We need an extra file with the credentials an information about AWS resources named dhw.cfg

About

Create a data pipeline that load data from S3 to Redshift

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages