GitHub - gurjarprateek/bixi-data-repository: This project aims at creating a Data Repository for BIXI Montreal for analyzing bike commuters data set.

BIXI Data Lake and Warehouse

Project Scope

Steps taken for the project
- Download dataset from Kaggle and open data
- Store raw csv and json data in S3 buckets
  - OD_.csv for example OD_2014.csv
  - stations.json
- Create a data model and schema design for ETL
  - Excel spreadsheet which lists tables, columns and datatypes
  - Dimensional model showing trips, station information, station status, calendar with relationships
- Code logic to ingest api data, clean it, convert to csv and dump in S3 buckets
  - Python scripts used as tasks by airflow
- Manual steps to create Glue Crawlers to create data catalogue
- Create scripts for external schema in redshift to which points to Glue Data Catalog
- Create scripts for dimension and fact tables in redshift adhering to data model and schema
- Write Copy statements to populate fact and dimension tables on a schedule
- Automate the process using Airflow
  - Create dags script
  - Create task scripts for data cleaning and parquet conversion
  - Create task script to execute ETL operations on redshift
  - Other helper functions
Purpose of Final Data Model
- Data model to report on trip durations sliced by month/period of day/year
- Performance comparison of stations
- YoY growth of trips

Scenario

Case where Data was increased by 100x
- Converting data to parquet (columnar format to save storage and improve query performance)
- Improve ETL query performance by using dist-keys and sort-keys for trips fact table
- Use dist style as all for dimension tables to improve join performance
- Boost airflow cluster memory
Pipelines would run on a daily basis by 7 am everyday
- Use airflow variables to add schedule variable which can be set directly from airflow web interface without needing to change the code
The database needs to be accessed by 100 people
- Use Redshift workload management to improve query queuing

Defending decisions

Choice of tool and tech
- Jupyter Notebook for proof of concept
- Airflow for job monitoring and orchestration
  - Requires creation of dags with use of airflow variable for the following
    - Start Date
    - Historic/Incremental dag runs
    - Schedule
    - Catch up
  - Python script for airflow tasks
- S3 for data-lake with CSV/Json data transformed to parquet formats
- Redshift as data-warehouse for reporting purpose
- Sublime text as script editor
Data Model
- STAR Schema approach with the followings tables
  
  Fact Table
  - fact_trips (data for all the trips)
  Dimension Tables
  - dim_station (data for general station info)
  - dim_calendar (calendar data for day, month, year, holidays, week)

Execution

Project Code

Clean coding standards
- Use of pep-8 coding standards for python code
- Use proper code syntax for SQL
Modular
- Create functions where ever possible for python scripts
- Use of common table expressions and temp tables for ETL

Quality Checks

At least two data quality checks
- Count of records is always greater than 0
- Null counts

Data Model

ETL process result in data model outlined in write up
Data Dictionary of Final Data Model is included
The Data model is appropriate for identified purpose

Dataset

At least 2 data sources
- Stations json (From Kaggle)
- Trips csv
- Stations api (From open data)
More than a million lines of Data
- Trips CSV include over 3.2 Million rows for an year

Schedule

Hourly
Daily (default)
Monthly

Credentials Management

Airflow will need S3 and Redshift credentials to access and process data
S3 access keys and redshift dw credentials can be securely saved in Airflow
Airflow encrypts passwords internally making it safe
Alternate option could be using AWS secrets manager

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
airflow		airflow
.gitignore		.gitignore
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BIXI Data Lake and Warehouse

Project Scope

Scenario

Defending decisions

Execution

Project Code

Quality Checks

Data Model

Dataset

Schedule

Credentials Management

About

Uh oh!

Releases

Packages

Languages

gurjarprateek/bixi-data-repository

Folders and files

Latest commit

History

Repository files navigation

BIXI Data Lake and Warehouse

Project Scope

Scenario

Defending decisions

Execution

Project Code

Quality Checks

Data Model

Dataset

Schedule

Credentials Management

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages