Skip to content

shubhamparulekar/sql_data_warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL Data Warehouse and Analytics Solution

This repository presents a comprehensive data warehousing and analytics solution, demonstrating the end-to-end process of building a robust data warehouse and generating actionable insights. The project adheres to industry best practices in data engineering and analytics, showcasing a structured approach to data management and reporting.


Overview

The project aims to transform raw, disparate data into a structured and analyzable format through a multi-layered data warehousing approach. It covers data ingestion, cleansing, transformation, and presentation, culminating in a ready-to-use dataset for business intelligence and reporting.


Features

  • Multi-Layered Data Architecture: Implements Bronze, Silver, and Gold layers for progressive data refinement.
    • Bronze Layer: Raw, immutable data directly ingested from source systems.
    • Silver Layer: Cleaned, conformed, and integrated data, ready for initial analysis.
    • Gold Layer: Business-ready data structured into dimension and fact tables for analytical and reporting use cases.
  • Data Transformation Pipelines: Scripts for ETL/ELT processes to move data between layers.
  • Data Catalog: Comprehensive documentation of all tables and columns across the data warehouse layers.
  • Actionable Insights: Foundation for generating reports and dashboards to support business decisions.
  • Scalable Design: Designed with scalability in mind to accommodate growing data volumes.

Data Architecture

image

The data architecture follows a medallion architecture pattern, consisting of three distinct layers:

  1. Bronze Layer:

    • Purpose: Ingests raw data from various source systems (e.g., CRM, ERP) without any transformations. It serves as a historical archive.
    • Characteristics: Immutable, historical record of source data.
    • Files: bronze_crm_cust_info.csv, bronze_crm_prd_info.csv, bronze_crm_sales_details.csv, bronze_erp_cust_az12.csv, bronze_erp_loc_a101.csv, bronze_erp_px_cat_g1v2.csv
  2. Silver Layer:

    • Purpose: Cleanses, standardizes, and integrates data from the Bronze layer. It addresses data quality issues and applies basic transformations, providing a "single source of truth."
    • Characteristics: Conformed, integrated, and ready for initial analysis.
    • Files: silver_crm_cust_info.csv, silver_crm_prd_info.csv, silver_crm_sales_details.csv, silver_erp_cust_az12.csv, silver_erp_loc_a101.csv, silver_erp_px_cat_g1v2.csv
  3. Gold Layer:

    • Purpose: Presents highly refined, aggregated, and business-centric data. It's optimized for analytical queries and reporting, often structured for specific business needs.
    • Characteristics: Typically uses a star schema or snowflake schema, consisting of fact and dimension tables.
    • Files: gold_dim_customers.csv, gold_dim_products.csv, gold_fact_sales.csv

Project Structure

The repository is organized into the following directories:

sql_data_warehouse/
├── datasets/                 # Contains raw and processed CSV data files for each layer.
│   ├── bronze/
│   ├── silver/
│   └── gold/
├── docs/                     # Documentation files, including the data catalog.
├── outputs/                  # Stores generated reports, dashboards, or analytical results.
├── scripts/                  # ETL/ELT scripts for data processing and transformation.
├── tests/                    # Unit and integration tests for data quality and pipeline validation.
└── README.md                 # This documentation file.

Getting Started

To set up and run this project locally, follow these steps:

Prerequisites

  • A SQL database system (e.g., PostgreSQL, MySQL, SQL Server): If you plan to load this data into a database, you'll need a running instance.

Installation

  1. Clone the repository:
    git clone [https://github.com/shubhamparulekar/sql_data_warehouse.git](https://github.com/shubhamparulekar/sql_data_warehouse.git)
    cd sql_data_warehouse

Usage

The scripts/ directory contains the core logic for data processing. You can run these scripts sequentially to build the data warehouse layers.

  1. Data Ingestion (Bronze Layer):

    • Ensure your raw CSV files are placed in the datasets/bronze/ directory.
    • (If applicable) Run scripts that ingest data into your initial staging area or database.
  2. Data Cleansing and Transformation (Silver Layer):

    • Execute scripts that process data from the Bronze layer, apply cleansing rules, and store the refined results in datasets/silver/.
  3. Data Modeling and Aggregation (Gold Layer):

    • Run scripts that transform Silver layer data into the final dimension and fact tables, storing them in datasets/gold/ or loading them into your target analytical database.

Contributing

Contributions are welcome! If you have suggestions for improvements, new features, or bug fixes, please feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License. For full details, please refer to the LICENSE file within the repository (if applicable).

Acknowledgements

This readme.md was generated using Gemini

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages