Skip to content

Creating end-to-end data pipeline for analysis of orders from Fashion Campus

Notifications You must be signed in to change notification settings

rtilwalia/Fashion-Campus-Orders

Repository files navigation

Fashion-Campus-Orders

Data Engineering 2024 project for DataEngineer-Zoomcamp. Developed an end-to-end data pipeline for an Indonesian e-commerce website Fashion Campus.

Data is extracted from Kaggle - https://www.kaggle.com/datasets/latifahhukma/fashion-campus/data?select=transactions.csv

Information about the data - Fashion Campus, an e-commerce fashion company targeting "Indonesian Young Urbans" - young people aged 15-35 - was established in Indonesia in early 2016. The company offers a catalog of local and international brands popular among young people. Given that the data is static, the data pipeline operates as a one-time process. The dataset contains 4 CSV files

  1. Clickstream
  2. Transactions
  3. Product
  4. Customer

Goal

Develop a data architecture from the raw data of the Fashion Campus using Google Cloud Platform. The data is extracted from Kaggle, inital data ingestion and workflow orchestration is done through Mage. Final ETL pipeline is developed in DBT. When data is stored in the warehouse i.e. Bigquery, then visualization for business is done through Looker.

Data Architecture of the project-

Screenshot 2024-04-18 at 1 13 39 PM

Tools and Steps

  1. Cloud:

  2. Data Ingestion (batch):

    • Mage
    • Batch data ingestion is done through Mage, as it makes easy to handle big data and the data gets stored in data lake in batches.
  3. Data Lake:

    • Google Cloud Storage
    • When data is ingested and processed from Mage, it is stored in google cloud storage. As it is a cloud platform, it becomes easy to access the data for further processing.
  4. Data Transformations and Processing:

    • dbt

    • DBT is used for the development of the ETL of the data. Developed staging tables for the files which are further joined into a fact table.

    • Further dimensions are created according to the requirement and then data is pushed into data warehouse in batches.

      dbt_data_architecture

  5. Data Warehousing:

    • Google BigQuery
    • Data from both dev and prod environment is stored in bigquery. This can easily help us in writing adhoc SQL scripting and also provides data for visualization in looker
  6. Dashboarding:

Future Scope of the project

  1. Creating CI/CD pipeline on DBT, so that data can be merged easily on git.
  2. Developing further visualizations of clickstream to retain customers.
  3. Developing further dimensions of the ETL architecture to generate niche data.

About

Creating end-to-end data pipeline for analysis of orders from Fashion Campus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published