# üì¶ Setup Mounts And Paths
### üß∞ Importing project helper functions

In this project, we‚Äôve organized common setup and data generation logic into a separate **`helpers`** module.  
By importing these functions, we can keep the notebook clean and focus on the **data engineering workflow**, rather than repeating boilerplate code.

- `setup_schemas` ‚Äì creates and configures the database schemas used in this project.  
- `setup_volumes` ‚Äì sets up Unity Catalog volumes where data files will be stored.  
- `generate_user_data` ‚Äì generates realistic synthetic user data.  
- `generate_product_data` ‚Äì generates product catalog data with daily updates.  
- `generate_sales_data` ‚Äì generates transactional sales data with inserts, updates, and deletes.

This modular approach makes the pipeline easier to **maintain**, **reuse**, and **extend**.

In [0]:
from helpers import setup_schemas, setup_volumes, generate_batch_files as gen

### üèóÔ∏è Creating user schemas across environments

This command initializes the **database schemas** (namespaces) that will structure our data in **Unity Catalog**.

The function `setup_schemas.create_user_schemas()`:

- Loops through the environments: `dev` and `prd`.  
- Uses the catalog for each environment (e.g., `capstone_dev`, `capstone_prd`).  
- Creates the schemas (if they don‚Äôt already exist) for each data layer:
  - `bronze` ‚Äì minimally processed data.  
  - `silver` ‚Äì cleaned and standardized datasets.  
  - `gold` ‚Äì curated datasets ready for analytics.

By creating these schemas up front, we establish a **consistent data architecture** for the entire project.


In [0]:
setup_schemas.create_user_schemas()

### üóÇÔ∏è Creating Unity Catalog Volumes

Once our schemas are set up, we need **volumes** to store the raw data files for each entity in the project.

The function `setup_volumes.create_volumes()`:

- Iterates through the environments: `dev` and `prd`.  
- Targets the `bronze` schema in each environment (e.g., `capstone_dev.bronze`).  
- Creates volumes for each purpose:
  - `raw_files` ‚Äì stores generated and raw files.  
  - `checkpoint_files` ‚Äì stores checkpoint files.  
  - `schema_files` ‚Äì store schema files.


In [0]:
setup_volumes.create_volumes()

### üë§ Generating synthetic user data

Now that the schemas and volumes are set up, we can generate the **user dataset** for the project.

The function `generate_user_data.main()`:

- Creates **synthetic user records** with realistic attributes (name, email, phone, etc.).  
- Writes the generated data to the **`user` volume** in the specified environment.  
- We run it for both environments:
  - `env="dev"` ‚Äì for development/testing purposes.  
  - `env="prd"` ‚Äì for production-like datasets.

This step populates the **raw layer** of the Lakehouse with user data, ready for downstream processing and analytics.


In [0]:
gen.copy_base_files("users", "dev")
gen.copy_base_files("users", "prd")

### üõçÔ∏è Generating product catalog data

After generating user data, we create the **product dataset** for the project.

The function `generate_product_data.main()`:

- Generates a **synthetic product catalog** with attributes such as:
  - Product ID
  - Name
  - Category
  - Price
  - Daily updates to simulate realistic changes  
- Saves the generated data to the **`product` volume** in the specified environment.  
- Executed for both environments:
  - `env="dev"` ‚Äì for development and testing workflows.  
  - `env="prd"` ‚Äì for production-like datasets.

This step populates the **raw layer** of the Lakehouse with product data, providing a foundation for sales and event generation downstream.


In [0]:
gen.copy_base_files("products", "dev")
gen.copy_base_files("products", "prd")

### üí∞ Generating sales data

This script creates synthetic **sales transactions** for the project, including inserts, updates, and deletes.  

- First, it generates a **historical snapshot** of sales for the base date.  
- Then, it creates **daily changes** over multiple days, simulating realistic updates and new transactions.  
- All data is saved directly to the **`sales` volume** in the specified environment (`dev` or `prd`) using an **in-memory buffer**.


In [0]:
gen.copy_base_files("sales", "dev")
gen.copy_base_files("sales", "prd")