This solution accelerator notebook is available at [Databricks Industry Solutions](https://github.com/databricks-industry-solutions/).

# Prepare the Dataset

This notebook generates a synthetic three-tier supply chain network along with operational data, which will be used in later notebooks for stress testing. We will review the properties of the supply chain network and the requirements for the operational data.

## Cluster Configuration
This notebook was tested on the following Databricks cluster configuration:
- **Databricks Runtime Version:** 16.4 LTS ML (includes Apache Spark 3.5.2, Scala 2.12)
- **Single Node** 
    - Azure: Standard_DS4_v2 (28 GB Memory, 8 Cores)
    - AWS: m5d.2xlarge (32 GB Memory, 8 Cores)
- **Photon Acceleration:** Disabled (Photon boosts Apache Spark workloads; not all ML workloads will see an improvement)

In [0]:
%pip install -r ./requirements.txt --quiet
dbutils.library.restartPython()

In [0]:
import random
import numpy as np
import pandas as pd
import scripts.utils as utils

## Generate Supply Chain Network and Data

In the real world, the supply chain network already exists, so the primary task is to collect data and map the network. However, in this solution accelerator, we generate a synthetic dataset for the entire supply chain. This allows us to control the setup and better understand the methodology in depth.

Let’s generate a three-tier supply chain network consisting of sub-suppliers (Tier 3), direct suppliers (Tier 2), and finished-goods plants (Tier 1). The utility function `utils.generate_data` outputs the network topology (directed edges) and key operational parameters, including inventory, capacity, demand, and profit margins.

In [0]:
# Generate a synthetic 3-tier supply chain network dataset for optimization
# N1: number of product nodes
# N2: number of direct supplier nodes
# N3: number of sub-supplier nodes
dataset = utils.generate_data(N1=5, N2=10, N3=20) # DO NOT CHANGE

Let's visualize the network.

In [0]:
# Visualizes the 3-tier network
utils.visualize_network(dataset)

- Product (finished-goods plants: ●) at the top
- Tier 2 (direct suppliers: ■) in the middle
- Tier 3 (sub-suppliers: ▲) at the bottom
- Nodes with the same color produce and supply the same `material_type`.
- Grey is used for Product nodes that have no material-type code.
- Edges run from bottom to top, illustrating the flow of materials (Tier 3 ➜ Tier 2 ➜ Product).

## Operational Data

The operational data required to run stress tests on your supply chain network depends largely on how the optimization problem is formulated—that is, the objective function and its constraints. In this solution accelerator, we follow the formulation presented in this [paper](https://dspace.mit.edu/handle/1721.1/101782). The table below outlines the data elements used in this approach.

For more details on the problem formulation, variable definitions, and key assumptions, refer to the paper or the notebook `04_appendix`.

 Variable                  | What it represents                                                                                 |
 ------------------------- | -------------------------------------------------------------------------------------------------- |
 **tier1 / tier2 / tier3** | Lists of node IDs in each tier.                                                                    |
 **edges**                 | Directed links `(source, target)` showing which node supplies which.                               |
 **material\_type**  | List of all material types. 
 **supplier\_material\_type**  | Material type each supplier produces and supplies.                                    |
 **f**                     | Profit margin for each Tier 1 node’s finished product.                                             |
 **s**                     | On-hand inventory units at every node.                                                             |
 **d**                     | Demand per time unit for Tier 1 products.                                       |
 **c**                     | Production capacity per time unit at each node.                                                          |
 **r**                     | Number of material types (k) required to make one unit of node j.              |
 **N\_minus**              | For each node j (Tier 1 or 2), the set of material types it requires.                              |
 **N\_plus**               | For each supplier i (Tier 2 or 3), the set of downstream nodes j it feeds.                     |
 **P**                     | For each `(j, material_part)` pair, a list of upstream suppliers i that provides it (multi-sourcing view). |

Let's take a peek at the `dataset` dictionary that contains all the data we need.

In [0]:
for key in dataset.keys():
  print(f"{key}: {dataset[key]}", "\n")

## Wrap Up

In this notebook, we generated a synthetic three-tier supply chain network along with the corresponding operational data. We also reviewed the structure of the network and the key requirements for the data.

In the next notebook, `02_stress_testing (small network)`, we will run multiple stress tests on the small network constructed here.

&copy; 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source].  All included or referenced third party libraries are subject to the licenses set forth below.

| library                                | description             | license    | source                                              |
|----------------------------------------|-------------------------|------------|-----------------------------------------------------|
| pyomo | An object-oriented algebraic modeling language in Python for structured optimization problems | BSD | https://pypi.org/project/pyomo/
| highspy | Linear optimization solver (HiGHS) | MIT | https://pypi.org/project/highspy/