<div style="display: flex; gap: 10px;">
  <img src="../images/HOOPS_AI.jpg" style="width: 20%;">

# Training your custom HOOPS Embeddings Model

> **Purpose**: This document is for **Data Scientists** who want to **Train custom HOOPS Embedding models**. 

## Overview

The `EmbeddingFlowModel` is a specialized FlowModel implementation designed for **training** shape embeddings model from CAD data using contrastive learning.

Thus, enabling data scientists to train custom HOOPS Embedding models on their own CAD datasets.

### Training → Production Workflow

1. **Train** a custom HOOPS Embeddings using `EmbeddingFlowModel` + `FlowTrainer` (this document)
2. **Register** the trained model with `HOOPSEmbeddings.register_model()` 
3. **Deploy** for production use via `HOOPSEmbeddings` API (see notebook HOOPS_embeddings_cad_search_fabwave for an example)

### When to Train Custom Models

- Your CAD parts have unique geometric characteristics not captured by pre-trained models
- You need domain-specific embeddings (e.g., specific industry, manufacturing process)
- You have a large proprietary dataset to learn from
- You want to optimize embedding dimensions for your use case

**Note**: HOOPS AI's provided a pre-trained model (e.g., `ts3d_1M_dual_v1`) that can be used directly. See the [production guide](../Embeddings%20&%20Similarity/embeddings_and_retrieval_guide.md) on how to use it directly. Trained on a large dataset with nearly 1M parts from **public datasets (ABC, fabwave, etc)**. 

## Key Features

- **Contrastive Learning**: Learns shape representations by distinguishing between similar and dissimilar CAD geometries
- **Flexible Architecture**: Configurable embedding dimensions, projection layers, and training parameters
- **Unsupervised Training**: No labels required per CAD file - learns from geometric structure alone 

In [1]:
import hoops_ai
import os

hoops_ai.set_license(hoops_ai.use_test_license(), validate=False)

ℹ️ Using TEST LICENSE (expires February 8th, 2026 - 12 days remaining)
   For production use, obtain your own license from Tech Soft 3D
HOOPS AI version :  1.0.0-b2dev9 



In [2]:
import pathlib
import hoops_ai
from hoops_ai.dataset import DatasetLoader
from hoops_ai.ml.EXPERIMENTAL import FlowTrainer

In [3]:
# we define our tasks in a separate file for multprocessing compatibility
from scripts.cad_tasks_embeddings import EmbeddingModel, flows_outputdir, get_flow_name, gather_cad_files, encode_data_for_ml_training

HOOPS AI version :  1.0.0-b2dev9 



## Constructor Parameters

### Essential Training Parameters

#### `emb_dim` (int, default: 1024)
The dimensionality of the learned embeddings. This determines the size of the vector representation for each CAD shape.
- Higher dimensions can capture more detailed features but increase computational cost
- Typical values: 512, 1024, 2048

#### `lr` (float, default: 3e-4)
Learning rate for the optimizer during training.
- Controls the step size for gradient descent updates
- May need adjustment based on batch size and dataset characteristics

### Temperature Parameters

These parameters control the contrastive loss function's sensitivity to similarities:

#### `temp_init` (float, default: 0.05)
Initial temperature value for the contrastive loss.
- Lower values make the model more discriminative
- Higher values create softer similarities

#### `temp_min` (float, default: 0.01)
Minimum allowed temperature during training.

#### `temp_max` (float, default: 0.20)
Maximum allowed temperature during training.

## Data Processing Pipeline

The `EmbeddingFlowModel` training requires a preprocessing pipeline that gathers CAD files and extract the cad data needed for the training. Here's a complete example using the FlowManager decorators:

**Task 1 - Extract**: Uses `@flowtask.extract` to gather CAD files from local storage using `CADFileRetriever`. Supports multiple CAD formats and parallel processing.

**Task 2 - Prepare data for Embeddings Training**: Uses `@flowtask.transform` decorator which automatically initializes and provide an optimized datastorage and a parallel handling of the files

In [4]:
datasources_dir = pathlib.Path.cwd().parent.joinpath("packages","cadfiles","fabwave")
print(datasources_dir)

C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\packages\cadfiles\fabwave


### ETl pipeline preparation of the data to be used as ML-input

In [5]:
# Create and run the Data Flow
flow_name = get_flow_name() 

cad_flow = hoops_ai.create_flow(
    name=flow_name,
    tasks=[gather_cad_files, encode_data_for_ml_training],  # Imported from cad_tasks_embdedding.py
    max_workers=40,  
    flows_outputdir=str(flows_outputdir),
    ml_task="custom HOOPS Embeddings model Demo",
    auto_dataset_export=True,  # Enable automatic dataset merging
    #debug=True,  # Changed to True to enable debugging
    export_visualization=False
)

# Run the flow to process all files
flow_output, output_dict, flow_file = cad_flow.process(inputs={'cad_datasources': [str(datasources_dir)]}, clean_ouput_dir=False)

# Display results
print("\n" + "="*70)
print("FLOW EXECUTION COMPLETED SUCCESSFULLY")
print("="*70)
print(f"\nDataset files created:")
print(f"  Main dataset: {output_dict.get('flow_data', 'N/A')}")
print(f"  Info dataset: {output_dict.get('flow_info', 'N/A')}")
print(f"  Attributes: {output_dict.get('flow_attributes', 'N/A')}")
print(f"  Flow file: {flow_file}")
print(f"\nTotal processing time: {output_dict.get('Duration [seconds]', {}).get('total', 0):.2f} seconds")
print(f"Files processed: {output_dict.get('file_count', 0)}")

|INFO| FLOW | ######### Flow 'HOOPS Embedding Training' start #######
|INFO| FLOW | 
Flow Execution Summary
|INFO| FLOW | Task 1: Gather CAD files from datasources
|INFO| FLOW |     Inputs : cad_datasources
|INFO| FLOW |     Outputs: cad_dataset
|INFO| FLOW | Task 2: Extracting CAD ML-input for EmbeddingModel
|INFO| FLOW |     Inputs : cad_dataset
|INFO| FLOW |     Outputs: cad_files_encoded
|INFO| FLOW | Task 3: AutoDatasetExportTask
|INFO| FLOW |     Inputs : cad_files_encoded
|INFO| FLOW |     Outputs: encoded_dataset, encoded_dataset_info, encoded_dataset_attribs
|INFO| FLOW | 
Task Dependencies:
|INFO| FLOW | Gather CAD files from datasources has no dependencies.
|INFO| FLOW | Gather CAD files from datasources --> Extracting CAD ML-input for EmbeddingModel
|INFO| FLOW | Extracting CAD ML-input for EmbeddingModel --> AutoDatasetExportTask

|INFO| FLOW | Executing ParallelTask 'Gather CAD files from datasources' with 1 items.


DATA INGESTION:   0%|                                                                            | 0/1 [00:00<…

|INFO| FLOW | Executing ParallelTask 'Extracting CAD ML-input for EmbeddingModel' with 4572 items.


DATA TRANSFORMATION:   0%|                                                                    | 0/4572 [00:00<…

|INFO| FLOW | Executing SequentialTask 'AutoDatasetExportTask'.
[DatasetMerger] Using streaming merge into temporary directory store for large dataset...


DATA STORING/LOADING:   0%|          | 0/4546 [00:00<?, ?files/s]

|INFO| FLOW | Auto dataset export completed in 217.07 seconds
|INFO| FLOW | Time taken: 269.76 seconds
|INFO| FLOW | ######### Flow 'HOOPS Embedding Training' end ######

FLOW EXECUTION COMPLETED SUCCESSFULLY

Dataset files created:
  Main dataset: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\HOOPS Embedding Training\HOOPS Embedding Training.dataset
  Info dataset: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\HOOPS Embedding Training\HOOPS Embedding Training.infoset
  Attributes: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\HOOPS Embedding Training\HOOPS Embedding Training.attribset
  Flow file: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out/flows/HOOPS Embedding Training/HOOPS Embedding Training.flow

Total processing time: 269.76 seconds
Files processed: 4572


In [6]:
from hoops_ai.dataset import DatasetExplorer

explorer = DatasetExplorer(flow_output_file=str(flow_file))
explorer.print_table_of_contents()

[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:58863' processes=1 threads=16, memory=7.45 GiB>


Processing file info:   0%|          | 0/4346 [00:00<?, ?it/s]


--- Dataset Table of Contents ---

EDGES_GROUP:
  EDGE_CONVEXITIES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_DIHEDRAL_ANGLES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_INDICES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_LENGTHS_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_TYPES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065
  EDGE_U_GRIDS_DATA: Shape: (337065, 10, 6), Dims: ('edge', 'dim_x', 'component'), Size: 20223900
  FILE_ID_CODE_EDGES_DATA: Shape: (337065,), Dims: ('edge',), Size: 337065

FACES_GROUP:
  FACE_AREAS_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FACE_DISCRETIZATION_DATA: Shape: (130923, 100, 7), Dims: ('face', 'sample', 'component'), Size: 91646100
  FACE_INDICES_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FACE_LOOPS_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FACE_TYPES_DATA: Shape: (130923,), Dims: ('face',), Size: 130923
  FILE_ID_CODE_FACES_DATA: Shape

### Now we move towards running a training

In [7]:
flow_name = get_flow_name() 
flow_root_dir = flows_outputdir.joinpath("flows", flow_name)
print(flow_root_dir)

myFlow_info        = str(flow_root_dir.joinpath(f"{flow_name}.infoset"))
myFlow_dataset     = str(flow_root_dir.joinpath(f"{flow_name}.dataset"))

C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\HOOPS Embedding Training


In [8]:
# Load the already encoded dataset and perform the split
cadflowdataset = DatasetLoader(merged_store_path = myFlow_dataset, parquet_file_path=myFlow_info)
cadflowdataset.split(key='face_types', group="faces",train=0.8, validation=0.1, test=0.1)

[DatasetExplorer] Default local cluster started: <Client: 'tcp://127.0.0.1:58882' processes=1 threads=16, memory=7.45 GiB>


Processing file info:   0%|          | 0/4346 [00:00<?, ?it/s]

DEBUG: Successfully built file lists with 4346 files out of 4346 original file codes

DATASET STRUCTURE OVERVIEW

Group: edges
------------------------------
  edge_convexities: (337065,) (int32)
  edge_dihedral_angles: (337065,) (float32)
  edge_indices: (337065,) (int32)
  edge_lengths: (337065,) (float32)
  edge_types: (337065,) (int32)
  edge_u_grids: (337065, 10, 6) (float32)
  file_id_code_edges: (337065,) (int64)

Group: faces
------------------------------
  face_areas: (130923,) (float32)
  face_discretization: (130923, 100, 7) (float32)
  face_indices: (130923,) (int32)
  face_loops: (130923,) (int32)
  face_types: (130923,) (int32)
  file_id_code_faces: (130923,) (int64)

Group: graph
------------------------------
  edges_destination: (337065,) (int32)
  edges_source: (337065,) (int32)
  file_id_code_graph: (337065,) (int64)
  num_nodes: (337065,) (int64)

Dataset split by face_types: Train=3472, Validation=438, Test=436


(3472, 438, 436)

#### Here we define our trainer that will do the training job for us

In [9]:
flow_trainer = FlowTrainer(

    flowmodel       = EmbeddingModel, # imported from cad_tasks_embeddings.py
    datasetLoader   = cadflowdataset,
    experiment_name = "HOOPS_AI_train",
    result_dir      = flow_root_dir,
    accelerator     = 'gpu',
    devices         = [1], #[0]
    max_epochs      = 10,
    batch_size      = 64
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


HOOPS Embedding Model


In [None]:
trained_model_path = flow_trainer.train()
print(f"Training finished. Model checkpoint saved in {trained_model_path}")


-----------------------------------------------------------------------------------
HOOPS Embedding Model - TRAINING STEP
-----------------------------------------------------------------------------------
Training batch size               : 64
Adjusted learning rate (for batch): 0.002

Train set contains                : 3472 samples (79.89%)
Validation set contains           : 438 samples (10.08%)
Test set contains                 : 436 samples (10.03%)
Total samples                     : 4346
Max Epoch                         : 10

The trained model: C:\Users\LuisSalazar\Documents\MAIN\MLProject\repo\HOOPS-AI-tutorials\notebooks\out\flows\HOOPS Embedding Training\ml_output\HOOPS_AI_train\0127\000830\best.ckpt

To monitor the logs, run:
tensorboard --logdir results/HOOPS_AI_train/0127/000830
-----------------------------------------------------------------------------------
        


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [None]:
flow_trainer.test(trained_model_path)
print("Testing finished.")

## Inference

The output of a Flow EmbeddingsModel are the embeddings. This lsit of float values are difficult to understand and represents a learnable representation of your CAD file.

Here, we are going to use the FLowInference to get the value for a new file.


In [None]:
from hoops_ai.ml.EXPERIMENTAL import FlowInference
from hoops_ai.ml.EXPERIMENTAL import EmbeddingFlowModel

from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.insights import CADViewer

# Initialize CAD loader (needed for ML inference later)
loader = HOOPSLoader()

inference_model = FlowInference(cad_loader = loader, flowmodel = EmbeddingFlowModel(result_dir=flow_root_dir))
inference_model.load_from_checkpoint(trained_model_path)

In [None]:
test_files = pathlib.Path.cwd().parent.joinpath("packages")
cad_file_test = str(test_files.joinpath("cadfiles","gear_fabwave.step"))

ml_input = inference_model.preprocess(cad_file_test)    
predictions = inference_model.predict_and_postprocess(ml_input)
print(predictions)

In [None]:
predictions.shape

As opposite of the other two ml models of this library, the inference needs to be complemented with a vector store.

This tutorial ends here, we invite the reader to check out the notebook HOOPS EMBEDDINGS for CAD SEARCH to further details.

## Registering Your Trained Model for Production

Once training is complete, register your custom model with `HOOPSEmbeddings` to use it in production - see notebook HOOPS EMBEDDINGS CAD SEARCH

**Next Steps**:
- Using your registered model for similarity search
- Indexing embeddings in vector databases
- Querying for similar parts in production