Skip to content

stealthwater/model_tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Tools

A comprehensive toolkit for extracting, converting, and validating machine learning model weights and tokenizers.

Python 3.8+ License: BSD

Overview

Model Tools provides a unified interface for working with transformer models, particularly for preparing them for deployment in custom inference engines. It supports extraction from Hugging Face models, format conversion, validation, and comparison operations.

Features

  • 🔧 Model Extraction - Extract weights and tokenizers from Hugging Face models or local directories
  • 🔄 Format Conversion - Convert between safetensors and binary formats with precision control
  • ⚖️ Weight Comparison - Compare extracted weights with reference models for accuracy verification
  • Comprehensive Validation - Verify model integrity, weights, and configurations
  • 📦 Model Download - Download models from Hugging Face Hub with resume support
  • 🎯 Tokenizer Management - Extract, validate, and analyze tokenizer files
  • 🔢 Quantization Support - Handle quantized model weights (INT8, FP16)
  • ⚙️ Configurable - Extensive configuration options via JSON

Installation

Requirements

  • Python >= 3.8
  • PyTorch
  • Transformers
  • Hugging Face Hub

Install Dependencies

pip install torch transformers huggingface_hub safetensors numpy

Setup

git clone https://github.com/steathwater/model_tools.git
cd model_tools
pip install -e .

Quick Start

Extract a Model

python master_toolkit.py extract microsoft/DialoGPT-small

Extract to Custom Directory

python master_toolkit.py extract microsoft/DialoGPT-small --output models/my_model

Compare Weights

python master_toolkit.py compare models/my_model/weights.bin microsoft/DialoGPT-small

Convert Formats

python master_toolkit.py convert input/ output/ --format-type safetensors_to_bin

Usage

Command-Line Interface

# Extract model
python master_toolkit.py extract <model_name> [--output OUTPUT_DIR]

# Compare weights
python master_toolkit.py compare <weights_file> <reference_model> [--comparison-type {hf,file}]

# Convert format
python master_toolkit.py convert <input> <output> --format-type {safetensors_to_bin,bin_to_safetensors,precision_conversion}

# Verify model
python master_toolkit.py verify <path> --type {weights,tokenizer,extraction,conversion}

# Download model
python master_toolkit.py download <model_name> [--output-dir OUTPUT_DIR]

# Tokenizer operations
python master_toolkit.py tokenizer <model_name> --action {extract,validate,analyze}

Python API

from model_tools.operations.extract_operation import ExtractOperation

# Initialize operation
extractor = ExtractOperation()

# Extract model
result = extractor.execute(
    model_name='microsoft/DialoGPT-small',
    output_dir='./models/my_model'
)

if result.success:
    print(f"Extracted {result.data['num_tensors']} tensors")
    print(f"Model size: {result.data['total_size_gb']:.2f} GB")

Configuration

Configure the toolkit via config.json:

{
  "default_directories": {
    "input_dir": "models/input",
    "output_dir": "models/output",
    "temp_dir": "models/temp",
    "cache_dir": "models/cache"
  },
  "formats": {
    "precision": "fp32",
    "weight_format": "bin",
    "preserve_quantized": true
  },
  "validation_settings": {
    "auto_validate_weights": true,
    "auto_compare_with_hf": true,
    "validation_tolerance": 1e-5
  },
  "performance": {
    "use_gpu": true,
    "batch_size": 1
  }
}

Operations

Extract Operation

Extracts model weights and tokenizers from Hugging Face models or local directories.

Features:

  • Automatic model downloading
  • Local model detection
  • Tokenizer extraction
  • Automatic validation
  • Comparison with source model

Compare Operation

Compares extracted weights with reference models.

Metrics:

  • Exact tensor matching
  • Shape validation
  • Value difference statistics
  • Configuration comparison

Convert Operation

Converts between model formats and precisions.

Supported Conversions:

  • Safetensors → Binary
  • Binary → Safetensors
  • FP32 → FP16
  • FP32 → INT8

Verify Operation

Validates model files and extractions.

Checks:

  • File format validation
  • Tensor value validation (NaN/Inf detection)
  • Configuration validation
  • Tokenizer file validation

Download Operation

Downloads models from Hugging Face Hub.

Features:

  • Resume support
  • Progress tracking
  • Mirror URL support
  • Timeout handling

Project Structure

model_tools/
├── operations/          # Core operations
│   ├── extract_operation.py
│   ├── compare_operation.py
│   ├── convert_operation.py
│   ├── verify_operation.py
│   ├── tokenizer_operation.py
│   └── download_operation.py
├── utils/              # Utility modules
│   ├── file_utils.py
│   ├── tensor_utils.py
│   ├── validation_utils.py
│   ├── logging_utils.py
│   └── download_utils.py
├── config_loader.py    # Configuration management
├── master_toolkit.py   # CLI entry point
└── config.json         # Configuration file

Binary Format Specification

The toolkit uses a custom binary format for efficient weight storage:

Header (24 bytes):
  - vocab_size (4 bytes)
  - hidden_dim (4 bytes)
  - num_layers (4 bytes)
  - num_heads (4 bytes)
  - ff_dim (4 bytes)
  - max_seq_len (4 bytes)

Weights (4 bytes): Number of tensors

For each tensor:
  - name_length (4 bytes)
  - name (variable)
  - num_dimensions (4 bytes)
  - shape (4 bytes × num_dimensions)
  - data_size (8 bytes)
  - data (variable, float32)

Examples

Extract and Validate

# Extract model with automatic validation
python master_toolkit.py extract Qwen/Qwen3-1.7B --output models/qwen3

# Results show:
# - ✅ 311 tensors extracted
# - ✅ All weights validated
# - ✅ 100% match with source model

Batch Conversion

# Convert multiple safetensors files to binary
python master_toolkit.py convert models/input/ models/output/ \
    --format-type safetensors_to_bin \
    --precision fp16

Detailed Comparison

# Compare with statistical analysis
python master_toolkit.py compare \
    models/extracted/weights.bin \
    microsoft/DialoGPT-small \
    --comparison-type hf \
    --detailed

Troubleshooting

Common Issues

Model not found:

# Ensure model name is correct
python master_toolkit.py download <model_name> --output-dir models/input

Shape mismatch errors:

  • Check that the model architecture is supported
  • Verify the binary format matches the reader

Out of memory:

  • Use --precision fp16 for smaller memory footprint
  • Adjust batch_size in config.json

Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the BSD License - see the LICENSE file for details.

Acknowledgments

Support

For issues, questions, or suggestions:

  • Open an issue on GitHub Issues
  • Check existing documentation in the docs/ directory

Made with ❤️ for the ML community

About

A comprehensive toolkit for model extraction, conversion, and validation. This package provides utilities for working with machine learning models, particularly for preparing them for use with inference engines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages