SOWLv2: Text-Prompted Object Segmentation from video

SOWLv2 (SegmentedOWLv2) is a powerful command-line tool for text-prompted object segmentation. It seamlessly integrates Google's OWLv2 open-vocabulary object detector with Meta's SAM 2 (Segment Anything Model V2) to precisely segment objects in images, image sequences (frames), or videos based on natural language descriptions.

Given one or more text prompts (e.g., "a red bicycle", or "cat" "dog") and an input source, SOWLv2 will:

Utilize OWLv2 to detect bounding boxes for objects matching the text prompt(s), based on the principles from the paper Scaling Open-Vocabulary Object Detection.
Employ SAM 2 to generate detailed segmentation masks for each detected object, leveraging techniques from the paper SAM 2: Segment Anything in Images and Videos.
Save both binary segmentation masks (foreground vs. background) and overlay images (original image with masks visually overlaid) to a specified output directory.

✨ Key Features

Text-Prompted Segmentation: Identify and segment objects using free-form text descriptions.
Multi-Object Detection: Supports providing multiple text prompts simultaneously to detect different classes of objects in a single run.
State-of-the-Art Models: Leverages the power of Google's OWLv2 and Meta's SAM 2.
Versatile Input: Supports single images, directories of frames, and video files.
Comprehensive Output: Generates both binary masks for programmatic use and visual overlays for inspection.
Customizable: Allows selection of specific OWLv2 and SAM 2 model variants, detection thresholds, and video processing parameters.
Easy Installation: Installable via pip directly from the Git repository.
GPU Acceleration: Automatically utilizes CUDA-enabled GPUs if available, with a fallback to CPU.

🚀 Quick Start & Demo

Explore SOWLv2's capabilities interactively with our Google Colab Notebook:

The notebook provides a step-by-step demonstration for all supported input types (images, frames, videos), including examples with multiple prompts.

🛠️ Installation

SOWLv2 can be installed directly from this Git repository using pip. Ensure you have Python 3.10+ and pip installed.

pip install git+https://github.com/bladeszasza/SOWLv2.git

This command will also install all necessary dependencies, including transformers, sam2, opencv-python, torch, and others.

Alternatively, you can clone the repository and install using setup.py or requirements.txt:

git clone https://github.com/bladeszasza/SOWLv2.git
cd SOWLv2
pip install -r requirements.txt
# or
# python setup.py install```

## ⚙️ Usage

Once installed, the `sowlv2-detect` command-line tool will be available.

### Basic Command Structure:

To detect a single type of object:
```bash
sowlv2-detect --prompt "your text prompt" --input <path_to_input> --output <path_to_output_dir> [options]

To detect multiple types of objects in one go:

sowlv2-detect --prompt "prompt one" "prompt two" "another object" --input <path_to_input> --output <path_to_output_dir> [options]

Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g., "a red bicycle"). When providing multiple prompts, each prompt is a separate string.

Command-Line Options:

Examples:

Segment "dogs" in a single image:

sowlv2-detect --prompt "dog" --input path/to/your/dog_image.jpg --output results/dog_segmentation/

Segment "cat" and "duck" in a folder of image frames:

sowlv2-detect --prompt "cat" "duck" --input path/to/your/frames_folder/ --output results/person_bike_frames_segmentation/

Segment "car" (potentially a specific type like "a red car") in a video, sampling at 10 FPS:

sowlv2-detect --prompt "a red car" --input path/to/your/video.mp4 --output results/red_car_video_segmentation/ --fps 10

Using larger models to detect "cats" and "dogs" for potentially higher accuracy (requires more GPU VRAM):

sowlv2-detect --prompt "cat" "dog" --input image_with_pets.jpg --output results_pets_large/ \
              --owl-model "google/owlv2-large-patch14-ensemble" \
              --sam-model "facebook/sam2.1-hiera-large"

Output Structure:

The tool saves results in the specified output directory with the following structure:

output_dir/
├── binary/                  # Binary mask images/videos
│   ├── merged/             # Merged binary masks (all objects combined)
│   │   ├── 000001_merged_mask.png
│   │   ├── 000002_merged_mask.png
│   │   └── ...
│   └── frames/             # Individual binary masks per object
│       ├── 000001_obj1_cat_mask.png
│       ├── 000001_obj2_dog_mask.png
│       ├── 000002_obj1_cat_mask.png
│       └── ...
├── overlay/                # RGB overlay images/videos
│   ├── merged/            # Merged overlays (all objects combined)
│   │   ├── 000001_merged_overlay.png
│   │   ├── 000002_merged_overlay.png
│   │   └── ...
│   └── frames/            # Individual overlays per object
│       ├── 000001_obj1_cat_overlay.png
│       ├── 000001_obj2_dog_overlay.png
│       ├── 000002_obj1_cat_overlay.png
│       └── ...
└── video/                 # Generated videos (for video input)
    ├── binary/            # Binary mask videos
    │   ├── merged_mask.mp4     # Merged binary mask video
    │   ├── obj1_cat_mask.mp4   # Individual object videos
    │   └── obj2_dog_mask.mp4
    └── overlay/           # Overlay videos
        ├── merged_overlay.mp4  # Merged overlay video
        ├── obj1_cat_overlay.mp4
        └── obj2_dog_overlay.mp4

File Naming Convention:

For each detected object instance, SOWLv2 generates files using the following patterns:

Individual Object Files:

Binary masks: {frame_num}_obj{obj_id}_{prompt}_mask.png (e.g., 000001_obj1_cat_mask.png)
Overlay images: {frame_num}_obj{obj_id}_{prompt}_overlay.png (e.g., 000001_obj1_cat_overlay.png)

Merged Files (all objects combined):

Binary masks: {frame_num}_merged_mask.png (e.g., 000001_merged_mask.png)
Overlay images: {frame_num}_merged_overlay.png (e.g., 000001_merged_overlay.png)

Video Files:

Individual object videos: obj{obj_id}_{prompt}_mask.mp4 / obj{obj_id}_{prompt}_overlay.mp4
Merged videos: merged_mask.mp4 / merged_overlay.mp4

Objects are numbered sequentially (obj1, obj2, etc.) in the order they are detected by OWLv2, regardless of which text prompt they matched. Frame numbers use 6-digit zero-padding (000001, 000002, etc.).

SOWLv2 automatically assigns a unique color to each detected object class, making it easy to visually distinguish different object types in the output overlays and merged results.

Configuration File (Optional):

You can use a YAML configuration file to specify arguments, which is useful for managing complex settings or reproducing experiments. The prompt field in the YAML file can also be a list of strings.

Example config.yaml with multiple prompts:

prompt:
  - "a pedestrian crossing the street"
  - "traffic light"
  - "bus"
input: "data/street_scene.mp4"
output: "results/street_multi_object_video"
owl-model: "google/owlv2-base-patch16-ensemble"
sam-model: "facebook/sam2.1-hiera-small"
threshold: 0.15
fps: 15
device: "cuda"

Run with config:

sowlv2-detect --config config.yaml

Note: Command-line arguments will override values specified in the config file if both are provided. If --prompt is given on the command line, it will override the prompt list in the config file.

🧠 How It Works

SOWLv2 follows a two-stage pipeline:

OWLv2 Detection: The input image/frame is processed using the specified OWLv2 model (default: google/owlv2-base-patch16-ensemble) via the Hugging Face transformers library. Based on the provided text prompt(s), OWLv2 identifies relevant objects and outputs their bounding boxes along with confidence scores and the specific prompt that matched each detection.
SAM 2 Segmentation: For each bounding box detected by OWLv2 (above the specified confidence threshold), the SAM 2 model (default: facebook/sam2.1-hiera-small) is invoked. SAM 2 takes the original image and the bounding box as input prompts to generate a precise segmentation mask for the object within that box. For videos, SAM 2's video-specific capabilities are used to propagate masks across frames.
Output Generation: The generated binary masks are saved as grayscale PNG files. Additionally, these masks are overlaid onto the original images/frames to create visually inspectable results, which are also saved. For video inputs, individual frame outputs are processed, and then compiled into per-object mask videos and overlay videos.

📦 Dependencies

SOWLv2 relies on the following major Python packages:

torch (PyTorch)
transformers (for OWLv2 models)
sam2 (Meta's SAM 2 package)
opencv-python (for image and video processing)
numpy, Pillow, pyyaml, huggingface_hub

These dependencies are listed in setup.py and requirements.txt and will be installed automatically when using pip.

📜 License

This project is licensed under the Apache 2.0 License. See the LICENSE file for more details.

🤝 Contributing

Contributions are welcome! If you'd like to contribute, please feel free to fork the repository, make your changes, and submit a pull request. For major changes or new features, please open an issue first to discuss what you would like to change.

🙏 Acknowledgments

SOWLv2 builds upon the incredible work of researchers and developers behind these foundational models and libraries:

OWLv2: Google Research
- Paper: Scaling Open-Vocabulary Object Detection
SAM 2: Meta AI Research
- Paper: SAM 2: Segment Anything in Images and Videos

We extend our sincere gratitude to the authors and maintainers for open-sourcing their code and models.

🌟 Developed by Csaba Bolyòs 🚀

Connect with me: 🔗 LinkedIn | 📓 Google Colab Demo

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.claude		.claude
.github/workflows		.github/workflows
assets		assets
config		config
docs		docs
notebooks		notebooks
sowlv2.egg-info		sowlv2.egg-info
sowlv2		sowlv2
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_tests.py		run_tests.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SOWLv2: Text-Prompted Object Segmentation from video

✨ Key Features

🚀 Quick Start & Demo

🛠️ Installation

Command-Line Options:

Examples:

Output Structure:

File Naming Convention:

Configuration File (Optional):

🧠 How It Works

📦 Dependencies

📜 License

🤝 Contributing

🙏 Acknowledgments

🌟 Developed by Csaba Bolyòs 🚀

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

bladeszasza/SOWLv2

Folders and files

Latest commit

History

Repository files navigation

SOWLv2: Text-Prompted Object Segmentation from video

✨ Key Features

🚀 Quick Start & Demo

🛠️ Installation

Command-Line Options:

Examples:

Output Structure:

File Naming Convention:

Configuration File (Optional):

🧠 How It Works

📦 Dependencies

📜 License

🤝 Contributing

🙏 Acknowledgments

🌟 Developed by Csaba Bolyòs 🚀

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages