Skip to content
/ sff Public

CLI for semantic search on your computer. Searches text files and identifies the most relevant chunks to your query.

License

Notifications You must be signed in to change notification settings

do-me/sff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SemanticFileFinder (sff)

crates.io License: MIT OR Apache-2.0 GitHub stars

sff (SemanticFileFinder) is a command-line tool that rapidly searches for files in a given directory based on the semantic meaning of your query. It leverages sentence embeddings through model2vec-rs to understand content, not just keywords. It reads .txt, .md, and .mdx files, chunks their content, and ranks them by similarity to find the most relevant text snippets.

Installation & Quick Start

Once sff is published on crates.io, you can install it using Cargo:

cargo install sff
sff "project ideas for rust"

Ensure ~/.cargo/bin is in your system's PATH. Deafult is cwd with --path .

I use this tool myself to scan my personal notes. In the past these were simple .txt files in a folder until I migrated everything to iCloud + Obsidian. Here is some sample output from some random notes:

My notess

Performance

tl;dr: under 250ms for English-only models on ~2500 files and 10k chunks (with 20 words per chunk) on an M3 Max. If you need the best possible results and good multilingual retrieval, go for minishlab/potion-multilingual-128M. Else, stick to the default with minishlab/potion-retrieval-32M. Keep an eye on new model2vec models here: https://huggingface.co/minishlab.

Command Model Query Files Chunks Time (ms)
sff -m "minishlab/potion-base-8M" "javascript" potion-base-8M javascript 2537 10000 209.34
sff -m "minishlab/potion-retrieval-32M" "javascript" potion-retrieval-32M javascript 2537 10000 249.95
sff -m "minishlab/potion-multilingual-128M" "javascript" potion-multilingual-128M javascript 2537 10000 1001.69

Features

  • Semantic Search: Finds files based on meaning, not just exact keyword matches.
  • Supported Files: Scans .txt, .md, and .mdx files.
  • Content Chunking: Breaks down documents into smaller, manageable chunks for precise matching.
  • Embedding Powered: Uses model2vec-rs to generate text embeddings. Models are typically downloaded from Hugging Face Hub.
  • Fast & Parallelized: Utilizes Rayon for parallel processing of file discovery, embedding generation, and similarity calculation.
  • Customizable:
    • Specify search directory.
    • Define your semantic query.
    • Choose the embedding model (Hugging Face Hub or local path).
    • Limit the number of results.
    • Enable recursive search through subdirectories.
  • Verbose Mode: Offers detailed timing information for performance analysis.
  • Clickable File Paths: Output paths are formatted for easy opening in most terminals.

Usage

The basic command structure is:

sff [OPTIONS] <QUERY>...

Examples:

  • Search in the current directory for "machine learning techniques":

    sff "machine learning techniques"
  • Search recursively in ~/Documents/notes for "project ideas for rust":

    sff -p ~/Documents/notes -r "project ideas for rust"
  • Use a different model and limit results to 5:

    sff -m "minishlab/potion-multilingual-128M" -l 5 "benefits of parallel computing"

All Options:

You can view all available options with sff --help:

sff: Fast semantic file finder

Usage: sff [OPTIONS] <QUERY>...

Arguments:
  <QUERY>...
          The semantic search query

Options:
  -p, --path <PATH>
          The directory to search in
          [default: .]

  -m, --model <MODEL>
          Model to use for embeddings, from Hugging Face Hub or local path
          [default: minishlab/potion-retrieval-32M]

  -l, --limit <LIMIT>
          Number of top results to display
          [default: 10]

  -r, --recursive
          Search recursively through all subdirectories

  -v, --verbose
          Enable verbose mode to print detailed timings for nerds

  -h, --help
          Print help (see more with '--help')

  -V, --version
          Print version

Models

sff uses model2vec-rs, which typically downloads models from the Hugging Face Hub. The default model is minishlab/potion-retrieval-32M. You can specify any compatible sentence transformer model available on the Hub or a local path to a model. The first time you use a new model, it will be downloaded, which might take some time.

Roadmap

Mising Args

  • batch size - currently 128 texts of 20 words each are inferenced at the same time
  • filetypes - currently only .txt, .md, .mdx but should be customizable as args

Chunker Options

Output Options

PR's always welcome!

FAQ

MacOS: Search folders in iCloud

If you want to search any folder on iCloud (e.g. your Obsidian vault) you need to grant full disk access to your shell, e.g. iTerm2 in the system settings:

image

Reopen the shell and the problem should be fixed.

License

  • MIT

Built by Dominik Weckmüller. If you like semantic search, check out my other work on GitHub e.g. SemanticFinder!

About

CLI for semantic search on your computer. Searches text files and identifies the most relevant chunks to your query.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages