Use natural language prompting to interact with Meta's SAM! While they trained and studied text prompting, they have chosen not to release it as part of their demo, so I figured I would try my hand at it. This extension is of course not the same as the text encoder they used, but is instead a layer on top of it, calling a backend built with OpenAI's CLIP and leaning heavily on kevinzakka's implementation of Gradient-weighted Class Activation Mapping (Grad-CAM).
This is just for fun, and results are variable in their quality. Some of the failure modes are explainable (the gradient mapping is imprecise, so some small objects are near misses) and others are just way off. Also tried grabbing all masks and filtering against the search term, but that was slower and noisier. Maybe with some more patience later it'll work better though.
You can install the extension from the Chrome web store here.
If you prefer to run the extension locally from this repo, see the following instructions:
To be added later, see here for now.