CoLSurgical

Abstract

Surgical triplet recognition aims to recognize surgical activities as triplets (i.e., <instrument, verb, target}>, which provides fine-grained information essential for surgical scene understanding. Existing methods for surgical triplet recognition rely on compositional methods that recognize the instrument, verb, and target simultaneously. In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. Chain-of-Look prompting is inspired by: (1) the chain-of-thought prompting in natural language processing, which divides a problem into a sequence of intermediate reasoning steps; (2) the inter-dependency between motion and visual appearance in the human vision system. Since surgical activities are conveyed by the actions of physicians, we regard the verbs as the carrier of semantics in surgical endoscopic videos. Additionally, we utilize the BioMed large language model to calibrate the generated visual prompt features for surgical scenarios. Our approach captures the visual reasoning processes underlying surgical activities and achieves better performance compared to the state-of-the-art methods on the largest surgical triplet recognition dataset, CholecT50.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
models		models
utils		utils
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

models

models

utils

utils

README.md

README.md

main.py

main.py

Repository files navigation

CoLSurgical

Abstract

About

Releases

Packages

Languages

southnx/CoLSurgical

Folders and files

Latest commit

History

Repository files navigation

CoLSurgical

Abstract

About

Resources

Stars

Watchers

Forks

Languages