Skip to content

A project that gives ChatGPT Vision spatial awareness and the ability to give accurate screen coordinates.

License

Notifications You must be signed in to change notification settings

quinny1187/GridGPT

Repository files navigation

GridGPT

Grid GPT is a class you can use in your ChatGPT vision api projects that will allow your ChatGPT with vision to finally give you accurate you coordinates when you ask it to locate specific objects. This will enable things like allowing ChatGPT vision to accurately control your mouse through the gui and click on things. There is some extra work to be do to add the translation layer from grid cell to actual button click but that is pretty easy to solve.

#Example 1: Small Image 50 pixel cell size - ask for a single grid image

#Example 2: Large 4k Image 100 pixel cell size - ask for group of cell grids image

#Example 3: Full prompt example first message in a new chat with ChatGPT and Vision nails it image

How it works

During runtime it takes your image and uses Pillow to

  1. Lay down cells of transparent white background. You modify the intensity of this in the code.
  2. Draw the grids according to cell size. They should cover the entire picture even if the cell size doesnt divide by your image size correctly.
  3. Add in a transparent text identifier so ChatGPT can tell the grids apart.
  4. Take the output file and send it to ChatGPT Vision along with the prompt.txt I included(it require two modifications).
  5. After some fine tuning of the parameters ChatGPT will be able to tell you exactly what grid cells to click on for the object you are looking for.

About

A project that gives ChatGPT Vision spatial awareness and the ability to give accurate screen coordinates.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages