Releases: reworkd/tarsier
v.0.6.0 - Microsoft OCR Support
Highlights π₯
- Added support for azure ocr service, previously the only provider was AWS
- Improved positioning of text chunks and fonts
What's Changed π
- π Check rectangle by @asim-shrestha in #19
- π Absolutely position some tags by @asim-shrestha in #24
- πΈ Snapshot by @asim-shrestha in #40
- π Consistent font sizes by @asim-shrestha in #42
- π Spaces instead of tabs by @asim-shrestha in #43
- β¨ Fix absolute positioning to be left of element instead of on top by @asim-shrestha in #45
- β¨ Group text chunks to fix paragraphs/sentence spacing by @asim-shrestha in #46
- π« ignore all descendants of interactable elements by @asim-shrestha in #50
- π Add support for MS Azure Vision OCR by @ml5ah in #85
New Contributors β€οΈ
Full Changelog: v0.5.0...v0.6.0
v0.5.0 - Multiple Tag Types
What's Changed
- Tag interfering with Xpath fix by @KhoomeiK in #14
- Bump mypy from 1.7.0 to 1.7.1 by @dependabot in #13
- fixed leaf text tagging by @KhoomeiK in #16
- Tagging improvements by @KhoomeiK in #18
New Contributors
- @KhoomeiK made their first contribution in #14
- @dependabot made their first contribution in #13
Full Changelog: v0.4.0...v0.5.0
v0.4.0 - Improved Tagging
π What's Changed
- βοΈ Fix readme citation link by @Krupskis in #3
- βοΈFix Citation Repository URL in Readme by @debanjum in #4
- π Remove Annotations and Tag All text elements (optionally) by @awtkns in #8
- π Make spans have red background with white text by @awtkns in #9
π New Contributors
- @Krupskis made their first contribution in #3
- @debanjum made their first contribution in #4
- @awtkns made their first contribution in #8
Full Changelog: v0.3.1...v0.4.0
v0.3.1 - Initial Release
π Vision utilities for web interaction agents π
π Main site Β Β β’Β Β π¦ Twitter Β Β β’Β Β π’ Discord
Announcing Tarsier
If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:
- How do you map LLM responses back into web elements?
- How can you mark up a page for an LLM better understand its action space?
- How do you feed a "screenshot" to a text-only LLM?
At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects.
Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier!
The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.
tarsier.mp4
How does it work?
Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]
.
In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon.
We define interactable elements as buttons, links, or input fields that are visible on the page.
Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs.
This is important to note given performance issues with existing vision language models.
Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.
Usage
Visit our cookbook for agent examples using Tarsier:
- An autonomous LangChain web agent π¦βοΈ
- An autonomous LlamaIndex web agent π¦
Otherwise, basic Tarsier usage might look like the following:
import asyncio
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
async def main():
google_cloud_credentials = {}
ocr_service = GoogleVisionOCRService(google_cloud_credentials)
tarsier = Tarsier(ocr_service)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://news.ycombinator.com")
page_text, tag_to_xpath = await tarsier.page_to_text(page)
print(tag_to_xpath) # Mapping of tags to x_paths
print(page_text) # My Text representation of the page
if __name__ == '__main__':
asyncio.run(main())
Supported OCR Services
- Google Cloud Vision
- Amazon Textract (Coming Soon)
- Microsoft Azure Computer Vision (Coming Soon)
Special shoutout to @KhoomeiK for making this happen! β€οΈ