13 Jun 00:48

awtkns

ff112bf

v.0.6.0 - Microsoft OCR Support Latest

Latest

Highlights 🔥

Added support for azure ocr service, previously the only provider was AWS
Improved positioning of text chunks and fonts

What's Changed 👀

🍌 Check rectangle by @asim-shrestha in #19
🍌 Absolutely position some tags by @asim-shrestha in #24
📸 Snapshot by @asim-shrestha in #40
🍌 Consistent font sizes by @asim-shrestha in #42
🍌 Spaces instead of tabs by @asim-shrestha in #43
✨ Fix absolute positioning to be left of element instead of on top by @asim-shrestha in #45
✨ Group text chunks to fix paragraphs/sentence spacing by @asim-shrestha in #46
🚫 ignore all descendants of interactable elements by @asim-shrestha in #50
🔎 Add support for MS Azure Vision OCR by @ml5ah in #85

New Contributors ❤️

@hargup made their first contribution in #33
@ml5ah made their first contribution in #85

Full Changelog: v0.5.0...v0.6.0

Contributors

hargup, asim-shrestha, and ml5ah

Assets 2

05 Dec 18:07

awtkns

v0.5.0

b488eda

v0.5.0 - Multiple Tag Types

What's Changed

Tag interfering with Xpath fix by @KhoomeiK in #14
Bump mypy from 1.7.0 to 1.7.1 by @dependabot in #13
fixed leaf text tagging by @KhoomeiK in #16
Tagging improvements by @KhoomeiK in #18

New Contributors

@KhoomeiK made their first contribution in #14
@dependabot made their first contribution in #13

Full Changelog: v0.4.0...v0.5.0

Contributors

dependabot and KhoomeiK

Assets 2

15 Nov 05:52

awtkns

v0.4.0

ae5a749

v0.4.0 - Improved Tagging

🎉 What's Changed

✍️ Fix readme citation link by @Krupskis in #3
✍️Fix Citation Repository URL in Readme by @debanjum in #4
🚀 Remove Annotations and Tag All text elements (optionally) by @awtkns in #8
🆑 Make spans have red background with white text by @awtkns in #9

👀 New Contributors

@Krupskis made their first contribution in #3
@debanjum made their first contribution in #4
@awtkns made their first contribution in #8

Full Changelog: v0.3.1...v0.4.0

Contributors

debanjum, awtkns, and Krupskis

Assets 2

11 Nov 19:52

awtkns

v0.3.1

7a0006f

v0.3.1 - Initial Release

🙈 Vision utilities for web interaction agents 🙈

🔗 Main site • 🐦 Twitter • 📢 Discord

Announcing Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

How do you map LLM responses back into web elements?
How can you mark up a page for an LLM better understand its action space?
How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects.
Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier!
The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

tarsier.mp4

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1].
In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon.
We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs.
This is important to note given performance issues with existing vision language models.
Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Usage

Visit our cookbook for agent examples using Tarsier:

An autonomous LangChain web agent 🦜⛓️
An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Supported OCR Services

Google Cloud Vision
Amazon Textract (Coming Soon)
Microsoft Azure Computer Vision (Coming Soon)

Special shoutout to @KhoomeiK for making this happen! ❤️

Contributors

KhoomeiK

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights 🔥

What's Changed 👀

New Contributors ❤️

Contributors

What's Changed

New Contributors

Contributors

🎉 What's Changed

👀 New Contributors

Contributors

Announcing Tarsier

How does it work?

Usage

Supported OCR Services

Contributors

Releases: reworkd/tarsier

v.0.6.0 - Microsoft OCR Support

Highlights 🔥

What's Changed 👀

New Contributors ❤️

Contributors

v0.5.0 - Multiple Tag Types

What's Changed

New Contributors

Contributors

v0.4.0 - Improved Tagging

🎉 What's Changed

👀 New Contributors

Contributors

v0.3.1 - Initial Release

Announcing Tarsier

How does it work?

Usage

Supported OCR Services

Contributors