# Tutorial: Building a Browser Use Agent From Scratch and with Magentic-UI - Level 3



You might have seen cool video demos online of AI agents taking control of a computer or a browser to perform tasks. This is a new category of agents referred to as Computer-Use-Agents (CUA) or Browser-Use-Agents (BUA). Examples of such CUA/BUA agents include [OpenAI's Operator](https://openai.com/index/introducing-operator/), [Claude Computer Use Model](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool), [AutoGen's MultiModalWebSurfer](https://microsoft.github.io/autogen/stable/reference/python/autogen_ext.agents.web_surfer.html), [Adept AI](https://www.adept.ai/blog/act-1), [Google's Project Mariner](https://deepmind.google/models/project-mariner/) and [Browser-Use](https://github.com/browser-use/browser-use/tree/main) among many others.


## What is a Computer Use Agent?

**Definition**: A computer or browser use agent is an agent that given a task, e.g., "order a shawarma sandwich from BestShawarma for pickup now", can programmatically control a computer or browser to autonomously complete the task. By "control a browser" we mean interacting with the browser in a similar way to how a human might control the browser: clicking on buttons, typing in fields, scrolling and so on. Note that a tool-use language model agent could complete this food ordering task if it had access to the restaurant API for instance, this would not make it a CUA agent as it is not _interacting_ with the browser to complete the task.

To make this distinction more clear, here is another example task.
Suppose we wanted to find the list of available Airbnbs in Miami from 6/18 to 6/20 for 2 guests.

![airbnb_sc.png](airbnb_sc.png)

How would a browser use agent solve this task:

- **Step 1:** Visit airbnb.com
- **Step 2:** Type "Miami" in the "Where" input box
- **Step 3:** Select "6/18" in the "Check in" date box
- **Step 4:** Select "6/20" in the "Check out" date box
- **Step 5:** Click on the "Who" button
- **Step 6:** Click "+" twice to add two guests
- **Step 7:** Click "Search" button
- **Step 8:** Summarize and extract listings from the webpage

On the other hand, suppose we had an API for Airbnb that looks like: `find_listings(location, check_in, check_out, guests)`

Then a tool-call agent would first need to generate a tool call: `find_listings("Miami", 6/18, 6/20, 2)` and read out the result of the tool call.

Clearly if we had an API for every website and everything on the computer, then it would be much simpler to perform this task. _But that is not the case currently_, many interfaces on the web cannot be accessed by an API and so the only way is through interacting with the website directly. While future interfaces might become more directly accessible to agents via APIs and MCP servers, for now we need to perform direct manipulation with the websites.

## What Does This Tutorial Cover?

In this tutorial, we will cover how to build a basic browser-use agent. The goal of this tutorial is to demystify such agents and show how we can build a simple version of them. The only thing we need is access to a large language model (LLM) that can perform tool calling or structured JSON outputs (GPT-4o, Qwen2.5-VL, Llama 3.1, ...). The LLM does not need to be vision capable, but a model capable of taking image input would improve performance significantly. The LLM also does not need to be trained previously for browser-use, out of the box LLMs can be turned into semi-capable browser-use agents following the recipe in this tutorial. At the end of the tutorial we will discuss further directions.

We will cover three levels of building your browser use agent:

- Level 1: From scratch using only the `playwright` python package.
- Level 2: Using helpers from the `magentic-ui` package which simplifies building your agent.
- Level 3: Using the WebSurfer Agent from the `magentic-ui` package directly.

This is Level 3.

# Tutorial Prerequisites




You will need Python >3.10 to run this tutorial and the `magentic-ui` package. [Magentic-UI](https://github.com/microsoft/magentic-ui/tree/main) is a research prototype from Microsoft of a human-centered agentic interface. In this tutorial we will be using utilities and helpers from that package without using the Magentic-UI application itself.

We recommend using a virtual environment to avoid conflicts with other packages.

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install magentic-ui
```

Alternatively, if you use [`uv`](https://docs.astral.sh/uv/getting-started/installation/) for dependency management, you can install Magentic-UI with:

```bash
uv venv --python=3.12 .venv
. .venv/bin/activate
uv pip install magentic-ui
```

We also need to install the browsers that our agent will control with playwright:

```bash
playwright install --with-deps chromium
```

The other thing you need to set up is your LLM. The easiest way to follow this tutorial is to obtain an OpenAI API key and set it as an environment variable:

```bash
export OPENAI_API_KEY=<YOUR API KEY>
```

You can also use any open source model with [Ollama](https://ollama.com/) if you have a capable GPU at your disposal. We will be covering both using OpenAI and Ollama.

# Level 3: Using the WebSurfer Agent from Magentic-UI

We have a reference implementation of a capable browser use agent in Magentic-UI which we call the `WebSurfer` agent. I'll show you now how to use it. 



`WebSurfer` is an AutoGen AgentChat agent built using the tools we have seen previously to complete actions autonomously on the web. We have spent a lot of time fixing many many edge cases that arise on the web to arrive at a more reliable (but not perfect) browser use agent.
This agent builds on the [`MultimodalWebSurfer`](https://microsoft.github.io/autogen/stable/reference/python/autogen_ext.agents.web_surfer.html) agent from AutoGen that we previously developed. 

Let's see now how to use it!

In [81]:

from autogen_ext.models.openai import OpenAIChatCompletionClient
from magentic_ui.agents import WebSurfer
from magentic_ui.tools.playwright import (
    LocalPlaywrightBrowser,
)

browser = LocalPlaywrightBrowser(headless=False)

model_client = OpenAIChatCompletionClient(model="gpt-4o")

web_surfer = WebSurfer(
    name="web_surfer",
    model_client=model_client, # Use any client from AutoGen!
    animate_actions=True, # Set to True if you want to see the actions being animated!
    max_actions_per_step=10, # Maximum number of actions to perform before returning
    downloads_folder="debug", # Where to save downloads
    debug_dir="debug", # Where to save debug files and screenshots
    to_save_screenshots=False, # set to True if you want to save screenshots of the actions
    browser=browser, # Use any browser from Magentic-UI!
    multiple_tools_per_call=False, # Set to True if you want to use multiple tools per call
    json_model_output=False, # Set to True if your model does not support tool calling
)
await web_surfer.lazy_init()

task = "find the open issues assigned to husseinmozannar on the microsoft/magentic-ui repo on github"
try:
    messages = []
    async for message in web_surfer.run_stream(task=task):
        messages.append(message)
        print(message)
    print("########################################################")
    print("Final answer:")
    print(messages[-1].messages[-2].content)
finally:
    await web_surfer.close()


source='user' models_usage=None metadata={} content='find the open issues assigned to husseinmozannar on the microsoft/magentic-ui repo on github' type='TextMessage'
source='web_surfer' models_usage=RequestUsage(prompt_tokens=4858, completion_tokens=62) metadata={} content="I'll search for the open issues assigned to husseinmozannar on the microsoft/magentic-ui GitHub repository." type='TextMessage'
messages=[TextMessage(source='user', models_usage=None, metadata={}, content='find the open issues assigned to husseinmozannar on the microsoft/magentic-ui repo on github', type='TextMessage'), TextMessage(source='web_surfer', models_usage=RequestUsage(prompt_tokens=4858, completion_tokens=62), metadata={}, content="I'll search for the open issues assigned to husseinmozannar on the microsoft/magentic-ui GitHub repository.", type='TextMessage')] stop_reason=None
source='web_surfer' models_usage=RequestUsage(prompt_tokens=4858, completion_tokens=62) metadata={'internal': 'no', 'type': 'browse

[32m2025-06-15 18:22:17.142[0m | [1mINFO    [0m | [36mmagentic_ui.agents.web_surfer._web_surfer[0m:[36mclose[0m:[36m439[0m - [1mClosing WebSurfer...[0m


source='web_surfer' models_usage=RequestUsage(prompt_tokens=24988, completion_tokens=286) metadata={'internal': 'yes'} content=['\n\nThe actions the websurfer performed are the following.\n\n Action: web_search( {"explanation": "I\'ll search for the open issues assigned to husseinmozannar on the microsoft/magentic-ui GitHub repository.", "query": "open issues assigned to husseinmozannar microsoft/magentic-ui site:github.com", "require_approval": false} )\nObservation: I typed \'open issues assigned to husseinmozannar microsoft/magentic-ui site:github.com\' into the browser search bar.\n\n\n\n Action: click( {"explanation": "I\'ll click on the link to the GitHub issue page to find more details about the issues assigned to husseinmozannar.", "target_id": 28, "require_approval": false} )\nObservation: I clicked \'GitHub · Where software is built\'.\n\n\n\n Action: stop_action( {"explanation": "I found the open issues assigned to husseinmozannar on the microsoft/magentic-ui GitHub reposito

We encourage you to experiment using the sample code file [sample_web_surfer.py](https://github.com/microsoft/magentic-ui/blob/main/samples/sample_web_surfer.py) and to use the Magentic-UI application which provides a web UI to interact with the WebSurfer agent and launch multiple parallel tasks and more!

Just run:

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install magentic-ui
# export OPENAI_API_KEY=<YOUR API KEY>
magentic ui --port 8081
```
See [https://github.com/microsoft/magentic-ui](https://github.com/microsoft/magentic-ui) for the full instructions.

![../img/magenticui_running.png](../img/magenticui_running.png)

# What's next?





## Evaluation

The first thing you might be curious about is how well does the WebSurfer agent perform?

In Magentic-UI, we have built a small evaluation library [magentic-ui/eval](https://github.com/microsoft/magentic-ui/tree/main/src/magentic_ui/eval) that implements popular browser-use benchmarks and makes it easy to run evals. We will be building a bit on this library and will have a tutorial on how to use it.

 Magentic-UI has been tested against several benchmarks when running with o4-mini: [GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA) test set (42.52%), which assesses general AI assistants across reasoning, tool use, and web interaction tasks ; [AssistantBench](https://huggingface.co/AssistantBench) test set (27.60%), focusing on realistic, time-consuming web tasks; [WebVoyager](https://github.com/MinorJerry/WebVoyager) (82.2%), measuring end-to-end web navigation in real-world scenarios; and [WebGames](https://webgames.convergence.ai/) (45.5%), evaluating general-purpose web-browsing agents through interactive challenges.
To reproduce these experimental results, please see the following [instructions](experiments/README.md).

For reference, the current SOTA on WebVoyager is the [browser-use library](https://browser-use.com/posts/sota-technical-report) using GPT-4o achieving 89%. Note that the WebVoyager evaluation is not consistent across different systems as it relies on a mix of LLM-as-a-judge evaluation and human evaluation.


## Limitations

Using the Set-Of-Mark approach for building the Browser Use Agent has many limitations (note that both Magentic-UI and Browser Use library use SoM). For instance, any task that requires understanding coordinates on the screen our agent will fail on.
Examples:

- dragging an element from position A to position B
- drawing on the screen
- playing web games

Moreover, it will not generalize to any Computer Use task where we might not have the DOM to obtain element coordinates. Therefore, we will need to have a model that can click on specific coordinates rather than using element IDs. The [UI-Tars](https://github.com/bytedance/UI-TARS) models have such an ability as well as the latest [compute-preview-api](https://platform.openai.com/docs/guides/tools-computer-use) from OpenAI. Another approach is to use a grounding or parsing model instead of the DOM such as [OmniParser](https://microsoft.github.io/OmniParser/) to obtain element IDs from any GUI interface combined with a tool-calling LLM.


Another limitation is that these agents are not *real-time* and so tasks such as video-understanding or playing games become almost impossible natively as there multiple seconds delay between each agent action.

## Safety

Current LLMs are still very prone to adversarial attacks on the web, see these papers for how bad things can get it with current models even those tuned directly for CUA:

- [Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
](https://arxiv.org/html/2502.08586v1)
- [RedTeamCUA:
Realistic Adversarial Testing of Computer-Use Agents in
Hybrid Web-OS Environments](https://osu-nlp-group.github.io/RedTeamCUA/)

We recommend to have guardrails built into the agent to allow the human to approve actions if needed. We call such guardrails "ActionGuard" in Magentic-UI and they allow you to define heuristics in addition to LLM judgmenet for when actions might need human approval.




If you've made it this far I really appreciate you taking the time to read and hope you've enjoyed following along!