# Create MCP Agent using OpenVINO and Qwen-Agent

MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.

MCP helps you build agents and complex workflows on top of LLMs. LLMs frequently need to integrate with data and tools, and MCP provides:

- A growing list of pre-built integration that your LLM can directly plug into
- The flexibility to switch between LLM providers and vendors
- Best practices for securing your data within your infrastructure

![Image](https://github.com/user-attachments/assets/dfe1aa42-cae9-4356-be81-f010462d78a8)

[Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) is a framework for developing LLM applications based on the instruction following, tool usage, planning, and memory capabilities of Qwen. It also comes with example applications such as Browser Assistant, Code Interpreter, and Custom Assistant.

This notebook explores how to create a MCP Agent step by step using OpenVINO and Qwen-Agent.

#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Select device for inference](#Select-device-for-inference)
- [Select model for inference](#Select-model-for-inference)
- [Convert model using Optimum-CLI tool](#Convert-model-using-Optimum-CLI-tool)
    - [Weights Compression using Optimum-CLI](#Weights-Compression-using-Optimum-CLI)
- [Create an Agent](#Create-An-Agent)
- [Interactive Demo](#Interactive-Demo)

### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).


<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/llm-agent-mcp/llm-agent-mcp.ipynb" />


## Prerequisites

[back to top ‚¨ÜÔ∏è](#Table-of-contents:)


In [None]:
import os
from pathlib import Path
import requests

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"

%pip install -Uq pip
%pip uninstall -q -y optimum optimum-intel
%pip install --pre -Uq openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu \
"torch>=2.1" "datasets<4.0.0" "accelerate" "transformers>=4.51.0" "mcp-server-time" "mcp-server-fetch"
"pydantic==2.9.2" "pydantic-core==2.23.4" "gradio>=5.0.0" "gradio-client==1.4.0" "modelscope_studio==1.0.0-beta.8"
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu \
"git+https://github.com/huggingface/optimum-intel.git"
%pip install -q "git+https://github.com/openvinotoolkit/nncf.git"
    
utility_files = ["notebook_utils.py", "cmd_helper.py"]

for utility in utility_files:
    local_path = Path(utility)
    if not local_path.exists():
        r = requests.get(
            url=f"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/{local_path.name}",
        )
        with local_path.open("w") as f:
            f.write(r.text)

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("llm-agent-mcp.ipynb")

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from cmd_helper import clone_repo

clone_repo("https://github.com/openvino-dev-samples/Qwen-Agent.git", revision="ov-genai")

%pip install -q -e ./Qwen-Agent/"[gui,code_interpreter,mcp]"

## Select device for inference
[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

In [None]:
from notebook_utils import device_widget

device = device_widget(default="CPU")

device

## Select model for inference
[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

Large Language Models (LLMs) are a core component of Agent. In this example, we will demonstrate how to create a OpenVINO LLM model in Qwen-Agent framework. Since Qwen3 can support function calling during text generation, we select `Qwen/Qwen3-8B` as LLM in agent pipeline.

* **Qwen/Qwen3-8B** - Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. [Model Card](https://huggingface.co/Qwen/Qwen3-8B)
* **Qwen/Qwen3-4B** - Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. [Model Card](https://huggingface.co/Qwen/Qwen3-4B)


[Weight compression](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html) is a technique for enhancing the efficiency of models, especially those with large memory requirements. This method reduces the model‚Äôs memory footprint, a crucial factor for Large Language Models (LLMs). We provide several options for model weight compression:

* **FP16** reducing model binary size on disk using `save_model` with enabled compression weights to FP16 precision. This approach is available in OpenVINO from scratch and is the default behavior.
* **INT8** is an 8-bit weight-only quantization provided by [NNCF](https://github.com/openvinotoolkit/nncf): This method compresses weights to an 8-bit integer data type, which balances model size reduction and accuracy, making it a versatile option for a broad range of applications.
* **INT4** is an 4-bit weight-only quantization provided by [NNCF](https://github.com/openvinotoolkit/nncf). involves quantizing weights to an unsigned 4-bit integer symmetrically around a fixed zero point of eight (i.e., the midpoint between zero and 15). in case of **symmetric quantization** or asymmetrically with a non-fixed zero point, in case of **asymmetric quantization** respectively. Compared to INT8 compression, INT4 compression improves performance even more, but introduces a minor drop in prediction quality. INT4 it ideal for situations where speed is prioritized over an acceptable trade-off against accuracy.
* **INT4 AWQ** is an 4-bit activation-aware weight quantization. [Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We will use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.
* **INT4 NPU-friendly** is an 4-bit channel-wise quantization. This approach is [recommended](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) for LLM inference using NPU.

In [None]:
from llm_config import get_llm_selection_widget

form, lang, model_id_widget, compression_variant, use_preconverted = get_llm_selection_widget(device=device.value)

form

In [None]:
model_configuration = model_id_widget.value
model_id = model_id_widget.label
print(f"Selected model {model_id} with {compression_variant.value} compression")

## Convert model using Optimum-CLI tool
[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

ü§ó [Optimum Intel](https://huggingface.co/docs/optimum/intel/index) is the interface between the ü§ó [Transformers](https://huggingface.co/docs/transformers/index) and [Diffusers](https://huggingface.co/docs/diffusers/index) libraries and OpenVINO to accelerate end-to-end pipelines on Intel architectures. It provides ease-to-use cli interface for exporting models to [OpenVINO Intermediate Representation (IR)](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format.

<details>
  <summary><b>Click here to read more about Optimum CLI usage</b></summary>

The command bellow demonstrates basic command for model export with `optimum-cli`

```
optimum-cli export openvino --model <model_id_or_path> --task <task> <out_dir>
```

where `--model` argument is model id from HuggingFace Hub or local directory with model (saved using `.save_pretrained` method), `--task ` is one of [supported task](https://huggingface.co/docs/optimum/exporters/task_manager) that exported model should solve. For LLMs it is recommended to use `text-generation-with-past`. If model initialization requires to use remote code, `--trust-remote-code` flag additionally should be passed.
</details>

### Weights Compression using Optimum-CLI
[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model with the CLI. 
<details>
  <summary><b>Click here to read more about weights compression with Optimum CLI</b></summary>

Setting `--weight-format` to respectively fp16, int8 or int4. This type of optimization allows to reduce the memory footprint and inference latency.
By default the quantization scheme for int8/int4 will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.

For INT4 quantization you can also specify the following arguments :
- The `--group-size` parameter will define the group size to use for quantization, -1 it will results in per-column quantization.
- The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to int4 while 10% will be quantized to int8.

Smaller group_size and ratio values usually improve accuracy at the sacrifice of the model size and inference latency.
You can enable AWQ to be additionally applied during model export with INT4 precision using `--awq` flag and providing dataset name with `--dataset`parameter (e.g. `--dataset wikitext2`)

>**Note**: Applying AWQ requires significant memory and time.

>**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
</details>

In [None]:
from llm_config import convert_and_compress_model

model_dir = convert_and_compress_model(model_id, model_configuration, compression_variant.value, use_preconverted=use_preconverted.value)

## Configure MCP servers

[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

MCP server can be configured into an [MCP client](https://github.com/modelcontextprotocol/servers?tab=readme-ov-file#using-an-mcp-client). The configuration of MCP server be selected from [public MCP servers list](https://github.com/punkpeye/awesome-mcp-servers), or from your customized MCP server.

## Create An agent

[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

Function calling allows a model to detect when one or more tools should be called and respond with the inputs that should be passed to those tools. In an API call, you can describe tools and have the model intelligently choose to output a structured object like JSON containing arguments to call these tools. The goal of tools APIs is to more reliably return valid and useful tool calls than what can be done using a generic text completion or chat API.

We can take advantage of this structured output, combined with the fact that you can bind multiple tools to a tool calling chat model and allow the model to choose which one to call, to create an agent that repeatedly calls tools and receives results until a query is resolved.

OpenVINO has been integrated into the `Qwen-Agent` framework. You can use following method to create a OpenVINO based LLM for a `Qwen-Agent` pipeline.
Qwen-Agent offers a generic Agent class: the Assistant class, which, when directly instantiated, can handle the majority of Single-Agent tasks. Features:

- It supports role-playing.
- It provides automatic planning and tool calls abilities.
- RAG (Retrieval-Augmented Generation): It accepts documents input, and can use an integrated RAG strategy to parse the documents.

MCP server can be configured into an [MCP client](https://github.com/modelcontextprotocol/servers?tab=readme-ov-file#using-an-mcp-client). The configuration of MCP server be selected from [public MCP servers list](https://github.com/punkpeye/awesome-mcp-servers), or from your customized MCP server. Since the examples of the MCP server in this notebook are in remote, please make sure your system is connected with internet.

In [None]:
%%writefile mcp_test.py

import argparse
import openvino.properties as props
import openvino.properties.hint as hints
import openvino.properties.streams as streams
from qwen_agent.agents import Assistant


if __name__ == "__main__":
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('-h',
                        '--help',
                        action='help',
                        help='Show this help message and exit.')
    parser.add_argument('-m',
                        '--model_dir',
                        required=True,
                        type=str,
                        help='Required. model path')
    parser.add_argument('-d',
                        '--device',
                        default='CPU',
                        required=False,
                        type=str,
                        help='Required. device for inference')
    args = parser.parse_args()

    tools = [
        {
            'mcpServers': {  # You can specify the MCP configuration file
                'time': {
                    'command': 'python',
                    'args': ['-m', 'mcp_server_time', '--local-timezone=Asia/Shanghai']
                },
                'fetch': {
                    'command': 'python',
                    'args': ['-m', 'mcp_server_fetch']
                }
            }
        },
        'code_interpreter',  # Built-in tools
    ]


    llm_cfg = {
        "ov_model_dir": args.model_dir,
        "model_type": "openvino-genai",
        "device": args.device,
        "disable_thinking": True,
        "chat_mode": True,
        "genai_chat_template":"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
    }

    bot = Assistant(llm=llm_cfg,
                    system_message="/no_think ",
                    function_list=tools,
                    name='Qwen3 Tool-calling Demo',
                    description="I'm a demo using the Qwen3 tool calling. Welcome to add and play with your own tools!")

    messages = [{'role': 'user', 'content': 'What time is it?'}]
    response_plain_text = ''
    for response in bot.run(messages=messages):
        pass
    print(response)

Overwriting mcp_test.py


In [None]:
!python mcp_test.py --model_dir {str(model_dir)} --device {device.value}

2025-05-28 17:01:10,144 - mcp_manager.py - 122 - INFO - Initializing MCP tools from mcp servers: ['time', 'fetch']
2025-05-28 17:01:10,159 - mcp_manager.py - 340 - INFO - Initializing a MCP stdio_client, if this takes forever, please check the config of this mcp server: time
2025-05-28 17:01:27,661 - mcp_manager.py - 350 - INFO - No list resources: Method not found
2025-05-28 17:01:27,670 - mcp_manager.py - 340 - INFO - Initializing a MCP stdio_client, if this takes forever, please check the config of this mcp server: fetch
Downloading lxml (3.6MiB)
 Downloading lxml
Installed 36 packages in 181ms
2025-05-28 17:01:32,959 - mcp_manager.py - 350 - INFO - No list resources: Method not found
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2025-05-28 17:01:44,609 - mcp_manager.py - 277 - INFO - There are still ta

<think>

</think>
[TOOL_CALL] 
ime-get_current_time
{"timezone": "Asia/Shanghai"}
[TOOL_RESPONSE] time-get_current_time
{
  "timezone": "Asia/Shanghai",
  "datetime": "2025-05-28T17:01:38+08:00",
  "is_dst": false
}
<think>

</think>

The current time in Asia/Shanghai is 2025-05-28T17:01:38+08:00.


## Interactive Demo

[back to top ‚¨ÜÔ∏è](#Table-of-contents:)

Let's create a interactive agent using [Gradio](https://www.gradio.app/).

In [7]:
from pathlib import Path
from PIL import Image
import requests

openvino_logo = "openvino_logo.png"
openvino_logo_url = "https://cdn-avatars.huggingface.co/v1/production/uploads/1671615670447-6346651be2dcb5422bcd13dd.png"

if not Path(openvino_logo).exists():
    image = Image.open(requests.get(openvino_logo_url, stream=True).raw)
    image.save(openvino_logo)

In [None]:
%%writefile mcp_demo.py

import argparse
import openvino.properties as props
import openvino.properties.hint as hints
import openvino.properties.streams as streams
from qwen_agent.agents import Assistant
from gradio_helper import OpenVINOUI


if __name__ == "__main__":
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('-h',
                        '--help',
                        action='help',
                        help='Show this help message and exit.')
    parser.add_argument('-m',
                        '--model_dir',
                        required=True,
                        type=str,
                        help='Required. model path')
    parser.add_argument('-d',
                        '--device',
                        default='CPU',
                        required=False,
                        type=str,
                        help='Required. device for inference')
    args = parser.parse_args()

    tools = [
        {
            'mcpServers': {  # You can specify the MCP configuration file
                'time': {
                    'command': 'python',
                    'args': ['-m', 'mcp_server_time', '--local-timezone=Asia/Shanghai']
                },
                'fetch': {
                    'command': 'python',
                    'args': ['-m', 'mcp_server_fetch']
                },
            }
        },
        'code_interpreter',  # Built-in tools
    ]


    llm_cfg = {
        "ov_model_dir": args.model_dir,
        "model_type": "openvino-genai",
        "device": args.device,
        "chat_mode": True,
        "disable_thinking": True,
        "genai_chat_template":"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
    }

    bot = Assistant(llm=llm_cfg,
                    system_message="/no_think ",
                    function_list=tools,
                    name='OpenVINO MCP Demo',
                    description="I'm a demo using the Qwen3 tool calling. Welcome to add and play with your own tools!")

    chatbot_config = {
        'prompt.suggestions': [
            'What time is it?',
            "Covert time of Shanghai to New York"
        ],
        'agent.avatar': "openvino_logo.png",
        'input.placeholder': "Type your message here...",
    }

    demo = OpenVINOUI(
        bot,
        chatbot_config=chatbot_config,
    )
    demo.run(server_port=7860)

Overwriting mcp_demo.py


In [None]:
!python mcp_demo.py --model_dir {str(model_dir)} --device {device.value}

Now you can visit [http://127.0.0.1:7860](http://127.0.0.1:7860) to try this demo. 
If you are launching remotely, specify server_name and server_port. EXAMPLE: 

`demo.run(server_name='your server name', server_port='server port in int')`

To kill the process of demo, you can run following command:

on *Windows*


`!for /f "tokens=5" %a in ('netstat -aon ^| findstr ":7860 "') do taskkill /f /pid %a`

on *Linux*

`!kill -9 $(lsof -t -i :7860)`