# Working with the Loong environment

You can also check out this cookbook in Google Colab [here](https://colab.research.google.com/github/camel-ai/loong/blob/main/cookbooks/env_with_generator.ipynb).

<div class="align-center">
  <a href="https://www.camel-ai.org/"><img src="https://i.postimg.cc/KzQ5rfBC/button.png"width="150"></a>
  <a href="https://discord.camel-ai.org"><img src="https://i.postimg.cc/L4wPdG9N/join-2.png"  width="150"></a></a>
  
⭐ <i>Star us on [*Github*](https://github.com/camel-ai/camel), join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)
</div>

In [1]:
# Optional: Install camel if you don't have it
!pip install "git+https://github.com/camel-ai/camel.git@bec98152d3df3dd1731b78208608b4a9438a010e#egg=camel-ai[huggingface, data_tools]"

[33mDEPRECATION: git+https://github.com/camel-ai/camel.git@bec98152d3df3dd1731b78208608b4a9438a010e#egg=camel-ai[huggingface, data_tools] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting camel-ai (from camel-ai[data_tools,huggingface])
  Using cached camel_ai-0.2.38-py3-none-any.whl


The Loong *environment* is a unified interface that can be used for Synthetic Data Generation, RL training and Benchmarking agents. It integrates all the primitives that we implemented at CAMEL to provide a nice interface for developers and researchers. In this cookbook, we will explain how to initialize a *Single Step Environment* to generate synthetic data. More cookbooks about RL training and how to customize the environment are coming soon.

This type of environment is called a *single step* environment, because the agent only does one step. It gets a question sampled from the dataset (the initial state / observation) and then answers. The answer is then scored according to the reward function. Recently, rules-based reward functions, i.e. functions without any learnable parameters, have been successfully used to do RL with LLMs as as policy.

Since many RL algorithms (such as GRPO) need multiple rollouts at each step, batching is important to guarantee concurrency / parallelism. Our `SingleStepEnv` supports batching (both `reset` and `step`), but for the sake of simplicity, we will not use batching for this cookbook. We will soon release another cookbook dedicated to batching.

First, we have to load a dataset from which we will sample questions. The dataset can be either a `StaticDataset`, which is finite, or it can be a `BaseGenerator`, which is an infinite supply of question - answer pairs, synthetically generated in some way, depending on the implementation. To seed the generative process of the `BaseGenerator`, we need to seed it with a *seed dataset*. Each generator uses the seed dataset it was initialized with to generate new data.

In this cookbook, we will use the `FewShotGenerator`, which will generate new data points by doing simple few-shot prompting, using random data points from the seed dataset as examples.

A seed dataset can easily be thought of as a type of `StaticDataset`, so let's initialize our seed dataset as such a `StaticDataset`.

In [4]:
from camel.datasets import StaticDataset

from datasets import load_dataset

dataset = load_dataset("camel-ai/loong", split="graph_discrete_math")

seed_dataset = StaticDataset(dataset)

Add `%load_ext cudf.pandas` before importing pandas to speed up operations using GPU

In [5]:
example = seed_dataset[0]

print(f"Question: {example.question}")
print(f"Final Answer: {example.final_answer}")

Question: Given an undirected path graph with 10 vertices, what is the largest independent node set and the list of maximal cliques that can be obtained by repeatedly removing cliques from the graph? Return the result as a 2-tuple, i.e., (largest_independent_node_set, list_of_maximal_cliques), where the first element is a set of nodes in sorted order and the second element is a list of maximal cliques in sorted order (each clique is represented as a set of nodes).
Final Answer: ({0, 2, 4, 6, 9}, [{0, 1}, {2, 3}, {4, 5}, {6, 7}, {8, 9}])


The `FewShotGenerator` needs a python interpreter to compute a synthetic answer (pseudo ground truth) from the code it generated. For this, let's define a `PythonVerifier`.

Note: We will soon use dedicated CAMEL-based code interpreters instead of repurposing our Python verifier for this.

In [6]:
from camel.verifiers import PythonVerifier
from camel.agents import ChatAgent
from camel.extractors import BaseExtractor, BoxedStrategy

interpreter = PythonVerifier(required_packages=["numpy", "networkx"])
await interpreter.setup(uv=True)

Lastly, we need a model backend for the generation agent. Let's use the `ModelFactory` to create one.

Note: We use GPT-4o mini as a default here, hence we load our OpenAI API key. Feel free to use other models!

In [8]:
import os
from getpass import getpass

openai_api_key = getpass('Enter your API key: ')
os.environ["OPENAI_API_KEY"] = openai_api_key

Enter your API key: ··········


In [9]:
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.configs import ChatGPTConfig
from camel.datasets import FewShotGenerator

model = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI,
    model_type=ModelType.GPT_4O_MINI,
    model_config_dict=ChatGPTConfig().as_dict(),
)

# Note: When the generator is exhausted, it will create 20 new datapoints by default
# To save money on the API, let's set this number to 2 instead, so we don't generate more than we need.
generator = FewShotGenerator(
    buffer=2, seed_dataset=seed_dataset, verifier=interpreter, model=model
)

Let's next create a verifier that extracts content inside a `\boxed{...}` from the llm response and compares it semantically to the reference answer.

Since we want Loong to be flexible, we utilize built a dedicated extraction module that defines how to parse the llm response and extract the relevant portion that we want to compare. A dedicated cookbook on how to use it is coming soon.

In [10]:
from camel.verifiers import PythonVerifier
from camel.agents import ChatAgent
from camel.extractors import BaseExtractor, BoxedStrategy

# Initialize extractor
extractor = BaseExtractor([[BoxedStrategy()]])
await extractor.setup()


verifier = PythonVerifier(extractor=extractor, required_packages=["numpy", "networkx"])
await verifier.setup(uv=True)

Now that our generator and verifier are all set up, let's create a `SingleStepEnv` with it.

We can then call `env.reset()` to sample the underlying generator, which returns that question as an observation. We can then feed this observation into the CoT agent.

In [11]:
from camel.environments import SingleStepEnv

env = SingleStepEnv(generator, verifier)
await env.setup()

obs = await env.reset(seed=42)

print(f"Observation: {obs}")



Observation: question='In a complete graph with 5 vertices (node 0, 1, 2, 3, 4), what are the eccentricities of each node? Return the eccentricities as a dictionary with the node as the key and its eccentricity as the value.' context={} metadata={}


The agent would then process this observation and select an action, which it would feed into the `step` function, which feeds it back into the environment. More specifically, it feeds it back into the verifier, which then returns a reward based on whether the llm response and reference answer are aligned or not.

Let's first define a CAMEL agent and feed it the observation. Afterwards, we use the `step` function of the environment to get a reward.

In [30]:
agent = ChatAgent(model=model)
from camel.environments.models import Action
USER_PROMPT = r"""
You are an agent designed to answer mathematical questions with clarity and precision. Your task is to provide a step-by-step explanation for
any mathematical problem posed by the user, ensuring the response is easy to follow. Adhere to these guidelines:
Analyze the mathematical question carefully and break down the solution process into clear, logical steps.
Use natural language to explain each step, incorporating LaTeX notation (e.g., $x + 2$)
for mathematical expressions when helpful. Conclude your response with the final answer enclosed
in a LaTeX \boxed{} environment (e.g., \boxed{5}).
Place this at the end of your explanation as a standalone statement.
It should be a Python expression, for example "[1, 2, 3]" for a list.

The question you should answer is: """

response = agent.step(USER_PROMPT + obs.question).msgs[0].content

action = Action(llm_response=response)
next_obs, reward, done, info = await env.step(action)

agent.reset()

print(f"Is the episode done?: {done}")
print(f"Next Observation: {next_obs}")
print(f"Reward: {reward}")
print(f"Info: {info}")

Is the episode done?: True
Next Observation: question='Episode ended. This is just a placeholder.' context={} metadata=None
Reward: 10.0
Info: {'proposed_solution': 'To find the eccentricities of each node in a complete graph with 5 vertices, we need to first understand the definition of eccentricity in graph theory.\n\n**Step 1: Understanding Graph Basics**\n\nA complete graph, denoted as \\( K_n \\), is a graph where every pair of distinct vertices is connected by a unique edge. For \\( n = 5 \\), the graph \\( K_5 \\) has 5 vertices: 0, 1, 2, 3, and 4. \n\n**Step 2: Definition of Eccentricity**\n\nThe eccentricity \\( e(v) \\) of a vertex \\( v \\) is defined as the greatest distance from \\( v \\) to any other vertex in the graph. In mathematical terms, if \\( d(u, v) \\) is the distance (shortest path length) between vertices \\( u \\) and \\( v \\), then:\n\\[\ne(v) = \\max_{u \\in V} d(u, v)\n\\]\nwhere \\( V \\) is the set of all vertices in the graph.\n\n**Step 3: Distance in 

As you can see, the `step` function does exactly what you would expect it to do, if you know how Gym works.

`done` is always `True` in this case, since we are in a *single step* environment. This is also the reason that `next_obs` is a placeholder observation, as there is no next observation for this episode.

`reward` is the **total** reward for this action. By default, we only use an *accuracy reward*, i.e. $0$ if verifier returns that llm response and synthetic answer are not the same and $10$ otherwise. The accuracy reward can be manually set to any number by simply overriding the attribute of the environment (e.g. `env.ACCURACY_REWARD = 1`). We will add a cookbook soon that shows how to create a custom reward by extending `SingleStepEnv`.

Finally, `info` is a dict containing a lot of information about the specifc interaction with the environment. For example, it contains a dict listing the different components of the total reward.

### Environment Loop

Let's look at how this would look like in a typical loop.

In [32]:
for i in range(4):
  obs = await env.reset()
  response = agent.step(USER_PROMPT + obs.question).msgs[0].content
  action = Action(llm_response=response)
  next_obs, reward, done, info = await env.step(action)
  print(f"Reward at step {i}: {reward}")
  agent.reset() # to clear context window

Reward at step 0: 10.0
Reward at step 1: 0.0
Reward at step 2: 10.0
Reward at step 3: 10.0


In practice, the agent's model backend would point at an inference engine like vllm or SGLang. After each `step` call (or batches thereof), we would feed the reward for the action into a training framework like *veRL* or *HuggingFace TRL*. These would update the model backend that the agent points to, such that after every training step, the new iteration of the model is used for choosing an action.

Naturally, this setup may seem overkill. In the near future, we will explore multi-step environments like Chess, Go and GAIA, which is why we created this framework to unify RL training. We will work in the coming weeks to make it more efficient and release educational material on how to use it.

We are looking forward to see what you build using the Loong environment! Feel free to share it with us on 𝕏.