scaleway · RoRoJ · Aug 27, 2025 · Aug 21, 2025 · Aug 22, 2025 · Aug 25, 2025
diff --git a/macros/ai/chat-comp-vs-responses-api.mdx b/macros/ai/chat-comp-vs-responses-api.mdx
@@ -0,0 +1,17 @@
+---
+macro: chat-comp-vs-responses-api
+---
+
+Both the [Chat Completions API](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) and the [Responses API](https://www.scaleway.com/en/developers/api/generative-apis/#path-responses-beta-create-a-response) are OpenAI-compatible REST APIs that can be used for generating and manipulating conversations. The Chat Completions API is focused on generating conversational responses, while the Responses API is a more general REST API for chat, structured outputs, tool use, and multimodal inputs.
+
+The **Chat Completions** API was released in 2023, and is an industry standard for building AI applications, being specifically designed for handling multi-turn conversations. It is stateless, but allows users to manage conversation history by appending each new message to the ongoing conversation. Messages in the conversation can include text, images and audio extracts. The API supports `function` tool-calling, allowing developers to define functions that the model can choose to call. If it does so, it returns the function name and arguments, which the developer's code must execute and feed back into the conversation.
+
+The **Responses** API was released in 2025, and is designed to combine the simplicity of Chat Completions with the ability to do more agentic tasks and reasoning. It supports statefulness, being able to maintain context without needing to resend the entire conversation history. It offers tool-calling by built-in tools (e.g. web or file search) that the model is able to execute itself while generating a response. 
+
+<Message type="note">
+Scaleway's support for the Responses API is currently at beta stage. Support of the full feature set will be incremental: currently statefulness and tools other than `function` calling are not supported. 
+</Message>
+
+Most supported Generative API models can be used with both Chat Completions and Responses API. For the **`gtp-oss-120b` model, use of the Responses API is recommended, as it will allow you to access all of its features, especially tool-calling.
+
+For full details on the differences between these APIs, see the [official OpenAI documentation](https://platform.openai.com/docs/guides/migrate-to-responses).
@@ -40,7 +40,7 @@ No, you cannot increase maximum output tokens above [limits for each models](/ge
 These limits are in place to protect you against:
 - Long generation which may be ended by an HTTP timeout. Limits are designed to ensure a model will send its HTTP response in less than 5 minutes.
 - Uncontrolled billing, as several models are known to be able to enter infinite generation loops (specific prompts can make the model generate the same sentence over and over, without stopping at all).
-If you require higher maximum output tokens, you can use [Managed Inference](https://console.scaleway.com/inference/deployments) where these limts do not apply (as your bill will be limited by the size of your deployment).
+If you require higher maximum output tokens, you can use [Managed Inference](https://console.scaleway.com/inference/deployments) where these limits do not apply (as your bill will be limited by the size of your deployment).
 
 ### Can I use OpenAI libraries and APIs with Scaleway's Generative APIs?
 Yes, Scaleway's Generative APIs are designed to be compatible with OpenAI libraries and SDKs, including the OpenAI Python client library and LangChain SDKs. This allows for seamless integration with existing workflows.

@@ -3,17 +3,17 @@ title: How to query language models
 description: Learn how to interact with powerful language models using Scaleway's Generative APIs service.
 tags: generative-apis ai-data language-models chat-completions-api
 dates:
-  validation: 2025-05-12
+  validation: 2025-08-22
   posted: 2024-08-28
 ---
 import Requirements from '@macros/iam/requirements.mdx'
-
+import ChatCompVsResponsesApi from '@macros/ai/chat-comp-vs-responses-api.mdx'
 
 Scaleway's Generative APIs service allows users to interact with powerful language models hosted on the platform.
 
 There are several ways to interact with language models:
 - The Scaleway [console](https://console.scaleway.com) provides complete [playground](/generative-apis/how-to/query-language-models/#accessing-the-playground), aiming to test models, adapt parameters, and observe how these changes affect the output in real-time.
-- Via the [Chat API](/generative-apis/how-to/query-language-models/#querying-language-models-via-api)
+- Via the [Chat Completions API](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) or the [Responses API](https://www.scaleway.com/en/developers/api/generative-apis/#path-responses-beta-create-a-response)
 
 <Requirements />
 
@@ -39,10 +39,12 @@ The web playground displays.
 
 ## Querying language models via API
 
-The [Chat API](/generative-apis/api-cli/using-chat-api/) is an OpenAI-compatible REST API for generating and manipulating conversations.
-
 You can query the models programmatically using your favorite tools or languages.
-In the following example, we will use the OpenAI Python client.
+In the example that follows, we will use the OpenAI Python client.
+
+### Chat Completions API or Responses API?
+
+<ChatCompVsResponsesApi />
 
 ### Installing the OpenAI SDK
 
@@ -68,48 +70,97 @@ client = OpenAI(
 
 ### Generating a chat completion
 
-You can now create a chat completion, for example with the `llama-3.1-8b-instruct` model:
+You can now create a chat completion using either the Chat Completions or Responses API, as shown in the following examples:
 
-```python
-# Create a chat completion using the 'llama-3.1-8b-instruct' model
-response = client.chat.completions.create(
-    model="llama-3.1-8b-instruct",
-    messages=[{"role": "user", "content": "Describe a futuristic city with advanced technology and green energy solutions."}],
-    temperature=0.2,  # Adjusts creativity
-    max_tokens=100,   # Limits the length of the output
-    top_p=0.7         # Controls diversity through nucleus sampling. You usually only need to use temperature.
-)
+<Tabs id="generating-chat-completion">
 
-# Print the generated response
-print(response.choices[0].message.content)
-```
+    <TabsTab label="Chat Completions API">
+
+    ```python
+    # Create a chat completion using the 'llama-3.1-8b-instruct' model
+    response = client.chat.completions.create(
+        model="llama-3.1-8b-instruct",
+        messages=[{"role": "user", "content": "Describe a futuristic city with advanced technology and green energy solutions."}],
+        temperature=0.2,  # Adjusts creativity
+        max_completion_tokens=100,   # Limits the length of the output
+        top_p=0.7         # Controls diversity through nucleus sampling. You usually only need to use temperature.
+    )
+
+    # Print the generated response
+    print(response.choices[0].message.content)
+    ```
+
+    This code sends a message to the model and returns an answer based on your input. The `temperature`, `max_completion_tokens`, and `top_p` parameters control the response's creativity, length, and diversity, respectively.
+
+    </TabsTab>
+
+    <TabsTab label="Responses API (Beta)">
+
+    ```python
+    # Create a chat completion using the 'gpt-oss-120b' model
+    response = client.responses.create(
+        model="gpt-oss-120b",
+        input=[{"role": "user", "content": "Briefly describe a futuristic city with advanced technology and green energy solutions."}],
+        temperature=0.2,  # Adjusts creativity
+        max_output_tokens=100,   # Limits the length of the output
+        top_p=0.7         # Controls diversity through nucleus sampling. You usually only need to use temperature.
+
+    )
+    # Print the generated response. Here, the last output message will contain the final content.
+    # Previous outputs will contain reasoning content.
+    print(response.output[-1].content[0].text)
+    ```
+    </TabsTab>
+</Tabs>
 
-This code sends a message to the model and returns an answer based on your input. The `temperature`, `max_tokens`, and `top_p` parameters control the response's creativity, length, and diversity, respectively.
 
 A conversation style may include a default system prompt. You may set this prompt by setting the first message with the role system. For example:
 
-```python
-[
-  {
-  	"role": "system",
-  	"content": "You are Xavier Niel."
-  },
-  {
-  	"role": "user",
-  	"content": "Hello, what is your name?"
-  }
-]
-```
+  ```python
+  [
+    {
+      "role": "system",
+      "content": "You are Xavier Niel."
+    },
+    {
+      "role": "user",
+      "content": "Hello, what is your name?"
+    }
+  ]
+  ```
+
+Adding such a system prompt can also help resolve issues if you receive responses such as `I'm not sure what tools are available to me. Can you please provide a library of tools that I can use to generate a response?`.
 
 ### Model parameters and their effects
 
 The following parameters will influence the output of the model:
 
-- **`messages`**: A list of message objects that represent the conversation history. Each message should have a `role` (e.g., "system", "user", "assistant") and `content`.
-- **`temperature`**: Controls the output's randomness. Lower values (e.g., 0.2) make the output more deterministic, while higher values (e.g., 0.8) make it more creative.
-- **`max_tokens`**: The maximum number of tokens (words or parts of words) in the generated output.
-- **`top_p`**: Recommended for advanced use cases only. You usually only need to use temperature. `top_p` controls the diversity of the output, using nucleus sampling, where the model considers the tokens with top probabilities until the cumulative probability reaches `top_p`.
-- **`stop`**: A string or list of strings where the model will stop generating further tokens. This is useful for controlling the end of the output.
+<Tabs id="model-params">
+
+  <TabsTab label = "Chat Completions API">
+
+  - **`messages`**: A list of message objects that represent the conversation history. Each message should have a `role` (e.g., "system", "user", "assistant") and `content`.
+  - **`temperature`**: Controls the output's randomness. Lower values (e.g., 0.2) make the output more deterministic, while higher values (e.g., 0.8) make it more creative.
+  - **`max_completion_tokens`**: The maximum number of tokens (words or parts of words) in the generated output.
+  - **`top_p`**: Recommended for advanced use cases only. You usually only need to use temperature. `top_p` controls the diversity of the output, using nucleus sampling, where the model considers the tokens with top probabilities until the cumulative probability reaches `top_p`.
+  - **`stop`**: A string or list of strings where the model will stop generating further tokens. This is useful for controlling the end of the output.
+
+  See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.
+
+  </TabsTab>
+
+  <TabsTab label="Responses API (Beta)">
+
+  - **`input`**: A single text string, or an array of string/multi-modal inputs to provide to the model to generate a response. When using the array option, you can define a `role` and list of `content` inputs of different types (texts, files, images etc.)
+  - **`max_output_tokens`**: A maximum number of output tokens that can be generated for a completion. Different default maximum values
+are enforced for each model, to avoid edge cases where tokens are generated indefinitely.
+  - **`temperature`**: Controls the output's randomness. Lower values (e.g., 0.2) make the output more deterministic, while higher values (e.g., 0.8) make it more creative.
+  - **`top_p`**: Recommended for advanced use cases only. You usually only need to use temperature. `top_p` controls the diversity of the output, using nucleus sampling, where the model considers the tokens with top probabilities until the cumulative probability reaches `top_p`.
+
+  See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-responses-beta-create-a-response) for a full list of all available parameters.
+
+  </TabsTab>
+</Tabs>
 
 <Message type="warning">
  If you encounter an error such as "Forbidden 403" refer to the [API documentation](/generative-apis/api-cli/understanding-errors) for troubleshooting tips.
@@ -118,7 +169,8 @@ The following parameters will influence the output of the model:
 ## Streaming
 
 By default, the outputs are returned to the client only after the generation process is complete. However, a common alternative is to stream the results back to the client as they are generated. This is particularly useful in chat applications, where it allows the client to view the results incrementally as each token is produced.
-Following is an example using the chat completions API:
+
+Following is an example using the Chat Completions API, but the `stream` parameter can be set in the same way with the Responses API.
 
 ```python
 from openai import OpenAI
@@ -145,28 +197,62 @@ for chunk in response:
 
 The service also supports asynchronous mode for any chat completion.
 
-```python
-
-import asyncio
-from openai import AsyncOpenAI
-
-client = AsyncOpenAI(
-    base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
-    api_key="<SCW_API_KEY>"  # Your unique API key from Scaleway
-)
-
-async def main():
-    stream = await client.chat.completions.create(
-        model="llama-3.1-8b-instruct",
-        messages=[{
-        "role": "user",
-        "content": "Sing me a song",
-        }],
-        stream=True,
-    )
-    async for chunk in stream:
-        if chunk.choices and chunk.choices[0].delta.content:
-            print(chunk.choices[0].delta.content, end="")
-
-asyncio.run(main())
-```
+<Tabs id="async">
+
+  <TabsTab label="Chat Completions API">
+
+  ```python
+
+  import asyncio
+  from openai import AsyncOpenAI
+
+  client = AsyncOpenAI(
+      base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
+      api_key="<SCW_API_KEY>"  # Your unique API key from Scaleway
+  )
+
+  async def main():
+      stream = client.chat.completions.create(
+          model="llama-3.1-8b-instruct",
+          messages=[{
+          "role": "user",
+          "content": "Sing me a song",
+          }],
+          stream=True,
+      )
+      async for chunk in stream:
+          if chunk.choices and chunk.choices[0].delta.content:
+              print(chunk.choices[0].delta.content, end="")
+
+  asyncio.run(main())
+  ```
+  </TabsTab>
+  <TabsTab label="Responses API (Beta)">
+  ```python
+  import asyncio
+  from openai import AsyncOpenAI
+
+  client = AsyncOpenAI(
+      base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
+      api_key="<SCW_API_KEY>"  # Your unique API key from Scaleway
+  )
+
+  async def main():
+      stream = await client.responses.create(
+          model="gpt-oss-120b",
+          input=[{
+              "role": "user",
+              "content": "Sing me a song"
+          }],
+          stream=True,
+      )
+      async for event in stream:
+          if event.type == "response.output_text.delta":
+              print(event.delta, end="")
+          elif event.type == "response.completed":
+              break
+
+  asyncio.run(main())
+  ```
+  </TabsTab>
+</Tabs>