## Launch A Server

Launch the server with a reasoning model (Qwen 3.5-4B) and reasoning parser.

In [1]:


from sglang import separate_reasoning, assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server


if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path Qwen/Qwen3-4B --reasoning-parser qwen3 --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

  from .autonotebook import tqdm as notebook_tqdm


[2025-05-05 17:10:15] server_args=ServerArgs(model_path='Qwen/Qwen3-4B', tokenizer_path='Qwen/Qwen3-4B', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen3-4B', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=32857, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=569616803, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None,

Set the default backend. Note: you can set chat_template_name in RontimeEndpoint. 

In [2]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}", chat_template_name="qwen"))

[2025-05-05 17:10:36] INFO:     127.0.0.1:43444 - "GET /get_model_info HTTP/1.1" 200 OK


Let's start with a basic question-answering task. And see how the reasoning content is generated.

In [3]:
@function
def basic_qa(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant_begin()
    s += gen("answer", max_tokens=512)
    s += assistant_end()

state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])

[2025-05-05 17:10:36] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-05 17:10:36] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, gen throughput (token/s): 6.34, #queue-req: 0
[2025-05-05 17:10:37] Decode batch. #running-req: 1, #token: 104, token usage: 0.00, gen throughput (token/s): 82.06, #queue-req: 0
[2025-05-05 17:10:37] Decode batch. #running-req: 1, #token: 144, token usage: 0.00, gen throughput (token/s): 81.59, #queue-req: 0
[2025-05-05 17:10:38] Decode batch. #running-req: 1, #token: 184, token usage: 0.00, gen throughput (token/s): 81.15, #queue-req: 0
[2025-05-05 17:10:38] Decode batch. #running-req: 1, #token: 224, token usage: 0.00, gen throughput (token/s): 80.91, #queue-req: 0
[2025-05-05 17:10:39] Decode batch. #running-req: 1, #token: 264, token usage: 0.00, gen throughput (token/s): 80.59, #queue-req: 0
[2025-05-05 17:10:39] Decode batch. #running-req: 1, #token: 304, token usag

With `separate_reasoning`, you can move the reasoning content to `{param_name}_reasoning_content` in the state.

In [4]:

@function
def basic_qa_separate_reasoning(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant_begin()
    s += separate_reasoning(
        gen("answer", max_tokens=512),
        model_type="qwen3"
    )
    s += assistant_end()

reasoning_state = basic_qa_separate_reasoning("List 3 countries and their capitals.")
print_highlight(reasoning_state.stream_executor.variable_event.keys())
print_highlight(f"\nSeparated Reasoning Content:\n{reasoning_state["answer_reasoning_content"]}")
print_highlight(f"\n\nContent:\n{reasoning_state["answer"]}")
print_highlight(f"\n\nMessages:\n{reasoning_state.messages()[-1]}")

dict_keys(['answer', 'answer_reasoning_content'])
[2025-05-05 17:10:40] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 30, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-05 17:10:40,722 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-05 17:10:40,743 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-05 17:10:40] Decode batch. #running-req: 1, #token: 44, token usage: 0.00, gen throughput (token/s): 73.25, #queue-req: 0
[2025-05-05 17:10:41] Decode batch. #running-req: 1, #token: 84, token usage: 0.00, gen throughput (token/s): 82.31, #queue-req: 0
[2025-05-05 17:10:41] Decode batch. #running-req: 1, #token: 124, token usage: 0.00, gen throughput (token/s): 81.89, #queue-req: 0
[2025-05-05 17:10:42] Decode batch. #running-req: 1, #token: 164, token usage: 0.00, gen throughput (token/s): 81.34, #queue-req: 0
[2025-05-05 17:10:42] Decode batch. #running-req: 1, #token: 204, token usage: 0.00, gen throughput (token/s): 81.05, #queue-req: 0
[

`separate_reasoning` can also be used in multi-turn conversations.

In [5]:
@function
def multi_turn_qa(s):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user("Please give me a list of 3 countries and their capitals.")
    s += assistant(separate_reasoning(gen("first_answer", max_tokens=512), model_type="qwen3"))
    s += user("Please give me another list of 3 countries and their capitals.")
    s += assistant(separate_reasoning(gen("second_answer", max_tokens=512), model_type="qwen3"))
    return s


reasoning_state = multi_turn_qa()
print_highlight(f"\n\nfirst_answer:\n{reasoning_state['first_answer']}")
print_highlight(f"\n\nfirst_answer_reasoning_content:\n{reasoning_state['first_answer_reasoning_content']}")
print_highlight(f"\n\nsecond_answer:\n{reasoning_state['second_answer']}")
print_highlight(f"\n\nsecond_answer_reasoning_content:\n{reasoning_state['second_answer_reasoning_content']}")

[2025-05-05 17:10:44] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 18, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-05 17:10:45] Decode batch. #running-req: 1, #token: 74, token usage: 0.00, gen throughput (token/s): 77.84, #queue-req: 0
[2025-05-05 17:10:45] Decode batch. #running-req: 1, #token: 114, token usage: 0.00, gen throughput (token/s): 81.93, #queue-req: 0
[2025-05-05 17:10:46] Decode batch. #running-req: 1, #token: 154, token usage: 0.00, gen throughput (token/s): 81.38, #queue-req: 0
[2025-05-05 17:10:46] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 81.06, #queue-req: 0
[2025-05-05 17:10:47] Decode batch. #running-req: 1, #token: 234, token usage: 0.00, gen throughput (token/s): 80.80, #queue-req: 0
[2025-05-05 17:10:47] Decode batch. #running-req: 1, #token: 274, token usage: 0.00, gen throughput (token/s): 80.44, #queue-req: 0
[2025-05-05 17:10:47] INFO:     127.0.0.1:52600 - "POST /generate HTTP/1.1

## Using No thinking as Qwen 3's advanced feature 

sglang separate_reasoning is particularly useful when combined with Qwen 3's advanced feature.

[Qwen 3's advanced usages](https://qwenlm.github.io/blog/qwen3/#advanced-usages)


In [6]:
reasoning_state = basic_qa_separate_reasoning("List 3 countries and their capitals. /no_think")
print("Reasoning Content:\n", reasoning_state["answer_reasoning_content"])
print("Content:\n", reasoning_state["answer"])

[2025-05-05 17:10:51] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 26, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-05 17:10:51] Decode batch. #running-req: 1, #token: 57, token usage: 0.00, gen throughput (token/s): 76.85, #queue-req: 0
[2025-05-05 17:10:51] INFO:     127.0.0.1:52612 - "POST /generate HTTP/1.1" 200 OK
Reasoning Content:
 
Content:
 1. **France** - Paris  
2. **Japan** - Tokyo  
3. **Canada** - Ottawa


`separate_reasoning` can also be used in regular expression generation.

In [8]:
@function
def regular_expression_gen(s):
    s += user("What is the IP address of the Google DNS servers? just provide the answer")
    s += assistant(
        separate_reasoning(
            gen(
                "answer",
                temperature=0,
                regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
                max_tokens=512,
            )
        ,model_type="qwen3"), 
    )


reasoning_state = regular_expression_gen()

[2025-05-05 17:11:17] Prefill batch. #new-seq: 1, #new-token: 26, #cached-token: 8, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-05 17:11:17] Decode batch. #running-req: 1, #token: 68, token usage: 0.00, gen throughput (token/s): 1.54, #queue-req: 0
[2025-05-05 17:11:18] Decode batch. #running-req: 1, #token: 108, token usage: 0.00, gen throughput (token/s): 83.08, #queue-req: 0
[2025-05-05 17:11:18] Decode batch. #running-req: 1, #token: 148, token usage: 0.00, gen throughput (token/s): 82.53, #queue-req: 0
[2025-05-05 17:11:19] Decode batch. #running-req: 1, #token: 188, token usage: 0.00, gen throughput (token/s): 82.11, #queue-req: 0
[2025-05-05 17:11:19] Decode batch. #running-req: 1, #token: 228, token usage: 0.00, gen throughput (token/s): 81.86, #queue-req: 0
[2025-05-05 17:11:20] Decode batch. #running-req: 1, #token: 268, token usage: 0.00, gen throughput (token/s): 81.53, #queue-req: 0
[2025-05-05 17:11:20] Decode batch. #running-req: 1, #token: 308, token usag

In [12]:
print_highlight(f"Answer:\n{reasoning_state['answer']}")
print_highlight(f"\n\nReasoning Content:\n{reasoning_state['answer_reasoning_content']}")


Answer:
2023-10-05


Reasoning Content:
Okay, the user is asking for the IP addresses of Google's DNS servers. Let me recall what I know about DNS servers. Google provides two public DNS servers, right? They're commonly used for their reliability and speed.

I think the primary one is 8.8.8.8. Wait, isn't there another one? Oh yeah, 8.8.4.4. Those are the two main ones. Let me make sure I'm not mixing them up with other providers. For example, Cloudflare uses 1.1.1.1 and 1.0.0.1. But Google's are definitely 8.8.8.8 and 8.8.4.4. 

I should check if there are any other IP addresses, but I don't think so. They have two main ones. The user might be looking to set up their DNS settings, so providing both is important. Also, maybe mention that they're both in the same range, which is 8.8.0.0/14. But the user just asked for the IP addresses, so maybe just list them. 

Wait, the user said "just provide the answer," so maybe they don't need extra info. But to be thorough, I should confirm that 