- Model

In [None]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "2"

from vllm import LLM, SamplingParams, RequestOutput
llm = LLM(
    model="/root/project/models/DeepSeek-R1-Distill-Llama-8B",
    task="generate",
    enable_chunked_prefill=False,
    max_model_len=16384,
    max_seq_len_to_capture=16384,
    gpu_memory_utilization=0.9
)

tokenizer = llm.get_tokenizer()

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "/root/project/models/QwQ-32B",
    use_fast=True,
    trust_remote_code=True,   # LLaMA / 自定义 tokenizer 常需要
)

# tokenizer.vocab_size

- Sampling

In [None]:

sampling_params = SamplingParams(
    temperature=0.6,
    max_tokens=16384
)

In [None]:
# 采样参数
sampling_params = SamplingParams(
    max_tokens=16,
    top_p=0.99,
    logprobs=5
)

In [None]:
sampling_params = SamplingParams(
    temperature=0.2,
    max_tokens=16384
)

- Prompt

In [None]:
prompt = "WARNING 12-15 09:42:07 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer."

In [None]:
prompt = 'WARNING 12-15 09:42:07 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. [WARNING]\n\nWait, so this warning is saying that FlashInter is not available, so it\'s falling back to PyTorch\'s native implementation. So, maybe my code is okay. But the user mentioned that they get this warning and their output is incorrect. So perhaps, the problem is not with the code, but with the fact that they can\'t use FlashInfer, so the code is working, but giving a warning. Alternatively, maybe I should look into why the sampling isn\'t working as expected.\n\nAnother thing is that the user mentioned that the output is not correct. So perhaps the sampling isn\'t matching expectations. Maybe I should consider how the code is handling the tokenization and model loading.\n\nWait, in their code, they have:\n\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained("microsoft/llama2-70b")\n\nmodel = AutoModelForCausalLM.from_pretrained("microsoft/llama2-70b")\n\nBut I know that for LLaMA 70B, the AutoTokenizer may not have the "llama2-70b" model, but perhaps "llama2-70b" is a custom model. Maybe they should use the model name correctly.\n\nAlternatively, perhaps the issue is that the model is not loaded correctly due to version incompatibility or incorrect model name.\n\nWait, looking it up, the LLaMA 70B model is available on Hugging Face. So the code should be correct in that sense.\n\nBut perhaps the issue is that the auto-sampler is not working as expected. Maybe I should test the code.\n\nAlternatively, perhaps the user is using older version of PyTorch, and the topk and topp sampling is not supported. But the code seems to be written for PyTorch >= 1.13, which should support it.\n\nAnother thought: the user may have a custom model where the AutoTokenizer and AutoModelForCausalLM are not properly handling the model, perhaps due to missing dependencies or model compatibility.\n\nAlternatively, perhaps the model\'s tokenizer and the model are from different sources, so when using AutoTokenizer.from_pretrained and AutoModelForCausalLM.from_pretrained, they might not be properly aligned.\n\nWait, but the same model name is used for both, so that should be okay.\n\nAlternatively, maybe the model\'s tokenization is not handled correctly because the model expects a certain tokenization, and perhaps the AutoTokenizer is not using that.\n\nWait, but the AutoTokenizer should automatically handle the tokenization based on the model\'s requirements.\n\nHmm, perhaps the problem is that the user is trying to sample from the model, but the model expects a specific input format, and the code is not generating the samples correctly.\n\nWait, in the code, after getting the tokenizer, they get the model, and then perform the sampling.\n\nAnother thought: perhaps the model is using a different device, like CPU, which is slower, and the user expects it to run faster on GPU. But the code doesn\'t specify the device, so it might be using CPU by default.\n\nAlternatively, perhaps the model is not loaded correctly because of missing dependencies. For example, LLaMA 70B requires certain datasets or libraries that are not installed.\n\nWait, but the code is just using AutoTokenizer and AutoModelForCausalLM, so it should automatically download and install any required dependencies if not present.\n\nAlternatively, perhaps the user is encountering an error when running the code, but the error is not being displayed, just the warning about FlashInfer.\n\nWait, looking back at the user\'s message, they say that they get the warning about FlashInfer and the output is incorrect. So maybe the sampling is not working as intended, leading to incorrect outputs.\n\nAlternatively, perhaps the user\'s code is correct, but their dataset or setup is incorrect, leading to the wrong outputs.\n\nWait, perhaps the issue is with the tokenization. In the code, they are using the default tokenizer, which may not be handling certain edge cases correctly.\n\nAlternatively, perhaps the model is from a different manufacturer, and the AutoTokenizer and AutoModelForCausalLM are not compatible.\n\nWait, but the model is from Microsoft/Llama, so it should be compatible.\n\nAnother angle: perhaps the user is trying to sample from a model that has multiple versions, and the code is loading a different version than expected.\n\nAlternatively, perhaps the code is correct, but the user expects a specific output format, and the sampling is not aligning with that.\n\nAlternatively, perhaps the problem is that the user is using a different version of PyTorch, and the top-k sampling function has changed.\n\nWait, the code uses:\n\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nBut the AutoTokenizer and AutoModelForCausalLM are part of the transformers library, so if the user\'s version is outdated, this might cause issues.\n\nWait, the user\'s code is from a file called topk_topp_sampler.py, and in line 58, they get a warning about FlashInfer not being available and falling back to PyTorch\'s implementation.\n\nSo perhaps the issue is that the user\'s PyTorch version is newer, and the topk and topp functions have changed, leading to incorrect behavior.\n\nAlternatively, perhaps the code is correct, but the user is using a model that is not compatible with the top-k sampling, perhaps because it\'s not a language model that can be sampled from.\n\nAlternatively, perhaps the issue is that the code is not setting the correct parameters for the sampling, leading to incorrect outputs.\n\nAnother thought: perhaps the code is correct, but the user is not providing the correct prompts or inputs, leading to incorrect outputs.\n\nAlternatively, perhaps the code is correct, but the user is expecting the outputs to be in a specific format, and the code is not matching that.\n\nWait, perhaps the user is expecting the code to perform top-p sampling, but it\'s actually doing top-k sampling, or vice versa.\n\nLooking at the code, in topk_topp_sampler.py, the code is using:\n\ntokenizer = AutoTokenizer.from_pretrained("microsoft/llama2-70b")\n\nmodel = AutoModelForCausalLM.from_pretrained("microsoft/llama2-70b")\n\nThen, in the sampling function:\n\ndef topk_sample(tokenizer, model, k=300, p=1.0):\n\n    # ... code ...\n\nSo the function takes k and p as parameters.\n\nBut the user mentioned that the output is incorrect, so perhaps the parameters are not set correctly.\n\nAlternatively, perhaps the code is using the wrong sampling function, leading to incorrect outputs.\n\nWait, perhaps the code is correct, but the user is not using it correctly. For example, perhaps the function is called with the wrong k and p values.\n\nAlternatively, perhaps the code is not handling the tokens correctly after sampling, leading to incorrect outputs.\n\nAnother thought: perhaps the model\'s tokenizer is not compatible with the model, causing issues during sampling.\n\nWait, but the tokenizer is loaded from the same model, so they should be compatible.\n\nAlternatively, perhaps the model\'s tokenizer expects a different tokenization process, and the AutoTokenizer is not handling that correctly.\n\nAlternatively, perhaps the code is correct, but the user\'s setup is causing the model to run slowly, leading to incorrect behavior.\n\nAlternatively, perhaps the issue is that the model is not being properly fine-tuned or is in eval mode, affecting the sampling.\n\nWait, in the code, after loading the model, it\'s not being set to training mode. So perhaps the model is in eval mode, which could affect the sampling.\n\nAlternatively, perhaps the code is correct, but the user is not providing the correct seed for reproducibility, leading to different outputs.\n\nAlternatively, perhaps the issue is that the code is using the wrong device, leading to incorrect computations.\n\nWait, the code doesn\'t specify the device, so it might be using CPU by default, which is slower, but not necessarily incorrect unless the user expects GPU usage.\n\nAlternatively, perhaps the user\'s model is not loaded into memory correctly, leading to issues during sampling.\n\nWait, perhaps the model is too large, and the code is not handling the memory correctly, causing the model to not load properly.\n\nBut the model is LLaMA 70B, which is about 7B parameters, so it\'s a large model. Using it in a sampling function might be memory-intensive, especially on a GPU with limited VRAM.\n\nAlternatively, perhaps the code is correct, but the user is facing issues due to hardware limitations.\n\nAnother angle: perhaps the user is using a version of PyTorch that doesn\'t support the top-k and top-p sampling functions in the way they expect.\n\nAlternatively, perhaps the code is correct, but the user\'s expectations about the output are incorrect, leading to perceived issues.\n\nAlternatively, perhaps the code is correct, but the user is not providing the correct input (e.g., the prompt is empty or incorrect), leading to nonsensical outputs.\n\nWait, in the code, the user is passing a prompt and temperature, but if the prompt is empty or not properly formatted, the outputs might be unexpected.\n\nAlternatively, perhaps the code is correct, but the user is not handling the tokens correctly after sampling, perhaps not removing the special tokens.\n\nAnother thought: perhaps the code is correct, but the user is using a custom model where the AutoTokenizer and AutoModelForCausalLM are not properly aligned, causing issues during sampling.\n\nAlternatively, perhaps the issue is that the AutoTokenizer is not returning the expected tokens, leading to incorrect sampling.\n\nAlternatively, perhaps the code is correct, but the user is using a model that has multiple possible outputs, and the sampling is not aligning with their expectations.\n\nAnother angle: perhaps the user is using a model that requires specific input formats, such as being a causal model, and the code is not handling that correctly.\n\nWait, the model is AutoModelForCausalLM, which is designed for causal language models, so the code should be compatible.\n\nAlternatively, perhaps the user is encountering an error that is not being properly reported, leading them to think the output is incorrect.\n\nWait, the user mentioned that the output is incorrect, but didn\'t provide the actual output. So it\'s hard to tell without seeing the output.\n\nAlternatively, perhaps the code is correct, but the user is not using it correctly, leading to incorrect outputs.\n\nAlternatively, perhaps the code is correct, but the user is facing issues due to their environment, such as missing dependencies or library version mismatches.\n\nWait, perhaps the user is using an older version of PyTorch, and the topk and topp functions are not available, leading to errors.\n\nBut the code seems to be using the built-in functions, so as long as the PyTorch version is >=1.13, it should be fine.\n\nAlternatively, perhaps the user is using a version of the model that doesn\'t support top-k sampling, but I think LLaMA 70B should support it.\n\nAlternatively, perhaps the code is correct, but the user is not providing the correct number of samples, leading to incorrect outputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the temperature correctly, leading to less diverse samples than expected.\n\nAlternatively, perhaps the code is correct, but the user is not using the correct prompts, leading to incorrect outputs.\n\nAnother thought: perhaps the code is correct, but the user is not properly handling the tokenization step, such as not removing the [unused] tokens.\n\nAlternatively, perhaps the code is correct, but the user is not handling the probability distribution correctly during sampling.\n\nAlternatively, perhaps the code is correct, but the user is not using it in a loop, leading to incorrect outputs.\n\nAlternatively, perhaps the issue is that the code is not handling the model\'s tokenizer correctly, causing the tokens to be improperly generated.\n\nWait, in the code, the tokenizer is used as tokenizer.tokenize(prompt), but perhaps the prompt needs to be tokenized differently.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, perhaps not using the correct padding or truncation.\n\nAlternatively, perhaps the code is correct, but the user is not using the model\'s tokenizer, leading to incorrect token sequences.\n\nAnother angle: perhaps the user is using a model that requires specific input formats, such as being a translation model, but the code is designed for a language model.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s output correctly, such as not using the correct output spaces or not removing the special tokens.\n\nAlternatively, perhaps the code is correct, but the user is not properly handling the probability distribution during sampling, leading to incorrect token selection.\n\nAlternatively, perhaps the issue is that the code is not handling the temperature correctly, leading to incorrect sampling.\n\nWait, the code uses temperature=0.7, but if the temperature is set too high or too low, it might affect the sampling.\n\nBut the user didn\'t mention the temperature, so perhaps that\'s not the issue.\n\nAlternatively, perhaps the code is correct, but the user is not setting the correct random seed, leading to non-reproducible outputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the tokens correctly after sampling, such as not converting them back to strings or not removing the [unused] tokens.\n\nAlternatively, perhaps the code is correct, but the user is not using the tokenized tokens correctly, leading to incorrect model inputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s maximum sequence length, leading to truncated outputs.\n\nWait, in the code, when generating samples, it\'s possible that the model truncates the input sequences, leading to incorrect outputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s truncation correctly, leading to improper sampling.\n\nAnother thought: perhaps the code is correct, but the user is not using the correct device, leading to slower performance or incorrect computations.\n\nAlternatively, perhaps the issue is that the user is not using the correct model, such as using a smaller model instead of 70B.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s dependencies correctly, such as not installing all the required datasets.\n\nAlternatively, perhaps the issue is that the user is running the code on a CPU, which is too slow, leading to incorrect outputs due to timeouts.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s hyperparameters correctly, such as not setting the correct batch size or other parameters.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the probabilities correctly during the sampling process, leading to incorrect token selections.\n\nAnother angle: perhaps the code is correct, but the user is not using the correct sampling method, such as using top-p instead of top-k or vice versa, leading to incorrect outputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the temperature correctly, leading to less diverse or less coherent samples.\n\nAlternatively, perhaps the code is correct, but the user is not handling the tokens correctly, such as not using the correct tokenization scheme or not removing the [unused] tokens.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s output correctly, such as not using the correct output space or not decoding the tokens properly.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAnother thought: perhaps the user is not providing the correct device, and the model is running on a GPU, but the code is not properly handling the device, leading to incorrect computations.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s gradients correctly, leading to incorrect outputs.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not using it in the intended way, such as not providing the correct prompts or not using the function as intended.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s maximum sequence length, leading to truncated outputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, perhaps not using the correct padding or truncation.\n\nAnother angle: perhaps the user is not using the correct model name, leading to the wrong model being loaded, which in turn causes incorrect outputs.\n\nWait, in the code, the model is loaded as AutoModelForCausalLM.from_pretrained("microsoft/llama2-70b"). But perhaps the correct model name is different.\n\nWait, I should check what the correct model name is for LLaMA 70B. I recall that on Hugging Face, it\'s listed as "microsoft/llama2-70b". So the code seems correct.\n\nAlternatively, perhaps the user is using a different version of the model, such as "microsoft/llama2-7b", leading to incorrect behavior.\n\nAlternatively, perhaps the user is using a model that is not LLaMA, but another model with a similar name, leading to confusion.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, causing the tokenization to fail.\n\nAlternatively, perhaps the code is correct, but the user is not handling the tokenization correctly, such as not removing the [unused] tokens, leading to incorrect outputs.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAnother thought: perhaps the user is not using the correct temperature, leading to incorrect sampling. For example, if temperature is set to 0, it might sample the first token every time, leading to repetitive outputs.\n\nBut the code sets temperature=0.7, which is reasonable.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, perhaps not using the correct padding or truncation.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nAlternatively, perhaps the issue is that the code is correct, but the user is not handling the model\'s tokenizer correctly, leading to incorrect token sequences.\n\nWait, perhaps I\'m going in circles here. Let me try to summarize.\n\nThe user is running the code and getting a warning about FlashInfer not being available, which is fine, as the code is falling back to PyTorch\'s implementation. But the output is incorrect. The code seems to correctly load the tokenizer and model. The function to perform the top-k sampling is implemented, but perhaps there\'s an error in that function.\n\nLooking at the code, perhaps the issue is in how the tokens are being handled after sampling. For example, the function returns token_ids, but perhaps it needs to be decoded into text.\n\nAlternatively, perhaps the function is not handling the probabilities correctly, leading to incorrect token selection.\n\nAnother possibility is that the function is not properly handling the case when k is 0 or when p is 0, but the code seems to handle that.\n\nAlternatively, perhaps the function is not properly handling the case when the prompt is empty or when the tokenizer returns an empty list of tokens.\n\nAlternatively, perhaps the function is not properly handling the case when the model\'s output is not compatible with the tokens generated.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s maximum sequence length, leading to truncated outputs.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nAlternatively, perhaps the function is not properly handling the model\'s tokenizer, leading to incorrect token sequences.\n\nWait, perhaps I\'m overcomplicating this. Let me think of possible steps the user can take to debug this.\n\n1. Check if the model is loaded correctly. Perhaps the model name is incorrect or the tokenizer is not compatible.\n\n2. Verify that the sampling function is implemented correctly, especially the part where tokens are sampled.\n\n3. Check if the model\'s tokenizer is being used correctly, including any required parameters like max_length or padding.\n\n4. Ensure that the model is in the correct mode (eval or training) and that it\'s properly initialized.\n\n5. Check if the sampling function is returning the correct token sequences, perhaps by printing the output or using a debugger.\n\n6. Verify that the sampling function is handling edge cases, such as when the prompt is empty or when the tokenizer returns an empty list.\n\n7. Check if the model\'s device is correctly set, especially if using GPU.\n\n8. Ensure that the model\'s dependencies are correctly installed and that there are no missing libraries.\n\n9. Verify that the code is correctly handling the temperature and other parameters.\n\n10. Check if the code is correctly decoding the tokens back into text, if necessary.\n\nAlternatively, perhaps the code is correct, and the user is encountering a different issue, such as a hardware problem or incorrect usage of the function.\n\nIn conclusion, the code seems to be correctly written, but the user is experiencing issues with the output. Possible causes could be related to model loading, tokenizer handling, or sampling function implementation. The user should check each of these areas to identify and resolve the problem.\n'

In [None]:
prompt = "<｜begin▁of▁sentence｜><｜User｜>John packs M&M in small bags to eat.  He buys 3 large bags weighing 10 ounces each.  If an ounce of M&M has 30 M&M in it how many small bags can he make if he puts 10 in each small bag? There is a trap in the question, you won't figure out the correct answer at once.<｜Assistant｜><think>\nOkay, so John is packing M&M's into small bags. Hmm, let me try to figure out the problem step by step. The problem says he buys 3 large bags, each weighing 10 ounces. Each ounce of M&M's has 30 M&M's. He wants to put 10 M&M's in each small bag. I need to find out how many small bags he can make. \n\nFirst, let's figure out how many M&M's are in each large bag. The large bags each weigh 10 ounces. Since each ounce has 30 M&M's, then 10 ounces should have 10 times 30 M&M's. Let me write that down: 10 ounces * 30 M&M's per ounce. That equals 300 M&M's in each large bag. \n\nWait, but he has 3 large bags. So, the total number of M&M's he has is 3 large bags * 300 M&M's per bag. Let me calculate that: 3 * 300 is 900. So, he has 900 M&M's in total.\n\nNow, he wants to put 10 M&M's in each small bag. To find out how many small bags he can make, I need to divide the total number of M&M's by the number he puts in each small bag. So, 900 M&M's divided by 10 M&M's per small bag. That equals 90. So, he can make 90 small bags.\n\nWait, hold on a second. The problem mentions there's a trap in the question, so I must have made a mistake. Let me go through it again.\n\nHe buys 3 large bags weighing 10 ounces each. So, each large bag is 10 ounces. Each ounce has 30 M&M's. So, in each large bag, the weight is 10 ounces, meaning 10 * 30 = 300 M&M's. 3 large bags would be 3 * 300 = 900 M&M's. Then, each small bag holds 10 M&M's, so 900 / 10 = 90 small bags.\n\nHmm, that seems correct. But the problem says there's a trap. Maybe I misunderstood the ounces part. Let me think again.\n\nWait, maybe the weight is per bag, not per ounce. Wait, no, the problem says \"an ounce of M&M has 30 M&M.\" So, the number of M&M's per ounce is 30. So, 10 ounces would have 10 * 30 = 300.\n\nSo, 3 bags would be 3 * 300 = 900, as I thought. So, 900 M&M's divided by 10 per small bag is 90.\n\nIs the trap something else? Maybe the small bags themselves have a weight? Or is it about the number of M&M's per small bag? Let me reread the question.\n\n\"He buys 3 large bags weighing 10 ounces each. If an ounce of M&M has 30 M&M in it how many small bags can he make if he puts 10 in each small bag?\" \n\nWait, maybe it's a trick with the ounces. Maybe the large bags are 10 ounces each, but when he puts them into small bags, does the small bag's weight matter? It doesn't specify the weight of the small bags, so maybe it's not about that.\n\nAlternatively, maybe the confusion is between ounces and M&M's. Let me consider that. If each ounce has 30 M&M's, then each M&M is 1/30 of an ounce. So, 10 M&M's would be 10*(1/30) = 1/3 of an ounce. So, each small bag is 1/3 of an ounce.\n\n"

In [None]:
overthinking = "Wait, maybe the weight is per bag, not per ounce. Wait, no, the problem says \"an ounce of M&M has 30 M&M.\" So, the number of M&M's per ounce is 30. So, 10 ounces would have 10 * 30 = 300.\n\nSo, 3 bags would be 3 * 300 = 900, as I thought. So, 900 M&M's divided by 10 per small bag is 90.\n\nIs the trap something else? Maybe the small bags themselves have a weight? Or is it about the number of M&M's per small bag? Let me reread the question.\n\n\"He buys 3 large bags weighing 10 ounces each. If an ounce of M&M has 30 M&M in it how many small bags can he make if he puts 10 in each small bag?\" \n\nWait, maybe it's a trick with the ounces. Maybe the large bags are 10 ounces each, but when he puts them into small bags, does the small bag's weight matter? It doesn't specify the weight of the small bags, so maybe it's not about that.\n\nAlternatively, maybe the confusion is between ounces and M&M's. Let me consider that. If each ounce has 30 M&M's, then each M&M is 1/30 of an ounce. So, 10 M&M's would be 10*(1/30) = 1/3 of an ounce. So, each small bag is 1/3 of an ounce.\n\n"

rewrite = "But, maybe the weight is per bag, not per ounce. Wait, no, the problem says \"an ounce of M&M has 30 M&M.\" So, the number of M&M's per ounce is 30. So, 10 ounces would have 10 * 30 = 300.\n\nTherefore, 3 bags would be 3 * 300 = 900, as I thought. So, 900 M&M's divided by 10 per small bag is 90.\n\nIs the trap something else? Maybe the small bags themselves have a weight? Or is it about the number of M&M's per small bag? Let me go back and check my steps.\n\n\"He buys 3 large bags weighing 10 ounces each. If an ounce of M&M has 30 M&M in it how many small bags can he make if he puts 10 in each small bag?\" \n\nBut, maybe it's a trick with the ounces. Maybe the large bags are 10 ounces each, but when he puts them into small bags, does the small bag's weight matter? It doesn't specify the weight of the small bags, so maybe it's better to think in terms of unsigned addition.\n\nOr, maybe the confusion is between ounces and M&M's. Let me try once again. If each ounce has 30 M&M's, then each M&M is 1/30 of an ounce. So, 10 M&M's would be 10*(1/30) = 1/3 of an ounce. So, each small bag is 1/3 of an ounce.\n\n"

prompt = overthinking + rewrite

In [None]:
prompt = "Uriah needs to remove a total of 15 pounds from his book bag. The comic books weigh 1/4 pound each, and the toys weigh 1/2 pound each. He removes 30 comic books, and we need to find out how many toys he needs to remove.\n\nFirst, calculate the total weight of the comic books he removes. Since each comic book weighs 1/4 pound, 30 comic books would weigh:\n\n30 * 1/4 = 7.5 pounds\n\nNext, subtract this weight from the total weight that needs to be removed:\n\n15 pounds - 7.5 pounds = 7.5 pounds\n\nNow, determine how many toys correspond to 7.5 pounds. Each toy weighs 1/2 pound, so the number of toys he needs to remove is:\n\n7.5 / (1/2) = 15 pounds\n\nWait, perhaps the question is that he needs to remove a total of 15 pounds, and he removes 30 comic books. How many toys does he need to remove? Let's check the calculations again.\n\n30 comic books: 30 * 1/4 = 7.5 pounds.\n\nTotal to remove: 15 pounds.\n\nToys needed: 15 - 7.5 = 7.5 pounds.\n\nEach toy is 1/2 pound, so 7.5 / (1/2) = 7.5 * 2 = 15.\n\nWait, so 15 toys. But the user says the correct answer is 7. That's inconsistent. Maybe the initial problem was to remove 7.5 pounds, not 15? Or perhaps the total weight is 15, but he's removing 30 comic books which is 7.5 pounds, so toys to remove 7.5 pounds, which is 15 toys. But the correct answer is 7, so perhaps the weight of toys is different?\n\nWait, perhaps the weight of toys is 1/7 pound each? No, the user said toys are 1/2 pound each. Hmm.\n\nWait, maybe the total weight is not 15, but 8.5 pounds? No, the user says 15 pounds total.\n\nWait, maybe I misread the question. Let me read again:\n\n\"Uriah's book bag is getting too heavy for him. He needs to remove 15 pounds from it. His comic books weigh 1/4 pound each and his toys weigh 1/2 pound each. If he removes 30 comic books, how many toys does he need to remove? Why the correct answer is 7?\"\n\nWait, so he needs to remove 15 pounds. He removes 30 comic books, which are each 1/4 pound, so 30*(1/4) = 7.5 pounds. So he has removed 7.5 pounds, so he needs to remove 15 -7.5 =7.5 pounds more. Each toy is 1/2 pound, so 7.5 / (1/2) = 15 toys. But the correct answer is 7. So why is the answer 7? Maybe the total weight is 15, but each toy is 1 pound? Or perhaps the question is different.\n\nWait, maybe the question is that he needs to remove a total of 8.5 pounds? Because 7.5 +7.5 is 15, but 7.5 +7.5 is 15. So the answer is 15. But the user says the correct answer is 7. Maybe the user meant 8.5 pounds to remove? 15 is actually 30 comic books (7.5) and 15 toys (7.5). Hmm.\n\nWait, perhaps the initial problem was to remove 15 pounds, and he removes 30 comic books, so how many toys he needs to remove? If 30 comic books are 7.5, and he needs to remove 15, then he needs to remove 15 -7.5 =7.5 pounds from toys. Each toy is 0.5, so 7.5 /0.5=15. So 15 toys. But the correct answer is 7, so maybe the user meant 15 pounds in total, but he already removed 30 comic books which is 7.5, so he needs to remove 15-7.5=7.5 pounds in toys. 7.5 /0.5=15. So why is the correct answer 7?\n\nWait, maybe the question is that he needs to remove 15 pounds, and he removes 30 comic books, which is 7.5, so he needs to remove 15 -7.5=7.5 pounds, which is 7.5/(0.5)=15 toys. But why is the correct answer 7? Wait, maybe 7.5 divided by 1 pound? But the toys are 0.5 pounds each.\n\nWait, perhaps there's a mistake in the original problem. Or perhaps the user is misstating the weights. Alternatively, maybe the user is confusing pounds with ounces or something else. Or perhaps the weight of toys is 1/7 pound each.\n\nWait, 7 toys would be 7*(1/2)=3.5 pounds. So if he needs to remove 15-7.5=7.5 pounds, but 7 toys only remove 3.5 pounds, that doesn't add up.\n\nAlternatively, maybe the total weight to remove is 15 pounds, which is split between comic books and toys. So 30 comic books =7.5 pounds. So the remaining 15-7.5=7.5 pounds must come from toys. So toys:7.5 /0.5=15. So 15 toys.\n\nBut the user says the correct answer is 7. So perhaps the user made a mistake in the question. Alternatively, maybe the total weight to remove is only 8.5 pounds? 30 comic books=7.5, so he needs to remove 8.5-7.5=1 pound, which would require 2 toys (1 pound). So 2 toys. Still not 7.\n\nAlternatively, maybe the total weight is 15 pounds, but the user wants to remove 15 pounds, and he removes 30 comic books, which is 7.5, so he needs to remove 15-7.5=7.5 pounds. 7.5 pounds is 15 toys. But if the user says the answer is 7, there must be a different way.\n\nWait, perhaps the user meant that he needs to remove 15 pounds, and he removes 30 comic books, which is 7.5, so the remaining 7.5 pounds needs to be removed by toys. So 7.5 /0.5=15 toys. So 15 toys. But the user says the correct answer is 7. So perhaps the user made a mistake.\n\nAlternatively, perhaps the user meant that the total weight is 15 pounds, and he removes 30 comic books, which is 7.5, so he needs to remove 15-7.5=7.5 pounds from toys. So 7.5 /0.5=15 toys. So why is the answer 7?\n\nWait, perhaps the user meant that he needs to remove 15 pounds, and he removes 30 comic books, which is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So 15 toys. But the correct answer is 7, so perhaps the user meant that the total weight is 15 pounds, and he removes 30 comic books (7.5), so he needs to remove 7.5 pounds, which is 15 toys. Still, 15 is correct.\n\nWait, maybe the total weight is 15 pounds, and he removes 30 comic books, which is 7.5, so he needs to remove 15-7.5=7.5 pounds, which is 15 toys. But the user says the correct answer is 7, so perhaps the user is wrong.\n\nAlternatively, maybe the user is considering that 30 comic books is 30*(1/4)=7.5, so he needs to remove 15-7.5=7.5 pounds, which is 7.5/(1/2)=15 toys. So 15 toys. But the user says 7, so maybe the user made a mistake.\n\nAlternatively, perhaps the user is considering that the total weight is 15 pounds, and he removes 30 comic books (7.5), so he needs to remove 15-7.5=7.5 pounds, which is 7.5 divided by (1/2), which is 15. But the user says 7, so perhaps the user is wrong.\n\nAlternatively, maybe the user is considering that 30 comic books is 7.5, and he needs to remove 15 pounds, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15. But the user says 7, so maybe the user is wrong.\n\nAlternatively, perhaps the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 divided by (1/2)=15, but the user says 7, so maybe the user made a mistake.\n\nAlternatively, maybe the user is considering that 30 comic books is 30*(1/4)=7.5, and he needs to remove 15 pounds, so 15-7.5=7.5 pounds, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7, so the user is wrong.\n\nAlternatively, perhaps the user is considering that 30 comic books is 7.5 pounds, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. But the user says 7, so perhaps the user is wrong.\n\nAlternatively, maybe the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7, so the user made a mistake.\n\nAlternatively, perhaps the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7. So the user is wrong.\n\nAlternatively, perhaps the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7. So perhaps the user made a mistake.\n\nAlternatively, perhaps the user is considering that the total weight is 15 pounds, and he removes 30 comic books (7.5), so he needs to remove 15-7.5=7.5 pounds, which is 15 toys. So the correct answer is 15, but the user says 7, so the user is wrong.\n\nAlternatively, maybe the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7. So the user made a mistake.\n\nAlternatively, perhaps the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7, so the user is wrong.\n\nAlternatively, perhaps the user is considering that 30 comic books is 7.5, so he needs to remove 15-7.5=7.5, which is 7.5 / (1/2)=15 toys. So the correct answer is 15, but the user says 7, so the user made a mistake.\n\n"

In [None]:
prompt = "Okay, so John is packing M&M's into small bags, right? He has 3 large bags, each weighing 10 ounces. Hmm, and each ounce of M&M has 30 M&M in it. He wants to put 10 M&M in each small bag. The question is, how many small bags can he make? And the answer is supposed to be 63. Hmm, okay, let's break this down step by step.\n\nFirst off, let me get this straight. He has 3 large bags, each 10 ounces. So the total weight of all the M&M's he has is 3 times 10 ounces. Let me calculate that. 3 times 10 is 30 ounces. So, a total of 30 ounces of M&M's. Got that part.\n\nNow, each ounce of M&M has 30 M&M in it. So, if he has 30 ounces, how many M&M does that add up to? That would be the number of ounces multiplied by the number of M&M per ounce. So that’s 30 times 30. Hmm, 30 times 30 is 900. So, a total of 900 M&M's.\n\nJohn is putting 10 M&M in each small bag. So, to find out how many small bags he can make, I need to divide the total number of M&M's by the number he puts in each bag. That makes sense. So, 900 divided by 10. Well, 900 divided by 10 is 90. So, that suggests he can make 90 small bags. Wait, but the answer is supposed to be 63. Hmm, so I must be missing something here. Maybe I misunderstood the problem?\n\nLet me read the problem again: \"He buys 3 large bags weighing 10 ounces each. If an ounce of M&M has 30 M&M in it how many small bags can he make if he puts 10 in each small bag?\" So, he buys 3 large bags, each 10 ounces. Wait a second, so he has 3 large bags, each weighing 10 ounces. So, is the total weight 30 ounces, or is each large bag 10 ounces, so he has 3 of them? Wait, maybe the wording is confusing.\n\nWait, possibly, is he buying each large bag as 10 ounces, so he has 3 large bags, each weighing 10 ounces? So, 3 bags times 10 ounces each is 30 ounces. So, the total number of M&M is 30 ounces times 30 M&M per ounce, so that's definitely 900.\n\nSo, why is the answer 63? Because 30 divided by somewhere. Wait, maybe I read the question wrong. Let me check: does he have 3 large bags, each weighing 10 ounces? So, 3 times 10 is 30 ounces. Then, an ounce of M&M has 30 M&M, so 30 ounces times 30 M&M is 900. Divided by 10 M&M per bag, so 900 divided by 10 is 90. But answer is 63.\n\nWait, perhaps I misread the problem. It says \"He buys 3 large bags weighing 10 ounces each.\" Maybe that's the total weight of M&M, each large bag is 10 ounces, so 30 ounces total. But each ounce of M&M is 30 M&M, so 30 ounces is 900 M&M. Then, if he puts 10 per small bag, 900 divided by 10 is 90. So, 90 small bags. But the answer is supposed to be 63, so there must be another factor here.\n\nAlternatively, perhaps each large bag is 10 ounces of M&M, but he can't use all of them. Maybe each bag can only hold a certain number?\n\nWait, maybe it's a constraint on how many M&M can per large bag. For example, if a large bag can only fit so many ounces or M&M, but the problem doesn't specify. Hmm.\n\nWait, maybe he only has 3 large bags, so each large bag is 10 ounces, which he can reshape into some number of small bags.\n\nAlternatively, if each M&M is 1/30 of an ounce, then 3 large bags, 10 ounces each. Hmm, maybe he is limited by how many small bags he can make without exceeding the total weight he can carry.\n\nWait, maybe I need to think of the problem differently. Perhaps each small bag is 10 ounces? But the problem says he puts 10 in each small bag, not that the small bag weighs 10 ounces.\n\nWait, the problem is a bit ambiguous. Let me read it again.\n\nHe buys 3 large bags weighing 10 ounces each. So, each of those large bags is 10 ounces. Now, \"if an ounce of M&M has 30 M&M in it.\" So, an ounce has 30 M&M. So, then how many small bags can he make if he puts 10 in each small bag.\n\nWait, so perhaps 1 ounce = 30 M&M, so each small bag which has 10 M&M would be 10/30 ounces. So, each small bag is 10/30 ounces. So, 1 small bag is 1/3 of an ounce. Then, each large bag is 10 ounces. So, he can make 10 / (1/3) = 30 small bags from each large bag. So, he has 3 large bags, so 3 times 30 is 90 small bags. Again, that gets us 90.\n\nAlternatively, maybe it's the other way around. If an M&M per ounce is 30 M&M per ounce, so an M&M is 1/30 of an ounce in weight. So, putting 10 M&M in a small bag is 10*(1/30) = 1/3 of an ounce per small bag. Therefore, each large 10-ounce bag can make 10 divided by 1/3 = 30 small bags. Therefore, 3 large bags make 90 small bags. So 90 is the answer.\n\nAlternatively, maybe it's 10 M&M per small bag, so each small bag is 10 M&M, which is 10/30 = 1/3 of an ounce. Therefore, each 10-ounce large bag can make (10)/(1/3) = 30 small bags. Therefore, 3 large bags make 90 in total.\n\nAlternatively, perhaps it's 10 M&M per small bag, but M&M's are 1/30 of an ounce each, so each small bag is 1/3 of an ounce. Therefore, each 10-ounce large bag can make (10)/(1/3) = 30 small bags. Therefore, 3 large bags make 90 in total.\n\nAlternatively, maybe it's 10 M&M per small bag, but M&M's are 1/30 of an ounce, so each small bag is 1/3 of an ounce. Therefore, each 10-ounce large bag can make (10)/(1/3) = 30 small bags. Therefore, 3 large bags make 90 in total.\n\nAlternatively, maybe it's 10 M&M per small bag, but M&M's are 1/30 of an ounce, so each small bag is 1/3 of an ounce. Therefore, each 10-ounce large bag can make (10)/(1/3) = 30 small bags. Therefore, 3 large bags make 90 in total.\n\nAlternatively, maybe it's 10 M&M per small bag, but M&M's are 1/30 of an ounce, so each small bag is 1/3 of an ounce. Therefore, each 10-ounce large bag can make (10)/(1/3) = 30 small bags. Therefore, 3 large bags make 90 in total.\n\n"

In [None]:
prompt = tokenizer.apply_chat_template([{"role":"user","content":prompt}], tokenize=False, add_generation_prompt=True)

- Generate

In [None]:
outputs: list[RequestOutput] = llm.generate(
    prompt,
    sampling_params
)

completion = outputs[0].outputs[0]
print(prompt + completion.text)

- Log

In [None]:
id = 2

import re

KV_PATTERN = re.compile(r"(\S+?)=([^\s]+)")

def parse_kv_section(line: str) -> dict:
    return dict(KV_PATTERN.findall(line))

def extract_tokens_with_kv_filters(
    log_file: str,
    *,
    num_min=None,
    num_max=None,
    filters: dict[str, str | int] = None,
) -> list[int]:
    results = []
    filters = filters or {}

    with open(log_file, "r", encoding="utf-8") as f:
        for line in f:
            kv = dict(KV_PATTERN.findall(line))
            if "num" not in kv or "token_id" not in kv:
                continue

            num = int(kv["num"])
            if num_min is not None and num < num_min:
                continue
            if num_max is not None and num > num_max:
                continue

            # 动态字段匹配
            ok = True
            for k, v in filters.items():
                if k not in kv:
                    ok = False
                    break
                if str(kv[k]) != str(v):
                    ok = False
                    break

            if not ok:
                continue

            results.append((num, int(kv["token_id"])))

    results.sort(key=lambda x: x[0])
    token_ids = [t[1] for t in results]
    return token_ids

additional_token_ids = extract_tokens_with_kv_filters(f"/root/project/dos/log/QwQ-32B.log", filters={"dataset":"gsm8k", "id":id, "stage":"entropy_descent_sampling"})
additional_prompt = tokenizer.decode(additional_token_ids)
# prompt += additional_prompt
print(additional_prompt)

- Components

In [None]:
def loop_check(text: str, n: int) -> bool:
    """
    检查输出是否产生循环
    
    :param text: 输出文本
    :type text: str
    :return: 检查结果
    :rtype: bool
    """

    # multi-line check
    multi_line_loop = False
    steps = text.split("\n\n")[:-1]
    test_step = steps[-1]
    rev = steps[-2::-1]
    if test_step in rev:
        # 往前找到第一个和 test step 相同的思考步骤
        idx_rev = rev.index(test_step)
        # 换算成正序索引
        same_idx = len(rev) - 1 - idx_rev # 重复步骤索引
        test_idx = len(steps) - 1 # 最后一步索引
        distance = test_idx - same_idx
        if same_idx > distance: # 回溯重复模式时不会越界:
            if distance > 1: # 检查整个模式是否相同
                if all(steps[index] == steps[index - distance] for index in range(same_idx, test_idx)):
                    multi_line_loop = True
            else: # 重复思考相邻，可能存在非循环情况，至少重复10次
                if all(steps[index] == steps[test_idx] for index in range(test_idx - 10, test_idx)):
                    multi_line_loop = True

    # one-line check
    one_line_loop = False
    last_step = text.split("\n\n")[-1]
    if len(last_step) > 100:
        test_str = last_step[-1]
        rev = last_step[-2::-1]
        if test_str in rev:
            idx_rev = rev.index(test_str)
            same_idx = len(rev) - 1 - idx_rev # 重复步骤索引
            test_idx = len(last_step) - 1 # 最后一步索引
            distance = test_idx - same_idx
            if same_idx > distance: # 回溯重复模式时不会越界:
                if distance > 1: # 至少重复10次
                    if all(last_step[index] == last_step[index - distance] for index in range(same_idx, test_idx)) and all(last_step[index] == last_step[test_idx] for index in range(test_idx - 10*distance, test_idx, distance)):
                        one_line_loop = True
                else: # 至少重复 10 * n 次
                    if all(last_step[index] == last_step[test_idx] for index in range(test_idx - 10 * n, test_idx)):
                        one_line_loop = True

    return multi_line_loop or one_line_loop

# loop_check(prompt)

In [None]:
def over_reflect_trim(text: str, option: str | None, answer: str) -> bool | tuple[str, str]:
    text = text.split("<think>\n")[1]
    steps = text.split("\n\n")
    think = False
    reflect = False

    # think identify
    if option == None:
        for i, step in enumerate(steps):
            if "</think>" in step:
                return False
            elif reflect and (f" {answer} " in step or f" {answer}." in step or f" {answer}," in step or f" {answer}\n" in step) and i > reflect_done_idx + 1:
                return "\n\n".join(steps[:i+1]), steps[i+1]
            elif think and (f" {answer} " in step or f" {answer}." in step or f" {answer}," in step or f" {answer}\n" in step) and i > think_done_idx + 1:
                reflect = True # reflect done
                reflect_done_idx = i
            elif (f" {answer} " in step or f" {answer}." in step or f" {answer}," in step or f" {answer}\n" in step) and i > 1:
                think = True # think done
                think_done_idx = i

    else:
        for i, step in enumerate(steps):
            if "</think>" in step:
                return False
            if (f" {answer} " in step or f" {option}. " in step or f" {option} " in step) and i > 1:
                think = True # think done
                think_done_idx = i
            if think and (f" {answer} " in step or f" {option}. " in step or f" {option} " in step) and i > think_done_idx + 1:
                reflect = True # reflect done
                reflect_done_idx = i
            if reflect and (f" {answer} " in step or f" {option}. " in step or f" {option} " in step) and i > reflect_done_idx + 1:
                return "\n\n".join(steps[:i+1]), steps[i+1]

    return False

def thinking_loop_trim(text: str, token: str) -> str:
    steps = text.split("\n\n")
    all_steps = steps[:-1]
    test_step = all_steps[-1]
    rev = all_steps[::-1]
    for idx, step in enumerate(rev):
        if step != test_step and step.startswith(token):
            idx_1 = len(rev) - 1 - idx
            break
    if steps[-1] == "":
        for idx, step in enumerate(all_steps):
            if step == test_step:
                idx_2 = idx
                break

        return "\n\n".join(all_steps[idx_1:idx_2+1])
    else:
        return "\n\n".join(steps[idx_1:])

- DeepSeek API

In [None]:
from openai import OpenAI
import json

config = json.load(open('../config.json', 'r'))
api_key = config["api_key"]
reasoning_model = config["models"][5]
chat_model = config["models"][6]

client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com/beta")

with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/gsm8k_0_entropy_descent.txt", "r") as f:
    text = f.read()

content = text.split("<｜begin▁of▁sentence｜><｜User｜>")[1].split("<｜Assistant｜><think>")[0]
output = text.split("<｜Assistant｜><think>")[1].split("\n\n")[0] + "\n\n"

sampling_params = {
    "temperature": 1,
    "max_tokens": 10,
    "logprobs": 20
}

messages = [{"role":"user", "content":content}]

over_reflect_prompt = [
    {"role":"user", "content":content},
    # {"role":"assistant", "content":"<think>\n"+output, "reasoning_content":"", "prefix":True}
    {"role":"assistant", "content":"", "reasoning_content":output, "prefix":True}
]

response = client.chat.completions.create(
    model=reasoning_model,
    messages=over_reflect_prompt,
    max_tokens=10
)

# candidate_token_logprobs = [{top_logprob.token:top_logprob.logprob} for top_logprob in response.choices[0].logprobs.content[0].top_logprobs[:5]]

# print(response.choices[0].message.content)
print(response.choices[0].message.reasoning_content)

In [None]:
import re
from typing import Iterator

# 用于识别日志一条记录的开头（时间戳 + 竖线）
TS_PATTERN = re.compile(r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} \|")

def iter_log_records(path: str) -> Iterator[str]:
    """
    按“逻辑日志条目”迭代，而不是按物理行。
    多行组成的一条日志会被合并成一个长字符串返回。
    """
    current = None

    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            if TS_PATTERN.match(line):
                # 开始新记录
                if current is not None:
                    yield current
                current = line.rstrip("\n")
            else:
                # 续行：拼接到上一条记录后面
                if current is None:
                    # 文件一开始就遇到孤立行，跳过
                    continue
                current += "\n" + line.rstrip("\n")

        # 文件结束，别忘最后一个
        if current is not None:
            yield current

from typing import Optional, Dict, Tuple

# 解析一般的 key=value（不含 token 那一段）
KV_PATTERN = re.compile(r"(\S+?)=([^\s]+)")

# 解析 token=... 和 prob/entropy=... 之间的 token 字符串
TOKEN_PATTERN = re.compile(
    r"token=(?P<token>.*?)(?= prob/entropy=|$)",
    re.DOTALL,
)

def parse_log_line(line: str) -> Tuple[Dict[str, str], Optional[str]]:
    """
    输入一条“逻辑日志行”（已合并多行），返回 (kv_dict, token_str)
    - kv_dict：除 token 外的所有 key=value（dataset/id/stage/state/num/prob/entropy 等）
    - token_str：token= 和 prob/entropy= 之间的原始字符串（可能含空格/换行）
    """
    token_str = None

    m_tok = TOKEN_PATTERN.search(line)
    if m_tok:
        token_str = m_tok.group("token")
        start, end = m_tok.span()
        cleaned = line[:start] + line[end:]   # 把 token=... 这一段去掉，剩下的再解析 kv
    else:
        cleaned = line

    kv = dict(KV_PATTERN.findall(cleaned))
    return kv, token_str

from typing import List, Tuple

def extract_token_strings_with_filters(
    log_file: str,
    *,
    num_min: Optional[int] = None,
    num_max: Optional[int] = None,
    dataset: Optional[str] = None,
    id_: Optional[int] = None,
    stage: Optional[str] = None,
    state: Optional[str] = None,
    strip_token: bool = False,
) -> List[Tuple[int, str]]:
    """
    从日志中提取 token 字符串（支持多行 token），并按条件筛选。
    返回 [(num, token_str), ...]，按日志出现顺序。
    """
    results: List[Tuple[int, str]] = []

    for record in iter_log_records(log_file):
        kv, token = parse_log_line(record)
        if token is None:
            continue

        # 处理 num
        if "num" not in kv:
            continue
        try:
            num = int(kv["num"])
        except ValueError:
            continue

        # num 区间筛选
        if num_min is not None and num < num_min:
            continue
        if num_max is not None and num > num_max:
            continue

        # 其它字段筛选
        if dataset is not None and kv.get("dataset") != dataset:
            continue

        if id_ is not None:
            try:
                if int(kv.get("id", -1)) != id_:
                    continue
            except ValueError:
                continue

        if stage is not None and kv.get("stage") != stage:
            continue

        if state is not None and kv.get("state") != state:
            continue

        token_use = token.strip() if strip_token else token
        results.append((num, token_use))

    return results

pairs = extract_token_strings_with_filters(
    "/root/project/dos/log/DeepSeek.log",
    dataset="gsm8k",
    id_=18,
    stage="entropy_descent_sampling",
    strip_token=False,   # 如果你想保留原来的前导/后缀空格，就设 False
)

# 按 num 排序，得到完整 token 序列
pairs_sorted = sorted(pairs, key=lambda x: x[0])
tokens = [t for _, t in pairs_sorted]

# 如果这些 token 本来就是“按 tokenizer 输出的片段”，可以直接拼接：
generated_text = "".join(tokens)
print(generated_text)

In [None]:
import re
from collections import defaultdict
import matplotlib.pyplot as plt

log_file = "/root/project/dos/log/DeepSeek.log"  # 改成你的日志路径

# 解析：dataset=... id=... num=... prob/entropy=...
pattern = re.compile(
    r"dataset=(?P<dataset>\S+)\s+id=(?P<id>\d+).*?num=(?P<num>\d+).*?prob/entropy=(?P<pe>[0-9.eE+-]+)"
)

# {(dataset, id): [(num, pe), ...]}
series = defaultdict(list)

with open(log_file, "r", encoding="utf-8") as f:
    for line in f:
        m = pattern.search(line)
        if not m:
            continue
        key = (m.group("dataset"), int(m.group("id")))
        num = int(m.group("num"))
        pe = float(m.group("pe"))
        series[key].append((num, pe))

# 对每条曲线：按 num 排序；若同 num 多次出现，保留最后一次
def sort_and_dedup(points):
    points = sorted(points, key=lambda x: x[0])
    last = {}
    for n, v in points:
        last[n] = v
    nums_sorted = sorted(last.keys())
    vals_sorted = [last[n] for n in nums_sorted]
    return nums_sorted, vals_sorted

# ====== 画图（线性坐标）======
plt.figure(figsize=(10, 6))

for (dataset, sid), points in sorted(series.items(), key=lambda kv: (kv[0][0], kv[0][1])):
    xs, ys = sort_and_dedup(points)
    plt.plot(xs, ys, marker="o", linewidth=1.5, label=f"{dataset}, id={sid}")

plt.xlabel("num")
plt.ylabel("prob / entropy")
plt.title("prob/entropy vs num (grouped by dataset, id)")
plt.grid(True)
plt.legend(loc="best", fontsize=9)
plt.tight_layout()
plt.show()

# ====== 可选：再画一张 log y 轴（更适合你这种 1e10 饱和的情况）======
plt.figure(figsize=(10, 6))

for (dataset, sid), points in sorted(series.items(), key=lambda kv: (kv[0][0], kv[0][1])):
    xs, ys = sort_and_dedup(points)
    plt.semilogy(xs, ys, marker="o", linewidth=1.5, label=f"{dataset}, id={sid}")

plt.xlabel("num")
plt.ylabel("prob / entropy (log scale)")
plt.title("prob/entropy vs num (log scale, grouped by dataset, id)")
plt.grid(True, which="both")
plt.legend(loc="best", fontsize=9)
plt.tight_layout()
plt.show()

# ====== 可选：保存 ======
# plt.savefig("prob_entropy_grouped.png", dpi=200, bbox_inches="tight")


In [None]:
TARGET_ID = 2

plt.figure(figsize=(8, 5))

for (dataset, sid), points in series.items():
    if sid != TARGET_ID:
        continue

    xs, ys = sort_and_dedup(points)
    plt.plot(xs, ys, marker="o", label=f"{dataset}, id={sid}")

plt.xlabel("num")
plt.ylabel("prob / entropy")
plt.title(f"prob/entropy vs num (id={TARGET_ID})")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
TARGET_DATASET = "gsm8k"
TARGET_ID = 0

key = (TARGET_DATASET, TARGET_ID)

if key not in series:
    raise ValueError(f"{key} not found in logs")

xs, ys = sort_and_dedup(series[key])

plt.figure(figsize=(12, 5))

plt.semilogy(xs, ys, marker="o")
plt.xlabel("num")
plt.ylabel("prob / entropy (log scale)")
plt.title(f"log-scale prob/entropy ({TARGET_DATASET}, id={TARGET_ID})")
plt.grid(True, which="both")
plt.show()

In [None]:
import json
import glob

files = glob.glob(f"/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/*_entropy_descent.txt")
for file in files:
    with open(file, "r") as f:
        text = f.read()
        loop = loop_check(text,5)
        if loop:
            stop_reason = "loop"
        else:
            if text.endswith("</think>"):
                stop_reason = "eos"
            else:
                stop_reason = "max_length"

    with open(file.strip(".txt")+"t.json", "w") as f:
        f.write(json.dumps({
            "loop": loop,
            "stop_reason": stop_reason,
            "prompt": text
        }))

**长度倍率**

In [None]:
import json

with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/gsm8k_results.jsonl", "r") as f:
    results = [json.loads(line) for line in f.readlines()]

ids = set()
for result in results:
    if result["loop"] == True:
        ids.add(result["id"])

In [None]:
import glob
import json

with open("/root/project/dos/dataset/ours/gsm8k_sample.jsonl", "r") as f:
    gsm8k_dataset: list[dict[str, list[str]] | str | int] = [json.loads(line) for line in f.readlines()]

for i, data in enumerate(gsm8k_dataset):
    gsm8k_dataset[i] = {
        "id": i,
        "question": data["question"],
        "options": None,
        "option": None,
        "answer": data["answer"].split("\n#### ")[1],
        "incorrect": []
    }

paths = glob.glob("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/*entropy*.json")

data = gsm8k_dataset[4]
answer = data["answer"]

with open(paths[4],"r") as f:
    thinking = json.loads(f.read())["prompt"]

over_reflect, step = over_reflect_trim(thinking, None, answer)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(step))
token = tokenizer.convert_tokens_to_string(tokens[:3])
thinking_loop = thinking_loop_trim(thinking, token)
prompt = over_reflect + "\n\n" + thinking_loop + "\n\n"
print(f"Token: {token}\n")
print(prompt+"\n")

In [None]:
with open("/root/project/dos/dataset/ours/gsm8k_sample.jsonl", "r") as f:
    gsm8k_dataset: list[dict[str, list[str]] | str | int] = [json.loads(line) for line in f.readlines()]

for i, data in enumerate(gsm8k_dataset):
    gsm8k_dataset[i] = {
        "id": i,
        "question": data["question"],
        "options": None,
        "option": None,
        "answer": data["answer"].split("\n#### ")[1],
        "incorrect": []
    }

length = 0
for id, data in enumerate(gsm8k_dataset):
    if id not in ids:
        continue

    content = data["question"]
    print(f"Question: {content}\n")
    prompt = tokenizer.apply_chat_template(
        [{"role":"user", "content":content}],
        tokenize=False,
        add_generation_prompt=True
    )

    outputs: list[RequestOutput] = llm.generate(
        prompt,
        sampling_params
    )

    output = outputs[0]
    text = output.outputs[0].text
    print(f"Assistant: {text}\n")
    length += len(output.outputs[0].token_ids)

print(f"avg length: {length/10}")

In [None]:
import os

if os.path.exists("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/promtps.jsonl"):
        pass
else:
    with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/promtps.jsonl", "w") as f:
        pass

for result in results:
# result = results[0]
    if result["loop"] and result["temperature"] == 0:
        id = result["id"]
        data = gsm8k_dataset[id]
        answer = data["answer"]
        content = result["content"]

        over_reflect, step = over_reflect_trim(content, None, answer)
        token = tokenizer.convert_ids_to_tokens(tokenizer.encode(step))[0]
        thinking_loop = thinking_loop_trim(content, token)
        prompt = over_reflect + "\n\n" + thinking_loop + "\n\n"
        with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/promtps.jsonl", "a") as f:
            f.write(json.dumps({"id":id, "token": token, "prompt": prompt})+"\n")

In [None]:
with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/promtps.jsonl", "r") as f:
    prompts = [json.loads(line) for line in f.readlines()]

length = 0
for data in prompts:
    id = data["id"]
    prompt = tokenizer.apply_chat_template(
        [{"role":"user", "content":data["prompt"]}],
        tokenize=False,
        add_generation_prompt=True
    )

    outputs: list[RequestOutput] = llm.generate(
        prompt,
        sampling_params
    )

    output = outputs[0]
    text = output.outputs[0].text
    loop = loop_check(text, 5)
    length += len(output.outputs[0].token_ids)
    with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/gsm8k_results.jsonl", "a") as f:
        f.write(json.dumps({"id": id, "loop": loop, "temperature": 0.7, "output": len(output.outputs[0].token_ids), "trim": True})+"\n")

print(f"avg length: {length/len(prompts)}")

In [None]:
with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/promtps.jsonl", "r") as f:
    prompts = [json.loads(line) for line in f.readlines()]

length = 0
for data in prompts:
    id = data["id"]
    prompt = tokenizer.apply_chat_template(
        [{"role":"user", "content":data["prompt"]}],
        tokenize=False,
        add_generation_prompt=True
    )

    outputs: list[RequestOutput] = llm.generate(
        prompt,
        sampling_params
    )

    output = outputs[0]
    text = output.outputs[0].text
    loop = loop_check(text, 5)
    length += len(output.outputs[0].token_ids)
    with open("/root/project/dos/exp/DeepSeek-R1-Distill-Llama-8B/gsm8k_results.jsonl", "a") as f:
        f.write(json.dumps({"id": id, "loop": loop, "temperature": 0, "output": len(output.outputs[0].token_ids), "trim": True})+"\n")

print(f"avg length: {length/len(prompts)}")