In [3]:
import openai

# create a chat completion
completion = openai.ChatCompletion.create(
  model='vicuna-13b-v1.5',
  messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)

Hello! You may refer to me as Vicuna, a language model meticulously developed by the researchers at Large Model Systems Organization (LMSYS).


In [6]:
import guidance

guidance.llms.OpenAI.cache.clear()
llm = guidance.llms.OpenAI('chatglm2-6b', encoding_name='cl100k_base', chat_mode=True)

In [2]:
title = "SELF-CONSISTENCY IMPROVES CHAIN OF THOUGHT REASONING IN LANGUAGE MODELS"

context = """
1 INTRODUCTION
Although language models have demonstrated remarkable success across a range of NLP tasks, their
ability to demonstrate reasoning is often seen as a limitation, which cannot be overcome solely by
increasing model scale (Rae et al., 2021; BIG-bench collaboration, 2021, inter alia). In an effort
to address this shortcoming, Wei et al. (2022) have proposed chain-of-thought prompting, where
a language model is prompted to generate a series of short sentences that mimic the reasoning
process a person might employ in solving a task. For example, given the question “If there are 3
cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?”, instead
of directly responding with “5”, a language model would be prompted to respond with the entire
chain-of-thought: “There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 +
2 = 5 cars. The answer is 5.”. It has been observed that chain-of-thought prompting significantly
improves model performance across a variety of multi-step reasoning tasks (Wei et al., 2022).
In this paper, we introduce a novel decoding strategy called self-consistency to replace the greedy
decoding strategy used in chain-of-thought prompting (Wei et al., 2022), that further improves
language models’ reasoning performance by a significant margin. Self-consistency leverages the
intuition that complex reasoning tasks typically admit multiple reasoning paths that reach a correct
answer (Stanovich & West, 2000). The more that deliberate thinking and analysis is required for a
problem (Evans, 2010), the greater the diversity of reasoning paths that can recover the answer.
Figure 1 illustrates the self-consistency method with an example. We first prompt the language model
with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, we
propose a “sample-and-marginalize” decoding procedure: we first sample from the language model’s
decoder to generate a diverse set of reasoning paths; each reasoning path might lead to a different
final answer, so we determine the optimal answer by marginalizing out the sampled reasoning paths
to find the most consistent answer in the final answer set. Such an approach is analogous to the
human experience that if multiple different ways of thinking lead to the same answer, one has greater
confidence that the final answer is correct. Compared to other decoding methods, self-consistency
avoids the repetitiveness and local-optimality that plague greedy decoding, while mitigating the
stochasticity of a single sampled generation.
1
arXiv:2203.11171v4 [cs.CL] 7 Mar 2023
Published as a conference paper at ICLR 2023
Language
model
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how many
cars are in the parking lot?
A: There are 3 cars in the parking lot
already. 2 more arrive. Now there are
3 + 2 = 5 cars. The answer is 5. …Q: Janet’s ducks lay 16 eggs per day.
She eats three for breakfast every
morning and bakes muffins for her
friends every day with four. She sells
the remainder for $2 per egg. How
much does she make every day?
A:
She has 16 - 3 - 4 = 9 eggs
left. So she makes $2 * 9 =
$18 per day.
Sample a diverse set of
reasoning paths
She eats 3 for breakfast, so
she has 16 - 3 = 13 left. Then
she bakes muffins, so she
has 13 - 4 = 9 eggs left. So
she has 9 eggs * $2 = $18.
This means she she sells the
remainder for $2 * (16 - 4 - 3)
= $26 per day.
The answer is $18.
The answer is $26.
The answer is $18.
The answer is $18.
Marginalize out reasoning paths
to aggregate final answers
Language
model
This means she uses 3 + 4 = 7 eggs every day.
She sells the remainder for $2 per egg, so in
total she sells 7 * $2 = $14 per day.
The answer is $14.
The answer is $14.
Greedy decode
Figure 1: The self-consistency method contains three steps: (1) prompt a language model using
chain-of-thought (CoT) prompting; (2) replace the “greedy decode” in CoT prompting by sampling
from the language model’s decoder to generate a diverse set of reasoning paths; and (3) marginalize
out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.
Self-consistency is far simpler than prior approaches that either train an additional verifier (Cobbe
et al., 2021) or train a re-ranker given additional human annotations to improve generation quality
(Thoppilan et al., 2022). Instead, self-consistency is entirely unsupervised, works off-the-shelf with
pre-trained language models, requires no additional human annotation, and avoids any additional
training, auxiliary models or fine-tuning. Self-consistency also differs from a typical ensemble
approach where multiple models are trained and the outputs from each model are aggregated, it acts
more like a “self-ensemble” that works on top of a single language model.
We evaluate self-consistency on a wide range of arithmetic and commonsense reasoning tasks over
four language models with varying scales: the public UL2-20B (Tay et al., 2022) and GPT-3-175B
(Brown et al., 2020), and two densely-activated decoder-only language models: LaMDA-137B
(Thoppilan et al., 2022) and PaLM-540B (Chowdhery et al., 2022). On all four language models,
self-consistency improves over chain-of-thought prompting by a striking margin across all tasks. In
particular, when used with PaLM-540B or GPT-3, self-consistency achieves new state-of-the-art levels
of performance across arithmetic reasoning tasks, including GSM8K (Cobbe et al., 2021) (+17.9%
absolute accuracy gains), SVAMP (Patel et al., 2021) (+11.0%), AQuA (Ling et al., 2017) (+12.2%),
and across commonsense reasoning tasks such as StrategyQA (Geva et al., 2021) (+6.4%) and ARCchallenge (Clark et al., 2018) (+3.9%). In additional experiments, we show self-consistency can
robustly boost performance on NLP tasks where adding a chain-of-thought might hurt performance
compared to standard prompting (Ye & Durrett, 2022). We also show self-consistency significantly
outperforms sample-and-rank, beam search, ensemble-based approaches, and is robust to sampling
strategies and imperfect prompts.
2 SELF-CONSISTENCY OVER DIVERSE REASONING PATHS
A salient aspect of humanity is that people think differently. It is natural to suppose that in tasks
requiring deliberate thinking, there are likely several ways to attack the problem. We propose that
such a process can be simulated in language models via sampling from the language model’s decoder.
For instance, as shown in Figure 1, a model can generate several plausible responses to a math
question that all arrive at the same correct answer (Outputs 1 and 3). Since language models are not
perfect reasoners, the model might also produce an incorrect reasoning path or make a mistake in
one of the reasoning steps (e.g., in Output 2), but such solutions are less likely to arrive at the same
answer. That is, we hypothesize that correct reasoning processes, even if they are diverse, tend to
have greater agreement in their final answer than incorrect processes.
We leverage this intuition by proposing the following self-consistency method. First, a language
model is prompted with a set of manually written chain-of-thought exemplars (Wei et al., 2022). Next,
2
Published as a conference paper at ICLR 2023
GSM8K MultiArith AQuA SVAMP CSQA ARC-c
Greedy decode 56.5 94.7 35.8 79.0 79.0 85.2
Weighted avg (unnormalized) 56.3 ± 0.0 90.5 ± 0.0 35.8 ± 0.0 73.0 ± 0.0 74.8 ± 0.0 82.3 ± 0.0
Weighted avg (normalized) 22.1 ± 0.0 59.7 ± 0.0 15.7 ± 0.0 40.5 ± 0.0 52.1 ± 0.0 51.7 ± 0.0
Weighted sum (unnormalized) 59.9 ± 0.0 92.2 ± 0.0 38.2 ± 0.0 76.2 ± 0.0 76.2 ± 0.0 83.5 ± 0.0
Weighted sum (normalized) 74.1 ± 0.0 99.3 ± 0.0 48.0 ± 0.0 86.8 ± 0.0 80.7 ± 0.0 88.7 ± 0.0
Unweighted sum (majority vote) 74.4 ± 0.1 99.3 ± 0.0 48.3 ± 0.5 86.6 ± 0.1 80.7 ± 0.1 88.7 ± 0.1
Table 1: Accuracy comparison of different answer aggregation strategies on PaLM-540B.
we sample a set of candidate outputs from the language model’s decoder, generating a diverse set of
candidate reasoning paths. Self-consistency is compatible with most existing sampling algorithms,
including temperature sampling (Ackley et al., 1985; Ficler & Goldberg, 2017), top-k sampling (Fan
et al., 2018; Holtzman et al., 2018; Radford et al., 2019), and nucleus sampling (Holtzman et al.,
2020). Finally, we aggregate the answers by marginalizing out the sampled reasoning paths and
choosing the answer that is the most consistent among the generated answers.
In more detail, assume the generated answers ai are from a fixed answer set, ai ∈ A, where
i = 1, . . . , m indexes the m candidate outputs sampled from the decoder. Given a prompt and a
question, self-consistency introduces an additional latent variable ri
, which is a sequence of tokens
representing the reasoning path in the i-th output, then couples the generation of (ri
, ai) where
ri → ai
, i.e., generating a reasoning path ri
is optional and only used to reach the final answer ai
. As
an example, consider Output 3 from Figure 1: the first few sentences “She eats 3 for breakfast ... So
she has 9 eggs * $2 = $18.” constitutes ri
, while the answer 18 from the last sentence, “The answer
is $18”, is parsed as ai
.
1 After sampling multiple (ri
, ai) from the model’s decoder, self-consistency
applies a marginalization over ri by taking a majority vote over ai
, i.e., arg maxa
Pm
i=1 1(ai = a),
or as we defined as the most “consistent” answer among the final answer set.
In Table 1, we show the test accuracy over a set of reasoning tasks by using different answer
aggregation strategies. In addition to majority vote, one can also weight each (ri
, ai) by P(ri
, ai
|
prompt, question) when aggregating the answers. Note to compute P(ri
, ai
| prompt, question), we
can either take the unnormalized probability of the model generating (ri
, ai) given (prompt, question),
or we can normalize the conditional probability by the output length (Brown et al., 2020), i.e.,
P(ri
, ai
| prompt, question) = exp 1
K
PK
k=1 log P (tk|prompt,question,t1,...,tk−1)
, (1)
where log P(tk | prompt, question, t1, . . . , tk−1) is the log probability of generating the k-th token
tk in (ri
, ai) conditioned on the previous tokens, and K is the total number of tokens in (ri
, ai).
In Table 1, we show that taking the “unweighted sum”, i.e., taking a majority vote directly over ai
yields a very similar accuracy as aggregating using the “normalized weighted sum”. We took a closer
look at the model’s output probabilities and found this is because for each (ri
, ai), the normalized
conditional probabilities P(ri
, ai
| prompt, question) are quite close to each other, i.e., the language
model regards those generations as “similarly likely”.2 Additionally, when aggregating the answers,
the results in Table 1 show that the “normalized” weighted sum (i.e., Equation 1) yields a much
higher accuracy compared to its unnormalized counterpart. For completeness, in Table 1 we also
report the results by taking a “weighted average”, i.e., each a gets a score of its weighted sum divided
by Pm
i=1 1(ai = a), which results in a much worse performance.
Self-consistency explores an interesting space between open-ended text generation and optimal
text generation with a fixed answer. Reasoning tasks typically have fixed answers, which is why
researchers have generally considered greedy decoding approaches (Radford et al., 2019; Wei et al.,
2022; Chowdhery et al., 2022). However, we have found that even when the desired answer is fixed,
introducing diversity in the reasoning processes can be highly beneficial; therefore we leverage
1The parser is task dependent. For arithmetic reasoning, we parse the first numerical part as the final answer
after the model generates “The answer is ”. For commonsense reasoning, we parse the full string answer as the
final answer after the model generates “The answer is ”. Most generated outputs have a consistent format of
“{Reasoning paths}. The answer is X.” if we prompt the language model in this format.
2This also means that the language model is not well calibrated and thus cannot distinguish well between
correct solutions and wrong solutions, which also explains why additional re-rankers were trained to better judge
the quality of the solutions in previous work (Cobbe et al., 2021; Thoppilan et al., 2022).
3
Published as a conference paper at ICLR 2023
sampling, as commonly used for open-ended text generation (Radford et al., 2019; Brown et al., 2020;
Thoppilan et al., 2022), to achieve this goal. One should note that self-consistency can be applied
only to problems where the final answer is from a fixed answer set, but in principle this approach can
be extended to open-text generation problems if a good metric of consistency can be defined between
multiple generations, e.g., whether two answers agree or contradict each other.

4 RELATED WORK
Reasoning in language models. Language models are known to struggle in Type 2 tasks, such as
arithmetic, logical and commonsense reasoning (Evans, 2010). Previous work has primarily focused
on specialized approaches for improving reasoning (Andor et al., 2019; Ran et al., 2019; Geva et al.,
2020; Pi˛ekos et al., 2021). Compared to prior work, self-consistency is applicable to a wide range of
reasoning tasks without any additional supervision or fine-tuning, while still substantially improving
the performance of the chain-of-thought prompting approach proposed in Wei et al. (2022).
Sampling and re-ranking in language models. Multiple decoding strategies for language models
have been proposed in the literature, e.g., temperature sampling (Ackley et al., 1985; Ficler &
Goldberg, 2017), top-k sampling (Fan et al., 2018; Holtzman et al., 2018; Radford et al., 2019),
nucleus sampling (Holtzman et al., 2020), minimum Bayes risk decoding (Eikema & Aziz, 2020; Shi
et al., 2022), and typical decoding (Meister et al., 2022). Other work has sought to explicitly promote
diversity in the decoding process (Batra et al., 2012; Li et al., 2016; Vijayakumar et al., 2018).
Re-ranking is another common approach to improve generation quality in language models (Adiwardana et al., 2020; Shen et al., 2021). Thoppilan et al. (2022) collect additional human annotations
to train a re-ranker for response filtering. Cobbe et al. (2021) train a “verifier” to re-rank generated
solutions, which substantially improves the solve rate on math tasks compared to just fine-tuning the
language model. Elazar et al. (2021) improve the consistency of factual knowledge extraction by
extending pre-training with an additional consistency loss. All these methods require either training
an additional re-ranker or collecting additional human annotation, while self-consistency requires no
additional training, fine-tuning, nor extra data collection.
Extract reasoning paths. Some previous work has considered task-specific approaches for identifying reasoning paths, such as constructing semantic graphs (Xu et al., 2021a), learning an RNN
to retrieve reasoning paths over the Wikipedia graph (Asai et al., 2020), fine-tuning with human
annotated reasoning paths on math problems (Cobbe et al., 2021), or training an extractor with
heuristic-based pseudo reasoning paths (Chen et al., 2019). More recently, the importance of diversity in the reasoning processes has been noticed, but only leveraged via task-specific training,
either through an additional QA model over extracted reasoning paths (Chen et al., 2019), or by the
introduction of latent variables in a commonsense knowledge graph (Yu et al., 2022). Compared to
these approaches, self-consistency is far simpler and requires no additional training. The approach
we propose simply couples the generation of reasoning paths and a final answer by sampling from
the decoder, using aggregation to recover the most consistent answer without additional modules.
Consistency in language models. Some prior work has shown that language models can suffer
from inconsistency in conversation (Adiwardana et al., 2020), explanation generation (Camburu et al.,
2020), and factual knowledge extraction (Elazar et al., 2021). Welleck et al. (2020) use “consistency”
to refer to generating an infinite-length sequence in recurrent language models. Nye et al. (2021)
improve the logical consistency of samples from a System 1 model by adding a System 2-inspired
logical reasoning module. In this paper we focus on a slightly different notion of “consistency”, i.e.,
utilizing answer consistency among diverse reasoning paths to improve accuracy.
5 CONCLUSION AND DISCUSSION
We introduced a simple yet effective method called self-consistency, and observed that it significantly
improves accuracy in a range of arithmetic and commonsense reasoning tasks, across four large
language models with varying scales. Beyond accuracy gains, self-consistency is also useful for
collecting rationales when performing reasoning tasks with language models, and for providing
uncertainty estimates and improved calibration of language model outputs.
One limitation of self-consistency is that it incurs more computation cost. In practice people can try a
small number of paths (e.g., 5 or 10) as a starting point to realize most of the gains while not incurring
too much cost, as in most cases the performance saturates quickly (Figure 2). As part of future work,
one could use self-consistency to generate better supervised data to fine-tune the model, such that the
model can give more accurate predictions in a single inference run after fine-tuning. In addition, we
observed that language models can sometimes generate incorrect or nonsensical reasoning paths (e.g.,
the StrategyQA example in Table 4, the two population numbers are not exactly correct), and further
work is needed to better ground models’ rationale generations.
9
Published as a conference paper at ICLR 2023
REPRODUCIBILITY STATEMENT
In experiments, we included four different language models with varying scales. Two of them are public models: UL2 is a completely open-sourced model with model checkpoints available at https://
github.com/google-research/google-research/tree/master/ul2; GPT-3 is
also a public model with public API available at https://openai.com/api/. For GPT-3,
we have included two public engines (“code-davinci-001” and “code-davinci-002”) to further aid
reproducibility, as Codex is currently free so anyone can reproduce the results. In addition, as our
results make use of LaMDA-137B and PaLM-540B that are not publicly available, we provide the
exact input prompts for all tasks in Appendix A.3 (and note that we do not perform any finetuning
and only apply prompting to off-the-shelf language models).
ETHICS STATEMENT
As we stated in the discussion, language models can sometimes generate nonsensical or non-factual
reasoning paths, so one should use language models’ outputs with extra caution. We deal with
reasoning tasks mostly and the generated rationales are only used for inspecting how a model reaches
its answer. One could potentially use the generated rationales to further check why the model makes
certain mistakes or whether the model contains any biases when performing a certain task. For
language model in real-world use, further work is needed to better ground models’ predictions and
improve model’s factuality and safety, to ensure the models do not cause harms to users.
"""

In [7]:
program = guidance("""{{#system~}}
你是个拥有丰富经验的教授，负责给学生（用户）解答学术中的问题
注意，先提取每段话的关键信息并输出出来，每个关键信息至少列举原文中的3个论据以支撑该观点。注意不要遗漏其中的任何观点
{{~/system}}

{{#user~}}
教授，请帮助我总结这个段落，列举其中的主要贡献

标题：{{title}}
正文：{{context}}
{{~/user}}

{{#assistant~}}
{{~gen '关键信息' temperature=0.95 top_p=0.8 max_new_tokens=1024}}
{{~/assistant}}

{{! 再根据之前分析的关键信息，进行归纳总结 }}

{{#user~}}
根据这些关键信息，这段文章的主要贡献和改进点是什么？你需要用浅显易懂的语言总结知识，并指出其中的关键点。谢谢
{{~/user}}

{{#assistant~}}
{{gen '总结' temperature=0.95 top_p=0.8}}
{{~/assistant}}""", llm=llm)

executed_program = program(title=title, context=context)