# Large Language Models : Reasoning ability

# 1. Introduction

In recent years, generative AI have emerged as groundbreaking innovations in the field of artificial intelligence, revolutionizing various domains such as natural language processing, image generation, and interactive decision-making. Large Language Models (LLMs), such as GPT-3 (Generative Pre-trained Transformer) <a class='anchor' id='ref_brown2020language'></a> [[1]](#brown2020language) and BART (Bidirectional Encoder Representations from Transformers)<a class='anchor' id='ref_lewis2019Bart'></a> [[2]](#lewis2019Bart), are models pre-trained for text generation and have demonstrated remarkable capabilities in understanding and generating human-like text.

However, as these increasingly sophisticated and powerful LLMs, become more prevalent, it becomes crucial to address the question of reasoning ability in their behavior. Unlike humans, LLMs do not possess a mind of their own to think or reason independently. This lack of inherent reasoning capabilities can lead to challenges such as hallucinations or the generation of text that may lack logical coherence. Reasoning refers to the ability of LLMs to produce outputs that align with human values, ethical considerations, and logical reasoning. Ensuring reasoning abilities is of utmost importance to build trustworthy and responsible AI systems that can be deployed across diverse applications, including virtual assistants, chatbots, and automated decision-making systems.

In this essay, we will delve into the insights gained over the past two years of working and experimenting with LLM reasoning. We will explore the advancements made through chain-of-thought prompting, tree-of-thought frameworks, linguistic feedback reinforcement, interleaved reasoning and action, and handling complex mathematical reasoning. By analyzing related papers and their findings, we aim to provide a comprehensive overview of the architecture and progress of LLM reasoning, highlighting the lessons learned and the way forward in ensuring ethical and reliable AI systems.

# 2. Overview of LLM Reasoning

Large language model (LLM) reasoning, as mentionned earlier,  is a critical aspect in the development of responsible AI systems, as it ensures that LLMs produce outputs that are aligned with human values, ethical considerations, and logical reasoning. It serves as a safeguard against potential risks, biases, and harmful behavior that can arise from LLMs, promoting transparency, fairness, and trustworthiness in their applications.

Ensuring reasoning in LLMs poses several challenges and has far-reaching implications. One of the main challenges stems from the vast amount of training data used by LLMs, which is typically sourced from the internet and may contain biases, misinformation, or inappropriate content. As a result, LLMs may inadvertently perpetuate biases or generate misleading or harmful outputs. Additionally, LLMs may lack coherent reasoning capabilities or struggle with handling complex tasks effectively, which can limit their usefulness and reliability in real-world applications.

Here, we will focus only on the reasoning challenges, not the ethical ones.

To tackle these challenges, it is important to understand different types of reasoning:

- **Inductive Reasoning**: is a type of reasoning where a conclusion is drawn based on observations or evidence. It involves generalizing from specific instances to make a broader generalization or prediction. Inductive reasoning allows for the conclusion to be likely but not necessarily certain. It relies on the idea that if something is true for a particular set of cases, it is likely to be true for similar cases. For example, observing multiple birds with wings and concluding that all creatures with wings are likely to be birds.

- **Deductive Reasoning**: is a type of reasoning where a conclusion is drawn based on the truth of the premises. It follows a logical process where the conclusion necessarily follows from the premises. If the premises are true, then the conclusion must also be true. Deductive reasoning is often used in mathematics and formal logic. An example would be the premise that all mammals have kidneys, the premise that all whales are mammals, and the conclusion that all whales have kidneys.

- **Abductive Reasoning**: is a type of reasoning where a conclusion is drawn based on the best explanation for a given set of observations. It involves considering different hypotheses and selecting the most likely or best explanation based on the available evidence. Abductive reasoning is used to make educated guesses or hypotheses when faced with incomplete or uncertain information. For example, observing a car that cannot start and a puddle of liquid under the engine, and concluding that the most likely explanation is a leak in the radiator.

- **Formal Reasoning**: is a systematic and logical process that follows a set of rules and principles. It is characterized by its structured and rigorous approach, often used in disciplines like mathematics, formal logic, and computer science. Formal reasoning relies on deductive logic and mathematical proofs to arrive at valid conclusions. It involves applying established rules and principles to solve problems and make deductions.

- **Informal Reasoning**: is a less structured approach to reasoning that relies on intuition, experience, and common sense. It is used in everyday life situations where strict formal rules may not apply. Informal reasoning allows for more flexibility and open-ended thinking. It often involves making decisions or drawing conclusions based on personal experiences, heuristics, and contextual factors. Informal reasoning is more adaptable but may also be less reliable compared to formal reasoning.

By understanding and addressing these different types of reasoning, we can enhance the reasoning capabilities of LLMs

## 3. Chain-of-Thought Prompting: Enhancing Complex Reasoning

Chain-of-thought prompting <a class='anchor' id='ref_wei2023chainofthought'></a> [[3]](#wei2023chainofthought) is a powerful technique that has shown promising results in improving the reasoning abilities of large language models. This approach involves generating a series of intermediate reasoning steps, known as a chain of thought, to guide the LLM in performing complex reasoning tasks. By providing exemplars of chain-of-thought demonstrations as prompts, LLMs can naturally develop reasoning abilities.

![Chain-of-thought.png](attachment:aa62fa97-7c24-4341-ba72-017b36ff7dc7.png)

*Figure 1 : Chain of thought (source: <a>https://arxiv.org/pdf/2201.11903.pdf<a/>)*

The paper exploring chain-of-thought prompting conducted experiments on three large language models and demonstrated its effectiveness across arithmetic, commonsense, and symbolic reasoning tasks. The findings revealed significant improvements in performance, with the empirical gains being quite remarkable. For instance, the paper reports that even with just eight chain-of-thought exemplars, prompting a PaLM 540B model achieved state-of-the-art accuracy on the GSM8K benchmark of math word problems. This performance surpassed even the finetuned GPT-3 model with a verifier, highlighting the effectiveness of chain-of-thought prompting in enhancing complex reasoning capabilities in LLMs.

The success of chain-of-thought prompting in achieving state-of-the-art performance is a testament to its ability to enable LLMs to reason through a series of intermediate steps. By breaking down complex tasks into smaller reasoning steps, LLMs can better comprehend the underlying logic and make more informed decisions. This approach not only improves the overall reasoning capabilities of LLMs but also enhances their performance in a wide range of tasks that require arithmetic, commonsense, and symbolic reasoning. The findings from this paper underscore the significance of chain-of-thought prompting as a valuable technique for advancing the field of large language models and promoting complex reasoning in AI systems.

# 4. Tree-of-Thought Framework: Trial-and-Error Problem Solving

The Tree-of-Thought (ToT) <a class='anchor' id='ref_long2023large'></a> [[4]](#long2023large) <a class='anchor' id='ref_yao2023tree'></a> [[5]](#yao2023tree) framework is an innovative approach aimed at improving the problem-solving capabilities of LLMs by emulating the human mind's trial-and-error approach to complex reasoning tasks. This framework incorporates additional modules into LLMs to facilitate multi-round conversations and backtracking, allowing for a tree-like thought process similar to how humans explore solution spaces

![tree-of-thought.jpg](attachment:95fc5372-92e3-431b-9e52-bbb136a779bf.jpg)

*Figure 2 : Tree of Thought (source : https://arxiv.org/pdf/2305.08291.pdf)*

To implement the ToT framework, several modules are added to the LLM architecture. These include a prompter agent, a checker module, a memory module, and a ToT controller. The prompter agent interacts with the LLM and guides the problem-solving process by providing prompts and receiving responses. The checker module evaluates the correctness of intermediate steps and provides feedback. The memory module records the conversation and state history, enabling the system to backtrack to previous steps and explore alternative paths. Finally, the ToT controller manages the flow of information and decision-making within the framework.

The ToT framework was evaluated through an implementation of a ToT-based solver for the Sudoku Puzzle. Experimental results demonstrated the effectiveness of the framework in increasing success rates in solving Sudoku puzzles. By leveraging the tree-like thought process and the ability to backtrack, the ToT-based solver exhibited improved problem-solving capabilities. The incorporation of the additional modules allowed the system to explore different directions in the problem-solving process, leading to enhanced performance.

The success of the ToT framework in improving the problem-solving capabilities of LLMs signifies its potential in addressing complex reasoning tasks. By emulating the trial-and-error approach, LLMs equipped with the ToT framework can effectively navigate through solution spaces, explore alternative paths, and achieve better results. This framework opens up new possibilities for LLMs to tackle a wide range of problem-solving tasks and paves the way for the development of more advanced and efficient AI systems

# 5. Reflexion: Reinforcement through Linguistic Feedback

The Reflexion framework <a class='anchor' id='ref_shin2023reflexion'></a> [[6]](#shinn2023reflexion) introduces a novel approach to reinforcing large language models by leveraging linguistic feedback instead of traditional weight updates through reinforcement learning. In the Reflexion framework, LLMs engage in verbal reflection on task feedback signals and maintain their own reflective text in an episodic memory buffer. This approach aims to induce better decision-making in subsequent trials by leveraging the accumulated knowledge and insights gained through linguistic feedback.

One of the key strengths of the Reflexion framework lies in its flexibility in incorporating various types and sources of feedback signals. It can handle scalar values or free-form language as feedback signals, and these signals can come from external sources or be internally simulated. This flexibility allows the framework to adapt to different task requirements and utilize feedback in a way that maximizes performance and learning.

![RL.jpg](attachment:ec220bb7-a8b7-4aaf-bf0b-aafe9e379508.jpg)

*Figure 3: Reflexion works on decision-making, programming, and reasoning tasks. (Source : https://arxiv.org/pdf/2303.11366.pdf)*

Experimental results have shown that Reflexion outperforms traditional reinforcement learning methods across diverse tasks. For example, in sequential decision-making, coding, and language reasoning tasks, Reflexion achieved significant improvements over baseline agents. Notably, it achieved a remarkable 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art performance achieved by GPT-4 with an accuracy of 80%. These results demonstrate the efficacy of the Reflexion framework in leveraging linguistic feedback to enhance decision-making and improve performance in a range of tasks.

The Reflexion framework opens up new possibilities for training and reinforcing LLMs without relying solely on weight updates. By leveraging linguistic feedback and maintaining an episodic memory buffer, LLMs can effectively learn from their own reflective text and incorporate insights gained from previous experiences. This approach not only enhances the performance of LLMs but also provides a more interpretable and trustworthy decision-making process. The Reflexion framework exemplifies the potential of linguistic feedback in advancing the capabilities of LLMs and contributes to the development of responsible and effective AI systems

# 6. ReAct: Interleaving Reasoning and Action

The ReAct approach <a class='anchor' id='ref_yao2023react'></a> [[7]](#yao2023react) aims to integrate the generation of reasoning traces and task-specific actions in large language models to enhance their overall performance and capabilities. Unlike previous studies that have primarily focused on reasoning or acting as separate components, ReAct explores the synergistic benefits of interleaving both aspects in LLMs.

By interleaving reasoning traces and task-specific actions, ReAct allows LLMs to induce, track, and update action plans based on the reasoning process. This integration enables the model to handle exceptions, gather additional information from external sources such as knowledge bases or environments, and make informed decisions. Reasoning traces provide a valuable framework for the LLM to understand and update action plans, while actions allow the model to effectively interface with the external world.

![react.png](attachment:a727eed2-0f14-460c-88c9-b2abfeeef2c5.png)

*Figure 4 : ReAct example (Source : https://arxiv.org/pdf/2210.03629.pdf)*

Experimental results have demonstrated the effectiveness of the ReAct approach across a diverse set of language and decision-making tasks. When compared to state-of-the-art baselines, ReAct consistently outperforms in terms of performance, interpretability, and trustworthiness. For instance, in question answering tasks like HotpotQA and fact verification tasks like Fever, ReAct overcomes issues such as hallucination and error propagation that are common in chain-of-thought reasoning methods. By interacting with a simple Wikipedia API, ReAct generates human-like task-solving trajectories that are more interpretable than baseline methods without reasoning traces.

Furthermore, ReAct excels in interactive decision-making benchmarks such as ALFWorld and WebShop. It outperforms imitation and reinforcement learning methods by a significant margin, even when prompted with just one or two in-context examples. These results highlight the superiority of the ReAct approach in combining reasoning and action components, demonstrating its potential to enhance the interpretability and trustworthiness of LLMs.

The ReAct approach opens up avenues for developing more robust and capable LLMs that can effectively reason, plan actions, and interact with external sources to solve complex language and decision-making tasks. By integrating reasoning and action in an interleaved manner, ReAct not only improves the performance of LLMs but also enhances their human interpretability and trustworthiness. The findings from applying ReAct to various tasks provide valuable insights into the benefits of combining reasoning and action components, paving the way for further advancements in the field of LLM research and development.

# 7. Tabular Math Word Problems: Handling Complex Mathematical Reasoning

The Tabular Math Word Problems (TabMWP) dataset is designed to evaluate the abilities of large pre-trained language models in handling complex mathematical reasoning tasks that involve both textual and tabular data. TabMWP presents a unique challenge as it requires LLMs to reason over heterogeneous information sources. Each question in the dataset is aligned with a tabular context, presented as an image, semi-structured text, and a structured table. The dataset contains grade-level problems that necessitate multi-step mathematical reasoning processes.![math-dataset.png](attachment:af30caa7-188e-487b-9d4e-0978b3687093.png)

*Figure 5: Two examples from the TABMWP dataset (Source : https://arxiv.org/pdf/2209.14610.pdf)*

In the evaluation of pre-trained models on TabMWP, it has been observed that traditional few-shot approaches, such as few-shot GPT-3, face difficulties due to their reliance on in-context examples. The performance of few-shot GPT-3 on complex problems like TabMWP tends to be unstable and can degrade to near chance levels. This instability highlights the need for novel approaches that can effectively select relevant in-context examples to improve model performance.

To address this challenge, a novel approach called PromptPG <a class='anchor' id='ref_lu2023dynamic'></a> [[8]](#lu2023dynamic) has been proposed. PromptPG utilizes policy gradient techniques to learn the selection of in-context examples from a small amount of training data. By constructing appropriate prompts based on the selected in-context examples, PromptPG improves the accuracy of LLMs on TabMWP. Experimental results have shown that PromptPG outperforms the best baseline method by a significant margin, achieving higher accuracy and reducing prediction variance compared to random selection of in-context examples.
![prompt-GPT.png](attachment:a23da405-1649-4200-b79f-fc6dde29fa36.png)

*Figure 6: PromptPG (Source: https://arxiv.org/pdf/2209.14610.pdf)*

The findings from TabMWP and the success of the PromptPG approach highlight the importance of developing strategies to handle complex mathematical reasoning in LLMs. The dataset serves as a benchmark for evaluating LLMs' ability to reason over textual and tabular information, simulating real-world scenarios where mathematical reasoning is required. The novel PromptPG approach provides insights into effective selection strategies for in-context examples, improving the performance and stability of LLMs on complex mathematical reasoning tasks. These advancements contribute to the development of LLMs that can handle diverse problem-solving scenarios involving mathematical reasoning and heterogeneous information sources.

# 8. Multi-Chain Reasoning: Meta-Reasoning for QA

Multi-Chain Reasoning (MCR) <a class='anchor' id='ref_yoran2023answering'></a> [[9]](#yoran2023answering) is an approach that focuses on meta-reasoning over multiple chains of thought in the context of question-answering (QA) tasks. MCR aims to enhance the reasoning capabilities of large language models (LLMs) by explicitly modeling and reasoning over different chains of thought, enabling a more comprehensive exploration of the underlying information and relationships within a given QA context.![answering.png](attachment:1756f388-f871-4f26-83f6-bba8571a9571.png)

*Figure 7: An overview of MCR, given a question from the FERMI dataset. (Source : https://arxiv.org/pdf/2304.13007.pdf)*

Through the application of MCR, significant improvements have been achieved in multi-hop QA tasks. By considering multiple chains of thought, MCR enables LLMs to generate more accurate and informative answers. Additionally, MCR provides higher-quality explanations by capturing the reasoning processes that lead to the final answer. These explanations can offer insights into the intermediate steps and reasoning paths taken by the model, enhancing transparency and interpretability.

One of the key contributions of MCR is its focus on considering the relations between intermediate steps and generating unified explanations. By explicitly modeling the relations between different reasoning steps, MCR enables LLMs to generate coherent and consistent explanations that provide a holistic view of the reasoning process. This not only enhances the interpretability of the model's decision-making but also promotes a more reliable and trustworthy AI system.

The advancements brought by MCR in meta-reasoning for QA tasks highlight the importance of considering multiple chains of thought and reasoning paths in LLMs. By exploring different avenues of reasoning and generating unified explanations, MCR improves the model's ability to handle complex QA scenarios, where information is distributed across multiple pieces of evidence or multiple steps are required to arrive at the correct answer. This not only enhances the performance of LLMs in QA tasks but also contributes to the development of AI systems that provide transparent and reliable reasoning capabilities.

# 9. Complexity-Based Prompting: Effective Example Selection

Complexity-based prompting <a class='anchor' id='ref_fu2023complexity'></a> [[10]](#fu2023complexity) is a straightforward yet highly effective scheme used for selecting reasoning examples in large language models (LLMs). This approach involves the use of prompts with varying levels of complexity to train LLMs on multi-step reasoning tasks. By systematically increasing the complexity of prompts, LLMs are exposed to a diverse range of reasoning scenarios, leading to substantial improvements in their reasoning abilities.![complexity-prompting.png](attachment:265bc78f-5655-41c3-bd27-1c6a94419520.png)

*Figure 8: A: Chain of thoughts (in blue) are intermediate reasoning steps towards a final answer. The input of CoT prompting is a stack of few (often 8) CoT cases before a test question. Then the language model will continue generating an output CoT for the test question. B: Chains of harder reasoning complexity are chains with more reasoning steps (9 steps in this case, v.s. only 2 steps in subfigure A). C: During decoding, we sample N reasoning chains from the language model (N = 5 here), and take the majority answer over the K (K = 3 here) most complex generated chains. (Source : https://arxiv.org/pdf/2210.00720.pdf)*

The application of complexity-based prompting has shown remarkable results in multi-step reasoning tasks. Prompts with higher complexity, involving more intricate and nuanced reasoning steps, have been found to significantly enhance the performance of LLMs. This demonstrates the importance of providing LLMs with challenging examples that require complex reasoning skills, enabling them to learn and generalize effectively.

Moreover, the concept of complexity-based criteria has been extended beyond example selection to the decoding process as well. By incorporating complexity-based criteria during decoding, LLMs can generate more sophisticated and contextually appropriate responses. This extension has further contributed to the state-of-the-art performance of LLMs on various benchmarks, highlighting the effectiveness of complexity-based prompting in improving the quality of reasoning and language generation.

Overall, complexity-based prompting serves as a valuable approach for selecting reasoning examples and enhancing the reasoning abilities of LLMs. By exposing LLMs to prompts with varying levels of complexity, this scheme enables the models to learn and generalize complex reasoning patterns. Furthermore, extending the complexity-based criteria to the decoding process enhances the overall performance of LLMs, making them more capable of handling diverse reasoning tasks and producing high-quality responses

# 10. Conclusion

In conclusion, the concept of Large Language Models (LLMs) has revolutionized various domains by enabling powerful language generation and reasoning capabilities. However, ensuring reasonability in LLMs is of utmost importance to promote responsible and ethical AI systems. The challenges and implications associated with reasonability highlight the need for community collaboration and experimentation to advance this field.

Through approaches like chain-of-thought prompting, tree-of-thought framework, Reflexion, ReAct, complexity-based prompting, and multi-chain reasoning, significant progress has been made in enhancing reasoning abilities, problem-solving capabilities, reinforcement learning, and mathematical reasoning in LLMs. These approaches have demonstrated empirical gains, state-of-the-art performance, and improvements in diverse tasks.

Looking ahead, there are exciting prospects for further research in the field of LLM reasonability. Exploring novel techniques to enhance interpretability, explainability, and trustworthiness of LLMs remains a key focus. Additionally, investigating ways to address biases, ethical considerations, and the impact of LLM-generated content on society will be crucial for developing responsible AI systems. Collaborative efforts between researchers, practitioners, and policymakers will play a vital role in shaping the future direction of LLMs, ensuring their responsible deployment, and maximizing their benefits for society.

In summary, while LLMs have shown tremendous potential, the journey towards building reasonability in these models is ongoing. Continued research, experimentation, and community collaboration will be instrumental in addressing the challenges and realizing the full potential of LLMs while ensuring their ethical and reliable use in various domains.

# 11. References

[1] <font><a class='anchor' id='brown2020language'></a> T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners, 2020

[2] <font><a class='anchor' id='lewis2019Bart'></a> M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019

[3] <font><a class='anchor' id='wei2023chainofthought'></a> J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023
    
[4] <font><a class='anchor' id='long2023large'></a> J. Long. Large language model guided tree-of-thought, 2023.
 
[5] <font><a class='anchor' id='yao2023tree'></a> S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

[6] <font><a class='anchor' id='shinn2023reflexion'></a> N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023
    
[7] <font><a class='anchor' id='yao2023react'></a> S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models, 2023
    
[8] <font><a class='anchor' id='lu2023dynamic'></a> P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning, 2023
    
[9] <font><a class='anchor' id='yoran2023answering'></a> O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, and J. Berant. Answering questions by meta-reasoning over multiple chains of thought, 2023
    
[10] <font><a class='anchor' id='fu2023complexity'></a> Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-based prompting for multi-step reasoning, 2023

In [1]:
# Submission
import pandas as pd
df = pd.read_csv('/kaggle/input/2023-kaggle-ai-report/sample_submission.csv')
df.iloc[0,1] = 'Text Data'
df.iloc[1,1] = 'https://www.kaggle.com/code/flaussy/large-language-models-reasoning-ability'
df.iloc[2,1] = 'https://www.kaggle.com/code/abireltaief/contemporary-large-language-models-llms/comments'
df.iloc[3,1] = 'https://www.kaggle.com/code/jayitabhattacharyya/building-llms-from-scratch-generative-ai-report/comments'
df.iloc[4,1] = 'https://www.kaggle.com/code/narendra143/do-you-know-large-language-models/comments'
df.to_csv('submission.csv', index=False)
df.head()

Unnamed: 0,type,value
0,essay_category,Text Data
1,essay_url,https://www.kaggle.com/code/flaussy/large-lang...
2,feedback1_url,https://www.kaggle.com/code/abireltaief/contem...
3,feedback2_url,https://www.kaggle.com/code/jayitabhattacharyy...
4,feedback3_url,https://www.kaggle.com/code/narendra143/do-you...
