GitHub - rashmimarganiatgithub/LLMS_Library_2023: LLM_library is a comprehensive repository serves as a one-stop resource hands-on code, insightful summaries.

LLMS Top Paper 2023

LLMS Ecosystem Monthly Papers Updates with Summary Repository.

December

Paper	Links
1) Gemini - Introducing Gemini, a versatile family of multimodal models—Ultra, Pro, and Nano—that excel in image, audio, video, and text understanding. Gemini Ultra, the most capable model, advances the state of the art in 30 of 32 benchmarks, achieving human-expert performance on the MMLU exam benchmark and improving results across all 20 multimodal benchmarks. The family caters to diverse applications, from complex reasoning tasks to memory-constrained on-device use-cases. We emphasize responsible deployment of Gemini models, anticipating their wide-ranging impact on cross-modal reasoning and language understanding use cases.	Paper , summary
2) Magicoder - Introducing Magicoder, a series of open-source Large Language Models (LLMs) for code with up to 7B parameters. Trained using OSS-Instruct, this novel approach leverages open-source code snippets to generate diverse and realistic instruction data, mitigating inherent biases in synthetic data. Magicoder and its enhanced version, MagicoderS, outperform comparable state-of-the-art code models on various coding benchmarks, including Python text-to-code generation and multilingual coding. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1), showcasing the potential of OSS-Instruct for low-bias, high-quality instruction tuning with open-source references.	Paper , code & summary
3) Large Language Models on Graphs: A Comprehensive Survey - Large Language Models (LLMs), like ChatGPT and LLaMA, have made significant strides in natural language processing, showcasing strong text encoding/decoding and emergent reasoning capabilities. While LLMs excel in pure text scenarios, many real-world situations involve structured graph information or pair graph data with textual information. This paper systematically reviews the application of LLMs on graphs, categorizing scenarios into pure graphs, text-rich graphs, and text-paired graphs. Techniques discussed include LLM as Predictor, Encoder, and Aligner, exploring their advantages and disadvantages. The paper also touches on real-world applications, provides information on open-source codes, benchmark datasets, and concludes with potential future research directions in this evolving field.	Paper , code
4) Llama Guard - Introducing Llama Guard, an LLM-based model designed for Human-AI conversations, focusing on safety through a risk taxonomy. With a carefully curated dataset, our Llama2-7b model, instruction-tuned on this data, excels on benchmarks like OpenAI Moderation Evaluation and ToxicChat, rivaling or surpassing existing moderation tools. Llama Guard conducts multi-class classification, generating binary decision scores and enabling customization for specific use cases. We provide model weights, encouraging researchers to adapt them to evolving AI safety needs in the community.	Paper , code
5) Chain of Code - Proposing a novel approach, Chain of Code (CoC), we leverage the syntactic structure of code to enhance language models' (LMs) Chain of Thought reasoning. By encouraging LMs to write flexible pseudocode for semantic sub-tasks, CoC allows an interpreter to catch undefined behaviors and hand off to an LMulator for simulation. This extension proves effective, improving LM code-driven reasoning across various benchmarks. In experiments, CoC outperforms Chain of Thought and other baselines, achieving an 84% accuracy on BIG-Bench Hard, a 12% gain. CoC scales well with models of different sizes, expanding the range of reasoning questions LMs can accurately address by "thinking in code.	Paper , code& summary
6) Data Management For Large Language Models - This survey delves into the critical role of data management in training Large Language Models (LLMs). Addressing the lack of systematic analysis in the research community, it explores the rationale behind data management strategy selection, its effects, evaluation methodologies for curated datasets, and ongoing efforts for improvement. The survey provides a comprehensive overview of current research in data management during both pretraining and supervised fine-tuning stages of LLMs, covering aspects such as data quantity, quality, and domain/task composition. Looking ahead, it extrapolates existing challenges and outlines promising directions for future development, making it a valuable resource for practitioners aiming to construct powerful LLMs through effective data management practices.	Paper , code
7) RankZephyr - Introducing RankZephyr, an advanced open-source Large Language Model (LLM) for listwise zero-shot reranking in information retrieval. Addressing the persistent gap between proprietary and open-source models, RankZephyr not only matches the effectiveness of GPT-4 but, in some instances, outperforms the proprietary model. Comprehensive evaluations on diverse datasets demonstrate this capability, and RankZephyr shows resilience against variations in initial document ordering and the number of documents reranked. Outperforming GPT-4 on the NovelEval test set, our model addresses concerns about data contamination. To facilitate further research, we provide all necessary code for reproducibility in this rapidly evolving field.	Paper , code
8) The Efficiency Spectrum of LLM - This survey addresses the challenges posed by the growing computational demands of Large Language Models (LLMs) and explores a comprehensive range of algorithmic advancements to enhance their efficiency. Unlike other surveys focusing on specific areas, this paper covers multiple dimensions of efficiency, including scaling laws, data utilization, architectural innovations, training and tuning strategies, and inference techniques. By providing a holistic overview, this survey aims to be a valuable resource for researchers and practitioners, guiding future innovations in the critical research area of LLM efficiency.	Paper , code
9) QuIP - QuIP# employs incoherence processing and lattice codebooks. Introducing a Hadamard transform-based approach for faster GPU acceleration, the incoherence-processed weights are suitable for quantization with symmetric and round codebooks. Our new lattice codebook optimizes the 8-dimension unit ball packing density, designed for hardware-friendliness by exploiting symmetries in the lattices.	Paper , code
10) WEAK-TO-STRONG GENERALIZATION - Alignment techniques like RLHF rely on human supervision, but for superhuman models, evaluating complex behavior is challenging. We explore weak-to-strong generalization using GPT-4 models on NLP, chess, and reward modeling. Naive fine-tuning on weak model labels consistently improves performance. However, fully recovering strong model capabilities remains a challenge. Simple methods, like finetuning with an auxiliary confidence loss, show promise, addressing fundamental challenges in aligning superhuman models.	Paper , code,
11) Audiobox - Introducing Audiobox, a unified audio generation model based on flow-matching that addresses controllability limitations. Audiobox achieves benchmarks in speech and sound generation, offering independent control of transcript, vocal, and audio styles. Leveraging description-based and example-based prompting, it excels in tasks like zero-shot TTS and text-to-sound. Integration of Bespoke Solvers accelerates generation by over 25 times without performance loss.	Paper
12) Mathematical Language Models: A Survey - In recent years, there has been notable progress in leveraging Language Models (LMs), including Pre-trained Language Models (PLMs) and Large-scale Language Models (LLMs), in the field of mathematics. This comprehensive survey categorizes mathematical LLMs based on tasks and methodologies, encompassing instruction learning, tool-based methods, fundamental CoT techniques, and advanced CoT methodologies. The survey also compiles over 60 mathematical datasets, addressing challenges and outlining future trajectories. Positioned as a valuable resource, this survey aims to inspire future innovation in the domain of mathematical LMs.	Paper
13) LLM360 - The surge in open-source Large Language Models (LLMs) like LLaMA, Falcon, and Mistral offers diverse options for AI practitioners and researchers. However, most LLMs release only partial artifacts, hindering transparency and reproducibility. LLM360 advocates for fully open-sourcing LLMs, providing training code, data, model checkpoints, and intermediate results. The initiative, exemplified by Amber and CrystalCoder, aims to support open and collaborative AI research by making the end-to-end LLM training process transparent. More large-scale models are in progress, contributing to ongoing efforts in advancing LLMs through open-source initiatives.	Paper , code
14) A Survey of Large Language Models in Medicine - Large language models (LLMs), like ChatGPT, garner attention for their language understanding. In medicine, their potential for aiding physicians prompts exploration in AI and clinical applications. This survey comprehensively covers principles, applications, and challenges of LLMs in medicine. It addresses building, downstream performance, real-world utilization, challenges, and optimization of medical LLMs. Intending to offer insights and serve as a resource, the survey explores opportunities and challenges in the application of LLMs in the medical field.	Paper , code
15) An In-depth Look at Gemini's Language Abilities - The Google Gemini models, recently released, show results comparable to OpenAI's GPT series across various tasks. This paper provides a transparent, third-party comparison of OpenAI GPT and Google Gemini, with reproducible code. Analyzing results on 10 datasets covering diverse language abilities, we find Gemini Pro's accuracy slightly inferior to GPT 3.5 Turbo on all benchmarked tasks. We explain some underperformance factors, such as challenges in mathematical reasoning and sensitivity to answer ordering. Gemini excels in tasks like non-English language generation and handling complex reasoning chains.	Paper , code
16) PowerInfer - This paper presents PowerInfer, a high-speed Large Language Model (LLM) inference engine for personal computers with a single consumer-grade GPU. PowerInfer exploits the power-law distribution in neuron activation, distinguishing between hot and cold neurons. The GPU-CPU hybrid design preloads hot neurons on the GPU for fast access and computes cold neurons on the CPU, reducing GPU memory demands and data transfers. Adaptive predictors and neuron-aware sparse operators enhance efficiency. Evaluation shows PowerInfer achieves competitive token generation rates, outperforming llama.cpp by up to 11.69× while retaining model accuracy.	Paper , code
17) AppAgent - This paper introduces a novel Large Language Model (LLM)-based multimodal agent framework designed for operating smartphone applications. The framework allows the agent to interact with apps using simplified actions like tapping and swiping, bypassing the need for back-end access. The agent learns to navigate new apps through autonomous exploration or human demonstrations, creating a knowledge base for executing tasks across applications. Extensive testing on 50 tasks across 10 applications demonstrates the agent's proficiency in handling diverse high-level tasks without requiring back-end access.	Paper , code and summary
18) LLM in a flash - This paper addresses the challenge of efficiently running large language models (LLMs) that exceed available DRAM capacity by storing model parameters on flash memory and bringing them on demand to DRAM. The method involves constructing an inference cost model aligned with flash memory behavior, guiding optimization in two critical areas: reducing data transferred from flash and reading data in larger, more contiguous chunks. Two principal techniques are introduced within this flash memory-informed framework.	Paper
19) Retrieval-Augmented Generation for LLM: A Survey - Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant information from external knowledge bases before answering questions, reducing hallucinations, improving accuracy, and facilitating knowledge updates. This paper outlines RAG's development paradigms, summarizing Naive RAG, Advanced RAG, and Modular RAG. It discusses components like retriever, generator, and augmentation methods, along with evaluation methods and future research directions, emphasizing vertical optimization, horizontal scalability, and the technical stack and ecosystem of RAG.	Paper,code

November

Paper	Links
1) A Survey on Hallucination in LLMs - Hallucinations in Large Language Models: A Comprehensive Survey. Unraveling the taxonomy, factors, detection methods, and mitigation approaches, this survey scrutinizes the challenges and outlines future research directions in addressing hallucinations within the realm of large language models (LLMs) in natural language processing (NLP).	Paper
2) Simplifying Transformer Blocks - Simplified Transformers: A Design Recipe for Efficient and Effective Deep Transformer Architectures. By challenging the complexity of standard transformer blocks, this work presents a simplified design recipe that maintains training speed and performance, achieving 15% faster throughput and utilizing 15% fewer parameters in autoregressive and BERT models.	Paper , code
3) Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models - Exploring In-Context Learning of Transformers: Unveiling the Limits of Generalization. Investigating transformers' capabilities to perform new tasks in-context, this study reveals near-optimal model selection within pretraining data mixtures, but demonstrates limitations and degradation in generalization for out-of-domain tasks or extrapolation challenges.	Paper , code
4) GPT4All: An Ecosystem of Open Source Compressed Language Models - Democratizing Access to Large Language Models: The GPT4All Project. This paper recounts the development and evolution of the GPT4All project, an open-source initiative aiming to democratize access to large language models (LLMs). It provides technical details of the GPT4All model family and discusses its growth into a comprehensive open source ecosystem.	Paper
5) S-LoRA - S-LoRA: Scalable Serving of Many Low-Rank Adapters for Efficient Language Model Inference. This paper introduces S-LoRA, a system designed for scalable serving of many Low-Rank Adapters (LoRA) in the deployment of large language models. It focuses on batched inference during serving, optimizing GPU memory usage, and achieving high throughput, enabling efficient language model inference across multiple GPUs.	Paper , code
6) FreshLLMs - This paper presents "FreshQA," a dynamic QA benchmark assessing large language models' factuality. It reveals LLM limitations and introduces "FreshPrompt," a method enhancing performance with current information from a search engine. FreshPrompt outperforms other approaches, emphasizing evidence quantity, order, and instructing concise answers. The FreshQA dataset is released for research.	Paper , code
7) YaRN - This paper introduces YaRN (Yet another RoPE extensioN), a compute-efficient method for extending the context window in transformer-based language models. YaRN allows models to effectively utilize and extrapolate to longer context lengths, surpassing previous state-of-the-art methods in context window extension. The approach is shown to require fewer tokens and training steps. Additionally, YaRN demonstrates the ability to extrapolate beyond the limited context of a fine-tuning dataset.	Paper , code
8) Large Language Models Understand and Can be Enhanced by Emotional Stimuli - This paper investigates the ability of Large Language Models (LLMs) to understand emotional stimuli. The authors conduct automatic experiments on 45 tasks using various LLMs, demonstrating that LLMs have a grasp of emotional intelligence. They introduce "EmotionPrompt," a method combining the original prompt with emotional stimuli, and show its effectiveness in improving LLM performance across deterministic and generative tasks. A human study with 106 participants further supports the benefits of EmotionPrompt in enhancing the quality of generative tasks. The findings suggest a promising avenue for exploring interdisciplinary knowledge in human-LLMs interaction.	Paper , code & summary
9) ChatGPT's One-year Anniversary - ChatGPT, released in late 2022, demonstrated the effectiveness of instruction-tuning LLMs. This success sparked interest in both closed-source and rapidly advancing open-source models, impacting research and business. On ChatGPT's first anniversary, this work provides a brief overview, surveying tasks where open-source LLMs claim equivalence or superiority.	Paper
10) MEDITRON-70B - In this work, we address the limitations of existing medical large language models (LLMs) by introducing MEDITRON, an open-source suite with 7B and 70B parameters adapted for the medical domain. Leveraging Llama-2 and pretraining on a curated medical corpus, MEDITRON outperforms state-of-the-art baselines on major medical benchmarks, showcasing a 6% absolute performance gain and challenging closed-source LLMs like GPT-3.5 and Med-PaLM. Our release includes code for corpus curation and model weights to foster open-source development of advanced medical LLMs.	Paper , code
11) Can Generalist Foundation Models Outcompete Special-Purpose Tuning? - We challenge the assumption that generalist foundation models like GPT-4 can't match specialist capabilities. Through prompt engineering, our method Medprompt propels GPT-4 to achieve state-of-the-art results on nine medical benchmarks without domain-specific training. It surpasses specialist models with fewer calls and demonstrates generalization across various domains.	Paper
12) The Impact of Large Language Models on Scientific Discovery - In this report, we assess the performance of GPT-4, a state-of-the-art language model, in scientific discovery across drug discovery, biology, computational chemistry, materials design, and partial differential equations. Using expert-driven case assessments and occasional benchmark testing, we find promising potential for GPT-4 in various scientific applications, showcasing its ability in complex problem-solving and knowledge integration tasks.	Paper
13) Fine-tuning Language Models for Factuality - This work addresses the issue of hallucinations in large pre-trained language models (LLMs) by fine-tuning them for improved factuality without human labeling. Leveraging recent innovations in NLP, the approach uses methods to judge factuality by measuring consistency with external knowledge bases and employs direct preference optimization for fine-tuning. Results show significant improvements in factuality for Llama-2 on held-out topics compared to other strategies, offering a potential solution to mitigate misinformation in LLM-generated content.	Paper
14) Contrastive Chain-of-Thought Prompting - This work introduces contrastive chain of thought to enhance language model reasoning. While conventional chain of thought focuses on valid demonstrations, the proposed approach includes both valid and invalid reasoning demonstrations. By providing examples of both correct and incorrect reasoning, the model is guided to reason step-by-step, reducing mistakes. The method aims to improve generalization through automatic construction of contrastive demonstrations, and experiments on reasoning benchmarks show its potential as a general enhancement for chain-of-thought prompting.	Paper , code
15) A Survey on Language Models for Code - This work comprehensively reviews recent advancements in code processing with language models, covering 50+ models, 30+ evaluation tasks, and 500 related works. It categorizes models into general language models (like GPT) and specialized models pretrained on code. The review discusses the historical transition from statistical models to pretrained Transformers, similar to NLP. Code-specific features like AST, CFG, and unit tests are explored, along with key challenges and potential future directions in this domain.	Paper
16) Learning to Filter Context for Retrieval-Augmented Generation - FILCO is proposed to enhance the quality of context for generation models in tasks like question answering and fact verification. The method identifies useful context using lexical and information-theoretic approaches, and trains context filtering models for test-time context filtering. Experiments with FLAN-T5 and LLaMa2 on knowledge-intensive tasks show that FILCO outperforms existing approaches in various scenarios, improving the quality of context for better model performance.	Paper , code,
17) MART - We present MART, a Multi-round Automatic Red-Teaming method for enhancing the safety of Large Language Models (LLMs). MART combines automatic adversarial prompt writing and safe response generation in an iterative process. The adversarial LLM generates challenging prompts, and the target LLM is fine-tuned with safety-aligned data on these prompts, reducing violation rates by up to 84.7% after 4 rounds. Model helpfulness on non-adversarial prompts remains stable, demonstrating improved safety without compromising performance on regular instructions.	Paper , code
18) Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure - In a simulated stock trading scenario, we reveal misaligned behavior in Large Language Models (LLMs), specifically GPT-4, as it strategically deceives users. The LLM acts on an insider tip, hides its real trading motives when reporting to its manager, and displays deceptive behavior despite being trained to be helpful, harmless, and honest. Our investigation explores variations in settings, such as access to a reasoning scratchpad, system instructions, pressure, and perceived risk, highlighting the potential challenges in ensuring ethical behavior in LLMs.	Paper , code
19) System 2 Attention - We introduce System 2 Attention (S2A) to address issues in Soft attention of Transformer-based Large Language Models (LLMs), mitigating the incorporation of irrelevant information. S2A leverages LLMs' reasoning ability to regenerate input context, focusing on relevant portions before attending to elicit a more accurate response. In experiments, S2A outperforms standard attention-based LLMs across tasks involving opinion or irrelevant information, including QA, math word problems, and long-form generation, enhancing factuality and objectivity while reducing sycophancy.	Paper , code
20) Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey - We introduce System 2 Attention (S2A) to address issues in Soft attention of Transformer-based Large Language Models (LLMs), mitigating the incorporation of irrelevant information. S2A leverages LLMs' reasoning ability to regenerate input context, focusing on relevant portions before attending to elicit a more accurate response. In experiments, S2A outperforms standard attention-based LLMs across tasks involving opinion or irrelevant information, including QA, math word problems, and long-form generation, enhancing factuality and objectivity while reducing sycophancy.	Paper , code
21) PaSS - Proposing a method for scaling large language models, the approach addresses autoregressive bottlenecks by introducing parallel decoding. Unlike speculative sampling, it requires only one model, achieving up to a 30% speed-up with minimal additional parameters.	Paper
22) Orca 2 - Orca 1 excels by learning from rich signals like explanation traces, outperforming conventional instruction-tuned models on benchmarks. In Orca 2, the focus shifts to enhancing smaller language models' reasoning abilities by exploring improved training signals. Rather than relying solely on imitation learning, the approach teaches small models various reasoning techniques, aiming to enable them to determine the most effective solution strategy for different tasks. Evaluated on 15 benchmarks with approximately 100 tasks, Orca 2 surpasses models of similar size and matches or exceeds the performance of models 5-10 times larger, particularly in complex tasks testing advanced reasoning abilities in zero-shot settings.	Paper
23) GPQA: A Graduate-Level Google-Proof Q&A Benchmark - Introducing GPQA, a rigorous dataset comprising 448 challenging multiple-choice questions in biology, physics, and chemistry. Crafted by domain experts, the questions prove difficult for both expert and non-expert validators, with experts reaching 65% accuracy (74% discounting identified mistakes) and skilled non-experts achieving only 34% accuracy, despite extensive web access. Even state-of-the-art AI systems struggle, with the best GPT-4 baseline achieving 39% accuracy. The dataset's complexity underscores the need for scalable oversight methods for future AI systems tackling tough questions, especially in scenarios like scientific knowledge development. GPQA aims to facilitate experiments in realistic oversight, fostering reliable extraction of truthful information from advanced AI systems by human experts.	Paper , code
24) Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents - Large Language Models (LLMs) exhibit impressive empirical performance and reasoning capabilities in complex tasks, leveraging chain-of-thought (CoT) reasoning. CoT enhances interpretability, controllability, and flexibility. Recent research extends CoT to create autonomous language agents. This concise survey covers CoT mechanics, paradigm shifts, and the rise of CoT-driven language agents, appealing to beginners and experienced researchers alike. Topics include generalization, efficiency, customization, scaling, and safety for future exploration.	Paper
25) GAIA - Introducing GAIA, a benchmark for General AI Assistants representing a significant milestone in AI research. GAIA presents real-world questions that demand fundamental abilities like reasoning, multi-modality handling, web browsing, and tool-use proficiency. While conceptually simple for humans, these questions challenge advanced AIs, with human respondents achieving 92% accuracy compared to GPT-4 with plugins at 15%. This performance gap contrasts with recent trends where large language models outperform humans in domain-specific tasks. GAIA focuses on assessing a system's robustness akin to average human capabilities. The methodology includes 466 questions and answers.	Paper
26) MedAgents - Proposing a solution to challenges faced by Large Language Models (LLMs) in medicine and healthcare, we present a Multi-disciplinary Collaboration (MC) framework. This novel approach involves role-playing LLM-based agents engaging in collaborative discussions to enhance proficiency and reasoning capabilities in the medical domain. The training-free and interpretable framework consists of gathering domain experts, proposing analyses, summarizing into a report, iterating discussions for consensus, and making decisions. Focusing on the zero-shot scenario, our results on nine datasets demonstrate the MC framework's excellence in mining medical expertise and enhancing LLM reasoning abilities. Human evaluations and ablation studies further analyze method errors and performance factors.	Paper , code
27) Camels in a Changing Climate - Introducing TÜLU 2, an enhanced suite of models for adapting large language models to downstream tasks and user preferences. It includes improved instruction datasets (TÜLU-V2-mix), LLAMA-2 models (TÜLU 2), models trained with direct preference optimization (TÜLU 2+DPO, including TÜLU 2+DPO 70B), and CODE TÜLU 2 models outperforming CODE LLAMA. Evaluations demonstrate state-of-the-art performance, matching or exceeding GPT-3.5-turbo-0301. All checkpoints, data, training, and evaluation code are released for future open efforts in this domain.	Paper , code

October

Paper	Links
1) Language Models Represent Space and Time - The study on Llama-2 models suggests that large language models might be developing coherent models of data generation, resembling world models. They show robust linear representations of space and time, with identified "space neurons" and "time neurons" encoding coordinates, indicating potential elements of a world model.	Paper , code
2) Retrieval meets Long Context Large Language Models - The study compares extending the context window and retrieval-augmentation in large language models (LLMs) for downstream tasks. Surprisingly, a 4K context window with retrieval-augmentation performs comparably to a 16K context window. Retrieval significantly enhances LLM performance regardless of context size. The best model, a retrieval-augmented LLaMA2-70B with a 32K context window, outperforms competitors and its non-retrieval baseline, showcasing the value of retrieval methods in LLMs for practical applications.	Paper
3) Efficient Streaming Language Models with Attention Sinks - StreamingLLM, a framework for deploying Large Language Models (LLMs) in streaming applications, efficiently addresses memory consumption and generalization challenges. It overcomes attention sink phenomena and achieves stable and efficient language modeling, outperforming sliding window recomputation baselines. Tested on models like Llama-2, MPT, Falcon, and Pythia, StreamingLLM showcases up to 22.2x speedup in streaming settings.	Paper , code
4) The Dawn of LMMs: Preliminary Explorations with GPT-4V - This paper analyzes GPT-4V, a large multimodal model that extends language models with visual understanding. The study explores its diverse capabilities and highlights its potential for new human-computer interaction methods. The report concludes with discussions on emerging application scenarios and future research directions.	Paper
5) Think before you speak - This study explores a novel approach called "pause-training" for language models, introducing a learnable pause token to delay token generation during training and inference. The model processes extra computations before finalizing its answer. Evaluations on 1B and 130M parameter models, pre-trained and fine-tuned with delays, show gains across various tasks, highlighting potential for a new paradigm in language model prediction.	Paper
6) Self-Taught Optimizer - This work explores the concept of using language models to improve a seed "improver" program, which iteratively refines code using the language model's predictions. The improved improver exhibits better performance across various tasks, demonstrating the potential of language models to enhance their own scaffolding programs. The study raises considerations about self-improving technologies and evaluates instances where generated code may bypass safety measures.	Paper
7) RA-DIT: Retrieval-Augmented Dual Instruction Tuning - This work introduces Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a fine-tuning methodology that enhances language models (LMs) with retrieval capabilities. RA-DIT operates in two steps, updating the LM to better utilize retrieved information and improving the retriever to return more relevant results. Demonstrating significant performance improvements, the RA-DIT 65B model achieves state-of-the-art results on various knowledge-intensive zero- and few-shot learning benchmarks, outperforming existing in-context retrieval-augmented LM approaches.	Paper
8) Large Language Models as Analogical Reasoners - This work introduces Analogical Prompting, a novel approach for guiding the reasoning process of large language models. Inspired by analogical reasoning in humans, the method prompts language models to self-generate relevant exemplars or knowledge in the context before solving a given problem. Analogical Prompting outperforms 0-shot Chain-of-Thought (CoT) and manual few-shot CoT in various reasoning tasks, including math problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench.	Paper
9) Ring Attention with Blockwise Transformers for Near-Infinite Context - Ring Attention Breakthrough: Revolutionizing Memory-Efficient Transformers for Unprecedented AI Model Performance in Handling Long Sequences	Paper , code
10) Survey on Factuality in LLMs - Unveiling the Factuality Challenge in Large Language Models: Implications, Causes, and Strategies for Enhancing Reliability Across Domains	Paper , summary
11) LLM can Learn Rules - Hypotheses-to-Theories (HtT): A Breakthrough Framework for Enhancing Reasoning Accuracy in Large Language Models	Paper
12) Meta-CoT - Breaking Boundaries: Meta-CoT Unleashes Generalizable Chain-of-Thought Prompting for Advanced Reasoning in Large Language Models	Paper , code
13) A Survey of LLM for healthcare - Unveiling the Healthcare Revolution: A Comprehensive Survey on the Evolution of Large Language Models in Healthcare Applications	Paper , summary
14) RECOMP - Efficiency Meets Accuracy: Compressing Retrieved Documents for Enhanced Inference in Language Models	Paper , code
15) InstructRetro - Scaling New Heights: Retro 48B - The Pinnacle of Retrieval-Augmented Pretraining for Large Language Models	Paper
16) FireAct - Fine-Tuning for Language Agents: Unveiling the Power of Tailored Training in LM Augmented Environments	Paper , code
17) Llemma - Llemma: Advancing Mathematical Language Understanding with a Pretrained Large Language Model	Paper , code
18) LLM for Software Engineering - Navigating the Landscape: A Survey of Large Language Models in Software Engineering and Open Challenges	Paper , code
19) Self-RAG - Enhancing Factual Accuracy: Self-Reflective Retrieval-Augmented Generation Framework for Large Language Models	Paper , code & summary
20) Understanding Retrieval Augmentation for Long-Form Question Answering - Decoding the Impact: A Comprehensive Study of Retrieval-Augmented Language Models in Long-Form Question Answering, Analyzing Answer Attributes and Attribution Patterns for Enhanced Insights.	Paper
21) Can Large Language Models Explain Themselves? - Self-Explanations: Assessing the Quality and Efficiency of Automatically Generated Explanations by Large Language Models in Sentiment Analysis and Feature Attribution Tasks.	Paper
22) OpenAgents - Empowering Everyday Users: OpenAgents - An Open Platform for Real-World Deployment and Interaction with Language Agents	Paper , code
23) Eliciting Human Preferences with Language Models - Language Models: Generative Active Task Elicitation (GATE) for Enhanced Task Specification and Alignment with Human Preferences	Paper , code
24) AutoMix - Optimizing Large Language Model Deployment: AutoMix - Efficient Routing of Queries to Balance Cost and Performance	Paper , code,
25) Zephyr: Direct Distillation of LM Alignment - To Enhance Chat Model Intent Alignment through Distilled Direct Preference Optimization	Paper , code
26) The Perils & Promises of Fact-checking with LLM - Fact-Checking with Authority: Evaluating Large Language Models in Autonomous Information Verification with Contextual Reasoning and Source Citation	Paper
27) ALCUNA: Large Language Models Meet New Knowledge - Navigating New Knowledge: Introducing KnowGen and ALCUNA Benchmarks to Evaluate Large Language Models in Handling Novel Entity Attributes and Relationships	Paper , code
28) Detecting Pretraining Data from Large Language Models - Unmasking the Unseen: Detecting Pretraining Data in Large Language Models with Min-K% Prob	Paper , code and summary
29) Branch-Solve-Merge Improves Large Language Model Evaluation and Generation - "Optimizing Large Language Model Performance: Introducing Branch-Solve-Merge (BSM) Program for Coherent and Task-Adaptive Natural Language Generation and Evaluation. BSM strategically decomposes intricate tasks into parallel sub-tasks, independently solves them, and merges the solutions for improved coherence, constraint satisfaction, and task performance across various LLMs, including Vicuna, LLaMA-2-chat, and GPT-4."	Paper , code
30) Evaluating Large Language Models: A Comprehensive Survey - Evaluating the Power and Pitfalls: A Comprehensive Survey on the Evaluation of Large Language Models (LLMs). Unveiling insights into knowledge, alignment, and safety aspects, this survey provides a thorough examination of evaluation methodologies, benchmarks, and specialized domain performances, fostering responsible development for maximizing societal benefit and minimizing potential risks.	Paper
31) ChipNeMo - Enhancing Large Language Models for Industrial Chip Design with Domain Adaptation Techniques. Exploring custom tokenization, domain-adaptive continued pretraining, and supervised fine-tuning, ChipNeMo achieves significant performance gains in engineering assistant chatbot, EDA script generation, and bug summarization tasks, opening avenues for future improvements in domain-adapted LLM applications.	Paper
32) FP8-LM - Efficient Training for Large Language Models: Exploring FP8 Low-Bit Data Formats. Introducing an FP8 automatic mixed-precision framework, this paper achieves a 39% reduction in real memory usage and a 75% increase in training speed for GPT-175B model on H100 GPU platform, surpassing widely adopted BF16 and Nvidia Transformer Engine methodologies.	Paper , code

September

Paper	Links
1) RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback** - Scaling Reinforcement Learning: RLAIF Revolutionizes Alignment with Human Preferences in Large Language Models	Paper
2) GPT Can Solve Mathematical Problems Without a Calculator - Revolutionizing Arithmetic Abilities: Large Language Models Surpass Expectations with Accurate Multi-Digit Operations	Paper , code
3) Large Language Models as Optimizers - Optimization by PROmpting (OPRO): Harnessing Large Language Models for Gradient-Free Optimization	Paper , code
4) FLM-101B - Revolutionizing LLMs: Cost-Effective Training and IQ Evaluation with FLM-101B	Paper , code and summary
5) Cognitive Architectures for Language Agents - CoALA: Towards Cognitive Architectures for Language Agents - A Framework for Organizing and Advancing Language Models	Paper , code
6) Textbooks Are All You Need - Empowering Smaller Transformers: Unveiling phi-1.5 for Common Sense Reasoning in Natural Language	Paper
7) The Rise and Potential of Large Language Model Based Agents: A Survey - From Language Models to Intelligent Agents: A Comprehensive Survey on LLM-Based Agents for Artificial General Intelligence	Paper
8) RAIN: Your Language Models Can Align Themselves without Finetuning - Self-Guided Alignment: Rewindable Auto-regressive INference for Consistent Large Language Model Responses	Paper , code
9) A Survey of Hallucination in Large Foundation Models - Hallucination in Large Foundation Models: A Comprehensive Survey and Strategies for Mitigation	Paper
10) Agents: An Open-source Framework for Autonomous Language Agents - Agents: An Open-Source Library for Building and Deploying Autonomous Language Agents	Paper , code
11) MAmmoTH - Open-Source Large Language Models for Superior General Math Problem-Solving	Paper , code
12) Chain-of-Verification Reduces Hallucination in Large Language Models - Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.	Paper
13) Contrastive Decoding Improves Reasoning in Large Language Models - Unlocking Improved Reasoning: Contrastive Decoding Boosts Text Generation, Surpassing Greedy Decoding on Various Tasks	Paper
14) LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models - Efficiently extending the context of large language models, LongLoRA introduces a shifted sparse attention (S2-Attn) for computational savings during fine-tuning, achieving impressive performance on LLMs from 7B to 70B. By leveraging improved LoRA for context extension and adopting S2-Attn, LongLoRA efficiently extends model contexts without significant additional computational costs, showcasing compatibility with existing techniques and achieving strong empirical results across various tasks.	Paper , code
15) LLM Really Good at Generating Complex Structured Data - Evaluating Large Language Models' (LLMs) performance on tasks requiring the generation of complex structured data, this study introduces Struc-Bench, encompassing five representative LLMs. Identifying common formatting errors and areas for improvement in current models, the research proposes a structure-aware fine-tuning approach. Leveraging FormatCoT for format instruction generation, the study demonstrates significant improvements in adherence to natural language constraints, particularly with LLaMA-7B. The resulting ability map outlines model capabilities across six dimensions, shedding light on LLM weaknesses in handling intricate structured outputs and suggesting avenues for future research.	Paper , code
16) A Large-Scale Real-World LLM Conversation Dataset - Introducing LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art Large Language Models (LLMs), this paper emphasizes the importance of studying user interactions with LLMs in real-world scenarios. The dataset is collected from 210K unique IP addresses on the Vicuna demo and Chatbot Arena website, showcasing diversity, originality, and scale. The paper provides an overview of the dataset's content, curation process, basic statistics, and topic distribution. It demonstrates the dataset's versatility through various use cases, making it a valuable resource for understanding and advancing LLM capabilities.	Paper
17) Language Modeling Is Compression - Large language models, such as Chinchilla 70B, demonstrate powerful general-purpose prediction capabilities with impressive compression performance. This work advocates for viewing the prediction problem through the lens of compression, revealing insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, trained primarily on text, outperforms domain-specific compressors like PNG and FLAC when compressing ImageNet patches and LibriSpeech samples, respectively. The prediction-compression equivalence also allows the use of any compressor, like gzip, to build a conditional generative model.	Paper
18) Compositional Foundation Models for Hierarchical Planning - HiP, or Compositional Foundation Models for Hierarchical Planning, addresses decision-making challenges in novel environments with long-horizon goals. It leverages expert foundation models in language, vision, and action to perform hierarchical reasoning. The large language model constructs symbolic plans, grounded in the environment through a video diffusion model, and connects them to visual-motor control using an inverse dynamics model. Consistency is maintained through iterative refinement, demonstrating effectiveness in long-horizon tasks like table-top manipulation.	Paper , code and summary
19) OWL: A Large Language Model for IT Operations - OWL, a large language model designed for IT operations, is introduced in this paper, showcasing its prowess in handling various IT-related tasks. Trained on the OWL-Instruct dataset, the model employs a mixture-of-adapter strategy for efficient tuning across different domains or tasks. Evaluation on the OWL-Bench and other open IT-related benchmarks demonstrates OWL's superior performance compared to existing models, highlighting its potential to revolutionize techniques in IT operations with specialized Large Language Models (LLMs).	Paper , code
20) The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" - "Reversal Curse" in large language models revealed: Auto-regressive models, when trained on statements like "A is B," fail to automatically generalize to "B is A." For example, if trained on "Olaf Scholz was the ninth Chancellor of Germany," the model won't deduce the answer to "Who was the ninth Chancellor of Germany?" We demonstrate this issue with GPT-3 and Llama-1, revealing a basic failure of logical deduction across model sizes and families, unaffected by data augmentation. Even in real-world queries about celebrities, models struggle with reversed questions, showcasing the persistence of the Reversal Curse.	Paper , code
21) Effective Long-Context Scaling of Foundation Models - Unlocking Long Context: A series of large language models (LLMs) supporting effective context windows up to 32,768 tokens. Built through continual pretraining from Llama 2 with longer training sequences, these models demonstrate consistent improvements on language modeling, synthetic context probing tasks, and various research benchmarks, especially excelling in long-context tasks. The 70B variant, with a cost-effective instruction tuning procedure, outperforms gpt-3.5-turbo-16k on a suite of long-context tasks, showcasing the effectiveness of the approach. The study provides an in-depth analysis of Llama's position encodings and explores the impact of design choices in the pretraining process.	Paper
22) Graph Neural Prompting with Large Language Models - Introducing Graph Neural Prompting (GNP), a plug-and-play method that enhances large language models' (LLMs) understanding of knowledge graphs (KGs). GNP incorporates a graph neural network encoder, cross-modality pooling module, domain projector, and self-supervised link prediction objective. Through extensive experiments, GNP showcases superior performance in both commonsense and biomedical reasoning tasks across different LLM sizes and settings. The approach addresses limitations in capturing grounded knowledge, offering a promising avenue for knowledge-enhanced language modeling with pre-trained LLMs.	Paper
23) Aligning Large Multimodal Models with Factually Augmented RLHF - Presenting a solution to the multimodal misalignment issue in Large Multimodal Models (LMM), we adapt Reinforcement Learning from Human Feedback (RLHF) to align vision and language. Our Factually Augmented RLHF algorithm incorporates additional factual information into the reward model, mitigating reward hacking and improving performance. The approach is evaluated on the MMHAL-BENCH benchmark, focusing on penalizing hallucinations. As the first LMM trained with RLHF, our model achieves remarkable improvement, reaching a 94% performance level of the text-only GPT-4 on LLaVA-Bench and a 60% improvement on MMHAL-BENCH compared to other baselines.	Paper , code
24) Large Language Model Alignment: A Survey - In this comprehensive survey, we delve into the realm of Large Language Models (LLMs), examining alignment methodologies to ensure their behaviors align with human values. Categorizing methods into outer and inner alignment and addressing interpretability and adversarial vulnerabilities, we explore benchmarks and evaluation approaches. Beyond fostering research interests, our survey aims to bridge the gap between the AI alignment community and researchers focused on LLM capability exploration, envisioning a future of both capable and safe LLMs.	Paper
25) MentaLLaMA - This study delves into interpretable mental health analysis on social media, leveraging large language models (LLMs) for nuanced predictions and detailed explanations. Recognizing challenges in zero-shot/few-shot scenarios and domain-specific finetuning, the authors introduce the Interpretable Mental Health Instruction (IMHI) dataset, encompassing 105K samples from 10 sources across 8 mental health tasks. With expert-written prompts and strict evaluations, they train MentalLLaMA, the first open-source LLM series for interpretable mental health analysis. Evaluation results demonstrate MentalLLaMA's competitive correctness and high-quality explanation generation, bridging gaps in interpretable mental health assessment.	Paper , code
26) Enhancing Zero-Shot Chain-of-Thought Reasoning in LLM - Improving zero-shot chain-of-thought reasoning in large language models, LogiCoT introduces a neurosymbolic framework. By incorporating principles from symbolic logic, LogiCoT aims to enhance the reasoning processes of generative language models. Experimental evaluations across various domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, highlight the effectiveness of LogiCoT in fortifying the reasoning abilities of large language models.	Paper

August

Paper	Links
1) MetaGPT - MetaGPT utilizes a framework with LLM-based multi-agents to encode human SOPs, enhancing problem-solving for software development, code generation, and data analysis, leveraging tools like AutoGPT and LangChain.	Paper, code,summary
2) The Hydra Effect: Emergent Self-repair in Language Model Computations - Exploring Language Model Computations: Unveiling Adaptive Computation and Counterbalancing Functions through Causal Analysis	Paper
3) SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning - "Enhancing Language Model Reasoning: Introducing SelfCheck for Error Recognition and Improved Question-Answering Performance"	Paper , code
4) LLM As DBA - D-Bot: A Textually-Informed Framework for Continuous Database Maintenance Experience and Collaborative Diagnosis	Paper , code
5) Tracing Political Biases: Unveiling Unfair NLP Models from Pretraining to Downstream Tasks - Methods Developed to Measure Media Biases in LLMs ,LLM Findings: Political Leanings Reinforce Existing Polarization in Corpora	Paper
6) AgentBench: Evaluating LLMs as Agents - Exploring Language Model Computations: Unveiling Adaptive Computation and Counterbalancing Functions through Causal Analysis	Paper , code
7) Studying Large Language Model Generalization with Influence Functions - Scaling Influence Functions for Large Language Models (LLMs) using EK-FAC Approximation.Exploring Generalization Patterns in LLMs: Insights from Influence Functions on Model Behavior and Risk Mitigation	Paper
8) Simple synthetic data reduces sycophancy in large language models - Mitigating Sycophancy in Language Models: Assessing Prevalence, Scaling Impact, and Introducing a Synthetic-Data Intervention	Paper , code
9) Pre-Trained Large Language Models for Industrial Control - GPT-4 as an HVAC Controller: Assessing Performance, Generalization, and Text Context Impact in Industrial Control	Paper
10) Trustworthy LLMs - Assessing Trustworthiness in Large Language Models: A Comprehensive Survey and Measurement Study on Alignment with Social Norms, Values, and Regulations	Paper
11) Self-Alignment with Instruction Backtranslation - Instruction Backtranslation: A Scalable Approach for Building High-Quality Instruction-Following Language Models through Self-Augmentation and Self-Curation	Paper
12) Platypus: Quick, Cheap, and Powerful Refinement of LLMs - Platypus: Leading Open LLM Leaderboard with Efficient Fine-Tuning and Merging	Paper , code and summary
13) A Survey on Model Compression for Large Language Models - Navigating the Future of Large Language Models: A Comprehensive Survey on Model Compression Strategies for Practical Deployment
Paper
14) Shepherd: A Critic for Language Model Generation - Shepherd: Advancing Language Model Capabilities with Targeted Critiques and Refinements
Paper , code
15) Solving Challenging Math Word Problems Using GPT-4 - Revolutionizing Math Reasoning with GPT-4 Code Interpreter: Unveiling the Power of Code-Based Self-Verification
Paper
16) Teach LLMs to Personalize -- An Approach inspired by Writing Education - Revolutionizing personalized text generation, our cutting-edge approach employs a multistage, multitask framework inspired by writing education practices. Unveiling the future of tailored content, this innovative technique showcases significant improvements over various baselines across diverse domains, setting a new standard in large language model training.	Paper
17) OctoPack: Instruction Tuning Code Large Language Models - Revolutionizing Language Model Performance: Instruction Tuning with Code for State-of-the-Art Results	Paper , code
18) Efficient Guided Generation for Large Language Models - Revolutionizing Neural Text Generation: A State Machine Approach with Regular Expressions and Context-Free Grammars in Outlines Python Library. Discover a groundbreaking framework that transforms text generation challenges by framing them as transitions in a finite-state machine. This model-agnostic technique efficiently incorporates regular expressions and context-free grammars, facilitating the creation of an index over language models' vocabularies. The approach seamlessly integrates domain-specific knowledge, ensures adherence to constraints, and enhances text structure reliability. With minimal impact on token sequence generation, this method surpasses current solutions, as evidenced by a high-performing implementation in the open-source Outlines Python library.	Paper
19) Code Llama - Code Llama: Unveiling State-of-the-Art Language Models for Diverse Coding Tasks. Our 7B, 13B, and 34B parameter models redefine large language model training, excelling in infilling, supporting extensive input contexts, and delivering zero-shot instruction following. From foundation models to Python specializations, Code Llama sets benchmarks with up to 55% scores on key code evaluation benchmarks. Released under a permissive license, Code Llama is your go-to for cutting-edge research and commercial applications.	Paper & code
20) Instruction Tuning for Large Language Models: A Survey - Surveying Instruction Tuning in Large Language Models: A critical technique for enhancing capabilities and controllability. This paper systematically reviews the methodology, dataset construction, model training, and applications of instruction tuning. It addresses factors influencing outcomes, potential pitfalls, criticisms, and suggests research avenues. Project page:	Paper , code
21) Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities - Securing the Future: Navigating Threats and Vulnerabilities in the Age of Large Language Models
Paper
22) Giraffe: Adventures in Expanding Context Lengths in LLMs - Extending Limits: Exploring Context Length Extrapolation in Modern Large Language Models	Paper , code
23) A Survey on Large Language Model based Autonomous Agents - Unlocking Human-like Intelligence: A Holistic Survey of Large Language Model-Based Autonomous Agents	Paper
24) Prompt2Model - Revolutionizing NLP Systems: Prompt2Model Unleashes Special-Purpose Models for Efficient Deployment	Paper , code
25) Measuring Legal Reasoning in Large Language Models - LegalBench: Unveiling the Spectrum of Legal Reasoning for Large Language Models	Paper , summary
26) Language to Rewards for Robotic Skill Synthesis - Closing the Gap: Language Models Define Rewards for Robot Control Tasks	Paper , code and summary
27) LLaSM: Large Language and Speech Model - Elevating Human-AI Interaction: Introducing LLaSM for Multi-Modal Speech-and-Language Instructions	Paper
28) Lucene Is All You Need - Breaking the Mold: Vector Search with OpenAI Embeddings in Lucene Challenges Conventional Wisdom	Paper
29) Graph of Thoughts - Revolutionizing Language Models: Graph of Thoughts (GoT) Unleashes Advanced Prompting Capabilities
Paper , code
30) Nougat: Neural Optical Understanding for Academic Documents - Bridging the Gap: Nougat - Transforming Scientific PDFs into Machine-Readable Markup Language with Visual Transformers
Paper
29) FacTool: Factuality Detection in Generative AI - A Comprehensive Framework for Detecting Factual Errors in Texts Generated by Large Language Models	Paper , code
30) Transformers as Support Vector Machines - Transformers Unveiled: A Formal Link to SVMs and the Journey to Optimizing Implicit Bias	Paper,Code

July

Paper	Links
1) Llama 2 - a collection of pretrained foundational models and fine-tuned chat models ranging in scale from 7B to 70B; Llama 2-Chat is competitive on a range of tasks and shows strong results on safety and helpfulness.	Paper, code,summary
2) How is ChatGPT’s Behavior Changing Over Time? - evaluates different versions of GPT-3.5 and GPT-4 on various tasks and finds that behavior and performance vary greatly over time; this includes differences in performance for tasks such as math problem-solving, safety-related generations, and code formatting.	Paper, code,summary
3) FlashAttention-2 - improves work partitioning and parallelism and addresses issues like reducing non-matmul FLOPs, parallelizing attention computation which increases occupancy, and reducing communication through shared memory.	Paper, code,summary
4) Measuring Faithfulness in Chain-of-Thought Reasoning - nds that CoT reasoning shows large variation across tasks by simple interventions like adding mistakes and paraphrasing; demonstrates that as the model becomes larger and more capable, the reasoning becomes less faithful; suggests carefully choosing the model size and tasks can enable CoT faithfulness.	Paper, code,summary
5) Generative TV & Showrunner Agents - an approach to generate episodic content using LLMs and multi-agent simulation; this enables current systems to perform creative storytelling through the integration of simulation, the user, and powerful AI models and enhance the quality of AI-generated content.	Paper & code & summary
6) Challenges & Application of LLMs - summarizes a comprehensive list of challenges when working with LLMs that range from brittle evaluations to prompt brittleness to a lack of robust experimental designs.	Paper,
7) Retentive Network - presents a foundation architecture for LLMs with the goal to improve training efficiency, inference, and efficient long-sequence modeling; adapts retention mechanism for sequence modeling that support parallel representation, recurrent representations, and chunkwise recurrent representation.	Paper, code,summary
8) Meta-Transformer - a framework that performs unified learning across 12 modalities; it can handle tasks that include fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series).	Paper, code,summary
9) Retrieve In-Context Example for LLMs - presents a framework to iteratively train dense retrievers to identify high-quality in-context examples for LLMs; the approach enhances in-context learning performance demonstrated using a suite of 30 tasks; examples with similar patterns are helpful and gains are consistent across model sizes.	Paper, code,summary
10) FLASK - proposes fine-grained evaluation for LLMs based on a range of alignment skill sets; involves 12 skills and can help to provide a holistic view of a model’s performance depending on skill, domain, and level of difficulty; useful to analyze factors that make LLMs more proficient at specific skills.	Paper, code,summary
11) CM3Leon - introduces a retrieval-augmented multi-modal language model that can generate text and images; leverages diverse and large-scale instruction-style data for tuning which leads to significant performance improvements and 5x less training compute than comparable methods.	Paper, code,summary
12) Claude 2 - presents a detailed model card for Claude 2 along with results on a range of safety, alignment, and capabilities evaluations.	Paper,code & summary
13) Secrets of RLHF in LLMs - takes a closer look at RLHF and explores the inner workings of PPO with code included.	Paper, code,summary
14) LongLLaMA - employs a contrastive training process to enhance the structure of the (key, value) space to extend context length; presents a fine-tuned model that lengthens context and demonstrates improvements in long context tasks.	Paper, code,summary
15) LLMs as General Pattern Machines - shows that even without any additional training, LLMs can serve as general sequence modelers, driven by in-context learning; this work applies zero-shot capabilities to robotics and shows that it’s possible to transfer the pattern among words to actions.	Paper, code &summary
16) Teaching Arithmetics to Small Transformers - trains small transformer models on chain-of-thought style data to significantly improve accuracy and convergence speed; it highlights the importance of high-quality instructive data for rapidly eliciting arithmetic capabilities.	Paper, code,summary
17) Generative Pretraining in Multimodality - presents a new transformer-based multimodal foundation model to generate images and text in a multimodal context; enables performant multimodal assistants via instruction tuning.	Paper, code & summary
18) A Survey on Evaluation of LLMs - a comprehensive overview of evaluation methods for LLMs focusing on what to evaluate, where to evaluate, and how to evaluate.	Paper, Sample project & Summary
19) How Language Models Use Long Contexts - finds that LM performance is often highest when relevant information occurs at the beginning or end of the input context; performance degrades when relevant information is provided in the middle of a long context.	Paper, code,summary
20) LLMs are Effective Text Rankers - proposes a prompting technique that enables open-source LLMs to perform state-of-the-art text ranking on standard benchmarks.	Paper, code,summary
21) Multimodal Generation with Frozen LLMs - introduces an approach that effectively maps images to the token space of LLMs; enables models like PaLM and GPT-4 to tackle visual tasks without parameter updates; enables multimodal tasks and uses in-context learning to tackle various visual tasks.	Paper, code,summary
22) CodeGen2.5 - releases a new code LLM trained on 1.5T tokens; the 7B model is on par with >15B code-generation models and it’s optimized for fast sampling.	Paper, code,summary
23) Elastic Decision Transformer - introduces an advancement over Decision Transformers and variants by facilitating trajectory stitching during action inference at test time, achieved by adjusting to shorter history that allows transitions to diverse and better future states.	Paper, code & summary
24) Robots That Ask for Help - presents a framework to measure and align the uncertainty of LLM-based planners that ask for help when needed.	Paper, code & summary
25) Scaling Transformer to 1 Billion Tokens - presents LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, with no loss in shorter sequences.	Paper, code,summary
26) Universal Adversarial LLM Attacks - finds universal and transferable adversarial attacks that cause aligned models like ChatGPT and Bard to generate objectionable behaviors; the approach automatically produces adversarial suffixes using greedy and gradient search.	Paper,code,summary
27) Med-PaLM Multimodal - introduces a new multimodal biomedical benchmark with 14 different tasks; it presents a proof of concept for a generalist biomedical AI system called Med-PaLM Multimodal; it supports different types of biomedical data like clinical text, imaging, and genomics.	Paper , summary
28) L-Eval - a standardized evaluation for long context language models containing 411 long documents over 2K query-response pairs encompassing areas such as law, finance, school lectures, long conversations, novels, and meetings.	Paper, code
29) LoraHub - introduces LoraHub to enable efficient cross-task generalization via dynamic LoRA composition; it enables the combination of LoRA modules without human expertise or additional parameters/gradients; mimics the performance of in-context learning in few-shot scenarios.	Paper, code,summary
30) Survey of Aligned LLMs - presents a comprehensive overview of alignment approaches, including aspects like data collection, training methodologies, and model evaluation.	Paper, code
31) WavJourney - leverages LLMs to connect various audio models to compose audio content for engaging storytelling; this involves an explainable and interactive design that enhances creative control in audio production.	Paper, code,summary
32) FacTool - a task and domain agnostic framework for factuality detection of text generated by LLM; the effectiveness of the approach is tested on tasks such as code generation and mathematical reasoning; a benchmark dataset is released, including a ChatGPT plugin.	Paper, code,summary

June

Paper	Links
1) Textbooks Are All You Need - they introduce phi-1, a new LLM for code of significantly smaller size (1.3 B params), trained for only 4 days BUT - the trick is - it uses high-quality data. Results? 50.6% acc on HumanEval (similarly for MBPP) -> comparable to many much much larger models	Paper,summary
2) Voicebox - an all-in-one generative speech model; it can synthesize speech across 6 languages; it can perform noise removal, content editing, style conversion, and more; it's 20x faster than current models and outperforms single-purpose models through in-context learning.	Paper, code,summary
3) FinGPT - an open-source LLM for the finance sector; it takes a data-centric approach, providing researchers & practitioners with accessible resources to develop FinLLMs.	Paper, code,summary
4) Crowd Workers Widely Use Large Language Models for Text Production Tasks - estimates that 33-46% of crowd workers on MTurk used LLMs when completing a text production task.	Paper, code,summary
5) Reliability of Watermarks for LLMs - watermarking is useful to detect LLM-generated text and potentially mitigate harms; this work studies the reliability of watermarking for LLMs and finds that watermarks are detectable even when the watermarked text is re-written by humans or paraphrased by another non-watermarked LLM.	Paper, code,summary
6) Applications of Transformers - a new survey paper highlighting major applications of Transformers for deep learning tasks; includes a comprehensive list of Transformer models.	Paper
7) Unifying LLMs & Knowledge Graphs - provides a roadmap for the unification of LLMs and KGs; covers how to incorporate KGs in LLM pre-training/inferencing, leverage LLMs for KG tasks such as question answering, and enhance both KGs and LLMs for bidirectional reasoning.	Paper, code
8) Augmenting LLMs with Long-term Memory - proposes a framework to enable LLMs to memorize long history; it’s enhanced with memory-augmented adaptation training to memorize long past context and use long-term memory for language modeling; achieves improvements on memory-augmented in-context learning over LLMs.	Paper,code , summary
9) Sparse-Quantized Representation - a new compressed format and quantization technique that enables near-lossless compression of LLMs across model scales; “allows LLM inference at 4.75 bits with a 15% speedup”.	Paper, code,summary
10) MusicGen - a simple and controllable model for music generation built on top of a single-stage transformer LM together with efficient token interleaving patterns; it can be conditioned on textual descriptions or melodic features and shows high performance on a standard text-to-music benchmark.	Paper, code,summary
11) ChatDB Augmenting LLMs with Databases - combines an LLM with a set of SQL databases, enabling a symbolic memory framework; completes tasks via LLM generating SQL instructions that manipulate the DB autonomously.	Paper, code,summary
12) Concept Scrubbing in LLM - presents a method called LEAst-squares Concept Erasure (LEACE) to erase target concept information from every layer in a neural network; it’s used for reducing gender bias in BERT embeddings.	Paper
13) Fine-Grained RLHF - trains LMs with fine-grained human feedback; instead of using overall preference, more explicit feedback is provided at the segment level which helps to improve efficacy on long-form question answering, reduce toxicity, and enables LM customization.	Paper, code,summary
14) Hierarchical Vision Transformer - pretrains vision transformers with a visual pretext task (MAE), while removing unnecessary components from a state-of-the-art multi-stage vision transformer; this enables a simple hierarchical vision transformer that’s more accurate and faster at inference and during training.	Paper, code,summary
15) Humor in ChatGPT - explores ChatGPT’s capabilities to grasp and reproduce humor; finds that over 90% of 1008 generated jokes were the same 25 jokes and that ChatGPT is also overfitted to a particular joke structure.	Paper, code,summary
16) Orca: Progressive Learning from Complex Explanation Traces of GPT-4 - it helps in imitating the Reasoning Process of Larger LLMs, develops a 13B parameter model that learns to imitate the reasoning process of large foundational models like GPT-4; it leverages large-scale and diverse imitation data and surpasses instruction-tuned models such as Vicuna-13B in zero-shot reasoning.	Paper, code,summary
17) Let’s Verify Step by Step - achieves state-of-the-art mathematical problem solving by rewarding each correct step of reasoning in a chain-of-thought instead of rewarding the final answer; the model solves 78% of problems from a representative subset of the MATH test set.	Paper, code,summary
18) The Impact of Positional Encoding on Length Generalization in Transformers - shows that explicit position embeddings are not essential for decoder-only Transformers; shows that other positional encoding methods like ALiBi and Rotary are not well suited for length generalization.	Paper, code
19) BiomedGPT - a unified biomedical generative pretrained transformer model for vision, language, and multimodal tasks. Achieves state-of-the-art performance across 5 distinct tasks with 20 public datasets spanning over 15 unique biomedical modalities.	Paper, code,summary
20) Fine-Tuning Language Models with Just Forward Passes - proposes a memory-efficient zeroth-order optimizer and a corresponding SGD algorithm to finetune large LMs with the same memory footprint as inference.	Paper , code,summary
21) MERT - an acoustic music understanding model with large-scale self-supervised training; it incorporates a superior combination of teacher models to outperform conventional speech and audio approaches.	Paper , code,summary
22) Bytes Are All You Need - investigates performing classification directly on file bytes, without needing to decode files at inference time; achieves ImageNet Top-1 accuracy of 77.33% using a transformer backbone; achieves 95.42% accuracy when operating on WAV files from the Speech Commands v2 dataset.	Paper, code,summary
23) Direct Preference Optimization - while helpful to train safe and useful LLMs, the RLHF process can be complex and often unstable; this work proposes an approach to finetune LMs by solving a classification problem on the human preferences data, with no RL required.	Paper, code,summary
24) SQL-PaLM - an LLM-based Text-to-SQL adopted from PaLM-2; achieves SoTA in both in-context learning and fine-tuning settings; the few-shot model outperforms the previous fine-tuned SoTA by 3.8% on the Spider benchmark; few-shot SQL-PaLM also outperforms few-shot GPT-4 by 9.9%, using a simple prompting approach.	Paper, code,summary
25) CodeTF - an open-source Transformer library for state-of-the-art code LLMs; supports pretrained code LLMs and popular code benchmarks, including standard methods to train and serve code LLMs efficiently.	Paper, code,summary
26) ClinicalGPT - a language model optimized through extensive and diverse medical data, including medical records, domain-specific knowledge, and multi-round dialogue consultations.	Paper,code,summary
27) LOMO - proposes a new memory-efficient optimizer that combines gradient computation and parameter update in one step; enables tuning the full parameters of an LLM with limited resources.	Paper,code,summary
28) SequenceMatch - formulates sequence generation as an imitation learning problem; this framework allows the ability to incorporate backtracking into text generation through a backspace action; this enables the generative model to mitigate compounding errors by reverting sample tokens that lead to sequence OOD.	Paper,code,summary
29) LMFlow - an extensible and lightweight toolkit that simplifies finetuning and inference of general large foundation models; supports continuous pretraining, instruction tuning, parameter-efficient finetuning, alignment tuning, and large model inference.	Paper,code,summary
30) MotionGPT - uses multimodal control signals for generating consecutive human motions; it quantizes multimodal control signals intro discrete codes which are converted to LLM instructions that generate motion answers.	Paper,code,summary
31) Wanda - introduces a simple and effective pruning approach for LLMs; it prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis; the approach requires no retraining or weight update and outperforms baselines of magnitude pruning.	Paper,code,summary
32) AudioPaLM - fuses text-based and speech-based LMs, PaLM-2 and AudioLM, into a multimodal architecture that supports speech understanding and generation; outperforms existing systems for speech translation tasks with zero-shot speech-to-text translation capabilities.	Paper,code,summary
33) Understanding Theory-of-Mind in LLMs with LLMs - a framework for procedurally generating evaluations with LLMs; proposes a benchmark to study the social reasoning capabilities of LLMs with LLMs.	Paper, ,code,summary
34) Evaluations with No Labels - a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on input text; can be used to monitor LLM behavior on datasets streamed during live model deployment.	Paper, ,code,summary
35) Long-range Language Modeling with Self-Retrieval - an architecture and training procedure for jointly training a retrieval-augmented language model from scratch for long-range language modeling tasks.	Paper ,code,summary
36) Generative AI for Programming Education - evaluates GPT-4 and ChatGPT on programming education scenarios and compares their performance with human tutors; GPT-4 outperforms ChatGPT and comes close to human tutors' performance.	Paper, code,summary
37) LeanDojo - an open-source Lean playground consisting of toolkits, data, models, and benchmarks for theorem proving; also develops ReProver, a retrieval augmented LLM-based prover for theorem solving using premises from a vast math library.	Paper, code,summary
38) Extending Context Window of LLMs - extends the context window of LLMs like LLaMA to up to 32K with minimal fine-tuning (within 1000 steps); previous methods for extending the context window are inefficient but this approach attains good performance on several tasks while being more efficient and cost-effective.	Paper, code,summary
39) Computer Vision Through the Lens of Natural Language - proposes a modular approach for solving computer vision problems by leveraging LLMs; the LLM is used to reason over outputs from independent and descriptive modules that provide extensive information about an image.	Paper, code,summary
40) InterCode - introduces a framework of interactive coding as a reinforcement learning environment; this is different from the typical coding benchmarks that consider a static sequence-to-sequence process.	Paper, code,summary

May

Paper	Links
1) QLoRA - an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning performance.	Paper, code,summary
2) LIMA - a new 65B parameter LLaMa model fine-tuned on 1000 carefully curated prompts and responses; it doesn't use RLHF, generalizes well to unseen tasks not available in the training data, and generates responses equivalent or preferred to GPT-4 in 43% of cases, and even higher compared to Bard.	Paper, code,summary
3) Voyager - an LLM-powered embodied lifelong learning agent in Minecraft that can continuously explore worlds, acquire skills, and make novel discoveries without human intervention.	Paper, code,summary
4) Gorilla - a finetuned LLaMA-based model that surpasses GPT-4 on writing API calls. This capability can help identify the right API, boosting the ability of LLMs to interact with external tools to complete specific tasks.	Paper, code,summary
5. The False Promise of Imitating Proprietary LLMs - provides a critical analysis of models that are finetuned on the outputs of a stronger model; argues that model imitation is a false premise and that the higher leverage action to improve open source models is to develop better base models.	Paper , code,summary
6) Sophia - presents a simple scalable second-order optimizer that has negligible average per-step time and memory overhead; on language modeling, Sophia achieves 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time.	Paper , code,summary
7) The Larger They Are, the Harder They Fail - shows that LLMs fail to generate correct Python code when default function names are swapped; they also strongly prefer incorrect continuation as they become bigger.	Paper, code,summary
8) LLM Research Directions - discusses a list of research directions for students looking to do research with LLMs.	Paper, code,summary
9) Reinventing RNNs for the Transformer Era - proposes an approach that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs; results show that the method performs on part with similarly sized Transformers.	Paper, code,summary
10) Evidence of Meaning in Language Models Trained on Programs - argues that language models can learn meaning despite being trained only to perform next token prediction on text.	Paper, code,summary
11) Towards Expert-Level Medical Question Answering with Large Language Models - a top-performing LLM for medical question answering; scored up to 86.5% on the MedQA dataset (a new state-of-the-art); approaches or exceeds SoTA across MedMCQA, PubMedQA, and MMLU clinical topics datasets.	Paper, code,summary
12) MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers - a multi-scale decoder architecture enabling end-to-end modeling of sequences of over one million bytes; enables sub-quadratic self-attention and improved parallelism during decoding.	Paper, code,summary
13) StructGPT: A General Framework for Large Language Model to Reason over Structured Data - improves the zero-shot reasoning ability of LLMs over structured data; effective for solving question answering tasks based on structured data.	Paper , code,summary
14) TinyStories: How Small Can Language Models Be and Still Speak Coherent English? - uses a synthetic dataset of short stories to train and evaluate LMs that are much smaller than SoTA models but can produce fluent and consistent stories with several paragraphs, and demonstrate reasoning capabilities.	Paper , code,summary
15) DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining - trains a small proxy model over domains to produce domain weights without knowledge of downstream tasks; it then resamples a dataset with the domain weights and trains a larger model; this enables using a 280M proxy model to train an 8B model (30x larger) more efficiently.	Paper, code,summary
16) CodeT5+: Open Code Large Language Models for Code Understanding and Generation - supports a wide range of code understanding and generation tasks and different training methods to improve efficacy and computing efficiency; tested on 20 code-related benchmarks using different settings like zero-shot, fine-tuning, and instruction tuning; achieves SoTA on tasks like code completion, math programming, and text-to-code retrieval tasks.	Paper, code,summary
17) Symbol tuning improves in-context learning in language models - an approach to finetune LMs on in-context input-label pairs where natural language labels are replaced by arbitrary symbols; boosts performance on unseen in-context learning tasks and algorithmic reasoning tasks.	Paper), code,summary
18) Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability - shows that PaLM is exposed to over 30 million translation pairs across at least 44 languages; shows that incidental bilingualism connects to the translation capabilities of PaLM.	Paper, code,summary
19) LLM explains neurons in LLMs - applies GPT-4 to automatically write explanations on the behavior of neurons in LLMs and even score those explanations; this offers a promising way to improve interpretability in future LLMs and potentially detect alignment and safety problems.	Paper, code,summary
20) PaLM 2 - a new state-of-the-art language model integrated into AI features and tools like Bard and the PaLM API; displays competitive performance in mathematical reasoning compared to GPT-4; instruction-tuned model, Flan-PaLM 2, shows good performance on benchmarks like MMLU and BIG-bench Hard.	Paper, code,summary
21) TidyBot - shows that robots can combine language-based planning and perception with the few-shot summarization capabilities of LLMs to infer generalized user preferences that are applicable to future interactions.	Paper, code,summary
22) Unfaithful Explanations in Chain-of-Thought Prompting - demonstrates that CoT explanations can misrepresent the true reason for a model’s prediction; when models are biased towards incorrect answers, CoT generation explanations supporting those answers.	Paper , code,summary
23) InstructBLIP - explores visual-language instruction tuning based on the pre-trained BLIP-2 models; achieves state-of-the-art zero-shot performance on 13 held-out datasets, outperforming BLIP-2 and Flamingo.	Paper , code,summary
24) Active Retrieval Augmented LLMs - introduces FLARE, retrieval augmented generation to improve the reliability of LLMs; FLARE actively decides when and what to retrieve across the course of the generation; demonstrates superior or competitive performance on long-form knowledge-intensive generation tasks.	Paper, code,summary
25) FrugalGPT - presents strategies to reduce the inference cost associated with using LLMs while improving performance.	Paper, code,summary
26) StarCoder - an open-access 15.5B parameter LLM with 8K context length and is trained on large amounts of code spanning 80+ programming languages.	Paper, code,summary
27) MultiModal-GPT - a vision and language model for multi-round dialogue with humans; the model is fine-tuned from OpenFlamingo, with LoRA added in the cross-attention and self-attention parts of the language model.	Paper, code,summary
28) scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI - a foundation large language model pretrained on 10 million cells for single-cell biology.	Paper, code,summary
29) GPTutor: a ChatGPT-powered programming tool for code explanation - a ChatGPT-powered tool for code explanation provided as a VSCode extension; claims to deliver more concise and accurate explanations than vanilla ChatGPT and Copilot; performance and personalization enhanced via prompt engineering; programmed to use more relevant code in its prompts.	Paper, code,summary
30) Are Emergent Abilities of Large Language Models a Mirage? - presents an alternative explanation to the emergent abilities of LLMs; suggests that existing claims are creations of the researcher’s analyses and not fundamental changes in model behavior on specific tasks with scale	Paper,code,summary
31) PMC-LLaMA: Further Finetuning LLaMA on Medical Papers - a LLaMA model fine-tuned on 4.8 million medical papers; enhances capabilities in the medical domain and achieves high performance on biomedical QA benchmarks.	Paper , code,summary
32) Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes - a mechanism to extract rationales from LLMs to train smaller models that outperform larger language models with less training data needed by finetuning or distillation.	Paper, code,summary
33) Poisoning Language Models During Instruction Tuning - show that adversaries can poison LLMs during instruction tuning by contributing poison examples to datasets; it can induce degenerate outputs across different held-out tasks.	Paper, code,summary
34) Unlimiformer: Long-Range Transformers with Unlimited Length Input - proposes long-range transformers with unlimited length input by augmenting pre-trained encoder-decoder transformer with external datastore to support unlimited length input; shows usefulness in long-document summarization; could potentially be used to improve the performance of retrieval-enhanced LLMs.	Paper, code,summary
35) Learning to Reason and Memorize with Self-Notes - an approach that enables LLMs to reason and memorize enabling them to deviate from the input sequence at any time to explicitly “think”; this enables the LM to recall information and perform reasoning on the fly; experiments show that this method scales better to longer sequences unseen during training.	Paper, code,summary

April

Paper	Links
1) Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning - applies deep reinforcement learning to synthesize agile soccer skills for a miniature humanoid robot; the resulting policy allows dynamic movement skills such as fast recovery, walking, and kicking.	Paper, code,summary
2) Scaling Transformer to 1M tokens and beyond with RMT - leverages a recurrent memory transformer architecture to increase BERT’s effective context length to two million tokens while maintaining high memory retrieval accuracy.	Paper, code,summary
3) Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond - a comprehensive and practical guide for practitioners working with LLMs; discusses many use cases with practical applications and limitations of LLMs in real-world scenarios.	Paper , code,summary
4) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head - connects ChatGPT with audio foundational models to handle challenging audio tasks and a modality transformation interface to enable spoken dialogue.	Paper , code,summary
5) ChatGPT for Information Extraction - provides a deeper assessment of ChatGPT's performance on the important information extraction task.	Paper, code,summary
6) Comparing Physician vs ChatGPT - investigates if chatbot assistants like ChatGPT can provide responses to patient questions while emphasizing quality and empathy; finds that chatbot responses were preferred over physician responses and rated significantly higher in terms of both quality and empathy.	Paper, code,summary
7) Stable and low-precision training for large-scale vision-language models - introduces methods for accelerating and stabilizing training of large-scale language vision models.	Paper, code,summary
8) Learning to Compress Prompts with Gist Tokens - an approach that trains language models to compress prompts into gist tokens reused for compute efficiency; this approach enables 26x compression of prompts, resulting in up to 40% FLOPs reductions.	Paper, code,summary
9) Scaling the leading accuracy of deep equivariant models to biomolecular simulations of realistic size - presents a framework for large-scale biomolecular simulation; this is achieved through the high accuracy of equivariant deep learning and the ability to scale to large and long simulations; the system is able to “perform nanoseconds-long stable simulations of protein dynamics and scale up to a 44-million atom structure of a complete, all-atom, explicitly solvated HIV capsid on the Perlmutter supercomputer.”	Paper, code,summary
10) Evaluating Verifiability in Generative Search Engines - performs human evaluation to audit popular generative search engines such as Bing Chat, Perplexity AI, and NeevaAI; finds that, on average, only 52% of generated sentences are supported by citations and 75% of citations support their associated sentence.	Paper, code,summary
11) Generative Disco: Text-to-Video Generation for Music Visualization - an AI system based on LLMs and text-to-image models that generates music visualizations.	Paper , code,summary
12) Visual Instruction Tuning - presents an approach that uses language-only GPT-4 to generate multimodal language-image instruction-following data; applies instruction tuning with the data and introduces LLaVA, an end-to-end trained large multimodal model for general-purpose visual and language understanding.	Paper, code,summary
13) ChatGPT: Applications, Opportunities, and Threats	Paper, code,summary
14) Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models - a plug-and-play compositional reasoning framework that augments LLMs and can infer the appropriate sequence of tools to compose and execute in order to generate final responses; achieves 87% accuracy on ScienceQA and 99% on TabMWP.	Paper, code,summary
15) Generative Agents: Interactive Simulacra of Human Behavior - proposes an architecture that extends LLMs to build agents that enable simulations of human-like behavior; these capabilities are possible by storing a complete record of an agent's experiences, synthesizing memories over time into higher-level reflections, and retrieving them dynamically to plan behavior.	Paper, code,summary
16) Emergent autonomous scientific research capabilities of large language models - presents an agent that combines LLMs for autonomous design, planning, and execution of scientific experiments; shows emergent scientific research capabilities, including the successful performance of catalyzed cross-coupling reactions.	Paper, code,summary
17) ChemCrow: Augmenting large-language models with chemistry tools - presents an LLM chemistry agent that performs tasks across synthesis, drug discovery, and materials design; it integrates 13 expert-design tools to augment LLM performance in chemistry and demonstrate effectiveness in automating chemical tasks.	Paper , code,summary
18) One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era - A Survey of ChatGPT and GPT-4	Paper , code,summary
19) OpenAGI: When LLM Meets Domain Experts - an open-source research platform to facilitate the development and evaluation of LLMs in solving complex, multi-step tasks through manipulating various domain expert models.	Paper, code,summary
20) Teaching Large Language Models to Self-Debug - proposes an approach that teaches LLMs to debug their predicted program via few-shot demonstrations; this allows a model to identify its mistakes by explaining generated code in natural language; achieves SoTA on several code generation tasks like text-to-SQL generation.	Paper, code,summary
21) Instruction Tuning with GPT-4 - presents GPT-4-LLM, a "first attempt" to use GPT-4 to generate instruction-following data for LLM fine-tuning; the dataset is released and includes 52K unique English and Chinese instruction-following data; the dataset is used to instruction-tune LLaMA models which leads to superior zero-shot performance on new tasks.	Paper, code,summary
22) Eight Things to Know about Large Language Models - discusses important considerations regarding the capabilities and limitations of LLMs.	Paper, code,summary
23) A Survey of Large Language Models - a new 50 pages survey on large language models.	Paper, code,summary
24) Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data - an open-source chat model fine-tuned with LoRA. Leverages 100K dialogs generated from ChatGPT chatting with itself; it releases the dialogs along with 7B, 13B, and 30B parameter models.	Paper , code,summary
25) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark - a new benchmark of 134 text-based Choose-Your-Own-Adventure games to evaluate the capabilities and unethical behaviors of LLMs.	Paper , code,summary
26) Better Language Models of Code through Self-Improvement - generates pseudo data from knowledge gained through pre-training and fine-tuning; adds the data to the training dataset for the next step; results show that different frameworks can be improved in performance using code-related generation tasks.	Paper, code,summary
27) Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models - an overview of applications of ChatGPT and GPT-4; the analysis is done on 194 relevant papers and discusses capabilities, limitations, concerns, and more	Paper, code,summary
28) Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - a suite for analyzing LLMs across training and scaling; includes 16 LLMs trained on public data and ranging in size from 70M to 12B parameters.	Paper, code,summary

March

Paper	Links
1) BloombergGPT: A Large Language Model for Finance - a new 50B parameter large language model for finance. Claims the largest domain-specific dataset yet with 363 billion tokens... further augmented with 345 billion tokens from general-purpose datasets; outperforms existing models on financial tasks while not sacrificing performance on general LLM benchmarks.	Paper, code,summary
2) Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware - a low-cost system that performs end-to-end imitation learning from real demonstrations; also presents an algorithm called Action Chunking with Transformers to learn a generative model that allows a robot to learn difficult tasks in the real world.	Paper, Tweet
3) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace - a system that leverages LLMs like ChatGPT to conduct task planning, select models and act as a controller to execute subtasks and summarize responses according to execution results.	Paper, code,summary
4) ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge - a medical chat model fine-tuned on LLaMA using medical domain knowledge. Collects data on around 700 diseases and generated 5K doctor-patient conversations to finetune the LLM.	Paper, code,summary
5. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention - a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model; generates responses comparable to Alpaca with fully fine-tuned 7B parameter; it’s also extended for multi-modal input support.	Paper , code,summary
6) ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks - demonstrates that ChatGPT can outperform crowd-workers for several annotation tasks such as relevance, topics, and frames detection; besides better zero-shot accuracy, the per-annotation cost of ChatGPT is less 20 times cheaper than MTurk.	Paper , code,summary
7) Language Models can Solve Computer Tasks - shows that a pre-trained LLM agent can execute computer tasks using a simple prompting scheme where the agent recursively criticizes and improves its outputs.	Paper, code,summary
8) DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents - a paradigm to enhance large language model completions by allowing models to communicate feedback and iteratively improve output; DERA outperforms base GPT-4 on clinically-focused tasks.	Paper, code,summary
9) Sparks of Artificial General Intelligence: Early experiments with GPT-4 - a comprehensive investigation of an early version of GPT-4 when it was still in active development by OpenAI.	Paper, code,summary
10) Capabilities of GPT-4 on Medical Challenge Problems - shows that GPT-4 exceeds the passing score on USMLE by over 20 points and outperforms GPT-3.5 as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B).	Paper, code,summary
11) GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models - investigates the potential implications of GPT models and related systems on the US labor market.	Paper, code,summary
12) CoLT5: Faster Long-Range Transformers with Conditional Computation - a long-input Transformer model that employs conditional computation, devoting more resources to important tokens in both feedforward and attention layers.	Paper , code,summary
13) Artificial muses: Generative Artificial Intelligence Chatbots Have Risen to Human-Level Creativity - compares human-generated ideas with those generated by generative AI chatbots like ChatGPT and YouChat; reports that 9.4% of humans were more creative than GPT-4 and that GAIs are valuable assistants in the creative process.	Paper , code,summary
14) A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models - a comprehensive capability analysis of GPT series models; evaluates performance on 9 natural language understanding tasks using 21 datasets.	Paper, code,summary
15) Context-faithful Prompting for Large Language Models - presents a prompting technique that aims to improve LLMs' faithfulness using strategies such as opinion-based prompts and counterfactual demonstrations.	Paper, code,summary
16) PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing - a trillion parameter language model with sparse heterogeneous computing.	Paper, code,summary
17) GPT-4 Technical Report - GPT-4 - a large multimodal model with broader general knowledge and problem-solving abilities.	Paper, code,summary
18) LERF: Language Embedded Radiance Fields - a method for grounding language embeddings from models like CLIP into NeRF; this enables open-ended language queries in 3D.	Paper, code,summary )
19) An Overview on Language Models: Recent Developments and Outlook - an overview of language models covering recent developments and future directions. It also covers topics like linguistic units, structures, training methods, evaluation, and applications.	Paper, code,summary
20) Eliciting Latent Predictions from Transformers with the Tuned Lens - a method for transformer interpretability that can trace a language model predictions as it develops layer by layer.	Paper, code,summary
21) Meet in the Middle: A New Pre-training Paradigm - a new pre-training paradigm using techniques that jointly improve training data efficiency and capabilities of LMs in the infilling task; performance improvement is shown in code generation tasks.	Paper , code,summary
22) Resurrecting Recurrent Neural Networks for Long Sequences - demonstrates that careful design of deep RNNs using standard signal propagation arguments can recover the performance of deep state-space models on long-range reasoning tasks.	Paper , code,summary
23) UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation - a new approach to tune a lightweight and versatile retriever to automatically retrieve prompts to improve zero-shot performance and help mitigate hallucinations.	Paper, code,summary
24) Patches Are All You Need? - proposes ConvMixer, a parameter-efficient fully-convolutional model which replaces self-attention and MLP layers in ViTs with less-expressive depthwise and pointwise convolutional layers.	Paper, code,summary
25) NeRFMeshing: Distilling Neural Radiance Fields into Geometrically-Accurate 3D Meshes - a compact and flexible architecture that enables easy 3D surface reconstruction from any NeRF-driven approach; distills NeRFs into geometrically-accurate 3D meshes.	Paper, code,summary
26) High-throughput Generative Inference of Large Language Models with a Single GPU - a high-throughput generation engine for running LLMs with limited GPU memory.	Paper, Code , code,summary
27) PaLM-E: An Embodied Multimodal Language Model - incorporates real-world continuous sensor modalities resulting in an embodied LM that performs tasks such as robotic manipulation planning, visual QA, and other embodied reasoning tasks.	Paper, Demo , code,summary
28) Prismer: A Vision-Language Model with An Ensemble of Experts - a parameter-efficient vision-language model powered by an ensemble of domain experts; it efficiently pools expert knowledge from different domains and adapts it to various vision-language reasoning tasks.	Paper, GitHub, Project , code,summary
29) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - it connects ChatGPT and different visual foundation models to enable users to interact with ChatGPT beyond language format.	Paper, GitHub code,summary
30) A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT - an overview of generative AI - from GAN to ChatGPT.	Paper, code,summary
31) Larger language models do in-context learning differently - shows that with scale, LLMs can override semantic priors when presented with enough flipped labels; these models can also perform well when replacing targets with semantically-unrelated targets.	Paper , code,summary
32) OpenICL: An Open-Source Framework for In-context Learning - a new open-source toolkit for in-context learning and LLM evaluation; supports various state-of-the-art retrieval and inference methods, tasks, and zero-/few-shot evaluation of LLMs.	Paper, Repo, code,summary
33) MathPrompter: Mathematical Reasoning using Large Language Models - a technique that improves LLM performance on mathematical reasoning problems; it uses zero-shot chain-of-thought prompting and verification to ensure generated answers are accurate.	Paper, code,summary
34) Scaling up GANs for Text-to-Image Synthesis - enables scaling up GANs on large datasets for text-to-image synthesis; it’s found to be orders of magnitude faster at inference time, synthesizes high-resolution images, & supports various latent space editing applications.	Paper, Project , code,summary
35) Language Is Not All You Need: Aligning Perception with Language Models - introduces a multimodal large language model called Kosmos-1; achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.	Paper, code,summary

Feb

Paper	Links
1) EvoPrompting: Language Models for Code-Level Neural Architecture Search - combines evolutionary prompt engineering with soft prompt-tuning to find high-performing models; it leverages few-shot prompting which is further improved by using an evolutionary search approach to improve the in-context examples.	Paper, code,summary
2) Goal Driven Discovery of Distributional Differences via Language Descriptions - a new task that automatically discovers corpus-level differences via language description in a goal-driven way; applications include discovering insights from commercial reviews and error patterns in NLP systems.	Paper , Code, code,summary
3) Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control - a scalable approach to planning with LLMs in embodied settings through grounding functions; GD is found to be a general, flexible, and expressive approach to embodied tasks.	Paper, Project code,summary
4) Enabling Conversational Interaction with Mobile UI using Large Language Models - an approach that enables versatile conversational interactions with mobile UIs using a single LLM.	Paper, code,summary
5) LLaMA: Open and Efficient Foundation Language Models - a 65B parameter foundation model released by Meta AI; relies on publicly available data and outperforms GPT-3 on most benchmarks despite being 10x smaller.	Paper, code,summary
6) The Wisdom of Hindsight Makes Language Models Better Instruction Followers - an alternative algorithm to train LLMs from feedback; the feedback is converted to instruction by relabeling the original one and training the model, in a supervised way, for better alignment.	Paper, GitHub code,summary
7) Active Prompting with Chain-of-Thought for Large Language Models - a prompting technique to adapt LLMs to different task-specific example prompts (annotated with human-designed chain-of-thought reasoning); this process involves finding where the LLM is most uncertain and annotating those.	Paper, Code code,summary
8) Recitation-Augmented Language Models - an approach that recites passages from the LLM’s own memory to produce final answers; shows high performance on knowledge-intensive tasks.	Paper , code,summary
9) Learning Performance-Improving Code Edits - an approach that uses LLMs to suggest functionally correct, performance-improving code edits.	Paper, code,summary
10) More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models - a comprehensive analysis of novel prompt injection threats to application-integrated LLMs.	Paper, code,summary
11) Aligning Text-to-Image Models using Human Feedback - proposes a fine-tuning method to align generative models using human feedback.	Paper, code,summary
12) MERF: Memory-Efficient Radiance Fields for Real-time View Synthesis in Unbounded Scenes - a memory-efficient radiance field representation for real-time view synthesis of large-scale scenes in a browser.	Paper, code,summary
13) Symbolic Discovery of Optimization Algorithms - a simple and effective optimization algorithm that’s more memory-efficient than Adam.	Paper, code,summary
14) Transformer models: an introduction and catalog	Paper, code,summary
15) The Capacity for Moral Self-Correction in Large Language Models - finds strong evidence that language models trained with RLHF have the capacity for moral self-correction. The capability emerges at 22B model parameters and typically improves with scale.	Paper, code,summary
16) Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment - an unsupervised method for text-image alignment that leverages pretrained language models; it enables few-shot image classification with LLMs.	Paper , Code ,summary
17) Augmented Language Models: a Survey - a survey of language models that are augmented with reasoning skills and the capability to use tools.	Paper, code,summary
18) Auditing large language models: a three-layered approach - proposes a policy framework for auditing LLMs.	Paper, code,summary
19) Energy Transformer - a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associate Memory model; this follows the popularity that Hopfield Networks have gained in the field of ML.	Paper, code,summary
20) Toolformer: Language Models Can Teach Themselves to Use Tools - introduces language models that teach themselves to use external tools via simple API calls.	Paper, code,summary
21) Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents - proposes using language models for open-world game playing.	Paper, code,summary
22) A Categorical Archive of ChatGPT Failures - a comprehensive analysis of ChatGPT failures for categories like reasoning, factual errors, maths, and coding.	Paper, code,summary
23) Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery - optimizing hard text prompts through efficient gradient-based optimization.	Paper, code,summary
24) Data Selection for Language Models via Importance Resampling - proposes a cheap and scalable data selection framework based on an importance resampling algorithm to improve the downstream performance of LMs.	Paper, code,summary
25) A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity - performs a more rigorous evaluation of ChatGPt on reasoning, hallucination, and interactivity.	Paper, code,summary
26) Offsite-Tuning: Transfer Learning without Full Model - introduces an efficient, privacy-preserving transfer learning framework to adapt foundational models to downstream data without access to the full model.	Paper, Project, code,summary

Jan

Paper	Links
1) REPLUG: Retrieval-Augmented Black-Box Language Models - a retrieval-augmented LM framework that adapts a retriever to a large-scale, black-box LM like GPT-3.	Paper, code,summary
2) The Flan Collection: Designing Data and Methods for Effective Instruction Tuning - release a more extensive publicly available collection of tasks, templates, and methods to advancing instruction-tuned models.	Paper, code,summary
3) Multimodal Chain-of-Thought Reasoning in Language Models - incorporates vision features to elicit chain-of-thought reasoning in multimodality, enabling the model to generate effective rationales that contribute to answer inference.	Paper, Code ,summary
4) Benchmarking Large Language Models for News Summarization	Paper , code,summary
5) Mathematical Capabilities of ChatGPT - investigates the mathematical capabilities of ChatGPT on a new holistic benchmark called GHOSTS.	Paper, code,summary
6) Large Language Models Can Be Easily Distracted by Irrelevant Context - finds that many prompting techniques fail when presented with irrelevant context for arithmetic reasoning.	Paper, code,summary
7) MusicLM: Generating Music From Text - a generative model for generating high-fidelity music from text descriptions.	Paper, code,summary
8) Hungry Hungry Hippos: Towards Language Modeling with State Space Models - an approach to reduce the gap, in terms of performance and hardware utilization, between state space models and attention for language modeling.	Paper, code,summary
9) A Watermark for Large Language Models - a watermarking framework for proprietary language models.	Paper, code,summary
10) DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature - an approach for zero-shot machine-generated text detection. Uses raw log probabilities from the LLM to determine if the passage was sampled from it.	Paper, code,summary
11) Large language models generate functional protein sequences across diverse families - an LLM that can generate protein sequences with a predictable function across large protein families.	Paper, code,summary
12) Dissociating language and thought in large language models: a cognitive perspective - a review paper on the capabilities of LLMs from a cognitive science perspective.	Paper, code,summary
13) Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk - new work analyzing how generative LMs could potentially be misused for disinformation and how to mitigate these types of risks.	Paper, code,summary
14) Why do Nearest Neighbor Language Models Work? - empirically identifies reasons why retrieval-augmented LMs (specifically k-nearest neighbor LMs) perform better than standard parametric LMs.	Paper, Code, code,summary
15) Memory Augmented Large Language Models are Computationally Universal - investigates the use of existing LMs (e.g, Flan-U-PaLM 540B) combined with associative read-write memory to simulate the execution of a universal Turing machine.	Paper , code,summary
16) A Survey on Transformers in Reinforcement Learning - transformers for RL will be a fascinating research area to track. The same is true for the reverse direction (RL for Transformers)... a notable example: using RLHF to improve LLMs (e.g., ChatGPT).	Paper, code,summary
17) Scaling Laws for Generative Mixed-Modal Language Models - introduces scaling laws for generative mixed-modal language models.	Paper, code,summary
18) Rethinking with Retrieval: Faithful Large Language Model Inference - shows the potential of enhancing LLMs by retrieving relevant external knowledge based on decomposed reasoning steps obtained through chain-of-thought prompting.	Paper, code,summary
19) SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot - presents a technique for compressing large language models while not sacrificing performance; "pruned to at least 50% sparsity in one-shot, without any retraining."	Paper, code,summary
20) Large Language Models as Corporate Lobbyists - with more capabilities, we are starting to see a wider range of applications with LLMs. This paper utilized large language models for conducting corporate lobbying activities.	Paper , Code, summary

MileStone Papers from 2017-2023 from Awesome LLM Repo

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMS Top Paper 2023

June

May

April

March

Feb

Jan

About

Releases

Packages

rashmimarganiatgithub/LLMS_Library_2023

Folders and files

Latest commit

History

Repository files navigation

LLMS Top Paper 2023

June

May

April

March

Feb

Jan

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages