# LAB: Large-Scale Alignment for ChatBots

https://arxiv.org/abs/2403.01081

- taxonomy-guidedvsynthetic data generation process and a multi-phase tuning framework
- scalable, cost-effective solution for enhancing LLM capabilities and instruction following behaviors without the drawbacks of catastrophic forgetting

## 1. Introduction

(i) a taxonomy-guided synthetic data generation method and quality assurance process 
    
=> that yields a  highly diverse and high-quality instruction dataset

=> without resorting to the use of proprietary LLMs like GPT-4 or substantial human curation
    
(ii) a novel multi-phase training framework and un conventional tuning regime 

=> that allows for adding new knowledge and instruction-following abilities into pre-trained LLMs 

=> without suffering from catastrophic forgetting

## 2. Related work

Concurrent work, GLAN (Li et al., 2024), employs a semi-automatic approach to synthetic data generation that uses a human-curated taxonomy to generate instruction tuning data from a teacher model

However, GLAN cannot be used to generate synthetic data from domains that are not captured in the teacher model’s support

As such, like many other synthetic data generation approaches, GLAN has to rely on a large proprietary model (GPT-4). 

This poses complicated questions about the usability of generated data (especially for commercial purposes) since the terms of use of proprietary models typically forbid using the model to improve other models.

## 3. Methodology

(i) a taxonomy to enable data curation (section 3.1) as well as, guide the synthetic data generator (section 3.2)

=> serves the purpose of ensuring high diversity and quality in the  synthetically generated instruction-tuning dataset while

(ii) a multi-phased instruction-tuning method with replay buffers to enable large-scale alignment-tuning. (section 3.3).

=> ensures training stability and prevents catastrophic forgetting.

### 3.1 Taxonomy

a taxonomy that hierarchically classifies the data samples into smaller task groups

At a high  level, the taxonomy has three main branches: knowledge, foundational skills, and compositional skills

Each of these branches is further split into more granular levels where the tasks are defined  in the leaf nodes and exemplified by providing manually written instruction-response pairs

=> This allows for easily identifying missing tasks in the target LLM and other tasks of interest and adding them to the training data pool

New tasks are added to the taxonomy by creating a leaf node under the appropriate branch and attaching 1–3 examples

#### Knowledge

first divided based on document types like textbooks, technical manuals, etc.

further divided into various domains like finance, statistics, etc.

Each domain has a collection of documents and a sample set of domain-specific questions and answers.

=>  This organization allows for  better control over the licensing of text documents

=> only the documents with permissible licenses are selected for synthetic data generation, excluding knowledge sources that lack proper licensing, reinforcing the integrity of our knowledge-generation processes

#### Foundational skills

We identify mathematics, coding, linguistic ability and reasoning as foundational skills 

=> that the model requires to prime itself for better knowledge acquisition and build further complex and compositional skills

To teach the model foundational skills, we employ publicly available datasets
- The flan collection: Designing data and methods for effective instruction tuning, 2023
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023
- Learning to  mine aligned code and natural language pairs from stack overflow ACM, 2018
- Musique: Multihop  questions via single-hop question composition, 2022

#### Compositional skills

tasks that require a combination of knowledge andfoundational skills, synergistically, to answer complex queries from users

Example
- the model’s ability to write a company-wide email sharing insights about the company’s performance  last quarter and guidance for the upcoming year 
- would require the model to understand the financial  aspects of revenue, profit and loss, 
- the skills of doing basic arithmetic 
- and also have the skills to compose a formal email.

### 3.2 TAXONOMY-DRIVEN SYNTHETIC DATA GENERATOR

The small number of manually curated data samples, embedded in the leaf nodes of the taxonomy,  can be directly used for instruction tuning of the chatbot

=>  however, the model may still perform poorly. Prior work (Li et al., 2023) has shown that typically, a large amount of high-quality instruction data is required for improving instruction following performance of LLMs

It is possible to leverage existing Synthetic Data Generators like Wang et al. (2023); Taori et al. (2023) to use the embedded examples  and generate a lot more instruction data synthetically using teacher LLMs

=> But, such distillation based SDGs tend to over-sample from the dominant modes of the teacher model and thus lack in diversity and quality of the generated data Gudibande et al. (2023)

this limitation is attributed to the random selection of examples from the pool of seed samples

=> with random selection, the examples used to prompt the teacher model at each time are an “average” of the seed pool i.e. they do not focus on any specific task.

=> This lack of focus tends to encourage the teacher model to generate more synthetic data from its dominant modes and ignore the long tail of interesting tasks

we replace the random sampling in existing SDGs with a taxonomy-driven approach to guide the sampling of synthetic data

=> enabling targeted coverage of the support of the teacher model distribution around the individual leaf nodes of the taxonomy. 

With the taxonomy-driven sampling,  since only the examples within each of the leaf nodes are used when sampling for the corresponding tasks, each of the tasks are guaranteed to be well represented in the prompts

=> taxonomy-driven sampling produces diverse set of synthetic data and hence improve the data used to train student model across the task domain

we now introduce two new synthetic data generation (SDG) methods in LAB that leverage the taxonomy to guide the data generation process

1. The first one is targeted for skills generation and uses the handful of task examples in the leaf nodes to generate a lot more using the open-source Mixtral-7x8B model

2. The second one is targeted at knowledge generation. While it still uses the Mixtral-7x8B model, unlike prior works, it does not rely on the knowledge stored in the teacher model.

#### SKILL GENERATION

Skills-SDG uses four prompt templates, one for each of the four, below-mentioned, stages of data generation

Each template has its own set of principles and instructions that control the role of the teacher model (generator vs evaluator) and guide the generation/evaluation process.

##### 1. Instruction generation

the teacher model acts as a question generator

using a specialized prompt to leverage its knowledge and create diverse questions

By iterating through each leaf node of a taxonomy, the teacher generates queries that adhere to specific principles and thoroughly explore the targeted domain

=> enhancing the comprehensiveness of the generated content

**Instruction Generator prompt template**
    
```
 You are asked to come up with a set of {num samples} diverse questions on {task}.
 Please follow these guiding principles when generating responses:
 * Use proper grammar and punctuation.
 * Always generate safe and respectful content. Do not generate content that is harmful, abusive, or offensive.
 * Always generate content that is factually accurate and relevant to the prompt.
 * The questions should be clear and human-like.
 * The questions should be diverse and cover a wide range of topics.
 * The questions should not be template-based or generic, it should be very diverse.
 * Simply return the questions, do not return any answers or explanations.
 * Strictly adhere to the prompt and generate responses in the same style and format as the example.
 To better assist you with this task, here is an example:
 ### Question:
 1. {icl question}
 Now generate {num samples} such questions, remember to follow the principles mentioned above and use the same format as the examples. Remember to use the same style and format as the example above. Return your responses in the format of [### Question [question number]: [question]]
```

##### 2. Evaluating synthetic instruction

the teacher model assumes the role of an instruction evaluator

the teacher model uses targeted prompts to filter out questions that don’t meet predefined principles, including:
- relevance to the domain, 
- potential harm,
- or questions beyond a language model’s answering capabilities

=> This ensures that only high-quality, contextually appropriate questions move forward in the process.

##### 3. Generating responses

The teacher model, functioning as a response generator in this stage

adopts dual personas for precision and creativity, guided by distinct prompts.

=> This  tailored approach helps to generate both, creative responses for domains like writing and role-play, and precise answers for STEM and data extraction

=> aligning the response style to human expectations through principles and seed examples in the leaf nodes

##### 4. Evaluating the synthetic instruction-response pair

rigorous process to filter and select high-quality instruction and response pairs

Using a 3-point rating system, the teacher model evaluates each sample, filtering out those that are incorrect, irrelevant, or deviate from the provided principles

=> ensuring the training dataset’s quality and relevance are enhanced for the student model

**Instruction-response Evaluation template**

```
 Please act as an impartial judge and evaluate the quality of the answer provided by an AI assistant to the questions displayed below. Evaluate whether or not the answer is a good example of how AI Assistant should respond to the user’s instruction. Please assign a score using the following 3-point scale:
 1: It means the answer is incorrect, irrelevant, unsafe or provides incomplete and garbage information. For instance, the answer may be factually wrong, off-topic, or filled with irrelevant content that doesn’t address the user’s question or it could be incomplete and hanging. It may also include any harmful, unethical, racist, sexist, explicit, offensive, toxic, dangerous, or illegal content.
 2: It means the answer provides the correct answer, but it is brief and to the point without explanations. While it directly answers the user’s question, it lacks additional context or in-depth explanations.
 3: It means the answer is a perfect answer from an AI Assistant. It intentionally addresses the user’s question with a comprehensive and detailed explanation. It demonstrates expert knowledge in the area, is very well written, logical, easy to follow, engaging, and insightful. And the answer is safe and does not include any harmful content.
 Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the answer on a scale of 1 to 3 as mentioned above. Please use the following examples as a reference for your evaluation.
```

#### KNOWLEDGE-GENERATION

Synthetic data generators are inherently limited by the knowledge and capabilities of the teacher model

=> This is one of the main reasons why most successful SDG methods depend on GPT-4 model, which presumably has the highest coverage of knowledge and skills

However, there are many domains that no open/proprietary model is trained on 

=> and hence cannot work as a teacher model using existing SDG methods

To address this limitation, we devised a new SDG pipeline for generating instruction data on domains that the teacher model has not been trained on. We call it knowledge-SDG

1. Similar to the process of skills generation, knowledge-SDG uses the curator-provided examples embedded in the leaf nodes of the knowledge branch of the taxonomy

2. But additionally, the teacher model is provided a knowledge source in the form of documents, manuals, and books on the target subject 

=> to ground the generated instruction data into a reliable source 

=> thus avoiding dependence on the internal knowledge base of a teacher model, which may struggle with specialized domains and could lead to inaccuracies or hallucinations especially on highly specialized, technical domains.

To ensure that the generated answers remain faithful to the content of the source material, similar to the skills-SDG, teacher model is repurposed as an evaluator that validates the generated responses are grounded and faithful to the source documents

#### MULTI-PHASE TRAINING

LAB training happens in two phases, knowledge tuning, followed by skills tuning.

In the knowledge-tuning phase, the model is trained on samples from the knowledge and foundational skills branches of the taxonomy. 

This phase in-turn, is carried out in two steps. We split the data under the knowledge and foundational skills branches into two buckets based on the response length. 

Then we first train the model on the samples with short responses before moving on to training on samples with long responses. 

=> Similar to prior work, our empirical results also suggest that this two-step approach to knowledge-tuning improves model performance.

Post-knowledge tuning, we start the skills-tuning phase where the best model checkpoint from the knowledge-tuning phase is trained on the compositional skills branch of the taxonomy. 

=> Our empirical findings indicate that starting with knowledge and foundational skills training, before progressing to compositional skills leads to significantly better benchmark performance

a replay buffer of the data from the knowledge-tuning phase in employed.

=> To address the challenge of catastrophic forgetting when training in two distinct phases

For selecting the best model checkpoint during intermediate phases, we rely on the MMLU benchmark during the knowledge-tuning phase

and the MT-bench during the skills-tuning phase. 

Summary
- Knowledge Tuning 1 - Knowledge (short)
- Knowledge Tuning 2 - Knowledge (long) & Foundational skills + replay of KT1 data
- Skill Tuning - Compositional skills + replay of KT1 & KT2 data

In our training process, we consciously avoid overtraining

=> Despite the possibility of achieving higher scores on intermediate benchmarks, we have found that selecting checkpoints from earlier stages of training results in more reliable and generalizable model performance. 

We employ small learning rates with an extended warm-up period, specifically 2 × 10−5 for Llama-based models and 1×10−6 for Mistral-based models, each beginning with a linear warm-up.

=> This strategy  is hypothesized to aid the model in transitioning from broad dataset-wide learning to more focused, task-specific adjustments. 

we utilize a large effective batch size of 3840, achieved through gradient accumulation

=> to enhance stability across the diverse range of tasks being learned concurrently. 

=> Our findings suggest that using cosine decay on learning rates during intermediate phases can destabilize subsequent training stages, likely due to the learning rate’s reduction to near zero, narrowing the loss landscape and complicating the integration of new phase gradients

Summary MERLINITE-7B: 
- LEARNING RATE = 1E-6, 
- BATCH SIZE = 3840, 
- CONTEXT LENGTH = 2048 for KT & 4096 for ST, 
- #SAMPLES = 630K for KT1 & 230K for KT2 & 550K for ST,
- #WARM-UP = 800, 
- #EPOCHS = 4 for KT1 & 4 for KT2 & 7 for ST

## 4. RESULTS

we implemented the LAB method on two distinct open models, LLAMA-2-13B  and MISTRAL-7B

utilizing MIXTRAL-8X7B-INSTRUCT V0.1 as the teacher model.

we employed a taxonomy consisting of numerous leaf nodes to produce a dataset comprising 1.2 million samples, divided almost evenly between knowledge-based (617k) and skill-based (588k) samples

in terms of MT-Bench, LABRADORITE-13B performs better than the current best model fine-tuned on LLAMA-2-13B and MERLINITE-7B performs better than the current best model fine
tuned on MISTRAL-7B, achieving state-of-the-art performance in term of chatbot capability.

Importantly,our training method ensures that the model is not only good at multi-turn conversation but also maintains its knowledge or reasoning capability, as shown by the overall superior performance  in the rest of the metrics. 

Besides, unlike those top models that use GPT-4 as the teacher model, we achieve this performance using the open-weights MIXTRAL-8X7B-INSTRUCT-V0.1, which is a relatively weaker teacher model at orders of magnitude less cost.