# InstructLab

Website: https://instructlab.ai/

InstructLab's model-agnostic technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models not by rebuilding and retraining the entire model but by composing new skills into it.

Github: https://github.com/instructlab

- instructlab: Command-line interface. Use this to chat with the model or train the model (training consumes the taxonomy data) - Apache 2.0
- training: standalone deepspeed training implementation (backend) - Apache 2.0

- taxonomy: Taxonomy tree that will allow you to create models tuned with your data - Apache 2.0
- schema: JSON schema for Taxonomy YAML

- community: InstructLab Community wide collaboration space including contributing, security, code of conduct, etc
- instructlab-bot: GitHub bot to assist with the taxonomy contribution workflow

Huggingface: https://huggingface.co/instructlab

4 models
- instructlab/granite-7b-lab
- instructlab/granite-7b-lab-GGUF
- instructlab/merlinite-7b-lab-GGUF
- instructlab/merlinite-7b-lab

License: Apache 2.0

## README

https://github.com/instructlab/community/blob/main/README.md

## InstructLab CLI

https://github.com/instructlab/instructlab/blob/main/README.md

## Taxonomy

https://github.com/instructlab/taxonomy/blob/main/README.md

## Project FAQ

https://github.com/instructlab/community/blob/main/FAQ.md

> What is LAB?

LAB (Large-scale Alignment for chatBots) is a novel synthetic data-based align tuning method for LLMs from IBM Research. It consists of three components:

- A taxonomy-driven data curation process
- A large-scale synthetic data generator
- Multi-phased-training with replay buffers

The LAB approach allows incrementally adding new knowledge and skills to an already pre-trained model without catastrophic forgetting.

> What large language models (LLMs) am I contributing to through the InstructLab project?

Contributions to the InstructLab project fine-tune Merlinite-7b or Granite-7b, an open source licensed LLM. Contributors have direct access to the model they are improving through Hugging Face.

> What is a “skill”?

In the context of InstructLab, a skill is a capability domain submitted by a contributor intending to train the AI model on the submitted information. In other words, when you submit a skill, you teach the AI model how to do something.

InstructLab skills are broken down into two main categories:

- Composition skills. Composition or performative skills allow AI models to perform specific tasks or functions. With InstructLab, there are two types of composition skills:
  - Freeform compositional skills are performative skills that do not require additional context. For example, to train an AI model to write a poem, you would provide examples of poems.
  - Grounded compositional skills are performative skills that require additional context. One example is how an AI model reads the value of a cell in a table layout. To create the grounded skill to read a table formatted in Markdown, the additional context might be an example table layout.
- Foundational skills. Foundational skills are skills like math, reasoning, and coding. 

**Note: Foundational skills are not currently being accepted.**

> What is “knowledge”?

Knowledge consists of data and facts. When creating knowledge for an AI model, you are providing it with additional data and information to answer questions more accurately. Whereas skills are the information that trains an AI model on how to do something, knowledge is based on the AI model’s ability to answer questions that involve facts, data, or references.

> Is the project looking for certain types of skill contributions?

Currently, InstructLab only accepts compositional (freeform and grounded) skills and knowledge. However, any type of freeform or grounded skill can be submitted. Some skills might not be added to the taxonomy repository for reasons such as duplication, submitting a skill that the model already does well, or submitting a controversial skill.

Foundational skills are not currently being accepted.

For a list of accepted skills, see Accepted Skills.

https://github.com/instructlab/taxonomy/blob/main/docs/SKILLS_GUIDE.md#accepted-skills

Skills to Avoid.

https://github.com/instructlab/taxonomy/blob/main/docs/SKILLS_GUIDE.md#skills-to-avoid

> What are the acceptance criteria for a knowledge submission?

Requirements for knowledge submissions can be found in the Getting Started with Knowledge Contributions guide.

https://github.com/instructlab/taxonomy?tab=readme-ov-file#getting-started-with-knowledge-contributions

> Do submissions to the project require a contributor license agreement of some kind?

The InstructLab project follows the same approach (the Developer's Certificate of Origin 1.1 (DCO)) that the Linux Kernel community uses to manage code contributions. Unless the file says otherwise for this project, the relevant open source license is the Apache License, Version 2.0. When submitting a patch for review, you must include a sign-off statement in the commit message. See the "Legal" section of the Contributing document.

You can find more information about useful tools for managing DCO sign-off in our Community Contributions Guide.

https://github.com/instructlab/community/blob/main/CONTRIBUTING.md#developer-certificate-of-origin-dco

> How can I submit a skill or knowledge?

For information about submitting a skill after you have identified a gap, see the Ways to contribute guide.

https://github.com/instructlab/taxonomy?tab=readme-ov-file#ways-to-contribute

> What happens after you submit a pull request?

After a pull request is submitted, a review is conducted by both the Taxonomy Triage team and the Taxonomy Approvers team to ensure that they are relevant, actionable, and have all of the required information needed to be a valuable addition to the AI model. Triagers might provide feedback and use labels to manage the state of the submitted pull request. Triagers also might provide informative feedback and helpful comments to improve the submission. After the pull request is approved, a Taxonomy Approver merges the skill.

More information regarding basic review questions, subjective review questions, labels, and the reasons for approval, further review requirements, or rejection can be found on the Triaging contributions page of the GitHub repository.

https://github.com/instructlab/taxonomy/blob/main/docs/triaging/triaging-contributions.md

> How are submissions reviewed?

For code review, the project maintainers use LGTM (Looks Good to Me) in comments on the code review to indicate acceptance. A change requires LGTMs from two of the maintainers.

For skills and knowledge PRs, your PR will be checked to ensure it is relevant, actionable, and has all the information necessary for the approval team to review and merge the PR. The Triage team will use labels to manage the state and action of PRs as well as provide feedback to contributors based upon the following review guidelines:

- Does the PR have the pull request template information filled out?
- Did all the PR checks pass?
- Does the skill have three or more examples?
- Are the YAML fields correct?
- No PII in content
- Does this content include anything documented in the project's Avoid these Topics guidelines?
- Does it adhere to the Code of Conduct guidelines?
- Was a response clearly generated by the LLM?

> How long will it take for my pull request to be reviewed?

Due to the large number of contributions currently being received, it is difficult to provide an exact timeline for reviewing your pull request.

> If my pull request is accepted, how long will it take for my changes to appear in the next model update?

After a pull request is accepted, the changes are regularly incorporated into InstructLab.

## Skills & Knowledge guide

https://github.com/instructlab/taxonomy/blob/main/docs/SKILLS_GUIDE.md#skills-guide

https://github.com/instructlab/taxonomy?tab=readme-ov-file#getting-started-with-knowledge-contributions

### Accepted Skills

https://github.com/instructlab/taxonomy/blob/main/docs/SKILLS_GUIDE.md#accepted-skills

- Creative Writing / Poetics
- Learning to Format Information
- Table Analysis and Processing
- Qualitative Inference and Chain-of-Thought Reasoning
- Word Problems
- Trust and Safety
- Searching, Extraction and Summarization
- Complex Rulesets and Games
- Writing Style and Personalities
- Instruction-Following Behavior

### What are the acceptance criteria for a skills submission?

Skills should seek to add capabilities or a knowledge domain to the AI model; in other words, a skills submission should teach the AI model how to do something instead of providing information about something. A good skills submission might address something that the AI model does poorly and seek to enhance its ability to execute that capability better. For a list of commonly accepted skills, see Accepted Skills.

Skills submissions that are unlikely to be accepted include submitting a knowledge request instead of a skills request, submitting a skill that the model already does well, submitting a controversial skill, or submitting skills that do not execute pure math or coding. For a list of skills to avoid submitting, see 

### Skills to Avoid.

https://github.com/instructlab/taxonomy/blob/main/docs/SKILLS_GUIDE.md#skills-to-avoid

There are several types of skills that we don't expect this procedure to improve. Most skills in these categories will be rejected.

- Math
- Real world knowledge-based skills (Unless it can be framed as a "grounded skill", where the user is expected to provide context, knowledge contributions will be a separate part of the taxonomy. Skills shouldn't expect the model to come up with its own facts, but instead assemble facts provided.)
- Red Teaming: Adversarial questions and answers will be rejected at this time.
- Small Ce: hanges to Original Response: If the original LLM response is pretty close, but it's not responding to your exact expectations, a skill is not the right way to solve that problem.

Avoid These Topics

- PII (personally identifiable information) or any content invasive of individual privacy rights
- Violence including self-harm
- Cyber Bullying
- Internal documentation or other that is confidential to your employer or organization, e.g. trade secrets
- Discrimination
- Religion
 - Facts such as, "Christianity is, according to the 2011 census, the fifth most practiced religion in Nepal, with 375,699 adherents, or 1.4% of the population", are fine as a knowledge contribution.  
  - Advocating in favor of or against any religious faith is not acceptable.
- Medical or health information
  - Facts such as, "In mammals, pulmonary ventilation occurs via inhalation (breathing)," are fine as a knowledge contribution. 
  - Tailored medical/health advice is not acceptable.
- Financial information
  - Facts such as "laissez-faire economics ... argues that market forces alone should drive the economy and that governments should refrain from direct intervention in or moderation of the economic system," are fine as a knowledge contribution. 
  - Tailored financial advice is not acceptable.
- Legal settlements/mitigations
- Gender Bias
- Hostile Language, threats, slurs, derogatory or insensitive jokes or comments
- Profanity
- Pornography and sexually explicit or suggestive content
- Any contributions that would allow for automated decision making that affect an individual's rights or well-being, e.g. social scoring
- Any contributions that engage in political campaigning or lobbying

We are also not accepting submissions of the following content:
- Jokes
- Poems
- Code
  - Anything code-related that can be traced back to code for a computer. Not limited to sed or bash but yamls for OpenShift or Kubernetes, to python snippets to Java suggestions. 
  - There are specific models focused on this space and this isn't for this model for the time being.
- "Guard Rails" for AI
  - We expect our upstream engineering team to create these types of skills and safe guards. We appreciate our community wanting to help with this, but there are underlying engineering decisions and taking this from the community may conflict with these.
  
We received so many at the beginning, and with jokes being "in the eye of the beholder" and puns requiring nuance for native English speakers, we realized we were possibly unconsciously biasing our model. We have discovered that working with both topics has its own challenges, and if we want something generalized, finding consensus was unsuccessful.

### Building Your LLM Intuition

LLMs have inherent limitations that make certain tasks extremely difficult, like doing math problems. They're great at other tasks, like creative writing. And they could be better at things like logical reasoning.

Consider these when you're generating skills. Skills in the first and second categories are welcomed. Skills in the third category are usually borderline and may be rejected.

#### LLMs are great at

Skills in this category are welcomed, as refining these abilities helps us get better at the kinds of tasks where LLMs can excel.

For these, however, it's common for LLMs to already have excellent performance. Try 3-5 examples in lab chat to confirm a deficit in the model before you build your submission, and share the examples in your Pull Request (PR).

- Brainstorming
- Creativity
- Connecting information
- Cross-lingual behavior

#### LLMs need help with

Skills in this category are welcomed, since LLM behavior in these sorts of topics are very difficult for the model to get right. Try several examples to understand the nuances of the model's ability to do these sorts of tasks, and consider using corrections to the results you get in your tuning process.

- Chains of reasoning
- Analysis
- Story plots
- Reassembling information
- Effective and succinct summaries

#### LLMs are not so great at

Skills in this category are ways in which LLMs struggle, and may always struggle. Solving math and computation problems via probability on natural language queries is probably not the best way to solve them. That said, improving some of these foundational skills may be something this work tackles in the future, but not at this time.

Most skill submissions in these categories are likely to be rejected.

For hallucinations in particular, trying to solve this with a skill is unlikely to work. Consider contributing to the Knowledge taxonomy when it opens instead to improve the model's understanding of facts.

- Math
- Computation
- "Turing-complete" type tasks
- Generating only true real-world information (they're prone to hallucinations)

### Getting Started with Knowledge Contributions

https://github.com/instructlab/taxonomy?tab=readme-ov-file#getting-started-with-knowledge-contributions

While skills are foundational or performative, knowledge is based more on answering questions that involve facts, data, or references.

Knowledge in the taxonomy tree consists of a few more elements than skills:
- Each knowledge node in the tree has a qna.yaml, similar to the format of the qna.yaml for skills.
- **Knowledge submissions require you to create a Git repository**, can be with GitHub, that contains the markdown files of your knowledge contributions. These contributions in your repository must use the markdown (.md) format.
- The qna.yaml includes parameters that contain information from your repository.

Guidelines for Knowledge contributions
- Submit the most up-to-date version of the document
- All submissions must be text, images will be ignored
- Do not use tables in your markdown freeform contribution

Important
- There is a limit to how much content can exist in the question/answer pairs for the model to process. 
- Due to this, only add a maximum of around 2300 words to your question and answer seed example pairs in the qna.yaml file.

Important
- Upon release, the taxonomy repository is **only accepting contributions from Wikipedia** and is capped at 50 contributions. 
- If you want to add knowledge to the taxonomy repository, please fill out this [InstructLab Knowledge Submission Registration form](https://docs.google.com/forms/d/1VWJ_XPwH3gBTIXCabpWc0I5pjWIlXETMSFKXc8fpgkA/viewform?edit_requested=true) and await acceptance! Please do not add contributions if you do not receive the confirmation email. Thank you!

This knowledge example references one markdown file: oscars2024_results.md. You can also add multiple files for knowledge contributions.

What might these markdown files look like? They can be freeform. Here's what a snippet of oscars2024_results.md might look like in your Git repository.

You can organize the knowledge markdown files in your repository however you want. You just need to ensure the YAML is pointing to the correct file.

Example attribution.txt file

```
Title of work: 96th Academy Awards
Link to work: https://en.wikipedia.org/wiki/96th_Academy_Awards
License of the work: CC-BY-SA-4.0
Creator names: Wikipedia Authors
```

For more information on what to include in your attribution.txt file, see For your attribution.txt file in CONTRIBUTING.md.

https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#for-your-attributiontxt-file

An important part of contributing to the InstructLab project is citing your sources of information. This comes in the form of your attribution.txt that you add to the pull requests. Almost all instances of attribution can be covered by the parameters required for Creative Commons Attribution licenses. Some parameters are as follows:
- Title of work
- Link to work: Include link to a specific revision where possible
- License of the work: Include an SPDX identifier where possible
- Creator names
- Copyright information
- Modification information: Indicate if work was itself derived from another openly licensed work

You can also see this citation style in the Data sources documentation

https://github.com/instructlab/community/blob/main/docs/DataSources.md

Note
- Due to the higher volume, it will naturally take longer to receive acceptance for a knowledge contribution pull request than for a skill pull request. 
- Smaller pull requests are simpler and require less time and effort to review.

## Community Collaboration 

https://github.com/instructlab/community/blob/main/Collaboration.md

**Project Slack workspace**

https://github.com/instructlab/community/blob/main/InstructLabSlackGuide.md

For real-time chat discussions, please join our InstructLab Slack workspace.

Slack history is deleted after 90 days, so for conversations that should preserved for a longer period use the project mailing lists.

If you want to add feedback or think there is a "large issue" to discuss, a mailing list or a specific repository issue tracker is a good place to have the conversation rather than Slack. If you are unsure of where to comment, users@instructlab.ai is the best place to start.

**Email lists**

https://github.com/instructlab/community/blob/main/Collaboration.md#email-lists

Subscriptions requires a Google account. To join a list, click the list name in the table below to visit the list subscription page.

**Project meetings**

InstructLab project calendar

https://calendar.google.com/calendar/embed?src=c_23c2f092cd6d147c45a9d2b79f815232d6c3e550b56c3b49da24c4b5d2090e8f%40group.calendar.google.com

We host weekly community meetings each Tuesday at 14:00 UTC.

We have two dedicated Office Hours slots each Thursday so we're able to meet with folks across different time zones. See the InstructLab project calendar to select which time works best for you.

We host daily Triage Team stand up meetings at 18:30 UTC. (time zone converter). In this meeting, triagers speak and discuss possible issues or successes with the different PRs put into the https://github.com/instructlab/taxonomy repo. If you have questions or ideas, we have an open door policy and would love for you to join us.

**Github & Huggingface**

We are using the GitHub discussion boards in each repo for cases where we need to document things quickly but ephemerally, such as working together as a community to squash a nasty bug. In that case, a link to the appropriate discussion board post will be sent to the relevant project mailing lists so folks can follow along on GitHub. 

Rather than use the discussion boards to discuss proposals for enhancements or to request help with using InstructLab, please reach out on the project email lists or Slack.

We regularly post model builds on the project's Hugging Face page.

**Social Media**

Linkedin page: https://www.linkedin.com/company/instructlab

X (Twitter): https://twitter.com/instructlab

Youtube channel: https://www.youtube.com/@InstructLab

**Submitting content**

Have you made a video tutorial, how to document, or other content that would be helpful to folks involved in the InstructLab community? Thank you!

We would love to help you share it. Please file an issue in the Community Repo or send a note to the community email list to let us know about what you have created.

## Contributing

https://github.com/instruct-lab/community/blob/main/CONTRIBUTING.md

When you're ready to start contributing, you can follow the Getting started guide. 

https://github.com/instruct-lab/community/blob/main/README.md#getting-started-with-the-instructlab-project-workstreams

This guide shows you how to
- Install the ilab CLI.
- Deploy the LLM locally.
- Add skills or knowledge and train to the local LLM with your data.
- Create a pull request and add your information to the InstructLab taxonomy.
- Get reviews on your pull requests

## Code of Conduct

https://github.com/instructlab/community/blob/main/CODE_OF_CONDUCT.md

we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone

Examples of behavior that contributes to creating a positive environment [...]

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.

## Project Governance

https://github.com/instructlab/community/blob/main/governance.md

The InstructLab Project has a two-level governance structure with an Oversight Committee and Project Maintainers.

https://github.com/instructlab/community/blob/main/MAINTAINERS.md

InstructLab community Slack channel

https://github.com/instructlab/community/blob/main/InstructLabSlackGuide.md

Announce mailing list

https://github.com/instructlab/community/blob/main/Collaboration.md#email-lists

Advancement to the project Maintainer position, removal or stepping down, and duties are detailed in the Contributor Roles. 

https://github.com/instructlab/community/blob/main/CONTRIBUTOR_ROLES.md

- Member: Active contributor in the community / Multiple contributions and sponsored by 2 Maintainers or Reviewers / InstructLab GitHub org member
- Triager: Triaging issues and PRs / History of issue and PR triage and sponsored by 2 Maintainers / InstructLab GitHub Triage team member
- Maintainer: Sets direction and priorities for a project / Demonstrated responsibility and excellent technical judgement. Nominated and approved by Maintainers team./ MAINTAINERS file Maintainer entry

## Security

https://github.com/instructlab/.github/blob/main/SECURITY.md

Reporting a Vulnerability :please DO NOT report the issue publicly via the GitHub issue tracker, Slack Workspace, etc. Instead, send an email with as many details as possible to security-reporting@instructlab.ai. This is a private mailing list for the security team.

Security Vulnerability Response: Each report is acknowledged and analyzed by the core maintainers within 3 working days. After the initial reply to your report, the security team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.