In [39]:
from galactic import GalacticDataset

# IMPORTANT! If you're using Galactic in a notebook, we use async coroutines to call APIs. You'll need these two lines or it will fail.
import nest_asyncio
nest_asyncio.apply()

We're going to demonstrate how to use the AI labeling and distillation features to clean and preprocess the OpenHermes instruction-tuning dataset. This is a great dataset for fine-tuning LLMs collected by [Teknium](https://github.com/teknium1)--but it's so big! We'll use Galactic's powerful tagging, labeling, and filtering features to reduce it down to a more manageable size, keeping only the topics we're interested in.

In [9]:
ds = GalacticDataset.from_hugging_face(
    "teknium/openhermes",
    split="train",
)

Map:   0%|          | 0/242831 [00:00<?, ? examples/s]

In [10]:
ds.column_names, len(ds)

(['input', 'instruction', 'output', '__id'], 242831)

First, let's do some basic stuff: We'll detect the language of each instruction, and then we'll filter out all the non-English instructions.

In [12]:
ds = ds.detect_language(
    field="instruction"
).filter(
    lambda x: x["__language"] == "en"
)
len(ds)



Map:   0%|          | 0/242239 [00:00<?, ? examples/s]

INFO: Detected language in field instruction, added language metadata to '__language'.


Filter:   0%|          | 0/242239 [00:00<?, ? examples/s]

242239

Next, we'll count tokens of each instruction and output, and filter shorter outputs. We're interested in outputs where the model has to maintain fluency for a longer response, not just provide a short answer. Counting tokens with a tokenizer takes a while--if you don't provide a tokenizer, Galactic will just count bytes instead. 

In [13]:
ds = ds.count_tokens(
    fields=["instruction", "output"],
    tokenizer="TaylorAI/Flash-Llama-13B" # same as meta llama, but don't have to login to huggingface
).filter(
    lambda x: x["__token_count__instruction"] > 10 and x["__token_count__output"] > 225
)
len(ds)

Map:   0%|          | 0/242239 [00:00<?, ? examples/s]

INFO: Counted tokens in fields: ['instruction', 'output'], added metadata to __token_count__


Filter:   0%|          | 0/242239 [00:00<?, ? examples/s]

110355

Now that we've significantly reduced the size of the dataset, we can start doing stuff that takes a little longer. To start out, we'll scan all the fields for PII and remove examples that contain it.

In [14]:
ds = ds.detect_pii(
    fields=["instruction", "input", "output"],
).filter(
    lambda x: x["__pii__any"] == False
)
len(ds)

Map:   0%|          | 0/110355 [00:00<?, ? examples/s]

INFO: Detected PII in fields: ['instruction', 'input', 'output']; added __pii__email, __pii__phone, __pii__credential, and __pii__any metadata.


Filter:   0%|          | 0/110355 [00:00<?, ? examples/s]

107639

Next, let's think about the kinds of instructions we want our fine-tuning dataset to contain. OpenHermes includes instructions both for problem-solving, and also for more creative uses like writing and roleplay. Let's imagine we're only interested in problem-solving (math and programming). We'll use Galactic's AI classifier feature to automatically label topics with GPT-3.5-turbo on a fraction of the data. Then, we'll distill those labels to build a fast classifier that can cheaply and tractably classify the entire dataset!

In [15]:
# set the api key and rate limits (check your account to see what your rate limits for GPT-3.5-turbo are)
# Galactic will automatically use the 16k model for longer sequences, and the 4k model for shorter sequences.
ds.set_openai_key(
    "[...]"
)
ds.set_rate_limits(
    max_tokens_per_minute=350_000,
    max_requests_per_minute=4_000
)

In [16]:
classes = {
    "programming": "Request involves a programming task, including software, web development, data science, or machine learning.",
    "math": "Request involves a math task, including algebra, calculus, geometry, or statistics.",
    "world_knowledge": "Request involves a question about the world, including science, history, geography, or politics.",
    "creative": "Request involves a creative task, including writing, art, music, video, or storytelling.",
    "roleplay": "Request asks the model to play the role a character, historical person, or existing person.",
    "other": "Request does not fall into any of the above categories.",
}

labeled_subset = ds.select(range(5000)).ai_classifier(
    "topic",
    field="instruction",
    classes=classes,
    prompt=None, # use the default prompt, which just provides the class definitions and asks the model to classify the text
    backend="openai"
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

INFO: Example prompt: Classify the provided text into one of the following classes:

- programming: Request involves a programming task, including software, web development, data science, or machine learning.
- math: Request involves a math task, including algebra, calculus, geometry, or statistics.
- world_knowledge: Request involves a question about the world, including science, history, geography, or politics.
- creative: Request involves a creative task, including writing, art, music, video, or storytelling.
- roleplay: Request asks the model to play the role a character, historical person, or existing person.
- other: Request does not fall into any of the above categories.

---

Text: Famous inventors and their inventions: Identify five well-known inventors and briefly describe one of their most significant inventions.

---

Class:


  0%|          | 0/5000 [00:00<?, ?it/s]

INFO: Parallel processing complete.


Flattening the indices:   0%|          | 0/5000 [00:00<?, ? examples/s]

As you can see below, all the outputs are valid classes. That's because Galactic uses a logit bias trick to force the API model to output a valid class. This means you're guaranteed to get a result for every example. It's good to put "other" as an option so the model has an escape hatch if none of the classes seem to fit.

In [17]:
from collections import Counter
Counter(labeled_subset["topic"])

Counter({'math': 1742,
         'programming': 1617,
         'world_knowledge': 985,
         'creative': 552,
         'other': 96,
         'roleplay': 8})

Even labeling a few thousand examples with OpenAI took a really long time, and I have higher rate limits than most people. Luckily, Galactic is designed to help you label way more data than OpenAI can handle, by distilling labels into a fast classifier. One option is to train a linear model on top of embeddings, but the embeddings also take a while (especially if you're computing them locally), so that's only recommended if you want to embed everything anyway. A faster option is to train a FastText model. It only takes a few minutes to train, and can be used to classify any text even if we haven't computed embeddings yet.

In [30]:
labeled_subset.distill_fasttext(
    model_name="topic_classifier",
    save_dir="../local/fasttext_models",
    input_field="instruction",
    label_field="topic",
    target_model_size="1M", # constraining model size
)

Progress: 100.0% Trials:   22 Best score:  0.897704 ETA:   0h 0m 0s
Training again with best arguments
Read 0M words
Number of words:  12097
Number of labels: 6
Progress: 100.0% words/sec/thread:   94656 lr:  0.000000 avg.loss:  0.661008 ETA:   0h 0m 0s
Progress: 100.0% words/sec/thread:  281997 lr:  0.000000 avg.loss:  0.221580 ETA:   0h 0m 0s


Now, we can use the model we just trained to predict topics for all 100k+ examples, in less time than it took for OpenAI to label 5000.

In [41]:
ds = ds.fasttext_classifier(
    ds,
    new_column="topic",
    model_path="../local/fasttext_models/topic_classifier.ftz",
    field="instruction"
)



Map:   0%|          | 0/107639 [00:00<?, ? examples/s]

INFO: Applied classifier to field instruction, added result to topic.
INFO: Applied classifier to field instruction, added result to topic.


In [43]:
Counter(ds["topic"])

Counter({'math': 47069,
         'world_knowledge': 32435,
         'programming': 19150,
         'creative': 8880,
         'roleplay': 103,
         'other': 2})

In [47]:
ds = ds.filter(
    lambda x: x["topic"] in ["math", "programming"]
)
len(ds)

Filter:   0%|          | 0/66219 [00:00<?, ? examples/s]

66219

In [46]:
ds

  input                                        instruction  \
0        A hotel chain wants to optimize its pricing st...   
1        A zoo wants to expand its facilities by adding...   
2        Implement a Scala program that reads a JSON fi...   
3        A hospital wants to improve its emergency resp...   
4        Write a Python script that connects to a Maria...   
5        Solve the heat equation u_t = k * u_xx on the ...   
6        Determine the critical points of the function ...   
7        Write an OCaml function that finds the longest...   
8        Create a Node.js Express application with two ...   
9        Find the orthogonal projection of vector (1,2,...   

                                              output  __id __language  \
0  To develop a dynamic pricing model for the hot...     5         en   
1  Step 1: Calculate the total cost of expansion ...     9         en   
2  To implement this program, you can use the fol...    15         en   
3  Let x be the number of

In [45]:
ds.save("../local/hermes_problem_solving.jsonl")