# Project 1: getting started

Let's get started with installing the dependencies.

In [None]:
!pip install "distilabel[hf-inference-endpoints, openai]" "model2vec" "semhash" -U -q

## Working with Hugging Face

Let's first [get our token](https://huggingface.co/settings/tokens) and then log in.

In [None]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We will use the [`fka/awesome-chatgpt-prompts`](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts) dataset. This dataset holds a pretty neat collection prompts to use for language models.

In [None]:
from datasets import load_dataset

ds = load_dataset("fka/awesome-chatgpt-prompts")
ds

README.md:   0%|          | 0.00/339 [00:00<?, ?B/s]

prompts.csv:   0%|          | 0.00/104k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/203 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 203
    })
})

In [None]:
ds["train"].features

{'act': Value(dtype='string', id=None),
 'prompt': Value(dtype='string', id=None)}

We can then do some cool operations.

In [None]:
def do_cool_things(row):
    row["act_prompt"] = row["act"] + row["prompt"]

ds.map(do_cool_things)

Map:   0%|          | 0/203 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 203
    })
})

In [None]:
ds['train'][0]

{'act': 'An Ethereum Developer',
 'prompt': 'Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.'}

We can also do cool batch operations.

In [None]:
def do_cool_things(batch):
    row_act_prompts = []
    for act, prompt in zip(batch["act"], batch["prompt"]):
        row_act_prompts.append(act+prompt)
    batch["act_prompt"] = row_act_prompts

ds.map(do_cool_things, batched=True)

Map:   0%|          | 0/203 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 203
    })
})

In [None]:
ds['train'][0]

{'act': 'An Ethereum Developer',
 'prompt': 'Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.'}

## Calling HF LLMs

We can then search a model on Hugging Face and start calling LLMs. Let's use the [`meta-llama/Llama-3.2-3B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and find and use the snippet form calling API endpoints.


In [None]:
from openai import OpenAI
from huggingface_hub import get_token

client = OpenAI(
	base_url="https://router.huggingface.co/hf-inference/v1",
    api_key=get_token()
)

messages = [
	{
		"role": "user",
		"content": "What is the capital of France?"
	}
]

completion = client.chat.completions.create(
	model="meta-llama/Llama-3.2-3B-Instruct",
	messages=messages,
	max_tokens=500,

)

print(completion.choices[0].message)

ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None)


## HF LLMs within distilabel

We can then use a prompt from this dataset to call [LLMs with distilabel](https://distilabel.argilla.io/latest/components-gallery/llms/). Let's see how we can use the [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/#dedicated-inference-endpoints-or-tgi).

In [None]:
from distilabel.models.llms.huggingface import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    base_url="https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-3B-Instruct"
)

llm.load()
llm.generate_outputs(inputs=[[
    {"role": "user", "content": "What is the capital of France?"}
]])

[{'generations': ['The capital of France is Paris.'],
  'statistics': {'input_tokens': [42], 'output_tokens': [8]}}]

## LLMs with prompt templates

We can also use these LLMs along with the prompt templates. Prompt templates are called [tasks](https://distilabel.argilla.io/latest/components-gallery/tasks/). We've already discussed the [EvolInstruct](https://distilabel.argilla.io/latest/components-gallery/tasks/selfinstruct/), [SelfInstruct](https://distilabel.argilla.io/latest/components-gallery/tasks/selfinstruct/) and [Magpie](https://distilabel.argilla.io/latest/components-gallery/tasks/magpie/) templates, let's try to use it now.

In [None]:
from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM

# initialise the LLM
llm = InferenceEndpointsLLM(
    base_url="https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-3B-Instruct"
)

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=llm,
    num_evolutions=1,
)
evol_instruct.load()

result = next(
    evol_instruct.process([
        {"instruction": "What is the capital of France?"}
    ]
))
result



[{'instruction': 'What is the capital of France?',
  'evolved_instruction': "Write a comprehensive response outlining the capital of an independent European country that shares the longest land border with France, considering the city's population density, cultural significance, and historical ties to France, while also acknowledging the complexities of territorial claims and linguistic variations within the region.",
  'model_name': 'https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-3B-Instruct',
  'distilabel_metadata': {'statistics_instruction_evol_instruct_0': {'input_tokens': [264],
    'output_tokens': [53]}}}]

Note that the magpie template is slightly different and allows you to define some additional parameters during model init like `tokenizer_id`, `magpie_pre_query_template` and `use_magpie_template`.

## Using LLMs in a distilabel pipeline

Normally, distilabel works with [pipelines](https://distilabel.argilla.io/latest/sections/getting_started/quickstart/#define-a-custom-pipeline). We can use these to define custom synthetic data flow.

In [None]:
from huggingface_hub import whoami

from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import EvolInstruct

with Pipeline() as pipeline:  # the context window calls `load()` on components
    loader = LoadDataFromHub(
        repo_id="distilabel-internal-testing/instruction-dataset-mini",
        split="test",
        num_examples=10
    )
    evol_instruct = EvolInstruct(
        llm=InferenceEndpointsLLM(
            base_url="https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-3B-Instruct"
        ),
        num_evolutions=1,
        input_mappings={"instruction": "prompt"},  # ensure correct column mapping
    )
    loader.connect(evol_instruct)  # connect to determine flow of data
    # loader >> evol_instruct or use >> operator


distiset = pipeline.run()
#if we want to cache the pipeline: distiset = pipeline.run(use_cache = True)
distiset

README.md:   0%|          | 0.00/656 [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Distiset({
    default: DatasetDict({
        train: Dataset({
            features: ['prompt', 'completion', 'meta', 'evolved_instruction', 'model_name', 'distilabel_metadata'],
            num_rows: 10
        })
    })
})

In [None]:
distiset["default"]["train"][0]

{'prompt': 'Arianna has 12 chocolates more than Danny. Danny has 6 chocolates more than Robbie. Arianna has twice as many chocolates as Robbie has. How many chocolates does Danny have?',
 'completion': 'Denote the number of chocolates each person has by the letter of their first name. We know that\nA = D + 12\nD = R + 6\nA = 2 * R\n\nThus, A = (R + 6) + 12 = R + 18\nSince also A = 2 * R, this means 2 * R = R + 18\nHence R = 18\nHence D = 18 + 6 = 24',
 'meta': {'category': 'Question Answering',
  'completion': 'Denote the number of chocolates each person has by the letter of their first name. We know that\nA = D + 12\nD = R + 6\nA = 2 * R\n\nThus, A = (R + 6) + 12 = R + 18\nSince also A = 2 * R, this means 2 * R = R + 18\nHence R = 18\nHence D = 18 + 6 = 24',
  'id': 0,
  'input': None,
  'motivation_app': None,
  'prompt': 'Arianna has 12 chocolates more than Danny. Danny has 6 chocolates more than Robbie. Arianna has twice as many chocolates as Robbie has. How many chocolates does Da

In [None]:
distiset.push_to_hub("uplimit/example-dataset")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/953 [00:00<?, ?B/s]

## Explore your data

There is [an integration with Nomic AI](https://huggingface.co/blog/MaxNomic/explore-any-hugging-face-dataset-with-nomic-atlas) that allows you to Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas. Additionally, you could use something like [Argilla](https://huggingface.co/blog/argilla-ui-hub) for a more fine-grained analysis.


Let's start with exploring the data in [Nomic AI](https://atlas.nomic.ai/data/davidmberenstein/distilabel-intel-orca-dpo-pairs/map/58350d76-78cf-4383-ad65-2d4f562dabcf).

## Deduplicate you data

The [Dataset Tools organisation on Hugging Face](https://huggingface.co/collections/Dataset-Tools/models-for-dataset-curation-673c647d85be6398f9ba23d3) hold collections of tools and models to explore data or do feature engineering. For example, there is a really fast embedder which is based on [Model2Vec](https://github.com/MinishLab/model2vec) and that can be used to deduplicate data based on semantic overlab using [semhash](https://github.com/MinishLab/semhash/tree/main/semhash).


In [None]:
from datasets import load_dataset
from semhash import SemHash


ds = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

semhash = SemHash.from_records(records=ds["input"])

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate(threshold=0.8).deduplicated
print(f"Original dataset: {len(ds)}. Filtered dataset: {len(deduplicated_texts)}. Percentage left: {len(deduplicated_texts)/len(ds)}")

README.md:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/79.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.11/logging/handlers.py", line 1492, in emit
    self.enqueue(self.prepare(record))
  File "/usr/lib/python3.11/logging/handlers.py", line 1450, in enqueue
    self.queue.put_nowait(record)
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 138, in put_nowait
    return self.put(obj, False)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 88, in put
    raise ValueError(f"Queue {self!r} is closed")
ValueError: Queue <multiprocessing.queues.Queue object at 0x79189bb4a210> is closed
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  

model.safetensors:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/271k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/684k [00:00<?, ?B/s]

Original dataset: 12859. Filtered dataset: 10457. Percentage left: 0.8132047593125438


## Filter data on quality

Similarly, you can there are models and tools, to determine the quality of your texts. A model we could use is the [`HuggingFaceFW/fineweb-edu-classifier`](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) model for educational quality. However,
[Text Descriptives](https://github.com/HLasse/TextDescriptives) is another Python library you can explore for calculating a large variety of quality metrics from text.

In [None]:
from transformers import pipeline
import pandas as pd
from datasets import load_dataset


ds = load_dataset(
    path="argilla/distilabel-intel-orca-dpo-pairs",
    split="train"
)

pipe = pipeline(
    task="text-classification",
    model="HuggingFaceFW/fineweb-edu-classifier"
)

quality_predictions = pipe(ds["input"], truncation=True, verbose=True)
#quality_predictions = pipe(ds["text"], truncation=True, verbose=True)


quality_scores = [i["score"] for i in quality_predictions]

df = pd.DataFrame.from_dict(
    {
        "text": ds["input"],  #"text": ds["text"]
        "quality": quality_scores
    }
)
p_to_keep = 0.8
min_score = 0.8
df.sort_values(by="quality", ascending=False, inplace=True)
df = df.head(int(len(df)*p_to_keep))
df = df[df["quality"] > min_score]
print(f"Original dataset: {len(ds)}. Filtered dataset: {len(df)}. Percentage left: {len(df)/len(ds)}")

Device set to use cuda:0


Original dataset: 12859. Filtered dataset: 9466. Percentage left: 0.7361381133836223


## Upload and edit our dataset

We can then push our dataset to Hugging Face and [create a nice dataset card](https://huggingface.co/datasets/uplimit/uplimit-synthetic-data-week-1-basic).