# Prepare Datasets

## Introduction
In this notebook, we will prepare the datasets for the benchmarking of different OpenAI Embedding models for Binary Quantization. We will use the following datasets. We will use a 100K sample of the DBPedia dataset for the benchmarking. 

## Approach
We will use the following approach to prepare the datasets:
1. Load the datasets
2. Sanitize and Prepare the text
3. Compute & Save the embeddings back to the dataset

In [2]:
from typing import List

import loguru
from datasets import load_dataset
from datasets.exceptions import DatasetNotFoundError
from dotenv import load_dotenv
from openai import OpenAI
from tqdm import tqdm

load_dotenv()  # take environment variables from .env.

logger = loguru.logger
logger.add("logs.log", format="{time} {level} {message}", level="INFO")

1

In [2]:
client = OpenAI()
bs = 1000  # Batch size

In [3]:
def sanitize(text: str):
    text = text.replace("\n", " ")
    text = text.replace("\t", " ")
    text = text.strip()
    if len(text) <= 0:
        return " "
    return text


def prepare_dataset(dataset_name: str = "KShivendu/dbpedia-entities-openai-1M"):
    dataset = load_dataset("KShivendu/dbpedia-entities-openai-1M", split="train")
    dataset = dataset.shuffle(seed=42)
    dataset = dataset.select(range(100000))
    dataset = dataset.map(
        lambda x: {"combined_text": sanitize(f"{x['title']}\n{x['text']}")}
    )
    combined_text = dataset["combined_text"]

    return dataset, combined_text


dataset, combined_text = prepare_dataset()

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

In [4]:
def get_embedding(texts: List[str], model: str, dimensions: int):
    return client.embeddings.create(input=texts, model=model, dimensions=dimensions)

In [5]:
def create_embeddings(
    bs: int, combined_text: List[str], MODEL_NAME: str, DIMENSIONS: int
):
    """
    This function creates embeddings for a given text using a specified OpenAI model.

    Parameters:
    - bs (int): The batch size for processing the text.
    - MODEL_NAME (str): The name of the model to use for generating embeddings.
    - DIMENSIONS (int): The number of dimensions for the embeddings.

    Returns:
    - response_objects (list): A list of response objects containing the embeddings.

    """
    response_objects = []
    for i in tqdm(range(0, len(combined_text), bs)):
        response_objects.append(
            get_embedding(combined_text[i : i + bs], MODEL_NAME, DIMENSIONS)
        )
    return response_objects

## Putting it all together

In [13]:
dataset_combinations = [
    # {
    #     "model_name": "text-embedding-3-large",
    #     "dimensions": 3072,
    # },
    # {
    #     "model_name": "text-embedding-3-large",
    #     "dimensions": 1024,
    # },
    # {
    #     "model_name": "text-embedding-3-large",
    #     "dimensions": 1536,
    # },
    # {
    #     "model_name": "text-embedding-3-small",
    #     "dimensions": 512,
    # },
    # {
    #     "model_name": "text-embedding-3-small",
    #     "dimensions": 1024,
    # },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 1536,
    },
]

In [7]:
for combination in dataset_combinations:
    MODEL_NAME, DIMENSIONS = combination["model_name"], combination["dimensions"]
    DATASET_NAME = f"Qdrant/dbpedia-entities-openai3-{MODEL_NAME}-{DIMENSIONS}-100K"
    logger.info(f"Working on {DATASET_NAME}")
    dataset = load_dataset(DATASET_NAME, split="train")
    if "openai" in dataset.column_names:
        dataset = dataset.remove_columns("openai")
    if "embedding" in dataset.column_names:
        dataset = dataset.rename_column(
            "embedding", f"{MODEL_NAME}-{DIMENSIONS}-embedding"
        )
    if "combined_text" in dataset.column_names:
        dataset = dataset.remove_columns("combined_text")
    dataset.push_to_hub(DATASET_NAME)

[32m2024-02-06 01:27:38.134[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mWorking on Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-100K[0m


Uploading the dataset shards:   0%|          | 0/5 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/421 [00:00<?, ?B/s]

[32m2024-02-06 01:28:05.935[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mWorking on Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1024-100K[0m


Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/50 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/50 [00:00<?, ?ba/s]

[32m2024-02-06 01:29:07.920[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mWorking on Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-100K[0m


Uploading the dataset shards:   0%|          | 0/3 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/34 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/34 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/34 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/420 [00:00<?, ?B/s]

[32m2024-02-06 01:29:43.130[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mWorking on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-512-100K[0m


Downloading readme:   0%|          | 0.00/490 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/323M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/323M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/323M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/323M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Uploading the dataset shards:   0%|          | 0/4 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/490 [00:00<?, ?B/s]

[32m2024-02-06 01:36:15.236[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mWorking on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1024-100K[0m


Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/50 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/50 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/460 [00:00<?, ?B/s]

[32m2024-02-06 01:37:05.522[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mWorking on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K[0m


EmptyDatasetError: The directory at hf://datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K@7e82935b3d4b60d1f14f456025cbe8b934740a73 doesn't contain any data files

In [14]:
# if dataset has been created, rename the column to model_name and dimensions format
dataset, combined_text = prepare_dataset()
for combination in dataset_combinations:
    MODEL_NAME, DIMENSIONS = combination["model_name"], combination["dimensions"]
    DATASET_NAME = f"Qdrant/dbpedia-entities-openai3-{MODEL_NAME}-{DIMENSIONS}-100K"
    logger.info(f"Checking on {DATASET_NAME}")
    try:
        dataset = load_dataset(DATASET_NAME, split="train")
    except DatasetNotFoundError:
        logger.info(
            f"Creating embeddings for {MODEL_NAME} with {DIMENSIONS} dimensions"
        )
        response_objects = create_embeddings(
            bs,
            combined_text=combined_text,
            MODEL_NAME=MODEL_NAME,
            DIMENSIONS=DIMENSIONS,
        )
        embedding_responses = [r.data for r in response_objects]
        embedding_objects = [
            item for sublist in embedding_responses for item in sublist
        ]
        embeddings = [e.embedding for e in embedding_objects]
        logger.info(f"Embeddings created for {MODEL_NAME} with {DIMENSIONS} dimensions")
        dataset = dataset.add_column(f"{MODEL_NAME}-{DIMENSIONS}-embedding", embeddings)
        logger.info(f"{len(embeddings)} Embeddings added to the dataset")
        dataset.push_to_hub(DATASET_NAME)

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

[32m2024-02-06 02:02:59.784[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mChecking on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K[0m
[32m2024-02-06 02:03:00.122[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m10[0m - [1mCreating embeddings for text-embedding-3-small with 1536 dimensions[0m
100%|██████████| 100/100 [08:56<00:00,  5.36s/it]
[32m2024-02-06 02:11:56.219[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m24[0m - [1mEmbeddings created for text-embedding-3-small with 1536 dimensions[0m
[32m2024-02-06 02:13:26.579[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m26[0m - [1m100000 Embeddings added to the dataset[0m


Uploading the dataset shards:   0%|          | 0/4 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]