## Imports

In [1]:
import re
from typing import Dict

from datasets import load_dataset, Dataset, DatasetDict, load_from_disk, concatenate_datasets
from huggingface_hub import notebook_login

## Global variables

In [2]:
SAVE_GATH_BAIZE = True
SAVE_GLAIVE_FC_V2 = True
MERGE_BOTH = True
FINAL = True

## Dataset Names

In [3]:
dataset_name, dataset_name2 = "gathnex/Gath_baize", "glaiveai/glaive-function-calling-v2"

## Gath_baize

In [4]:
# Loading a Gath_baize dataset
dataset = load_dataset(dataset_name, split="train")

In [5]:
print(dataset)
dataset["chat_sample"][1]

Dataset({
    features: ['chat_sample', 'dataset_origin'],
    num_rows: 210311
})


'The conversation between Human and AI assisatance named Gathnex [INST] I would like to know how to find the path with the maximum minimal weight in a graph.\n[/INST] That\'s a great question. To find the path with the maximum minimal weight in a graph, you can use the algorithm called "Floyd–Warshall algorithm". Have you heard about it before?\n[INST] Yes, I have. Can you explain how it works?\n[/INST] Sure. The Floyd–Warshall algorithm is a dynamic programming algorithm that finds the shortest paths in a weighted graph with positive or negative edge weights. It does this by repeatedly improving an estimate of the shortest path between two points, until the estimate is optimal.\n[INST] How does it guarantee to find the maximum minimal weight path?\n[/INST] The algorithm works by computing a matrix of all the possible pairwise shortest paths. From that matrix, we can pick the path with the largest minimum weight. That path will be the maximum minimal weight path.\n[INST] That sounds in

### Function to split System and chat

In [6]:
def convert_row(row: str) -> Dict[str, str]:
    regex = r"(?P<system>^.+?)(?=\[INST\])|\[INST\](?P<user>.+?)\[\/INST\]|(?<=\[\/INST\])(?P<assistant>.+?)(?=\[INST\]|$)"
    # Compile the regex for better performance
    compiled_regex = re.compile(regex, re.DOTALL)
    # Search for matches
    matches = compiled_regex.finditer(row)

    system_prompt = ""
    conversation = ""
    # Process matches
    for match in matches:
        if match.group("system"):
            system = match.group("system").strip()
            system_prompt = f"SYSTEM: The conversation between Human and AI assistant"
        if match.group("user"):
            user = match.group("user").strip()
            conversation += f"USER: {user}\n\n\n"
        if match.group("assistant"):
            assistant = match.group("assistant").strip()
            conversation += f"ASSISTANT: {assistant} <|endoftext|>\n\n\n"
    
    return {'system': system_prompt, 'chat': conversation}

In [7]:
convert_row(dataset["chat_sample"][1])

{'system': 'SYSTEM: The conversation between Human and AI assistant',
 'chat': 'USER: I would like to know how to find the path with the maximum minimal weight in a graph.\n\n\nASSISTANT: That\'s a great question. To find the path with the maximum minimal weight in a graph, you can use the algorithm called "Floyd–Warshall algorithm". Have you heard about it before? <|endoftext|>\n\n\nUSER: Yes, I have. Can you explain how it works?\n\n\nASSISTANT: Sure. The Floyd–Warshall algorithm is a dynamic programming algorithm that finds the shortest paths in a weighted graph with positive or negative edge weights. It does this by repeatedly improving an estimate of the shortest path between two points, until the estimate is optimal. <|endoftext|>\n\n\nUSER: How does it guarantee to find the maximum minimal weight path?\n\n\nASSISTANT: The algorithm works by computing a matrix of all the possible pairwise shortest paths. From that matrix, we can pick the path with the largest minimum weight. That

### Splitting System and Chat

In [8]:
if SAVE_GATH_BAIZE:
    # Use the .map() method to apply convert_row to each row
    # Set batched=True if your convert_row function can handle batch processing for efficiency
    updated_dataset = dataset.map(lambda examples: convert_row(examples['chat_sample']), remove_columns=['chat_sample'])
    
    print(updated_dataset)

Map:   0%|          | 0/210311 [00:00<?, ? examples/s]

Dataset({
    features: ['dataset_origin', 'system', 'chat'],
    num_rows: 210311
})


In [9]:
if SAVE_GATH_BAIZE:
    print(updated_dataset[0])

{'dataset_origin': 'alpaca', 'system': 'SYSTEM: The conversation between Human and AI assistant', 'chat': 'USER: Generate a headline given a content block.\nThe Sony Playstation 5 is the latest version of the console. It has improved graphics and faster processing power.\n\n\nASSISTANT: Experience Amazing Graphics and Speed with the New Sony Playstation 5 <|endoftext|>\n\n\n'}


### Saving new dataset

In [10]:
# Specify the path where you want to save the dataset
save_path = './dataset/hibo-gath_baize'

In [11]:
if SAVE_GATH_BAIZE:
    # Save the dataset to disk
    updated_dataset.save_to_disk(save_path)
    
    print(f"Dataset saved to {save_path}")

Saving the dataset (0/1 shards):   0%|          | 0/210311 [00:00<?, ? examples/s]

Dataset saved to ./dataset/hibo-gath_baize


### Reloading new dataset

In [12]:
# Load the dataset from disk
reloaded_dataset = load_from_disk(save_path)

print(reloaded_dataset)
reloaded_dataset[0]

Dataset({
    features: ['dataset_origin', 'system', 'chat'],
    num_rows: 210311
})


{'dataset_origin': 'alpaca',
 'system': 'SYSTEM: The conversation between Human and AI assistant',
 'chat': 'USER: Generate a headline given a content block.\nThe Sony Playstation 5 is the latest version of the console. It has improved graphics and faster processing power.\n\n\nASSISTANT: Experience Amazing Graphics and Speed with the New Sony Playstation 5 <|endoftext|>\n\n\n'}

## glaive-function-calling-v2

In [13]:
# Loading a Gath_baize_function_calling dataset
dataset2 = load_dataset(dataset_name2, split="train")

In [14]:
print(dataset2)
dataset2[0]

Dataset({
    features: ['chat', 'system'],
    num_rows: 112960
})


{'chat': "USER: Can you book a flight for me from New York to London?\n\n\nASSISTANT: I'm sorry, but I don't have the capability to book flights. My current function allows me to get the exchange rate between two currencies. If you need help with that, feel free to ask! <|endoftext|>\n\n\n",
 'system': 'SYSTEM: You are a helpful assistant with access to the following functions. Use them if required -\n{\n    "name": "get_exchange_rate",\n    "description": "Get the exchange rate between two currencies",\n    "parameters": {\n        "type": "object",\n        "properties": {\n            "base_currency": {\n                "type": "string",\n                "description": "The currency to convert from"\n            },\n            "target_currency": {\n                "type": "string",\n                "description": "The currency to convert to"\n            }\n        },\n        "required": [\n            "base_currency",\n            "target_currency"\n        ]\n    }\n}\n'}

In [15]:
if SAVE_GLAIVE_FC_V2:
    # Adding 'dataset_origin' column directly within the .map() call
    updated_dataset = dataset2.map(lambda examples: {'dataset_origin': 'glaiveai/glaive-function-calling-v2'}, batched=False)

In [16]:
if SAVE_GLAIVE_FC_V2:
    print(updated_dataset[0])

{'chat': "USER: Can you book a flight for me from New York to London?\n\n\nASSISTANT: I'm sorry, but I don't have the capability to book flights. My current function allows me to get the exchange rate between two currencies. If you need help with that, feel free to ask! <|endoftext|>\n\n\n", 'system': 'SYSTEM: You are a helpful assistant with access to the following functions. Use them if required -\n{\n    "name": "get_exchange_rate",\n    "description": "Get the exchange rate between two currencies",\n    "parameters": {\n        "type": "object",\n        "properties": {\n            "base_currency": {\n                "type": "string",\n                "description": "The currency to convert from"\n            },\n            "target_currency": {\n                "type": "string",\n                "description": "The currency to convert to"\n            }\n        },\n        "required": [\n            "base_currency",\n            "target_currency"\n        ]\n    }\n}\n', 'datase

### Saving new dataset

In [17]:
# Specify the path where you want to save the dataset
save_path = './dataset/hibo-glaive-function-calling'

In [18]:
if SAVE_GLAIVE_FC_V2:
    # Save the dataset to disk
    updated_dataset.save_to_disk(save_path)
    
    print(f"Dataset saved to {save_path}")

Saving the dataset (0/1 shards):   0%|          | 0/112960 [00:00<?, ? examples/s]

Dataset saved to ./dataset/hibo-glaive-function-calling


### Reloading new dataset

In [19]:
# Load the dataset from disk
reloaded_dataset2 = load_from_disk(save_path)

print(reloaded_dataset2)
reloaded_dataset2[0]

Dataset({
    features: ['chat', 'system', 'dataset_origin'],
    num_rows: 112960
})


{'chat': "USER: Can you book a flight for me from New York to London?\n\n\nASSISTANT: I'm sorry, but I don't have the capability to book flights. My current function allows me to get the exchange rate between two currencies. If you need help with that, feel free to ask! <|endoftext|>\n\n\n",
 'system': 'SYSTEM: You are a helpful assistant with access to the following functions. Use them if required -\n{\n    "name": "get_exchange_rate",\n    "description": "Get the exchange rate between two currencies",\n    "parameters": {\n        "type": "object",\n        "properties": {\n            "base_currency": {\n                "type": "string",\n                "description": "The currency to convert from"\n            },\n            "target_currency": {\n                "type": "string",\n                "description": "The currency to convert to"\n            }\n        },\n        "required": [\n            "base_currency",\n            "target_currency"\n        ]\n    }\n}\n',
 'data

## hibo-function-calling-v1

### Merge both

In [20]:
print(reloaded_dataset['system'][0])

SYSTEM: The conversation between Human and AI assistant


In [21]:
print(reloaded_dataset2['system'][1])

SYSTEM: You are a helpful assistant with access to the following functions. Use them if required -
{
    "name": "get_news_headlines",
    "description": "Get the latest news headlines",
    "parameters": {
        "type": "object",
        "properties": {
            "country": {
                "type": "string",
                "description": "The country for which to fetch news"
            }
        },
        "required": [
            "country"
        ]
    }
}



In [22]:
if MERGE_BOTH:
    # Assuming updated_dataset and updated_dataset_with_origin are your two datasets ready to be merged
    merged_dataset = concatenate_datasets([reloaded_dataset, reloaded_dataset2])
    
    print(merged_dataset)

Dataset({
    features: ['dataset_origin', 'system', 'chat'],
    num_rows: 323271
})


### Saving merged dataset

In [23]:
# Specify the path where you want to save the dataset
save_path = './dataset/hibo-function-calling-merge'

In [24]:
if MERGE_BOTH:
    # Save the dataset to disk
    merged_dataset.save_to_disk(save_path)
    
    print(f"Dataset saved to {save_path}")

Saving the dataset (0/1 shards):   0%|          | 0/323271 [00:00<?, ? examples/s]

Dataset saved to ./dataset/hibo-function-calling-merge


### Reloading new dataset

In [25]:
# Load the dataset from disk
reloaded_dataset3 = load_from_disk(save_path)
print(reloaded_dataset3)

Dataset({
    features: ['dataset_origin', 'system', 'chat'],
    num_rows: 323271
})


In [26]:
reloaded_dataset3[0]

{'dataset_origin': 'alpaca',
 'system': 'SYSTEM: The conversation between Human and AI assistant',
 'chat': 'USER: Generate a headline given a content block.\nThe Sony Playstation 5 is the latest version of the console. It has improved graphics and faster processing power.\n\n\nASSISTANT: Experience Amazing Graphics and Speed with the New Sony Playstation 5 <|endoftext|>\n\n\n'}

In [27]:
reloaded_dataset3[300000]

{'dataset_origin': 'glaiveai/glaive-function-calling-v2',
 'system': 'SYSTEM: You are a helpful assistant, with no access to external functions.\n\n',
 'chat': 'USER: How can the efficiency of an algorithm be measured in terms of time complexity? Can you provide an example in [Java code] and calculate its Big O notation? \n[Formula]: \nT(n) = O(f(n))\n[Java code]:\nint sum = 0;\nfor (int i = 0; i < n; i++) {\n for (int j = 0; j < n; j++) {\n sum++;\n }\n}\nSystem.out.println(sum);\n[Formula calculation]:\nT(n) = O(n^2)\n\nASSISTANT: The efficiency of an algorithm can be measured in terms of time complexity by analyzing the amount of time it takes to run the algorithm as the input size grows larger. This is typically done by calculating the Big O notation of the algorithm, which provides an upper bound on the time complexity in terms of the input size.\nIn the example Java code you provided, the algorithm is calculating the sum of all possible pairs of integers from 0 to n-1. The outer 

## Split train

In [28]:
dataset_dict = DatasetDict({
    'train': reloaded_dataset3
})

In [29]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['dataset_origin', 'system', 'chat'],
        num_rows: 323271
    })
})

In [30]:
# Specify the path where you want to save the dataset
save_path = './dataset/hibo-function-calling-v1'

In [31]:
if FINAL:
    # Save the dataset to disk
    dataset_dict.save_to_disk(save_path)
    
    print(f"Dataset saved to {save_path}")

Saving the dataset (0/1 shards):   0%|          | 0/323271 [00:00<?, ? examples/s]

Dataset saved to ./dataset/hibo-function-calling-v1


In [32]:
if FINAL:
    dataset_dict.push_to_hub("thibaud-perrin/hibo-function-calling-v1")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/324 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
