# TDE-v2 Dataset Cookbook

This cookbooks severs as the preparation for the TDE-v2 dataset for the Chat-Transform project. It refactored the original dataset from Yeye-He's [repo](https://github.com/Yeye-He/Transform-Data-by-Example/tree/master/Benchmark) into a fine-tuning `Alpaca` format that is compatible with the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)


In [1]:
import os
import pandas as pd
import json

## Load and Check

In [2]:
data_directory = '../data/TDE-v2/'
data_frames = []

# Load all JSON files in the directory and convert to DataFrame
for subdir, _, files in os.walk(data_directory):
    for file in files:
        if file.endswith('.json'):
            file_path = os.path.join(subdir, file)
            with open(file_path, 'r') as f:
                df = pd.json_normalize(json.load(f),
                                    meta=[['context', 'input'], ['context', 'output']])
                
                test_fp = file_path.replace(data_directory, "")
                df['test_path'] = os.path.splitext(test_fp)[0]  # remove file extension
                data_frames.append(df)

# Combine all DataFrames into a single DataFrame
df_all = pd.concat(data_frames, ignore_index=True)

# split the function from field 'instruction'
df_all['function'] = df_all['instruction'].apply(lambda x: x.split(':')[0].strip())

# check the shape of the dataframe
df_all.shape

(226, 7)

In [3]:
df_all.head()

Unnamed: 0,chat,instruction,tuples,context.input,context.output,test_path,function
0,Transform first and last names into username,format(): Combine first letter of first name w...,"[{'input': 'john	smith', 'output': 'jsmith'}, ...",First and last names separated by a tab charac...,Username created by combining the first letter...,benchmark-FF-Trifacta-GoogleRefine/example_fil...,format()
1,Extract name of actor/actress,extract(): extract name of actor/actress from ...,[{'input': '* '''1953 [[Meena Kumari]] – ''[[B...,text format containing movie details,extracted name of actor/actress,benchmark-FF-Trifacta-GoogleRefine/example_fil...,extract()
2,Extract character names from wiki-style list,extract(): Extract the character name from the...,[{'input': '* '''1953 [[Meena Kumari]] – ''[[B...,Wiki-style list item containing actor and char...,Character name extracted from the input,benchmark-FF-Trifacta-GoogleRefine/example_fil...,extract()
3,format date of birth,format(): convert YYYYMMDD to MM-DD-YYYY,"[{'input': '19610223', 'output': '02-23-1961'}...",date in YYYYMMDD format,formatted date in MM-DD-YYYY,benchmark-FF-Trifacta-GoogleRefine/example_fil...,format()
4,Extract and format movie titles,transform(): Extract the movie title from the ...,[{'input': '* '''1953 [[Meena Kumari]] – ''[[B...,String containing movie information with title...,Lowercase movie title,benchmark-FF-Trifacta-GoogleRefine/example_fil...,transform()


> TDE-v2 dataset Data Dictionary

- The `chat` field represents a conversation between a user and an AI assistant. It is expected to be the input to the fine-tuned LM.

- The `instruction` field represents a coding instruction with API-style `function` name that specifies the transformation type. It is expected to be the target output from the fine-tuned LM.

- The `tuples` field represents a list of transformation-pairs, which serves as additional few-shot examples to help the fine-tuned LM to understand the user's intention. Each transformation-pair is a list of tuples, where each tuple contains an input and an output. The context of transformation pairs is stored in `context.input` and `context.output` fields, respectively.

- `test_path` refers to the original path of the test case in the TDE dataset.



In [4]:
# check duplicates of instruction, and sort it
df_all[df_all.duplicated(subset=['instruction'], keep=False)].sort_values(by='instruction')


Unnamed: 0,chat,instruction,tuples,context.input,context.output,test_path,function


In [5]:
# show distribution of function
df_all['function'].value_counts()


extract()             57
unit_convert()        48
format()              45
domain_calculate()    34
transform()           25
domain_map()          17
Name: function, dtype: int64

In [6]:
# count the number of transformation-pairs in tuples
df_all['num_tuples'] = df_all['tuples'].apply(len)
print(df_all['num_tuples'].describe()) 

print("Total number of tuples: ", df_all['num_tuples'].sum())
# show 

count    226.000000
mean       5.575221
std        2.232309
min        4.000000
25%        5.000000
50%        5.000000
75%        5.000000
max       25.000000
Name: num_tuples, dtype: float64
Total number of tuples:  1260


## Refactor to Alpaca Format

The fine-tuned LM serves as an AI assistant to translate user chat into the target instruction (chat-to-instruction), with the help of transformation-pairs stored in `tuples` field and the context information stored in `context.input` and `context.output` fields.

In [7]:
df_all.columns

Index(['chat', 'instruction', 'tuples', 'context.input', 'context.output',
       'test_path', 'function', 'num_tuples'],
      dtype='object')

### Supervised Fine-Tuning Dataset

Here we craft the fine-tuning dataset in Alpaca format. Regarding the above dataset, the *dataset description* in `dataset_info.json` of `LLaMA-Factory` should be the following options:

1. `tde-o1`: chat-only

```json
"tde-o1": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat", // user chat
    "query": "", // no human input
    "response": "instruction", // model response
    "system": ...
  }
}
```

2. `tde-o2`: example-only

```json
"tde-o2": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "", // without user chat
    "query": "train_pairs", // provide transformation-pairs
    "response": "instruction", // model response
    "system": ...
  }
}
```

3. `tde-o3`: context-only

```json
"tde-o3": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "", // without user chat
    "query": "ctx", // provide context
    "response": "instruction", // model response
    "system": ...
  }
}
```

4. `tde-o4`: chat + example

```json
"tde-o4": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat", // user chat
    "query": "train_pairs", // provide transformation-pairs
    "response": "instruction", // model response
    "system": ...
  }
}
```

5. `tde-o5`: chat + context

```json
"tde-o5": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat", // user chat
    "query": "ctx", // provide context
    "response": "instruction", // model response
    "system": ...
  }
}

6. `tde-o6`: example + context

```json
"tde-o6": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "", // without user chat
    "query": "ctx_t_pairs", // provide context + examples
    "response": "instruction", // model response
    "system": ...
  }
}
```

7. `tde-o7`: chat + example + context

```json
"tde-o7": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat", // user chat
    "query": "ctx_t_pairs", // provide context + examples
    "response": "instruction", // model response
    "system": ...
  }
}
```




In [8]:
# show a sample of the dataframe
df_all.iloc[0]


chat                   Transform first and last names into username
instruction       format(): Combine first letter of first name w...
tuples            [{'input': 'john	smith', 'output': 'jsmith'}, ...
context.input     First and last names separated by a tab charac...
context.output    Username created by combining the first letter...
test_path         benchmark-FF-Trifacta-GoogleRefine/example_fil...
function                                                   format()
num_tuples                                                        5
Name: 0, dtype: object

The `chat` and `instruction` fields remain the same. We will refactor the `tuples` field into `io_pairs` field, and add `context.*` field to store the context information.

In [13]:
# refactor columns based on Alpaca format
df_alpaca = df_all.copy()

# set system prompt
df_alpaca['system'] = "You are an AI assistant that translates user chat into the target instruction."

# context information
df_alpaca['ctx'] = df_alpaca.apply(lambda x: 'Context information:\n' +
                                   'Input: ' + x['context.input'] + '\n' + 
                                   'Output: ' + x['context.output'],axis=1)

# transformation pairs, split train and test pairs, top 3 for train, last for test
df_alpaca['train_pairs'] = df_alpaca['tuples'].apply(lambda x: 'Examples:\n' +
                                               '\n'.join([f'Input: {t["input"]}\nOutput: {t["output"]}' for t in x[:3]]))

df_alpaca['test_pairs'] = df_alpaca['tuples'].apply(lambda x: 'Examples:\n' +
                                               '\n'.join([f'Input: {t["input"]}\nOutput: {t["output"]}' for t in x[3:]]))


# ctx + train_pairs
df_alpaca['ctx_t_pairs'] = df_alpaca['ctx'] + '\n\n' + df_alpaca['train_pairs']

# slice columns
df_alpaca = df_alpaca[['test_path', 'system', 'chat', 'instruction', 'ctx', 'train_pairs', 'ctx_t_pairs', 'test_pairs']]


df_alpaca.head()


Unnamed: 0,test_path,system,chat,instruction,ctx,train_pairs,ctx_t_pairs,test_pairs
0,benchmark-FF-Trifacta-GoogleRefine/example_fil...,You are an AI assistant that translates user c...,Transform first and last names into username,format(): Combine first letter of first name w...,Context information:\nInput: First and last na...,Examples:\nInput: john\tsmith\nOutput: jsmith\...,Context information:\nInput: First and last na...,Examples:\nInput: alice\tbob\nOutput: abob\nIn...
1,benchmark-FF-Trifacta-GoogleRefine/example_fil...,You are an AI assistant that translates user c...,Extract name of actor/actress,extract(): extract name of actor/actress from ...,Context information:\nInput: text format conta...,Examples:\nInput: * '''1953 [[Meena Kumari]] –...,Context information:\nInput: text format conta...,Examples:\nInput: ** [[Geeta Bali]] – ''[[Vach...
2,benchmark-FF-Trifacta-GoogleRefine/example_fil...,You are an AI assistant that translates user c...,Extract character names from wiki-style list,extract(): Extract the character name from the...,Context information:\nInput: Wiki-style list i...,Examples:\nInput: * '''1953 [[Meena Kumari]] –...,Context information:\nInput: Wiki-style list i...,Examples:\nInput: ** [[Geeta Bali]] – ''[[Vach...
3,benchmark-FF-Trifacta-GoogleRefine/example_fil...,You are an AI assistant that translates user c...,format date of birth,format(): convert YYYYMMDD to MM-DD-YYYY,Context information:\nInput: date in YYYYMMDD ...,Examples:\nInput: 19610223\nOutput: 02-23-1961...,Context information:\nInput: date in YYYYMMDD ...,Examples:\nInput: 19221213\nOutput: 12-13-1922...
4,benchmark-FF-Trifacta-GoogleRefine/example_fil...,You are an AI assistant that translates user c...,Extract and format movie titles,transform(): Extract the movie title from the ...,Context information:\nInput: String containing...,Examples:\nInput: * '''1953 [[Meena Kumari]] –...,Context information:\nInput: String containing...,Examples:\nInput: ** [[Geeta Bali]] – ''[[Vach...


In [14]:
# Print all details of the first row in df_alpaca
print("Chat:")
print(df_alpaca.iloc[0]['chat'])
print("\nInstruction:")
print(df_alpaca.iloc[0]['instruction'])
print("\nTest Path:")
print(df_alpaca.iloc[0]['test_path'])
print("\nSystem:")
print(df_alpaca.iloc[0]['system'])
print("\nContext and Pairs:")
print(df_alpaca.iloc[0]['ctx_t_pairs'])

Chat:
Transform first and last names into username

Instruction:
format(): Combine first letter of first name with full last name to create username

Test Path:
benchmark-FF-Trifacta-GoogleRefine/example_file_name

System:
You are an AI assistant that translates user chat into the target instruction.

Context and Pairs:
Context information:
Input: First and last names separated by a tab character.
Output: Username created by combining the first letter of the first name with the full last name.

Examples:
Input: john	smith
Output: jsmith
Input: adam	williams
Output: awilliams
Input: james	johnson
Output: jjohnson


In [15]:
# export to JSON
df_alpaca.to_json('../data/TDE-alpaca/tde-alpaca.json', orient='records', indent=4, force_ascii=False)