# TDE-v2 Dataset Cookbook

This cookbooks severs as the preparation for the TDE-v2 dataset for the Chat-Transform project. It refactored the original dataset from Yeye-He's [repo](https://github.com/Yeye-He/Transform-Data-by-Example/tree/master/Benchmark) into a fine-tuning `Alpaca` format that is compatible with the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)


In [1]:
import os
import pandas as pd
import json

## Load and Check

In [9]:
data_directory = '../data/TDE-v2/'
data_frames = []

# Load all JSON files in the directory and convert to DataFrame
for subdir, _, files in os.walk(data_directory):
    for file in files:
        if file.endswith('.json'):
            file_path = os.path.join(subdir, file)
            with open(file_path, 'r') as f:
                df = pd.json_normalize(json.load(f),
                                       meta=[['context', 'input'], ['context', 'output']])
                
                test_fp = file_path.replace(data_directory, "")
                df['test_path'] = os.path.splitext(test_fp)[0]  # remove file extension
                data_frames.append(df)

# Combine all DataFrames into a single DataFrame
df_all = pd.concat(data_frames, ignore_index=True)

# split the function from field 'instruction'
df_all['function'] = df_all['instruction'].apply(lambda x: x.split(':')[0].strip())

# check the shape of the dataframe
df_all.shape

(226, 7)

In [10]:
df_all.head()

Unnamed: 0,chat,instruction,tuples,context.input,context.output,test_path,function
0,Transform first and last names into username,format(): Combine first letter of first name w...,"[{'input': 'john	smith', 'output': 'jsmith'}, ...",First and last names separated by a tab charac...,Username created by combining the first letter...,benchmark-FF-Trifacta-GoogleRefine/example_fil...,format()
1,Extract name of actor/actress,extract(): extract name of actor/actress from ...,[{'input': '* '''1953 [[Meena Kumari]] – ''[[B...,text format containing movie details,extracted name of actor/actress,benchmark-FF-Trifacta-GoogleRefine/example_fil...,extract()
2,Extract character names from wiki-style list,extract(): Extract the character name from the...,[{'input': '* '''1953 [[Meena Kumari]] – ''[[B...,Wiki-style list item containing actor and char...,Character name extracted from the input,benchmark-FF-Trifacta-GoogleRefine/example_fil...,extract()
3,format date of birth,format(): convert YYYYMMDD to MM-DD-YYYY,"[{'input': '19610223', 'output': '02-23-1961'}...",date in YYYYMMDD format,formatted date in MM-DD-YYYY,benchmark-FF-Trifacta-GoogleRefine/example_fil...,format()
4,Extract and format movie titles,transform(): Extract the movie title from the ...,[{'input': '* '''1953 [[Meena Kumari]] – ''[[B...,String containing movie information with title...,Lowercase movie title,benchmark-FF-Trifacta-GoogleRefine/example_fil...,transform()


> TDE-v2 dataset Data Dictionary

- The `chat` field represents a conversation between a user and an AI assistant. It is expected to be the input to the fine-tuned LM.

- The `instruction` field represents a coding instruction with API-style `function` name that specifies the transformation type. It is expected to be the target output from the fine-tuned LM.

- The `tuples` field represents a list of transformation-pairs, which serves as additional few-shot examples to help the fine-tuned LM to understand the user's intention. Each transformation-pair is a list of tuples, where each tuple contains an input and an output. The context of transformation pairs is stored in `context.input` and `context.output` fields, respectively.

- `test_path` refers to the original path of the test case in the TDE dataset.



In [12]:
# check duplicates of instruction, and sort it
df_all[df_all.duplicated(subset=['instruction'], keep=False)].sort_values(by='instruction')


Unnamed: 0,chat,instruction,tuples,context.input,context.output,test_path,function


In [13]:
# show distribution of function
df_all['function'].value_counts()


extract()             57
unit_convert()        48
format()              45
domain_calculate()    34
transform()           25
domain_map()          17
Name: function, dtype: int64

In [14]:
# count the number of transformation-pairs in tuples
df_all['num_tuples'] = df_all['tuples'].apply(len)
print(df_all['num_tuples'].describe()) 

print("Total number of tuples: ", df_all['num_tuples'].sum())
# show 

count    226.000000
mean       5.575221
std        2.232309
min        4.000000
25%        5.000000
50%        5.000000
75%        5.000000
max       25.000000
Name: num_tuples, dtype: float64
Total number of tuples:  1260


In [15]:

# re arrange columns
df_all = df_all[['test_path', 'chat', 'instruction', 'context.input', 'context.output', 'tuples']]

# sort by test_path
df_all.sort_values(by='test_path', inplace=True)


In [16]:
# export to CSV
df_all.to_csv('../data/TDE-v2/tde-v2-all.csv', index=False)

## Refactor to Alpaca Format

The fine-tuned LM serves as an AI assistant to translate user chat into the target instruction (chat-to-instruction), with the help of transformation-pairs stored in `tuples` field and the context information stored in `context.input` and `context.output` fields.

In [11]:
df_all.columns

Index(['chat', 'instruction', 'tuples', 'context.input', 'context.output',
       'test_path', 'function'],
      dtype='object')

> The number of training pairs in `tuples` field is a hyperparameter to be tuned.


In [30]:
# refactor columns based on Alpaca format
df_alpaca = df_all.copy()

df_alpaca['chat_ctx'] = df_alpaca.apply(
    lambda x: x['chat'] + '\n' +
              'Input: ' + x['context.input'] + '\n' + 
              'Output: ' + x['context.output'],
    axis=1)

split_index = 3
# only keep the first 3 transformation-pairs for training, align with paper setting

df_alpaca['train_pairs'] = df_alpaca['tuples'].apply(lambda x: '\n'.join([f'Input: {t["input"]}\nOutput: {t["output"]}' for t in x[:split_index]])) 

df_alpaca['test_pairs'] = df_alpaca['tuples'].apply(lambda x: '\n'.join([f'Input: {t["input"]}\nOutput: {t["output"]}' for t in x[split_index:]])) 


# drop the original columns
df_alpaca.drop(columns=['context.input', 'context.output', 'tuples', 'test_path'], inplace=True)

df_alpaca.head()


Unnamed: 0,chat,instruction,chat_ctx,train_pairs,test_pairs
40,normalize accented string,format(): normalize string by replacing accent...,normalize accented string\nInput: string with ...,Input: áéíóú\nOutput: aeiou\nInput: aeiou\nOut...,Input: aeío\nOutput: aeio\nInput: aeíouíóúz\nO...
16,normalize acronyms,format(): map full phrases to their acronyms.,normalize acronyms\nInput: full phrases with a...,Input: association computing machinery\nOutput...,Input: special interest group information retr...
19,normalize movie titles,transform(): normalize movie titles by removin...,normalize movie titles\nInput: movie titles wi...,Input: Harry Potter 4 aka Harry Potter and the...,Input: The Hunger Games 3 aka the hunger games...
5,parse the house area from the description,extract(): extract the house area from the des...,parse the house area from the description\nInp...,Input: Mar 18 Beautiful One bedroom Available....,"Input: Mar 1 Lake Washington, Bellevue $1234 /..."
35,parse rental price from Craigslist listings,extract(): extract the rental price from each ...,parse rental price from Craigslist listings\nI...,Input: Mar 18 Beautiful One bedroom Available....,"Input: Mar 1 Lake Washington, Bellevue $1234 /..."


### Supervised Fine-Tuning Dataset

Regarding the above dataset, the *dataset description* in `dataset_info.json` of `LLaMA-Factory` should be the following 3 options:

1. `tde-o1`: chat-only

```json
"tde-o1": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat", // user chat, no context
    "query": "", 
    "response": "instruction", // model response
  }
}
```

2. `tde-o2`: chat + transformation-pairs

```json
"tde-o2": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat", // user chat
    "query": "io_pairs", // transformation-pairs
    "response": "instruction", // model response
  }
}
```

3. `tde-o3`: chat + context

```json
"tde-o3": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat_ctx", // user chat + context
    "query": "", // empty
    "response": "instruction", // model response
  }
}
```

4. `tde-o4`: chat + context + transformation-pairs

```json
"tde-o4": {
  "file_name": "tde-alpaca.json",
  "columns": {
    "prompt": "chat_ctx", // user chat + context
    "query": "io_pairs", // transformation-pairs
    "response": "instruction", // model response
  }
}
```

In [31]:
# export to JSON
df_alpaca.to_json('../data/TDE-v2/tde-alpaca.json', orient='records', indent=4, force_ascii=False)