<a href="https://colab.research.google.com/github/wsy258-strar/docta_tech_assessment_data_engineer/blob/main/docta_tech_assessment_data_engineer_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Take Home Test: Reformat a Public Dataset for LLM Training

### Objective

The goal of this task is to prepare public datasets for more effective use in training and fine-tuning Large Language Models (LLMs). You are required to reformat a specific subset of a public dataset into a structured, consistent format to facilitate its usability.

### Detailed Instructions

#### 1. Dataset Selection and Preparation

- **Dataset:** You are assigned the `Headline` subset of the [AdaptLLM/finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks) dataset.

- **Task Description:** Each entry in the `input` column contains multiple "Yes" or "No" questions alongside their respective answers. Your task is to:

  - Develop a Python script to parse and separate each question and its answer from the entry.
  - Save each question-answer pair in a structured JSON format as follows:
    ```json
    {
      "id": "<unique_identifier>",
      "Question": "<question_text>",
      "Answer": "<answer_text>"
    }
    ```

  - You are encouraged to introduce additional attributes if needed to preserve the integrity and completeness of the information. Adding relevant tag information is strongly recommended.
- **Automation Requirement:** The task must be completed using Python. Manual editing or data manipulation is strictly prohibited. Your script should efficiently handle variations in data format within the column.

#### 2. Deliverables

- **Reformatted Dataset:** Provide the schema of the final format you adopted for saving the results.
- **Transformation Code:** Submit the complete code used for converting the dataset into the designated format.
- **Statistics:** Report the total number of question-answer pairs extracted from the dataset.
- **Performance Metrics:** Document the time taken to complete the dataset cleanup and transformation process.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any

In [3]:
from datasets import load_dataset

dataset = load_dataset("AdaptLLM/finance-tasks",'Headline')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.4M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/20547 [00:00<?, ? examples/s]

In [4]:
dataset = dataset['test']

In [None]:
len(dataset)

20547

In [None]:
dataset['input'][0]

'Headline: "Gold falls to Rs 30,800; silver down at Rs 41,200 per kg" Now answer this question: Does the news headline talk about price in the past? Yes\n\nHeadline: "gold futures add to gains after adp data" Now answer this question: Does the news headline talk about price? Yes\n\nHeadline: "Gold holds on to modest loss after data" Now answer this question: Does the news headline talk about price in the future? No\n\nHeadline: "spot gold quoted at $417.50, down 20c from new york" Now answer this question: Does the news headline talk about a general event (apart from prices) in the past? No\n\nHeadline: "gold hits new record high at $1,036.20 an ounce" Now answer this question: Does the news headline compare gold with any other asset? No\n\nHeadline: "gold may hit rs 31,500, but pullback rally may not sustain for long: experts" Now answer this question: Does the news headline talk about price?'

In [25]:
dataset['input'][20305]

'Given the headline "gold prices gain in asia with chinese new year trade in focus", what is the answer to the question "Does the news headline talk about a general event (apart from prices) in the past?" No\n\nGiven the headline "Gold soars to highest since March 2014 amid flight to safety", what is the answer to the question "Does the news headline talk about a general event (apart from prices) in the past?" No\n\nGiven the headline "gold futures down 10 sessions in a row", what is the answer to the question "Does the news headline compare gold with any other asset?" No\n\nGiven the headline "Gold prices to trade volatile: Angel Commodities", what is the answer to the question "Does the news headline talk about price going up?" No\n\nGiven the headline "gold gains in asia, copper jumps as china pmi surveys mixed", what is the answer to the question "Does the news headline talk about a general event (apart from prices) in the past?" No\n\nGiven the headline "gold futures mark highest 

In [45]:
import time
import json
import uuid
import re

# # 测试
# dataset = [
#     {
#         "gold_index": 1,
#         "class_id": 0,
#         "input": 'Headline: "Gold falls to Rs 30,800; silver down at Rs 41,200 per kg" Now answer this question: Does the news headline talk about price in the past? Yes\n\nHeadline: "gold futures add to gains after adp data" Now answer this question: Does the news headline talk about price? Yes\n\nHeadline: "Gold holds on to modest loss after data" Now answer this question: Does the news headline talk about price in the future? No\n\nHeadline: "spot gold quoted at $417.50, down 20c from new york" Now answer this question: Does the news headline talk about a general event (apart from prices) in the past? No\n\nHeadline: "gold hits new record high at $1,036.20 an ounce" Now answer this question: Does the news headline compare gold with any other asset? No\n\nHeadline: "gold may hit rs 31,500, but pullback rally may not sustain for long: experts" Now answer this question: Does the news headline talk about price?',
#         "id": 0,
#         "options": ["No", "Yes"]
#     }
# ]
def clean_answer(answer):
    match = re.search(r'\b(?:Yes|No)\b', answer, re.IGNORECASE)

    return match.group(0) if match else ''

# 解析函数
def parse_questions_answers(dataset):
    structured_data = []
    count = 0  # 添加计数器

    lines = dataset['input'].split('\n\n')  # 以双换行符分割问题和答案对
    lines = [item.replace("No or Yes?", "") if isinstance(item, str) else item for item in lines]
    lines = [item.replace("Yes or No?", "") if isinstance(item, str) else item for item in lines]
    lines = [item.replace("Options: - Yes - No", "") if isinstance(item, str) else item for item in lines]
    lines = [item.replace("Options: - No - Yes", "") if isinstance(item, str) else item for item in lines]
    lines = [item.replace("Options:\n- Yes\n- No", "") if isinstance(item, str) else item for item in lines]
    lines = [item.replace("Options:\n- No\n- Yes", "") if isinstance(item, str) else item for item in lines]
    # lines = [item.replace("\nAnswer: ", "") if isinstance(item, str) else item for item in lines]
    # lines = [item.replace("Answer:", "") if isinstance(item, str) else item for item in lines]
    for idx , line in enumerate(lines):
        answer=[]
        question = line.strip().split('?')[0]
        try:

            par = line.strip().split('Does')
            if len(par) > 1:
                question = "Does " + par[1].strip().split('?')[0] +'?'
        except IndexError:

            pass

        try:

            parts = line.strip().split('?')
            if len(parts) > 1:
                answer = parts[-1].strip()
                answer = clean_answer(answer)
        except IndexError:

            pass

        # 使用uuid生成唯一标识符
        # unique_id = str(uuid.uuid4())
        unique_id = str(dataset['id']) +"-" + str(count + 1)

        structured_data.append({
            "id": unique_id,
            "Question": question,
            "Answer": answer
            })
        count += 1  # 增加计数器
    return structured_data,count

#记录开始时间
start_time = time.time()

# 执行解析
structured_dataset=[]
total_count = 0  # 总计数器
for i in range(len(dataset)):
    structured_data, count = parse_questions_answers(dataset[i])
    structured_dataset += structured_data
    total_count += count
# structured_dataset += parse_questions_answers(dataset[20305])
# print(structured_dataset)
end_time = time.time()

# 将结构化数据保存为JSON文件
with open('structured_questions_answers.json', 'w') as outfile:
    json.dump(structured_dataset, outfile, indent=2)

#验证
print(json.dumps(structured_dataset, indent=2))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [46]:
# 打印问答对的总数
print(f"Total number of question-answer pairs extracted: {total_count}")

Total number of question-answer pairs extracted: 123286


In [47]:
# 打印完成数据集清理和转换过程所需的时间
print(f"Time taken to clean and transform the dataset: {end_time - start_time} seconds")

Time taken to clean and transform the dataset: 3.7689807415008545 seconds
