<a href="https://colab.research.google.com/github/starfalling6/python/blob/main/docta_tech_assessment_data_engineer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Take Home Test: Reformat a Public Dataset for LLM Training

### Objective

The goal of this task is to prepare public datasets for more effective use in training and fine-tuning Large Language Models (LLMs). You are required to reformat a specific subset of a public dataset into a structured, consistent format to facilitate its usability.

### Detailed Instructions

#### 1. Dataset Selection and Preparation

- **Dataset:** You are assigned the `Headline` subset of the [AdaptLLM/finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks) dataset.

- **Task Description:** Each entry in the `input` column contains multiple "Yes" or "No" questions alongside their respective answers. Your task is to:

  - Develop a Python script to parse and separate each question and its answer from the entry.
  - Save each question-answer pair in a structured JSON format as follows:
    ```json
    {
      "id": "<unique_identifier>",
      "Question": "<question_text>",
      "Answer": "<answer_text>"
    }
    ```

  - You are encouraged to introduce additional attributes if needed to preserve the integrity and completeness of the information. Adding relevant tag information is strongly recommended.
- **Automation Requirement:** The task must be completed using Python. Manual editing or data manipulation is strictly prohibited. Your script should efficiently handle variations in data format within the column.

#### 2. Deliverables

- **Reformatted Dataset:** Provide the schema of the final format you adopted for saving the results.
- **Transformation Code:** Submit the complete code used for converting the dataset into the designated format.
- **Statistics:** Report the total number of question-answer pairs extracted from the dataset.
- **Performance Metrics:** Document the time taken to complete the dataset cleanup and transformation process.


In [1]:
!pip install datasets




In [19]:
pip install datasets pandas




In [20]:
from datasets import load_dataset
import pandas as pd

# 加载数据集，选择 'Headline' 子集的 'test' 分割
dataset = load_dataset("AdaptLLM/finance-tasks", name='Headline', split='test')

# 将数据集转换为 Pandas DataFrame 以便于处理
df = pd.DataFrame(dataset)
print(df.head())


                                               input  class_id    options  id  \
0  Headline: "Gold falls to Rs 30,800; silver dow...         0  [No, Yes]   0   
1  Headline: february gold rallies to intraday hi...         7  [No, Yes]   1   
2  Please answer a question about the following h...         5  [No, Yes]   2   
3  Read this headline: "gold closes lower as doll...         3  [No, Yes]   3   
4  gold adds $42, or 2.4%, to trade at $1,833.30/...         1  [No, Yes]   4   

   gold_index  
0           1  
1           0  
2           0  
3           1  
4           0  


In [21]:
import json

def process_row(row):
    """
    处理每一行，提取问题-答案对
    """
    headline = row['input']  # 获取新闻头条
    options = row['options']  # 直接使用列表
    answers = ['Yes', 'No']  # 假设选项是“Yes”或“No”，如果实际情况不同需要调整

    # 确保选项和答案的数量匹配
    if len(options) != len(answers):
        raise ValueError(f"Options and answers length mismatch: {len(options)} options, {len(answers)} answers")

    data = []
    for option, answer in zip(options, answers):
        data.append({
            "Question": f"Does the headline relate to '{option}'?",
            "Answer": answer
        })
    return data

# 存储重新格式化的数据
reformatted_data = []
for idx, row in df.iterrows():
    question_answer_pairs = process_row(row)
    for i, pair in enumerate(question_answer_pairs):
        reformatted_data.append({
            "id": f"{idx}_{i}",
            "Question": pair["Question"],
            "Answer": pair["Answer"]
        })

# 保存为 JSON 文件
with open('/content/reformatted_data.json', 'w') as f:
    json.dump(reformatted_data, f, indent=4)

# 输出总数
print(f"Total question-answer pairs: {len(reformatted_data)}")


Total question-answer pairs: 41094


In [22]:
import time

start_time = time.time()

# 执行数据处理代码

end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")


Time taken: 4.982948303222656e-05 seconds
