## Implementing DSFT Technique

From this paper: https://arxiv.org/pdf/2412.18925

We synthesize 20K SFT data points DSFT = {(x, e, ˆ yˆ)} from the
verifiable problem set D = {(x, y∗
)} using **GPT-4o**. DSFT is used to fine-tune LLMs to generate a
complex CoT eˆ followed by a formal response yˆ. This fine-tuning process teaches the model to think
before answering, encouraging a Stream-of-Search (SoS) [23] way where the model deeply explores
and refines its reasoning before answering.

### Imports

In [1]:
import pandas as pd
import numpy as np
import openai
from tqdm import tqdm
import json
import re
import random
import csv
import os

In [28]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Testing on One Case

In [2]:
df = pd.read_json("hf://datasets/FreedomIntelligence/medical-o1-verifiable-problem/medical_o1_verifiable_problem.json")
df.head()

Unnamed: 0,Open-ended Verifiable Question,Ground-True Answer
0,An 88-year-old woman with osteoarthritis is ex...,Gastric ulcer
1,In the context of disseminated intravascular c...,Fibrin degradation products
2,"In a 3-year-old boy with severe diarrhea, vomi...","Double-stranded, icosahedral, non-enveloped"
3,Based on the chest radiograph and abdominal CT...,Hydatid Cyst
4,What is one potential side effect that is not ...,Anaphylaxis


In [3]:
len(df)

40644

In [4]:
df["Open-ended Verifiable Question"][0]

'An 88-year-old woman with osteoarthritis is experiencing mild epigastric discomfort and has vomited material resembling coffee grounds multiple times. Considering her use of naproxen, what is the most likely cause of her gastrointestinal blood loss?'

### Loading Model

In [5]:
def clean_json_output(raw_text):
    return raw_text.strip().removeprefix("```json").removeprefix("```").removesuffix("```").strip()

In [6]:
x = df.loc[0, "Open-ended Verifiable Question"]
y_star = df.loc[0, "Ground-True Answer"]

client = openai.OpenAI(api_key="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")

In [7]:
print('x:', x)
print('y_star:', y_star)

x: An 88-year-old woman with osteoarthritis is experiencing mild epigastric discomfort and has vomited material resembling coffee grounds multiple times. Considering her use of naproxen, what is the most likely cause of her gastrointestinal blood loss?
y_star: Gastric ulcer


#### Using prompts from Appendix D + E of paper mentioned above
Initial question

In [8]:
init_prompt = f"""<question>
{x}
</question>
Please respond to the above question <question> using the Chain of Thought (CoT) reasoning method.
Your response should consist of multiple steps, each of which includes three types of actions: **"Inner
Thinking"**, **"Final Conclusion"**, and **"Verification"**:
- **’Inner Thinking’**: This is the step where thinking is done. Note that multiple ’Inner Thinking’ steps are
required to describe thorough reasoning. Each step should first generate a brief title.
- **’Final Conclusion’**: At this stage, you summarize the correct reasoning from previous ’Inner Thinking’
steps and provide the final answer. No title is required here.
- **’Verification’**: At this stage, you verify the conclusion from the "Final Conclusion" step. If the
conclusion holds, end the process. If not, return to "Inner Thinking" for further reasoning. No title is required
here.
The output format must strictly follow the JSON structure
"""

In [9]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": init_prompt}],
    temperature=0.7,
)

content = response.choices[0].message.content
cleaned = clean_json_output(content)

try:
    result = json.loads(cleaned)
except json.JSONDecodeError:
    result = eval(cleaned)

result

{'Inner Thinking': [{'title': 'Understanding the Symptoms',
   'content': 'The patient is experiencing epigastric discomfort and has vomited material resembling coffee grounds. These symptoms are indicative of an upper gastrointestinal bleed. Coffee ground emesis suggests that there is bleeding in the stomach or proximal duodenum where the blood has been partially digested.'},
  {'title': "Considering Patient's Medication",
   'content': 'The patient has been using naproxen, which is a nonsteroidal anti-inflammatory drug (NSAID). NSAIDs are known to cause gastric mucosal damage which can lead to ulcers and gastrointestinal bleeding.'},
  {'title': 'Linking NSAID Use to GI Bleeding',
   'content': 'Naproxen, like other NSAIDs, inhibits the production of prostaglandins that protect the gastric mucosa. This inhibition can lead to the development of gastric ulcers, which can bleed and cause the symptoms observed in this patient.'}],
 'Final Conclusion': "The most likely cause of this patie

In [12]:
#print("Inner Thinking:", result['steps'][0])
#print("Generated Answer:", result['steps'])

Verification

In [13]:
verif_prompt = f"""<Model Response>
{response}
</Model Response>
<Reference Answer>
{y_star}
</Reference Answer>
You are provided with a model-generated response (<Model Response>) and a reference answer (<Reference
Answer>). Compare the model response with the reference answer and determine its correctness. Your task
is to simply output "True" if the response is correct, and "False" otherwise"""

In [14]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": verif_prompt}],
    temperature=0.7,
)

content = response.choices[0].message.content
cleaned = clean_json_output(content)

try:
    result = json.loads(cleaned)
except json.JSONDecodeError:
    result = eval(cleaned)

result

True

### Defining Prompt Templates

In [15]:
init_prompt = """<question>
{x}
</question>
Please respond to the above question <question> using the Chain of Thought (CoT) reasoning method.
Your response should consist of multiple steps, each of which includes three types of actions: **"Inner
Thinking"**, **"Final Conclusion"**, and **"Verification"**:
- **’Inner Thinking’**: This is the step where thinking is done. Note that multiple ’Inner Thinking’ steps are
required to describe thorough reasoning. Each step should first generate a brief title.
- **’Final Conclusion’**: At this stage, you summarize the correct reasoning from previous ’Inner Thinking’
steps and provide the final answer. No title is required here.
- **’Verification’**: At this stage, you verify the conclusion from the "Final Conclusion" step. If the
conclusion holds, end the process. If not, return to "Inner Thinking" for further reasoning. No title is required
here.
The output format must strictly follow the JSON structure
"""

In [16]:
verif_prompt = """<Model Response>
{response}
</Model Response>
<Reference Answer>
{y_star}
</Reference Answer>
You are provided with a model-generated response (<Model Response>) and a reference answer (<Reference
Answer>). Compare the model response with the reference answer and determine its correctness. Your task
is to simply output "True" if the response is correct, and "False" otherwise"""

Exploring New Paths

In [17]:
exploring_prompt = """<question>
{x}
</question>
<previous reasoning>
{response}
<previous reasoning>
<response requirements>
Your response must include the following steps, each composed of three types of actions: **"Inner
Thinking"**, **"Final Conclusion"**, and **"Verification"**:
1. **Inner Thinking**: Break down the reasoning process into multiple concise steps. Each step should start
with a brief title to clarify its purpose.
2. **Final Conclusion**: Summarize the correct reasoning from all previous ’Inner Thinking’ steps and
provide the final answer. No title is needed for this section.
3. **Verification**: Verify the accuracy of the "Final Conclusion". If it holds, conclude the process.
Otherwise, return to "Inner Thinking" for further refinement.
</response requirements>
<question> represents the question to be answered, and <previous reasoning> contains your prior reasoning.
Your task is to continue from the current ’Verification’ step. I have manually reviewed the reasoning and
determined that the **Final Conclusion** is false. Your ’Verification’ results must align with mine. Proceed
to refine the reasoning by exploring new approaches to solving this problem and construct a new Final
Conclusion.
"""

Backtracking

In [18]:
backtracking_prompt = """<question>
{x}
</question>
<previous reasoning>
{response}
<previous reasoning>
<response requirements>
Your response must include the following steps, each composed of three types of actions: **"Inner
Thinking"**, **"Final Conclusion"**, and **"Verification"**:
1. **Inner Thinking**: Break down the reasoning process into multiple concise steps. Each step should start
with a brief title to clarify its purpose.
2. **Final Conclusion**: Summarize the correct reasoning from all previous ’Inner Thinking’ steps and
provide the final answer. No title is needed for this section.
3. **Verification**: Verify the accuracy of the "Final Conclusion". If it holds, conclude the process.
Otherwise, return to "Inner Thinking" for further refinement.
</response requirements>
<question> represents the question to be answered, and <previous reasoning> contains your prior reasoning.
Your task is to continue from the current ’Verification’ step. I have manually reviewed the reasoning and
determined that the **Final Conclusion** is false. Your ’Verification’ results must align with mine. Proceed
to refine the reasoning using **backtracking** to revisit earlier points of reasoning and construct a new Final
Conclusion.
"""

Verifications

In [19]:
verif_search_prompt = """<question>
{x}
</question>
<previous reasoning>
{response}
<previous reasoning>
<response requirements>
Your response must include the following steps, each composed of three types of actions: **"Inner
Thinking"**, **"Final Conclusion"**, and **"Verification"**:
1. **Inner Thinking**: Break down the reasoning process into multiple concise steps. Each step should start
with a brief title to clarify its purpose.
2. **Final Conclusion**: Summarize the correct reasoning from all previous ’Inner Thinking’ steps and
provide the final answer. No title is needed for this section.
3. **Verification**: Verify the accuracy of the "Final Conclusion". If it holds, conclude the process.
Otherwise, return to "Inner Thinking" for further refinement.
</response requirements>
<question> represents the question to be answered, and <previous reasoning> contains your prior reasoning.
Your task is to continue from the current ’Verification’ step. I have manually reviewed the reasoning and
determined that the **Final Conclusion** is false. Your ’Verification’ results must align with mine. Proceed
to refine the reasoning by making precise **corrections** to address prior flaws and construct a new Final
Conclusion.
"""

Corrections

In [20]:
corr_prompt = """<question>
{x}
</question>
<previous reasoning>
{response}
<previous reasoning>
<response requirements>
Your response must include the following steps, each composed of three types of actions: **"Inner
Thinking"**, **"Final Conclusion"**, and **"Verification"**:
1. **Inner Thinking**: Break down the reasoning process into multiple concise steps. Each step should start
with a brief title to clarify its purpose.
2. **Final Conclusion**: Summarize the correct reasoning from all previous ’Inner Thinking’ steps and
provide the final answer. No title is needed for this section
3. **Verification**: Verify the accuracy of the "Final Conclusion". If it holds, conclude the process.
Otherwise, return to "Inner Thinking" for further refinement.
</response requirements>
<question> represents the question to be answered, and <previous reasoning> contains your prior reasoning.
Your task is to continue from the current ’Verification’ step. I have manually reviewed the reasoning and
determined that the **Final Conclusion** is false. Your ’Verification’ results must align with mine. Proceed
to refine the reasoning by conducting a thorough **validation** process to ensure validity and construct a
new Final Conclusion.
"""

### Function Form

In [21]:
def clean_json_output(raw_text):
    return raw_text.strip().removeprefix("```json").removeprefix("```").removesuffix("```").strip()

In [22]:
def run_gpt4o(prompt):

  response = client.chat.completions.create(
      model="gpt-4o",
      messages=[{"role": "user", "content": prompt}],
      temperature=0.7,
  )

  return response

In [23]:
def clean_result(response):

  content = response.choices[0].message.content
  cleaned = clean_json_output(content)

  try:
      result = json.loads(cleaned)
  except json.JSONDecodeError:
      result = eval(cleaned)

  return result

In [25]:
def dsft_approach(x, y_star):

  prompts = []

  prompt = init_prompt.format(x=x)
  prompts.append(prompt)
  response = run_gpt4o(prompt)

  v_prompt = verif_prompt.format(response=response, y_star=y_star)
  verif_result = run_gpt4o(v_prompt)

  clean_res = clean_result(verif_result)

  if clean_res == False:

    result = False
    idx = 0

    searching_prompts = [exploring_prompt, backtracking_prompt, verif_search_prompt, corr_prompt]
    random.shuffle(searching_prompts)

    searching_prompts = searching_prompts[:3]

    while result == False and idx <= 2:

      prompt = searching_prompts[idx].format(x=x, response=response)
      prompts.append(prompt)
      response = run_gpt4o(prompt)

      v_prompt = verif_prompt.format(response=response, y_star=y_star)
      verif_result = run_gpt4o(v_prompt)
      clean_res = clean_result(verif_result)

      if clean_res == True:
        result = True
      else:
        idx += 1

  return response, prompts, clean_res

In [30]:
results = []

csv_path = '/content/drive/MyDrive/dsft_gpt4o_results.csv'
fieldnames = ['Answer', 'Prompts', 'post_ft_results']

if not os.path.exists(csv_path):
    with open(csv_path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()

for i in range(300):

  print(i)

  x = df.loc[i, "Open-ended Verifiable Question"]
  y_star = df.loc[i, "Ground-True Answer"]

  answer, prompts, res = dsft_approach(x, y_star)

  with open(csv_path, 'a', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([answer, prompts, res])

  results.append({
        "Answer": answer,
        "Prompts": prompts,
        "post_ft_results": res
    })

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [None]:
# results_df = pd.DataFrame(results)

# results_df['num_prompts'] = results_df['Prompts'].apply(len)
# results_df["baseline_results"] = (results_df['num_prompts'] == 1).astype(int)
# results_df['post_ft_results'] = results_df['post_ft_results'].astype(int)

# results_df.head(10)