In [2]:
from importlib import reload
from analysis_helpers import get_train_example, get_eval_example, get_training_metrics

### Real vs synthetic training data for reasoning
- `medqa-nejmcr-1k-random` is real clinician-written reasoning trace in a NEJM case report
- `medqa-nejmcr-1k-random-cot-extract` is synthetically generated reasoning given only the context and diagnosis from the NEM case report

In [3]:
get_train_example("medqa-nejmcr-1k-random", 0, print_format=True)

QUESTION:
A 1-day-old female infant was admitted to the neonatal intensive care unit of this hospital because of intermittent apnea and cyanosis. The day before admission, the patient was born at another hospital at 41 weeks 4 days' gestation to a 33-year-old primigravid mother. After 44 hours of labor, during which the amniotic membranes were artificially ruptured (17 hours before delivery) and the mother's temperature rose to 38.0°C, with a white-cell count of 26,700 per cubic millimeter, a cesarean section was performed because of failure of labor to progress. The infant weighed 4176 g (90th percentile) at birth, and the Apgar scores were 8 at 1 minute and 9 at 5 minutes.
At the other hospital, the infant was admitted to the newborn nursery, where she received care, including ophthalmic erythromycin. At the parents' request, vitamin K was not given. Results of laboratory tests are shown in
Table 1
. Approximately 24 hours after birth, several episodes of apnea occurred, associated w

In [4]:
get_train_example("medqa-nejmcr-1k-random-cot-extract", 0, print_format=True)

QUESTION:
A 1-day-old female infant was admitted to the neonatal intensive care unit of this hospital because of intermittent apnea and cyanosis. The day before admission, the patient was born at another hospital at 41 weeks 4 days' gestation to a 33-year-old primigravid mother. After 44 hours of labor, during which the amniotic membranes were artificially ruptured (17 hours before delivery) and the mother's temperature rose to 38.0°C, with a white-cell count of 26,700 per cubic millimeter, a cesarean section was performed because of failure of labor to progress. The infant weighed 4176 g (90th percentile) at birth, and the Apgar scores were 8 at 1 minute and 9 at 5 minutes.
At the other hospital, the infant was admitted to the newborn nursery, where she received care, including ophthalmic erythromycin. At the parents' request, vitamin K was not given. Results of laboratory tests are shown in
Table 1
. Approximately 24 hours after birth, several episodes of apnea occurred, associated w

### Real and synthetic both get it right

In [None]:
get_eval_example(["medqa-nejmcr-1k-random", "medqa-nejmcr-1k-random-cot-extract"], 0, filter_score=["correct", "correct"], print_format=True)

### Real and synthetic both get it wrong

In [None]:
get_eval_example(["medqa-nejmcr-1k-random", "medqa-nejmcr-1k-random-cot-extract"], 6, filter_score=["wrong", "wrong"], print_format=True)

QUESTION:
A second-year PG resident tells you to perform an ABG of a patient. All of the following are true about performing an ABG except:

EXPECTED ANSWER:
Index: A
Answer: Before performing the ABG, syringe should be loaded with 0.3 cc of heparin

RESPONSE (medqa-nejmcr-1k-random):
## Thinking

D
### Thinking

A second-year PG resident tells you to perform an arterial blood gas (ABG) on a patient. You have a few questions about the procedure. The resident says that before performing the ABG, a syringe should be loaded with 0.3 cc of heparin. You wonder if this is true. You also wonder whether a normal pH, HCO3− level, and PCO2 level always indicate the absence of an acid-base disorder. Finally, you wonder whether a different site should be tried if the modified Allen test is negative. You also wonder whether the radial artery is the preferred site for an ABG.
### Arterial Blood Gas

An ABG is a test that provides information about the acid-base status of a patient. It is obtained by

### Real gets it right, synthetic gets it wrong

In [None]:
get_eval_example(["medqa-nejmcr-1k-random", "medqa-nejmcr-1k-random-cot-extract"], 4, filter_score=["correct", "wrong"], print_format=True)

QUESTION:
A blue new born presents with cyanosis. The X–ray chest reveal oligaemic lung field and normal sized heart. Most likely diagnosis is –

EXPECTED ANSWER:
Index: B
Answer: Pulmonary atresia

RESPONSE (medqa-nejmcr-1k-random):
## Thinking

Differential Diagnosis
-------------------

### Pulmonary Atresia

Pulmonary atresia is a congenital anomaly in which the pulmonary valve is absent or hypoplastic, with no communication between the right ventricle and the pulmonary artery. The right ventricle is small and hypertrophied, and the right ventricular cavity is often obliterated by muscle bundles. The right ventricular outflow tract is filled with thrombus, and the pulmonary artery is hypoplastic. In its most common form, pulmonary atresia is associated with a ventricular septal defect, which is usually perimembranous. The aorta is usually large, and the ductus arteriosus is patent. The right ventricular outflow tract is filled with thrombus, and the pulmonary artery is hypoplastic.

### Qwen debugging

### Token Lengths

In [1]:
# Adapt for different token length
from analysis_helpers import get_token_stats
import os

print("Synthetic trace length")
print(get_token_stats("medqa-nejmcr-1k-random-cot-extract-qwen-tuned", "full"))
print("Real trace length")
print(get_token_stats("medqa-nejmcr-1k-random-nejmcr-qa-reason-qwen-tuned", "full"))

  from .autonotebook import tqdm as notebook_tqdm


Synthetic trace length
{'dataset_size': 1000, 'min_tokens': 550, 'max_tokens': 4417, 'avg_tokens': 1924.111, 'med_tokens': 1883.0, 'total_tokens': 1924111, 'under_8192_count': 1000, 'under_8192_percent': 100.0, 'analyzed_columns': ['Question', 'Complex_CoT', 'Response']}
Real trace length
{'dataset_size': 1000, 'min_tokens': 282, 'max_tokens': 8540, 'avg_tokens': 874.779, 'med_tokens': 824.0, 'total_tokens': 874779, 'under_8192_count': 999, 'under_8192_percent': 99.9, 'analyzed_columns': ['Question', 'Complex_CoT', 'Response']}


### Eval and Train Examples

In [11]:
# Iterate on extraction format
from analysis_helpers import get_train_example
# get_train_example("medqa-nejmcr-1k-random-qwen", 115, print_format=True)
get_train_example("medqa-nejmcr-1k-random-nejmcr-qa-reason-qwen-tuned", 125, print_format=True)
get_train_example("medqa-nejmcr-1k-random-cot-extract-qwen-tuned", 125, print_format=True)

QUESTION:
A 35-year-old man with a history of Stage IV-B Hodgkin's disease, mixed-cellularity type, treated with 12 courses of MOPP (nitrogen mustard, vincristine, procarbazine, and prednisone) six years prior, presents with a recent 40 kg weight loss, fever to 39.2°C, pulse of 130, respirations of 40, pallor, dyspnea, mild scleral icterus, dullness and diminished breath sounds over the lower right hemithorax, and a Grade 3 systolic ejection murmur. His hematocrit is 23.7%, white-cell count is 39,900 with 5% neutrophils, 4% band forms, 16% lymphocytes, 63% monocytes, 2% eosinophils, 4% myelocytes, and 6% metamyelocytes, 8 nucleated red cells per 100 white cells, platelet count is 31,000, and reticulocyte count is 8.1%. Chest x-ray reveals a large multiloculated right pleural effusion. Thoracentesis yields 300 ml of cloudy, red fluid with 420,000 red cells and 4000 white cells; cytology is negative for malignant cells, and cultures are negative. Bone marrow biopsy shows marked hypercell

In [7]:
# Getting it wrong
from analysis_helpers import get_eval_example
get_eval_example("huatuo", 315, filter_score="wrong", print_format=True)
get_eval_example("medqa-nejmcr-1k-random-cot-extract-qwen-tuned", 315, filter_score="wrong", print_format=True)
get_eval_example("medqa-nejmcr-1k-random-nejmcr-qa-reason-qwen-tuned", 315, filter_score="wrong", print_format=True)

QUESTION:
A surgeon decides to operate a patient of carcinoma cecum and perform a right hemicolectomy through a midline laparotomy approach. You have been instructed to prepare the pas of the patient for surgery. What will you do?

EXPECTED ANSWER:
Index: A
Answer: Clean and drape from the level of nipple to mid-thigh

RESPONSE (huatuo):


## Thinking

Alright, so we're getting ready for a right hemicolectomy, which is a pretty big deal since it's a surgery to remove part of the colon. This means we're going to be working in the abdominal area, specifically around the right side. Now, let's think about what that means for preparing the patient. We need to make sure the area is sterile, so we can't just cover a small part of the body. 

First, let's consider where the surgery is happening. The right side of the abdomen is involved, so we need to cover from the chest down to the pelvis. This is because the surgical site is in the lower part of the abdomen, and we don't want any dirt or g

### Train and Eval Chat Template Formatting

In [8]:
from analysis_helpers import get_train_example
# get_train_example("medqa-nejmcr-1k-random-qwen", 0)
# get_train_example("medqa-nejmcr-1k-random-cot-extract-qwen-tuned", 0)
get_train_example("medqa-nejmcr-1k-random-nejmcr-qa-reason-qwen-tuned", 100)
# get_train_example("medqa-nejmcr-1k-random-step-extract-qwen", 0)

{'raw': None,
 'formatted': {'text': "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nA 26-year-old woman presents with a 7-week history of progressive symptoms including severe occipital headache unresponsive to migraine treatments, somnolence, dysphoria, short-term memory problems, confusion, agitation, depersonalization, visual hallucinations, incoherent speech, abnormal movements (including hyperextension of the arms and mouth movements), fever, elevated creatine kinase, a generalized tonic-clonic seizure with bradycardia (30-39 bpm), and autonomic instability (facial flushing, diaphoresis, tachycardia, irregular breathing, pinpoint pupils). Initial workup revealed negative Lyme and syphilis tests, normal CT scans of the brain, and cerebrospinal fluid analysis showing modestly elevated protein, lymphocytic pleocytosis, and oligoclonal bands. MRI of the brain was normal. The patient has a history of migraine heada

In [25]:
# get_eval_example("medqa-nejmcr-1k-random-qwen", 5255)
from analysis_helpers import get_eval_example
get_eval_example("base-qwen", 5255)
get_eval_example("huatuo", 5255)
# get_eval_example("medqa-nejmcr-1k-random-cot-extract-qwen-tuned", 5255)
get_eval_example("medqa-nejmcr-1k-random-nejmcr-extract-qwen-tuned", 5255)
# get_eval_example("medqa-nejmcr-1k-random-step-extract-qwen", 5255)

{'examples': [('medqa-nejmcr-1k-random-nejmcr-extract-qwen-tuned',
   {'question': "A 25-year-old female is brought to the emergency department by her family. They found her somnolent next to her computer and scattered pill bottles over two hours ago. The family cannot recall the types of pill bottles that they found and did not bring them to the ED. The patient's past medical history is notable for anxiety, obesity, hypertension, and polycystic ovarian syndrome. She is currently not on any medications other than an exercise program with which she is non-compliant. A physical exam is performed and is within normal limits. The patient begins communicating with the physician and states that she did take acetaminophen but it was only a few pills. Her temperature is 99.5°F (37.5°C), pulse is 82/min, blood pressure is 125/85 mmHg, respirations are 11/min, and oxygen saturation is 97% on room air. Which of the following is the next best step in management?",
    'answer': 'N-acetylcysteine',

### Training Curves

In [None]:
from analysis_helpers import get_training_metrics
# get_training_metrics("medqa-nejmcr-1k-random-qwen", print_format=True)
# get_training_metrics("medqa-nejmcr-1k-random-qwen-tuned", print_format=True)
# get_training_metrics("medqa-nejmcr-1k-random-cot-extract-qwen", print_format=True)
get_training_metrics("medqa-nejmcr-1k-random-cot-extract-qwen-tuned", print_format=True)
get_training_metrics("medqa-nejmcr-1k-random-nejmcr-extract-qwen-tuned", print_format=True)
# get_training_metrics("medqa-nejmcr-1k-random-step-extract-qwen", print_format=True)

Training Metrics Summary for medqa-nejmcr-1k-random-cot-extract-qwen-tuned
--------------------------------------------------------------------------------

Hyperparameters:
---------------
Training Configuration:
  Learning Rate: 5.00e-06
  Batch Size: 32
  Epochs: 10
  Warmup Ratio: 0.20
  Weight Decay: 0.00
  Block Size: 8,192
  Adam: β₁=0.900, β₂=0.999, ε=1.0e-08


Training Progress:
-----------------

Start (Step 0, Epoch 0.06666666666666667):
  Train Loss: 1.677
  Val Loss: 1.505
  Train Accuracy: 0.654
  Val Accuracy: 0.666
  Learning Rate: 1.14e-06


Inflection Point (Step 7, Epoch 0.4222222222222222):
  Train Loss: 1.220
  Val Loss: 1.208
  Train Accuracy: 0.696
  Val Accuracy: 0.689
  Learning Rate: 3.69e-06


Part 1 (Step 24, Epoch 1.28):
  Train Loss: 1.060
  Val Loss: 1.001
  Train Accuracy: 0.724
  Val Accuracy: 0.722
  Learning Rate: 1.12e-07


Part 2 (Step 49, Epoch 2.5644444444444443):
  Train Loss: 0.678
  Val Loss: 0.899
  Train Accuracy: 0.803
  Val Accuracy: 0.742
