<a href="https://colab.research.google.com/github/vifirsanova/phat-llm/blob/main/model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

0. **Download and Prepare the Data:**
   - transcribe a set of audio recordings with OpenAI Whisper
   - IPA annotated audio files via GPT-4o
   - Praat and ELAN-annotated speech samples

1. **Load the Pre-trained Model:**

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_name = "openai/whisper-base"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

2. **Add LoRA Adapters:**

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
         r=16,  # Rank of the low-rank approximation
         lora_alpha=32,  # Scaling factor
         lora_dropout=0.1,  # Dropout probability
         target_modules=["q_proj", "v_proj"]  # Target modules to apply LoRA
     )

model = get_peft_model(model, lora_config)

3. **Prepare the Training Data:**

In [None]:
def preprocess_function(examples):
  audio_inputs = processor(examples["audio"], sampling_rate=16000, return_tensors="pt")
  with processor.as_target_processor():
      labels = processor(examples["text"], return_tensors="pt").input_ids
  return {"input_features": audio_inputs["input_features"], "labels": labels}

train_dataset = dataset["train"].map(preprocess_function, batched=True)

4. **Train the Model:**

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
         output_dir="./results",
         per_device_train_batch_size=16,
         per_device_eval_batch_size=16,
         num_train_epochs=3,
         evaluation_strategy="epoch",
         logging_dir="./logs",
         logging_steps=10,
         save_total_limit=2,
         save_strategy="epoch",
         fp16=True,
         learning_rate=5e-5,
     )

trainer = Seq2SeqTrainer(
         model=model,
         args=training_args,
         train_dataset=train_dataset,
         eval_dataset=dataset["validation"],
         data_collator=processor,
     )

trainer.train()

5. **Evaluate the Model:**

In [None]:
eval_results = trainer.evaluate()
eval_results

6. **Inference:**

In [None]:
# 1. Phonetic Transcription (IPA Symbols)
dataset_name = "your_ipa_transcription_dataset"
ipa_dataset = load_and_preprocess_dataset(dataset_name, "ipa")
ipa_model = fine_tune_model(ipa_dataset, "ipa", Wav2Vec2ForCTC, "./results/ipa", 10, 8, 5e-5)

# 2. Prosody Analysis
dataset_name = "your_prosody_dataset"
prosody_dataset = load_and_preprocess_dataset(dataset_name, "prosody")
prosody_model = fine_tune_model(prosody_dataset, "prosody", Wav2Vec2ForSequenceClassification, "./results/prosody", 10, 8, 5e-5)

# 3. Non-Verbal Marker Annotation
dataset_name = "your_non_verbal_dataset"
non_verbal_dataset = load_and_preprocess_dataset(dataset_name, "non_verbal")
non_verbal_model = fine_tune_model(non_verbal_dataset, "non_verbal", Wav2Vec2ForSequenceClassification, "./results/non_verbal", 10, 8, 5e-5)

Convert to XML through prompt-tuning