Add support for PHI-4 Multimodal

### Search before asking

- [x] I have searched the Multimodal Maestro [issues](https://github.com/roboflow/multimodal-maestro/issues) and found no similar feature requests.


### Question

Add support for PHI-4 multimodal as I want to train a combined speech, image and text model.

### Additional

_No response_