Important Notice:
This project is intentionally kept lean and simple, providing a hands-on experience with multimodal indexing, RAG search and AI reasoning techniques. While not intended for production use, it serves as a powerful starting point for exploring how it can unlock new possibilities in building smarter, more context-aware applications.
Voice Integration:
The real-time voice (STT/TTS) feature is a work in progress. Stay tuned for updates as new capabilities are added and improved!
The Field Service Assistant is a proof-of-concept application designed to revolutionize how field technicians access information and receive guidance in challenging industrial environments. By combining voice commands, image analysis, and multimodal retrieval capabilities, this system provides hands-free, contextual assistance when technicians need it most.
Key Features:
- 🗣️ Voice Interface: (Work-In-Progress) Hands-free speech-to-text input and text-to-speech responses using Azure OpenAI GPT-4o-Transcribe and GPT-4o-TTS
- 📸 Image Analysis: Submit photos of equipment for visual diagnostics and part identification
- 📚 Multimodal RAG: Retrieve relevant information from manuals, guides, and documentation based on both text and visual inputs
- 🧠 Agentic Retrieval: Smart, context-aware search that understands the technician's environment and needs
- ⚡ Real-time Responses: Optimized for low-latency, high-accuracy guidance in industrial settings
In fast-paced, noisy factory environments, field technicians often face complex issues that require digging through manuals, calling supervisors, or cross-referencing part images—all while trying to fix a machine. Switching between systems or screens wastes time and disrupts the repair process.
The Field Service Assistant solves this by allowing technicians to:
- Ask questions using natural voice commands while keeping hands free for repair work
- Submit photos of equipment or components for immediate identification and guidance
- Receive contextual instructions, part numbers, and repair procedures through voice feedback
- Access technical documentation instantly without manual searching
The application is built on a modern architecture leveraging Azure's AI services:
- Frontend: React-based UI with voice recording capabilities and image upload
- Backend: (Work-In-Progress) API service with WebSocket streaming for real-time voice interaction
- Search Layer: Azure AI Search with multimodal indexing of manuals and documentation
- AI Services:
- Azure OpenAI GPT-4o for conversational responses
- GPT-4o-mini-transcribe for speech-to-text conversion
- GPT-4o-mini-tts for text-to-speech output
- Azure Document Intelligence for processing technical documentation
graph TB
subgraph "Frontend"
UI[React UI]
Voice[Voice Control<br/>Component]
Audio[Audio Processing]
end
subgraph "WebSocket Layer"
WS[WebSocket Server<br/>/voice endpoint]
end
subgraph "Backend Services"
API[Python API<br/>Service]
RAG[RAG Base<br/>Component]
Search[Search Service]
end
subgraph "Azure AI Services"
STT[Azure OpenAI<br/>GPT-4o-mini-transcribe]
TTS[Azure OpenAI<br/>GPT-4o-mini-tts]
GPT[Azure OpenAI<br/>GPT-4o]
AIS[Azure AI Search<br/>Multimodal Index]
DI[Azure Document<br/>Intelligence]
end
subgraph "Storage"
Blob[Azure Blob Storage<br/>Images & Documents]
end
%% User interactions
User((Field Technician))
User -->|Voice Input| Voice
User -->|Image Upload| UI
UI -->|Display Results| User
Audio -->|Audio Output| User
%% Frontend connections
Voice <-->|Audio Stream| WS
UI <-->|Chat Messages| API
Voice --> Audio
%% WebSocket connections
WS <-->|Audio Data| STT
WS <-->|Text Data| TTS
%% Backend connections
API --> RAG
RAG --> Search
Search --> AIS
RAG --> GPT
API --> DI
%% Azure connections
AIS <--> Blob
DI --> Blob
%% Data flow annotations
WS -.->|Transcribed Text| API
API -.->|Response Text| WS
sequenceDiagram
participant User
participant Frontend
participant WebSocket
participant STT
participant Backend
participant TTS
User->>Frontend: Speak Question
Frontend->>WebSocket: Audio Stream (PCM16)
WebSocket->>STT: Forward Audio
STT->>WebSocket: Transcript Delta
WebSocket->>Frontend: Transcript Update
Frontend->>Backend: Complete Transcript
Backend->>Backend: Process with RAG
Backend->>WebSocket: Response Text
WebSocket->>TTS: Generate Speech
TTS->>WebSocket: Audio Chunks
WebSocket->>Frontend: Audio Stream
Frontend->>User: Play Response
- Azure account with subscription
- Azure AI Studio access
- Python 3.12+
- Node.js 18+
- Azure Developer CLI (azd)
-
Clone the repository
git clone https://github.com/your-org/field-service-assistant.git cd field-service-assistant -
Provision Azure resources
az login --use-device-code azd auth login azd env new <YOUR_ENVIRONMENT_NAME> azd env set AZURE_PRINCIPAL_ID $(az ad signed-in-user show --query id -o tsv) azd up
-
Start the application locally
src/start.sh
For detailed setup instructions, see the Azure AI Search Multimodal RAG Demo documentation.
This project implements a complete voice interface using WebSockets for bidirectional audio streaming. See the Voice Integration Plan for implementation details.
You can easily index your own technical documentation:
-
Place PDF documents in the
/datafolder -
Run the indexing script:
scripts/prepdocs.ps1
The application leverages the following Azure services:
- Azure AI Search - For multimodal indexing and retrieval
- Azure AI Document Intelligence - For processing technical documentation
- Azure OpenAI Service - For LLM-powered responses and voice processing
- Azure Blob Storage - For storing extracted images and data
- Azure App Service - For hosting the application
This repository contains the complete implementation demonstrated in the "Multimodal RAG for Field Technicians" workshop, including:
- Architecture reference for building multimodal RAG systems
- Complete voice interface implementation with Azure OpenAI
- Multimodal index creation using Azure AI Search
- Hands-free interaction patterns for industrial environments
The system demonstrates the following end-to-end scenarios:
- Submitting a voice question about equipment repair
- Uploading an image of a component with a voice description
- Receiving contextual repair instructions via voice
- Retrieving specific technical documentation with visual citation
This proof-of-concept follows several key principles for industrial field service applications:
- Latency Optimization: Streaming responses for immediate feedback
- Error Resilience: Graceful fallbacks for poor network conditions
- Context Preservation: Maintaining conversation history for continuous assistance
- Multimodal Fusion: Combining insights from both text and visual sources
- Security: Backend-driven approach for credential security
Future enhancements could include:
- Offline mode for limited connectivity environments
- Custom voice commands for common repair procedures
- Integration with AR/VR headsets for visual overlay guidance
- Real-time performance metrics and diagnostics
- Integration with equipment IoT sensors for predictive assistance
This project builds upon the Azure AI Search Multimodal RAG Demo with additional voice capabilities (Work-In-Progress).
- 📄 Introducing Agentic Retrieval in Azure AI Search - Learn about the next generation of intelligent search with agentic retrieval capabilities
- 📚 Multimodal Search Overview - Official Microsoft documentation on multimodal search in Azure AI
- 🔍 From Diagrams to Dialogue: New Multimodal Functionality in Azure AI Search - How Azure AI unlocks insights from visual content
- 🖼️ Get Started with Image Search in Azure Portal - Tutorial for implementing image search capabilities
- 🎤 GPT-4o-mini-transcribe WebSocket Sample - Real-time speech transcription implementation using Azure OpenAI's GPT-4o-mini-transcribe
- 🔊 GPT-4o-mini-tts Async Streaming Sample - Text-to-speech implementation with Azure OpenAI's GPT-4o-mini-tts
