VisionTTS is a mobile app and smart glass designed to help people with visual disabilities. accessing everyday visual information.
- Scene Describing, describe indoor and outdoor environments through audio feedback in Kinyarwanda, to help individuals with visual disabilities understanding their area and navigate
- Text Reading, read text like printed documents and books through audio feedback in Kinyarwanda
this a demo of VisionTTS mobile app, the demo also simulate how VisionTTS Glass should work.
| Metric | Average Response Time | Quality | Comments |
|---|---|---|---|
| Scene Describing | 31.5 seconds | ⭐ ⭐ ⭐ ⭐ | |
| Text Reading | 43 seconds | ⭐ ⭐ ⭐ | works only on printed documents |
VisionTTS/
├── README.md # Project documentation
├── app/ # PWA frontend application (JS, CSS, HTML)
│ ├── app.js # Main application logic
│ ├── sw.js # Service worker for offline support
│ ├── index.html # Frontend entry point
│ └── icons/ # App icons
├── screenshots/ # Documentation screenshots
├── textToSpeech/ # Kinyarwanda Text-to-Speech service
│ └── KinyarwandaTTSFemaleVoice/
│ ├── api/ # Python API for TTS
│ ├── model/ # TTS model files
│ └── ui/ # TTS testing UI
├── translation/ # Mbaza NLP translation service
│ └── mbazaNLP/
│ └── Quantized_Nllb_Finetuned_Edu_En_Kin_8bit/
│ ├── api/ # Python API for translation
│ └── model/ # NLLB translation model files
└── vlm/ # Vision Language Model (VLM) service
├── app.py # FastAPI server for scene understanding
└── api.rest # API testing requests
- Capture the Scene mobile camera take a picture of what is in front of the user — such as a sign, a book, or the surrounding environment.
- Understand What’s Seen the backend logic runs an AI model (qwen3-vl:2b-instruct-q4_K_M) that analyzes the image and understands what is happening in the scene
- Translate into Kinyarwanda Since the AI understands images in English, the system automatically translates the description into Kinyarwanda using translation model (Quantized_Nllb_Finetuned_Health_En_Kin_8bit_v2)
- Convert to Speech The translated description is turned into voice audio using a Kinyarwanda speech model (KinyarwandaTTS_female_voice). The user hears the description through phone speakers
software development and design phase is finished
next is putting up the glass hardware and connect with the software
pocket holder for raspberry pi
ESP32 holder
Pi camera Module 3 holder



