Skip to content

zakk-io/VisionTTS

Repository files navigation

VisionTTS —The gift of vision

VisionTTS is a mobile app and smart glass designed to help people with visual disabilities. accessing everyday visual information.

What VisionTTS do?

  • Scene Describing, describe indoor and outdoor environments through audio feedback in Kinyarwanda, to help individuals with visual disabilities understanding their area and navigate
  • Text Reading, read text like printed documents and books through audio feedback in Kinyarwanda

Demo

this a demo of VisionTTS mobile app, the demo also simulate how VisionTTS Glass should work.

Watch the demo

Performance

Metric Average Response Time Quality Comments
Scene Describing 31.5 seconds ⭐ ⭐ ⭐ ⭐
Text Reading 43 seconds ⭐ ⭐ ⭐ works only on printed documents

Project Structure

VisionTTS/
├── README.md                   # Project documentation
├── app/                        # PWA frontend application (JS, CSS, HTML)
│   ├── app.js                  # Main application logic
│   ├── sw.js                   # Service worker for offline support
│   ├── index.html              # Frontend entry point
│   └── icons/                  # App icons
├── screenshots/                # Documentation screenshots
├── textToSpeech/               # Kinyarwanda Text-to-Speech service
│   └── KinyarwandaTTSFemaleVoice/
│       ├── api/                # Python API for TTS
│       ├── model/              # TTS model files
│       └── ui/                 # TTS testing UI
├── translation/                # Mbaza NLP translation service
│   └── mbazaNLP/
│       └── Quantized_Nllb_Finetuned_Edu_En_Kin_8bit/
│           ├── api/            # Python API for translation
│           └── model/          # NLLB translation model files
└── vlm/                        # Vision Language Model (VLM) service
    ├── app.py                  # FastAPI server for scene understanding
    └── api.rest                # API testing requests

How it works

  1. Capture the Scene mobile camera take a picture of what is in front of the user — such as a sign, a book, or the surrounding environment.
  2. Understand What’s Seen the backend logic runs an AI model (qwen3-vl:2b-instruct-q4_K_M) that analyzes the image and understands what is happening in the scene
  3. Translate into Kinyarwanda Since the AI understands images in English, the system automatically translates the description into Kinyarwanda using translation model (Quantized_Nllb_Finetuned_Health_En_Kin_8bit_v2)
  4. Convert to Speech The translated description is turned into voice audio using a Kinyarwanda speech model (KinyarwandaTTS_female_voice). The user hears the description through phone speakers

What next?

software development and design phase is finished

next is putting up the glass hardware and connect with the software

Screenshot from 2026-04-16 12-37-23.png

pocket holder for raspberry pi

Screenshot from 2026-04-16 12-38-18.png

ESP32 holder

Screenshot from 2026-04-16 12-38-39.png

Pi camera Module 3 holder

Screenshot from 2026-04-16 12-39-02.png

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors