VisionTTS —The gift of vision

VisionTTS is a mobile app and smart glass designed to help people with visual disabilities. accessing everyday visual information.

What VisionTTS do?

Scene Describing, describe indoor and outdoor environments through audio feedback in Kinyarwanda, to help individuals with visual disabilities understanding their area and navigate
Text Reading, read text like printed documents and books through audio feedback in Kinyarwanda

Demo

this a demo of VisionTTS mobile app, the demo also simulate how VisionTTS Glass should work.

Performance

Metric	Average Response Time	Quality	Comments
Scene Describing	31.5 seconds	⭐ ⭐ ⭐ ⭐
Text Reading	43 seconds	⭐ ⭐ ⭐	works only on printed documents

Project Structure

VisionTTS/
├── README.md                   # Project documentation
├── app/                        # PWA frontend application (JS, CSS, HTML)
│   ├── app.js                  # Main application logic
│   ├── sw.js                   # Service worker for offline support
│   ├── index.html              # Frontend entry point
│   └── icons/                  # App icons
├── screenshots/                # Documentation screenshots
├── textToSpeech/               # Kinyarwanda Text-to-Speech service
│   └── KinyarwandaTTSFemaleVoice/
│       ├── api/                # Python API for TTS
│       ├── model/              # TTS model files
│       └── ui/                 # TTS testing UI
├── translation/                # Mbaza NLP translation service
│   └── mbazaNLP/
│       └── Quantized_Nllb_Finetuned_Edu_En_Kin_8bit/
│           ├── api/            # Python API for translation
│           └── model/          # NLLB translation model files
└── vlm/                        # Vision Language Model (VLM) service
    ├── app.py                  # FastAPI server for scene understanding
    └── api.rest                # API testing requests

How it works

Capture the Scene mobile camera take a picture of what is in front of the user — such as a sign, a book, or the surrounding environment.
Understand What’s Seen the backend logic runs an AI model (qwen3-vl:2b-instruct-q4_K_M) that analyzes the image and understands what is happening in the scene
Translate into Kinyarwanda Since the AI understands images in English, the system automatically translates the description into Kinyarwanda using translation model (Quantized_Nllb_Finetuned_Health_En_Kin_8bit_v2)
Convert to Speech The translated description is turned into voice audio using a Kinyarwanda speech model (KinyarwandaTTS_female_voice). The user hears the description through phone speakers

What next?

software development and design phase is finished

next is putting up the glass hardware and connect with the software

pocket holder for raspberry pi

ESP32 holder

Pi camera Module 3 holder

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
KinyarwandaTTSFemaleVoice/api		KinyarwandaTTSFemaleVoice/api
app		app
screenshots		screenshots
translation/Quantized_Nllb_Finetuned_Edu_En_Kin_8bit		translation/Quantized_Nllb_Finetuned_Edu_En_Kin_8bit
vlm		vlm
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionTTS —The gift of vision

What VisionTTS do?

Demo

Performance

Project Structure

How it works

What next?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VisionTTS —The gift of vision

What VisionTTS do?

Demo

Performance

Project Structure

How it works

What next?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages