A lightweight conversational assistant that allows users to interact using both speech and text. This system utilizes state-of-the-art models for speech recognition and text-to-speech (TTS) generation, including Whisper for transcription and Coqui TTS for speech synthesis. The project is built using Django and supports additional models and APIs via Hugging Face.
- Speech-to-Text: Uses Whisper for accurate speech recognition across 30+ languages.
- Text-to-Speech: Powered by Coqui TTS for multilingual voice synthesis.
- Microphone Input: Supports real-time speech input via microphone.
- Image Generation: Integrated with open-source models and Hugging Face API for generating images from text.
- Django Web Interface: Accessible through a local server for easy interaction.
view sreen recording for project
NB ON Secrets: I wll remove the secretes after one week you can go to hugging face and get your own api key from hugging face.
file path
src/src/secrets/secrets.txt
Ensure you have the following installed:
- Python 3.x
- Django
- FFmpeg
- pip
- Other dependencies listed in
requirements.txt
You have two options for running this application: Docker or Local Setup. Choose the method that best suits your environment.
If you have Docker and WSL installed, you can run the application in a Docker container. This method simplifies dependencies and environment setup.
-
Build Docker: Ensure you have Docker installed and configured and then build your docker container.
docker-compose up --build
you can stop the service and start the service at any point. dont build the container twice not unless its necessary.
manually install ffmpeg inside the running Docker container without rebuilding it. Here’s how to do that step-by-step:
-
Access the Running Container First, you need to get a shell into your running Django container. You can do this with the following command:
docker-compose exec django /bin/bash
This command opens a bash shell in the django* container.
-
Install ffmpeg Manually Once inside the container, you can install ffmpeg using apt-get. Run the following commands:
apt-get update apt-get install -y ffmpeg
-
Verify the Installation After the installation is complete, you can verify that ffmpeg is installed correctly by running:
ffmpeg -version
This should display the version of ffmpeg that was installed.
-
Exit container To exit container type
exit
-
-
Start the Services:
docker-compose up -d
-
Check Logs (Optional):
docker-compose logs -f django
-
Stop Services:
docker-compose down
-
Access the Interface: Open your browser and go to http://localhost:8000/api/interface/ to interact with the assistant.
-
Clone the Repository:
git clone https://github.com/Josewathome/speech-text-and-text-speech.git cd speech-text-and-text-speech
-
Create a Virtual Environment (Optional but Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate.bat
-
Install Requirements using WSL subsystem or linux:
pip install -r requirements.txt
If using windows
use this file to install the requirements manually in Windows_Requirements.docx
-
Set Up the Django Application: Make and apply migrations for the Django project.
python manage.py makemigrations text python manage.py migrate text
-
Run the Django Server: Start the development server.
python manage.py runserver
-
Access the Interface: Open your browser and go to http://localhost:8000/api/interface/.
The system uses OpenAI's Whisper model for speech-to-text functionality. To use Whisper:
-
Install Whisper:
pip install openai-whisper
-
Install FFmpeg: Whisper requires FFmpeg for processing audio files. Follow the instructions based on your operating system:
-
Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
-
macOS (Homebrew):
brew install ffmpeg
-
Windows: Download FFmpeg from the official website and add it to your system PATH.
-
-
Additional Dependencies: You may also need to install additional dependencies:
pip install setuptools-rust
Coqui TTS is used for synthesizing speech in multiple languages. To set it up:
-
Install Coqui TTS:
pip install TTS
-
For more details about customizing TTS models, refer to the Coqui TTS GitHub page.
-
Start the Django Server:
python manage.py runserver
-
Access the Application: Open your browser and go to http://localhost:8000/api/interface/ to interact with the assistant.
- The Whisper model is used to transcribe spoken language into text. You can use the microphone input to provide speech commands to the assistant.
- Coqui TTS will convert the generated text response back into speech. This enables natural conversations with the assistant.
- You can generate images based on text input using models hosted locally or via Hugging Face's API.
If you prefer using OpenAI's Whisper API for faster transcription, you can integrate it into your setup. The API supports multiple formats, including m4a, mp3, and wav, with pricing at $0.006 per minute of transcription. OpenAI's Whisper API can handle over 30 languages.
-
Sign Up for API Access: You can obtain your API key from the OpenAI website.
-
Modify Settings: Update the settings in your Django app to use the Whisper API instead of the local model for transcription.
Some of the models for image generation are integrated via Hugging Face API keys. Follow these steps to set it up:
-
Sign Up for Hugging Face: Visit Hugging Face to create an account.
-
Obtain API Key: Once registered, get your API key from your Hugging Face account settings.
-
Configure the API Key in Django: Set the Hugging Face API key in your environment or directly in the Django configuration file to enable API-based model access.
file path src/src/secrets/secrets.txt
We welcome contributions to improve the system. To contribute:
- Fork the repository.
- Create a new feature branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a pull request.
This project is licensed under the MIT License. See the LICENSE
file for details.
This `README.md` provides step-by-step instructions, covering the setup, usage, and model integration processes, while also mentioning the APIs used. It should be easy for anyone to follow along and use the system. Let me know if you want further customizations!