### **Introduction to the Multimodal Translator Capstone**

Welcome to the Multimodal Translator Capstone Project! This unique project integrates advanced language translation, text-to-speech, and image generation into one seamless workflow.

#### **What You'll Learn**
- **Multilingual Translation**: Harness the power of language models to translate text across various languages.
- **Text-to-Speech**: Convert text into lifelike narration for enhanced accessibility and engagement.
- **Contextual Imaging**: Generate relevant images based on translated content to provide visual context.
- **Multimodal Integration**: Combining various AI and software components that process text, audio, and music into a seamless, end-to-end workflow."

#### **Who Is This For?**
- Developers with a foundation in Python looking to expand their AI skillset.
- Enthusiasts aiming to integrate translation, audio, and image generation into real projects.
- Anyone interested in harnessing state-of-the-art multimodal techniques.

#### **Outcome**
By the end of this capstone, you’ll have a fully functional Multimodal Translator that can translate text, generate audio pronunciation, and produce context-relevant images—showcasing your ability to build integrated AI solutions.

**Step 1: Environment Setup**  
- Imports required libraries (deep_translator, OpenAIAudioTTS, DeepInfraImgGenModel).  
- Retrieves API keys from environment variables.  

In [None]:
import os
from deep_translator import GoogleTranslator
from swarmauri.llms.concrete.OpenAIAudioTTS import OpenAIAudioTTS
from swarmauri.llms.concrete.DeepInfraImgGenModel import DeepInfraImgGenModel
from swarmauri.utils.base64_to_file_path import base64_to_file_path

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
DEEPINFRA_API_KEY = os.getenv("DEEPINFRA_API_KEY")

**Step 2: Supported Languages Dictionary**  
- Defines a dictionary mapping language

In [2]:
SUPPORTED_LANGUAGES = {
    'af': 'Afrikaans',
    'ar': 'Arabic',
    'hy': 'Armenian',
    'az': 'Azerbaijani',
    'be': 'Belarusian',
    'bs': 'Bosnian',
    'bg': 'Bulgarian',
    'ca': 'Catalan',
    'zh-cn': 'Chinese',  # Note: Using Simplified Chinese as default
    'hr': 'Croatian',
    'cs': 'Czech',
    'da': 'Danish',
    'nl': 'Dutch',
    'en': 'English',
    'et': 'Estonian',
    'fi': 'Finnish',
    'fr': 'French',
    'gl': 'Galician',
    'de': 'German',
    'el': 'Greek',
    'he': 'Hebrew',
    'hi': 'Hindi',
    'hu': 'Hungarian',
    'is': 'Icelandic',
    'id': 'Indonesian',
    'it': 'Italian',
    'ja': 'Japanese',
    'kn': 'Kannada',
    'kk': 'Kazakh',
    'ko': 'Korean',
    'lv': 'Latvian',
    'lt': 'Lithuanian',
    'mk': 'Macedonian',
    'ms': 'Malay',
    'mr': 'Marathi',
    'mi': 'Maori',
    'ne': 'Nepali',
    'no': 'Norwegian',
    'fa': 'Persian',
    'pl': 'Polish',
    'pt': 'Portuguese',
    'ro': 'Romanian',
    'ru': 'Russian',
    'sr': 'Serbian',
    'sk': 'Slovak',
    'sl': 'Slovenian',
    'es': 'Spanish',
    'sw': 'Swahili',
    'sv': 'Swedish',
    'tl': 'Tagalog',
    'ta': 'Tamil',
    'th': 'Thai',
    'tr': 'Turkish',
    'uk': 'Ukrainian',
    'ur': 'Urdu',
    'vi': 'Vietnamese',
    'cy': 'Welsh'
}


**Step 3: Text Translation**  
- Defines `translate_text` function using GoogleTranslator.  
- Validates input and returns either the translation or an error message.  

In [3]:
def translate_text(text, target_lang):
    if not text or not isinstance(text, str):
        return "Error: Invalid input text"
    
    try:
        translator = GoogleTranslator(source="auto", target=target_lang)
        translation = translator.translate(text)
        return translation
    except Exception as e:
        return f"Translation error: {str(e)}"


**Step 4: Pronunciation Generation**  
- Defines `get_pronunciation` function using OpenAIAudioTTS.  
- Generates an audio file for the provided text.  

In [4]:
def get_pronunciation(text, audio_path):
    try:
        tts_model = OpenAIAudioTTS(api_key=OPENAI_API_KEY)
        audio_file_path = tts_model.predict(text=text, audio_path=audio_path)
        return audio_file_path
    except Exception as e:
        return f"Audio generation error: {str(e)}"

**Step 5: Contextual Image**  
- Defines `get_contextual_image` to generate a base64-encoded image from text. 

In [5]:
def get_contextual_image(query):
    
    try:
        image_gen_model = DeepInfraImgGenModel(api_key=DEEPINFRA_API_KEY, name="stabilityai/stable-diffusion-2-1") 
        img = image_gen_model.generate_image_base64(query)
        return img
    except Exception as e:
        return f"Image fetch error: {str(e)}"


**Step 6: Get Language code**  
- Can be used to add utility functions like language code retrieval.

In [6]:
def get_language_code(language_name: str) -> str:
    
    language_name = language_name.lower()
    for code, name in SUPPORTED_LANGUAGES.items():
        if name.lower() == language_name:
            return code
    return None

**Step 7: Multimodal Translation**  
- Combines text translation, audio generation, and image creation into one function.  

In [7]:
# Define the function that will handle the translation and audio/image generation
def multimodal_translator(target_lang, text_input, source_audio_path="source_audio.mp3", translated_audio_path="translated_audio.mp3"):

    target_lang_code = get_language_code(target_lang)

    translated_text = translate_text(text_input, target_lang_code)

    source_audio = get_pronunciation(text_input, source_audio_path)
    translated_audio = get_pronunciation(translated_text, translated_audio_path)

    image_url = get_contextual_image(translated_text)

    image = base64_to_file_path(image_url, "image.png")
    
    return source_audio, translated_audio, image



**Step 8: Gradio UI**  
- Creates a Gradio interface for user input.  
- Displays outputs for translation, audio playback, and image. 

In [9]:
import gradio as gr

# Create the Gradio interface
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            # Target Language Dropdown
            target_lang = gr.Dropdown(
                label="Target Language",
                choices=list(SUPPORTED_LANGUAGES.values()),
                value="Spanish"
            )
            # Text Input
            text_input = gr.Textbox(label="Text Input")
            # Submit Button
            submit_btn = gr.Button("Translate and Generate")
        
        with gr.Column():
            # Source Audio Output
            source_audio = gr.Audio(label="Source Audio", type="filepath")
            # Translated Audio Output
            translated_audio = gr.Audio(label="Translated Audio", type="filepath")
            # Image Output
            image_output = gr.Image(label="Generated Image", type="filepath")
    
    # Connect the button click to the function
    submit_btn.click(
        fn=multimodal_translator,
        inputs=[target_lang, text_input],
        outputs=[source_audio, translated_audio, image_output]
    )

# Launch the interface
if __name__ == "__main__":
    demo.launch(show_error=True)

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.
