# Multimodal Chatbot System

### Purpose
This notebook integrates and validates different AI components (text, image, and audio) into a unified multimodal chatbot system. The chatbot can perform text generation, image generation, audio generation, and computer vision tasks. 


### **Project Setup**
1. Load the necessary libraries.
2. Ensure environment variables for API keys are properly set.

In [1]:
# Import necessary libraries
import os
from dotenv import load_dotenv
from pathlib import Path
from swarmauri.utils.base64_to_img_url import base64_to_img_url
from swarmauri.llms.concrete.FalAIVisionModel import FalAIVisionModel
from swarmauri.llms.concrete.OpenAIAudioTTS import OpenAIAudioTTS
from swarmauri.llms.concrete.OpenAIModel import OpenAIModel as LLM
from swarmauri.conversations.concrete.Conversation import Conversation
from swarmauri.messages.concrete.HumanMessage import HumanMessage
from swarmauri.llms.concrete.DeepInfraImgGenModel import DeepInfraImgGenModel

# Load environment variables
load_dotenv()

# Fetch API keys
DEEPINFRA_API_KEY = os.getenv("DEEPINFRA_API_KEY")
IMGBB_API_KEY = os.getenv("IMGBB_API_KEY")
API_KEY = os.getenv("FAL_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Initialize models
llm = LLM(api_key=OPENAI_API_KEY)
llm_img_gen = DeepInfraImgGenModel(api_key=DEEPINFRA_API_KEY)
falai_vision_model = FalAIVisionModel(api_key=API_KEY) if API_KEY else None
tts = OpenAIAudioTTS(api_key=OPENAI_API_KEY)

# Output directory for audio
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

## Function `multimodal_chatbot` to handle various user inputs.

This function consolidates all tasks, enabling the chatbot to handle multimodal inputs and outputs in a seamless, user-friendly manner.

In [2]:
# Function to handle text, image, and audio inputs and outputs
def multimodal_chatbot(input_data, mode="text", voice="alloy", model="tts-1", output_filename="output.mp3"):
    """
    Multimodal Chatbot Function
    
    Parameters:
    - input_data (str): User's input, can be text or image URL.
    - mode (str): The type of task ('text', 'image_gen', 'audio', 'vision'). Default is 'text'.
    - voice (str): Voice for TTS. Default is 'alloy'.
    - model (str): TTS model. Default is 'tts-1'.
    - output_filename (str): Filename for the generated audio. Default is 'output.mp3'.
    
    Returns:
    - str or None: Text response, Image URL, or audio path based on mode.
    """
    if mode == "text":
        # Generate text response
        conversation = Conversation()
        conversation.add_message(HumanMessage(content=input_data))
        llm.predict(conversation=conversation)
        response = conversation.get_last().content
        print(f"Text Response: {response}")
        return response
    
    elif mode == "image_gen":
        # Generate image from text
        conversation = Conversation()
        conversation.add_message(HumanMessage(content=input_data))
        llm.predict(conversation=conversation)
        detailed_description = conversation.get_last().content
        image_base64 = llm_img_gen.generate_image_base64(detailed_description)
        try:
            image_url = base64_to_img_url(image_base64, IMGBB_API_KEY)
            print(f"Generated Image URL: {image_url}")
            return image_url
        except Exception as e:
            print(f"Error uploading the image: {e}")
            return None
    
    elif mode == "audio":
        # Generate audio from text
        tts.name = model
        tts.voice = voice
        output_path = output_dir / output_filename
        tts.predict(text=input_data, audio_path=str(output_path))
        print(f"Generated audio saved to: {output_path}")
        return str(output_path)
    
    elif mode == "vision":
        # Analyze image using computer vision
        try:
            result = falai_vision_model.process_image(image_url=input_data, prompt="Describe the content of this image.")
            print(f"Image Analysis Result: {result}")
            return result
        except Exception as e:
            print(f"Error processing image: {e}")
            return None

    else:
        raise ValueError("Invalid mode. Choose from 'text', 'image_gen', 'audio', or 'vision'.")


## Examples

In [3]:
# 1. Text Generation
example_prompt = "Tell me a short story about a scientist discovering new planets."
text_response = multimodal_chatbot(input_data=example_prompt, mode="text")

Text Response: Dr. Emily Roberts had always been passionate about astronomy. She spent countless nights gazing up at the stars, dreaming of the mysteries that lay beyond our solar system. So when she was given the opportunity to lead a team of researchers on a mission to discover new planets, she jumped at the chance.

Equipped with the latest technology and a team of brilliant scientists, Dr. Roberts set out on their journey into the vast unknown. For months, they scanned the skies, analyzing data and searching for any signs of distant planets.

And then, one fateful night, they made a groundbreaking discovery. Hidden among the sea of stars was a new solar system, unlike anything they had ever seen before. It was home to not just one, but three habitable planets, each with the potential to support life.

Dr. Roberts and her team were elated. They had unlocked a new chapter in the history of space exploration, and their names would go down in the annals of science. But more than that, 

In [4]:
# 2. Image Generation
image_prompt = "A futuristic city skyline at night with neon lights and flying cars."
image_url = multimodal_chatbot(input_data=image_prompt, mode="image_gen")

Generated Image URL: https://i.ibb.co/CMGBvtf/c6972ed60730.jpg


In [5]:
# 3. Audio Generation
sample_text = "Welcome to the text-to-speech demonstration."
audio_path = multimodal_chatbot(input_data=sample_text, mode="audio", voice="shimmer", output_filename="sample_output.mp3")

Generated audio saved to: output\sample_output.mp3


In [6]:
# 4. Computer Vision
image_url = "https://llava-vl.github.io/static/images/monalisa.jpg"
vision_response = multimodal_chatbot(input_data=image_url, mode="vision")

Image Analysis Result: The image you've provided is the famous painting known as the Mona Lisa. It is a portrait of a woman with a serene and enigmatic expression. The background features a distant, imaginary landscape, which is characteristic of Leonardo da Vinci's style. The painting is renowned for its use of


# Notebook Metadata

In [7]:
import platform
import sys
from datetime import datetime

# Display author information
author_name = "Huzaifa Irshad" 
github_username = "irshadhuzaifa"  

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

# Last modified datetime (file's metadata)
notebook_file = "Notebook_02_Performance_Optimization.ipynb"
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

# Display platform, Python version, and Swarmauri version
print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

import swarmauri

try:
    version = swarmauri.__version__
except AttributeError:
    version = f"Swarmauri Version: 0.5.1"

print(f"Swarmauri Version: {version}")

Author: Huzaifa Irshad
GitHub Username: irshadhuzaifa
Last Modified: 2024-11-07 12:58:51.580808
Platform: Windows 11
Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
Swarmauri Version: Swarmauri Version: 0.5.1
