Cell 1: Import necessary libraries

In [1]:
# Importing necessary libraries for object detection, text-to-speech, image processing, and Gradio interface
import os
from PIL import Image
import gradio as gr
from transformers import pipeline


Cell 2: Define the object detection pipeline and the prediction function

In [3]:
# Assuming 'od_pipe' is the pre-trained object detection pipeline from Hugging Face Transformers.
# The function 'get_pipeline_prediction' processes an input image with the object detection pipeline
# and returns the image with detected objects drawn on it.

def get_pipeline_prediction(pil_image):
    # Run object detection pipeline on the input image
    pipeline_output = od_pipe(pil_image)
    
    # Process and render the detection results in the input image
    processed_image = render_results_in_image(pil_image, pipeline_output)
    
    # Return the image with predicted objects drawn on it
    return processed_image


Cell 3: Create the Gradio interface for object detection

In [7]:
# Creating a Gradio interface for the object detection app.
# This interface takes an image as input, processes it through the object detection pipeline,
# and displays the output image with detected objects.

demo = gr.Interface(
    fn=get_pipeline_prediction,        # Function to call for processing
    inputs=gr.Image(label="Input image", type="pil"),   # Input type: Image
    outputs=gr.Image(label="Output image with predicted instances", type="pil")  # Output type: Image
)

# The 'share=True' argument will create a sharable link, making the app accessible online.
# The 'server_port' is set using the environment variable for deployment flexibility.
demo.launch(share=True, server_port=int(os.environ['PORT1']))


KeyError: 'PORT1'

Cell 4: Close the Gradio app

In [10]:
# Once you are done with the demo, you can close the Gradio interface.
demo.close()


Cell 5: Define the text summarization and TTS generation functions


In [13]:
# Assuming 'summarize_predictions_natural_language' is a helper function
# that takes the object detection pipeline output and summarizes it in natural language.

def generate_narration(pipeline_output):
    # Generate a natural language description from the object detection output
    text = summarize_predictions_natural_language(pipeline_output)
    
    # Define the text-to-speech pipeline using the Kakao model for audio generation
    tts_pipe = pipeline("text-to-speech", model="./models/kakao-enterprise/vits-ljs")
    
    # Generate speech from the textual description
    narrated_text = tts_pipe(text)
    
    # Return the narrated text audio
    return narrated_text


Cell 6: Play the generated audio from text-to-speech

In [16]:
# Function to play the generated audio using IPython's Audio class
# This function accepts the narrated text from the TTS pipeline and plays it in the notebook.

from IPython.display import Audio as IPythonAudio

def play_generated_audio(narrated_text):
    # Play the audio output from the TTS model, with the appropriate sampling rate
    return IPythonAudio(narrated_text["audio"][0], rate=narrated_text["sampling_rate"])


Cell 7: Combine Object Detection and TTS

In [19]:
# Combining object detection and TTS models. After detecting objects in the image,
# the output is summarized in natural language and converted to speech.

def detect_and_narrate(pil_image):
    # Step 1: Perform object detection
    pipeline_output = od_pipe(pil_image)
    
    # Step 2: Generate the processed image with detection results
    processed_image = render_results_in_image(pil_image, pipeline_output)
    
    # Step 3: Generate a natural language description of the objects detected
    narration = generate_narration(pipeline_output)
    
    # Return the processed image and the narrated audio
    return processed_image, narration


Cell 8: Gradio Interface for Object Detection + TTS

In [22]:
# Create a Gradio interface for the combined object detection and TTS pipeline.
# The app will return both the processed image and the narrated description as audio.

demo_with_tts = gr.Interface(
    fn=detect_and_narrate,  # Function to call for processing the image and generating narration
    inputs=gr.Image(label="Input image", type="pil"),  # Input type: Image
    outputs=[gr.Image(label="Processed image with detected objects", type="pil"),  # Output type: Processed image
             gr.Audio(label="Audio narration of objects")]  # Output type: Audio (narrated description)
)

# Launch the app and provide a shareable link
demo_with_tts.launch(share=True, server_port=int(os.environ['PORT1']))


KeyError: 'PORT1'

Cell 9: Close the Gradio app when done

In [None]:
# After using the Gradio app, it can be closed to free up resources.
demo_with_tts.close()


Explanation of the process:
Object Detection: The get_pipeline_prediction function runs an object detection pipeline on an input image and returns an image with detected objects.

Text Summarization: The generate_narration function summarizes the objects detected in the image in natural language.

Text-to-Speech: The summarized text is converted into speech using a pre-trained text-to-speech pipeline.

Gradio Interface: The demo_with_tts Gradio interface allows users to upload an image, see the detected objects, and hear a narrated description of what is in the image.

