<a href="https://colab.research.google.com/github/sayakpaul/GCP-ML-API-Demos/blob/master/Video_Intelligence_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook presents a small demo that marries the [Video Intelligence](https://cloud.google.com/video-intelligence) and [Text-to-Speech](https://cloud.google.com/text-to-speech) APIs offered by GCP. Following is the workflow of this demo - 

<div align="center"><img src="https://i.ibb.co/DbT89pv/image.png"></img></div>

This demo requires to have billing-enabled GCP project and in there the Video Intelligence and Text-to-Speech APIs should be enabled. You should also have your GCP Credentials key in `json` format (refer [here](https://cloud.google.com/docs/authentication/getting-started)). I followed the official samples and tutorials of the APIs (which are available at the aforementioned links) to developed this demo. 

A potential extension of this demo could be developed to aid blind people to navigate their ways when they are outside. I developed this demo keeping this mind, hence you won't see any visual annotations. 

Thanks to the [GDE program](https://developers.google.com/programs/experts/) for providing with the GCP credit support which made this demo possible. 

<div align="center"><img src="https://i.ibb.co/ZXtwJjV/Webp-net-resizeimage.png" width="100" height="100"></img></div>

In [None]:
#@title Upload your GCP credentials key to Colab
from google.colab import files
files.upload()

In [None]:
#@title Install Python client libraries
!pip install --upgrade google-cloud-videointelligence
!pip install --upgrade google-cloud-texttospeech

In [None]:
#@title Set the path to GCP credentials key
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/fast-ai-exploration-f32c198aac7e.json' 
!echo $GOOGLE_APPLICATION_CREDENTIALS

In [None]:
#@title Imports
from google.cloud import videointelligence
from google.cloud import texttospeech
from IPython.display import Audio

In [None]:
#@title Utility function for label detection
#@markdown Courtesy: https://cloud.google.com/video-intelligence/docs/analyze-labels#annotating_a_file_on
def label_video_gcs(gcs_path):
    """ Detects labels given a GCS path. """

    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [videointelligence.enums.Feature.LABEL_DETECTION]

    mode = videointelligence.enums.LabelDetectionMode.SHOT_AND_FRAME_MODE
    config = videointelligence.types.LabelDetectionConfig(label_detection_mode=mode)
    context = videointelligence.types.VideoContext(label_detection_config=config)

    operation = video_client.annotate_video(
        input_uri=gcs_path, features=features, video_context=context
    )
    print("\nProcessing video for label annotations:")
    result = operation.result(timeout=180)
    print("\nFinished processing.")

    # Process video/segment level label annotations
    # Get the first response, since we sent only one video.
    segment_labels = result.annotation_results[0].segment_label_annotations
    video_labels = []
    for i, segment_label in enumerate(segment_labels):
        print("Video label description: {}".format(segment_label.entity.description))
        video_labels.append(segment_label.entity.description)
    
    video_labels = ", ".join(video_labels)
    return "I see " + video_labels

In [None]:
#@title Utility function for logo recognition
#@markdown Courtesy: https://cloud.google.com/video-intelligence/docs/logo-recognition#annotate_a_video_in
def detect_logo_gcs(gcs_path):
    """ Detects logos given a GCS path. """

    client = videointelligence.VideoIntelligenceServiceClient()
    features = [videointelligence.enums.Feature.LOGO_RECOGNITION]
    operation = client.annotate_video(input_uri=gcs_path, features=features)

    print("\nProcessing video for logo detection:")
    response = operation.result(timeout=180)
    print("\nFinished processing.")

    # Get the first response, since we sent only one video.
    annotation_result = response.annotation_results[0]

    # Annotations for list of logos detected, tracked and recognized in video.
    if len(annotation_result.logo_recognition_annotations) > 0:
        logos = []
        for logo_recognition_annotation in annotation_result.logo_recognition_annotations:
            entity = logo_recognition_annotation.entity

            # Opaque entity ID. Some IDs may be available in [Google Knowledge Graph
            # Search API](https://developers.google.com/knowledge-graph/).
            logos.append(entity.description)
            print(u"Description : {}".format(entity.description))
        logos = ", ".join(logos)
        return "I see logos of " + logos
    
    else:
        return "No logos found!"

In [None]:
#@title Detect labels and logos
#@markdown Provide GCS path of the video or select one from the dropdown - 
GCS_PATH = "gs://video-api-storage/sample_video.mp4" #@param ["gs://video-api-storage/sample_video.mp4", "gs://video-api-storage/massachusetts.mp4", "gs://video-api-storage/toronto.mp4"] {allow-input: true}
labels = label_video_gcs(GCS_PATH)
logos = detect_logo_gcs(GCS_PATH)


Processing video for label annotations:

Finished processing.
Video label description: sidewalk
Video label description: street
Video label description: public space
Video label description: pedestrian

Processing video for logo detection:

Finished processing.


In [None]:
#@title Utility functions for generating SSML and audio
#@markdown Courtesy: https://cloud.google.com/text-to-speech/docs/ssml-tutorial
def text_to_ssml(sentence):
    # Generates SSML text from plaintext.
    # Given a sentence, this function converts the contents of the text
    # file into a string of formatted SSML text. This function formats the SSML
    # string so that, when synthesized, the synthetic audio will pause for two
    # seconds between each line of the text file. This function also handles
    # special text characters which might interfere with SSML commands.
    #
    # Args:
    # inputfile: plaintext sentence
    #
    # Returns:
    # A string of SSML text based on plaintext input

    # Convert plaintext to SSML
    ssml = "<speak>{}</speak>".format(sentence)

    # Return the concatenated string of ssml script
    return ssml

def ssml_to_audio(ssml_text, outfile="sample_audio.mp3"):
    # Generates SSML text from plaintext.
    #
    # Given a string of SSML text and an output file name, this function
    # calls the Text-to-Speech API. The API returns a synthetic audio
    # version of the text, formatted according to the SSML commands. This
    # function saves the synthetic audio to the designated output file.
    #
    # Args:
    # ssml_text: string of SSML text
    # outfile: string name of file under which to save audio output
    #
    # Returns:
    # nothing

    # Instantiates a client
    client = texttospeech.TextToSpeechClient()

    # Sets the text input to be synthesized
    synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)

    # Builds the voice request, selects the language code ("en-US") and
    # the SSML voice gender ("MALE")
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.MALE
    )

    # Selects the type of audio file to return
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    # Performs the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )

    # Writes the synthetic audio to the output file.
    with open(outfile, "wb") as out:
        out.write(response.audio_content)
        print("Audio content written to file " + outfile)

    return str(outfile)

In [None]:
#@title Generate audio for labels
ssml = text_to_ssml(labels)
audio_filename = ssml_to_audio(ssml, "labels.mp3")
Audio(filename=audio_filename, autoplay=True)

Audio content written to file labels.mp3


In [None]:
#@title Generate audio for logos
ssml = text_to_ssml(logos)
audio_filename = ssml_to_audio(ssml, "logos.mp3")
Audio(filename=audio_filename, autoplay=True)

Audio content written to file logos.mp3
