# Use the TTS Avatar Service to Create Engaging Videos
We guide you through creating avatar-based videos using the Azure TTS Avatar API and your own scripts. You can use these videos for various purposes, such as education or communication with customers and partners.

First, we'll walk you through creating the script for your avatar using the 'Audio Content Creation' tool. Next, we'll use the Azure TTS Avatar service to turn your script into a video with a custom avatar.

We'll then show you how to combine the avatar video with a content image or video created using PowerPoint. To finalize your video, you can use the FFMpeg command line tool for simplicity or the ClipChamp video editor for more advanced options.  

__Additional resources:__

- Available neural voices and languages: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/language-support?tabs=tts
- SSML overview: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
- Export PowerPoint slides as 1920x1080 resolution PNGs: https://youtu.be/Cv7vGce25rs
- FFmpeg command line tool for post processing of videos: https://ffmpeg.org/download.html
- Free video editing tool Clipchamp: https://clipchamp.com/

__Prerequisites:__  
You need an Azure subscription with a Speech resource to use the service. Add the `SPEECH_SERVICEE_REGION` and `SPEECH_SERVICE_API_KEY` to the `.env-avatar-video` file in this folder. You can find the values in the Azure portal under 'Keys and Endpoint' in the Speech resource. 

In [None]:
%pip install python-dotenv --quiet

## Setup

In [4]:
import requests
import json
import time
import os
from dotenv import load_dotenv

In [5]:
if not load_dotenv('./.env-avatar-video'):
    raise Exception("env file not found")

service_region = os.getenv("SPEECH_SERVICE_REGION")
subscription_key = os.getenv("SPEECH_SERVICE_API_KEY")
url_base = f"https://{service_region}.api.cognitive.microsoft.com"

In [6]:
project_folder = './my-project'  # your project folder
os.makedirs(project_folder, exist_ok=True)

texttype = 'Plaintext'  # SSML or Plaintext
ssml_path = os.path.join(project_folder, 'ssml.txt')  # if your avatar text input is in SSML format
plaintext_path = os.path.join(project_folder, 'plaintext.txt')  # if your avatar text input is plaintext format


In [7]:
# Helper functions

def download_file(url, local_path):
    """
    Download a file from a given URL to a local path. This function streams the file from the URL and writes it in chunks to the local
    file system. This allows it to handle large files that might not fit in memory.

    Parameters:
    url (str): The URL of the file to download.
    local_path (str): The local path where the file should be saved.

    Returns:
    str: The local path to the downloaded file.
    """
    with requests.get(url, stream=True) as r:
        r.raise_for_status()

        # Extract filename from URL
        filename = url.split("/")[-1].split("?")[0]

        local_filename = os.path.join(local_path, filename)

        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                if chunk:
                    f.write(chunk)

    return local_filename

## Provide text input for the avatar
Your avatar's speaking text can be provided as plain text file `plaintext.txt` or in the Spech Synthesis Markup Language (SSML) format using the file `ssml.txt`. SSML provides more flexibility to adjust neural voices, control pronounciation and add gestures to your avatar. You can use the __Audio Content Generation__ option in the [Speech Studio](https://speech.microsoft.com/portal) to easily generate SSML for various neural voices.

<img src="./media/audio-content-creation.png" alt="drawing" style="width:800px;"/>

Switch to the SSML view when you are satisfied with the results and copy the SSML content into the `ssml.txt` file of your project folder before you execute the next cell.

## Working with SSML directly
Alternatvely, you can create or edit your SSML file manually. Here is an example that illustrates a few customization options. Select the `ssml-example` project folder to try it out.

```xml
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
  <voice name="en-US-JennyMultilingualNeural">
    This example shows what you can do with SSML to let your avatar speak in various languages and how to add effects. 
    We are using the voice Jenny Multilangual which is able to speak in several languages.
    <lang xml:lang="fr-FR">Voici le talent linguistique Jenny parlant français. Ma voix est bien reconnaissable.</lang>
  </voice>
  <voice name="en-US-JennyNeural">
    I can speak in 15 different styles.
    <s />
    <mstts:express-as style="shouting">If you want a shouting Avatar, no problem for me!</mstts:express-as>
    <s />
    <mstts:express-as style="whispering">Or what about some whispered dialog?</mstts:express-as>
    You can insert a pause <break strength="medium" /> in my speech with a break tag.
    Use a phoneme to pronounce specific words correctly like OpenAI's <phoneme alphabet="ipa" ph="ˈdɑli">DALLE</phoneme> model. Without the phoneme, the neural voice would say DALLE, which is not correct.
    Adjust the speaking speed <prosody rate="+30.00%">so I can talk really fast if that's what you want</prosody><prosody rate="-40.00%"> or talk quite slowly.</prosody>
    <prosody pitch="+10.00%">Feel free to adjust the pitch for more highness of the sound.</prosody><prosody pitch="-20.00%"> Or the opposite if you like that better.</prosody>
    <prosody volume="-80.00%">Finally, here is how you reduce the volume.</prosody>
    <prosody volume="+40.00%">Or make me speak louder.</prosody>
    We hope that these examples help you to customize your avatar's communication for more engaging experiences.
    <s />
    <mstts:express-as style="friendly">Have fun!</mstts:express-as>
    <s />
  </voice>
</speak>
```

In [8]:
# load speaking script
if texttype == 'SSML':
    with open(ssml_path, 'r') as file:
        content = file.read()
elif texttype == 'Plaintext':
    with open(plaintext_path, 'r') as file:
        content = file.read()
else:
    print(f'Error: Texttype needs to be either "ssml" or "Plaintext". Got {texttype} instead.')

In [None]:
content

## Generate avatar video
You can specify the avatar character, style and further settings below. Note that video synthesis will take a few minutes depending on the length of your talking script.

In [None]:
# True for custom avatar and False for Prebuilt Avatars

customized = False


payload = {
    'synthesisConfig': {
            "voice": 'en-US-JennyMultilingualNeural',
        },
        # Replace with your custom voice name and deployment ID if you want to use custom voice.
        # Multiple voices are supported, the mixture of custom voices and platform voices is allowed.
        # Invalid voice name or deployment ID will be rejected.
        'customVoices': {
            # "YOUR_CUSTOM_VOICE_NAME": "YOUR_CUSTOM_VOICE_ID"
        },
        "inputKind": texttype,
        "inputs": [
            {
                "content": content,
            },
        ],
}


if customized:
    payload['avatarConfig'] = {
        "customized": customized,
        "talkingAvatarCharacter": 'Lisa-casual-sitting',  # talking avatar character
        "videoFormat": "mp4",  # mp4 or webm, webm is required for transparent background
        "videoCodec": "h264",  # hevc, h264 or vp9, vp9 is required for transparent background; default is hevc
        "subtitleType": "soft_embedded",
        "backgroundColor": "#FFFFFFFF", # background color in RGBA format, default is white; can be set to 'transparent' for transparent background
        # "backgroundImage": "https://samples-files.com/samples/Images/jpg/1920-1080-sample.jpg", # background image URL, only support https, either backgroundImage or backgroundColor can be set
    }
else:
    payload['avatarConfig'] = {
        "customized": customized,
        "talkingAvatarCharacter": 'Lisa',  # talking avatar character
        "talkingAvatarStyle": 'casual-sitting',  # talking avatar style, required for prebuilt avatar, optional for custom avatar
        "videoFormat": "mp4",  # mp4 or webm, webm is required for transparent background
        "videoCodec": "h264",  # hevc, h264 or vp9, vp9 is required for transparent background; default is hevc
        "subtitleType": "soft_embedded",
        "backgroundColor": "#FFFFFFFF", # background color in RGBA format, default is white; can be set to 'transparent' for transparent background
        # "backgroundImage": "https://samples-files.com/samples/Images/jpg/1920-1080-sample.jpg", # background image URL, only support https, either backgroundImage or backgroundColor can be set
    }


payload

In [13]:
import json
import logging
import os
import sys
import time
import uuid
import requests

logging.basicConfig(stream=sys.stdout, level=logging.INFO,  # set to logging.DEBUG for verbose output
        format="[%(asctime)s] %(message)s", datefmt="%m/%d/%Y %I:%M:%S %p %Z")
logger = logging.getLogger(__name__)



API_VERSION = "2024-04-15-preview"


def _create_job_id():
    # the job ID must be unique in current speech resource
    # you can use a GUID or a self-increasing number
    return uuid.uuid4()


def _authenticate():
    return {'Ocp-Apim-Subscription-Key': subscription_key}


def submit_synthesis(job_id: str):
    """
    Submits a batch avatar synthesis job to the specified URL.
    Args:
        job_id (str): The ID of the job to be submitted.
    Returns:
        bool: True if the job was submitted successfully, False otherwise.
    Logs:
        - Info: When the job is submitted successfully, including the job ID.
        - Error: When the job submission fails, including the status code and error message.
    """
    url = f'{url_base}/avatar/batchsyntheses/{job_id}?api-version={API_VERSION}'
    header = {
        'Content-Type': 'application/json'
    }
    header.update(_authenticate())
    

    response = requests.put(url, json.dumps(payload), headers=header)
    if response.status_code < 400:
        logger.info('Batch avatar synthesis job submitted successfully')
        logger.info(f'Job ID: {response.json()["id"]}')
        return True
    else:
        logger.error(f'Failed to submit batch avatar synthesis job: [{response.status_code}], {response.text}')


def get_synthesis(job_id):
    """
    Retrieve the status and result of a batch synthesis job.

    This function sends a GET request to the batch synthesis endpoint to check the status of a job.
    If the job is successful, it downloads the resulting video file to the specified project folder.

    Args:
        job_id (str): The unique identifier of the batch synthesis job.

    Returns:
        str: The status of the batch synthesis job. Possible values include 'Succeeded', 'Failed', etc.

    Logs:
        - Debug: When the batch synthesis job is successfully retrieved.
        - Info: When the batch synthesis job succeeds and the video is downloaded.
        - Error: When the request to retrieve the batch synthesis job fails.
    """
    url = f'{url_base}/avatar/batchsyntheses/{job_id}?api-version={API_VERSION}'
    header = _authenticate()

    response = requests.get(url, headers=header)
    if response.status_code < 400:
        logger.debug('Get batch synthesis job successfully')
        logger.debug(response.json())
        if response.json()['status'] == 'Succeeded':
            logger.info(f'Batch synthesis job succeeded, download URL: {response.json()["outputs"]["result"]}')
            download_file(response.json()["outputs"]["result"], project_folder)
            logger.info(f'Video downloaded to {project_folder}')
        return response.json()['status']
    else:
        logger.error(f'Failed to get batch synthesis job: {response.text}')


def list_synthesis_jobs(skip: int = 0, max_page_size: int = 100):
    """List all batch synthesis jobs in the subscription"""
    url = f'{url_base}/avatar/batchsyntheses?api-version={API_VERSION}&skip={skip}&maxpagesize={max_page_size}'
    header = _authenticate()

    response = requests.get(url, headers=header)
    if response.status_code < 400:
        logger.info(f'List batch synthesis jobs successfully, got {len(response.json()["values"])} jobs')
        logger.info(response.json())
    else:
        logger.error(f'Failed to list batch synthesis jobs: {response.text}')

In [None]:
job_id = _create_job_id()
if submit_synthesis(job_id):
    while True:
        status = get_synthesis(job_id)
        if status == 'Succeeded':
            logger.info('batch avatar synthesis job succeeded')
            break
        elif status == 'Failed':
            logger.error('batch avatar synthesis job failed')
            break
        else:
            logger.info(f'batch avatar synthesis job is still running, status [{status}]')
            time.sleep(5)

## Add content to your avatar video
In addition to the avatar video, we require another asset to represent the content we wish to display alongside the avatar. This could be either high-resolution images or a separate video, depending on your specific needs. 

<img src="./media/content-avatar.png" alt="drawing" style="width:1200px;"/>

PowerPoint is an excellent tool for creating this supplemental content, and here's how you can do it:

1. __Export slides as high-resolution images__: PowerPoint allows you to export slides as images. For optimal results, we recommend exporting the slides as 1920x1080 PNG files. You can follow this brief tutorial on how to do so: [Export slides as 1920x1080 PNGs](https://youtu.be/Cv7vGce25rs).

2. Alternatively __export the entire presentation as an MP4 video__: If you prefer to use a video instead of static images, PowerPoint offers the option to export your whole animated presentation as an MP4 file.  Here's how you can do it:  
Go to __File__ > __Export__ > __Create a Video__ in PowerPoint. Use slide transition and animation durations to adjust timings. The best way to align avatar and content video timing is to use a video editor (Option 2)

> **Tip:** For smooth synchronization, match the PowerPoint slide transition durations with the timestamps from your avatar video. This preemptive alignment minimizes the need for later timing adjustments.

Lastly, there are various methods available for integrating the avatar video and the content assets to produce your final video output.

### Option 1: Use the FFmpeg command line 

Install FFmpeg on a Linux system (Check out the [FFMpeg website](https://ffmpeg.org/download.html) for Windows and Mac options):
```bash
sudo apt update
sudo apt install ffmpeg
```
#### Examples
Start your terminal and navigate to the project folder.

>Add a content image as background to transparent avatar video:
>
>```bash
>ffmpeg -i content.png -vcodec libvpx-vp9 -i 0001.webm -filter_complex "overlay=(main_w-overlay_w)/2:(main_h-overlay_h)/2" -map 1:a output.mp4
>```

>Add a content video as background to transparent avatar video:
>
>```bash
>ffmpeg -i content.mp4 -vcodec libvpx-vp9 -i 0001.webm -filter_complex "overlay=(main_w-overlay_w)/2:(main_h-overlay_h)/2" -map 1:a output.mp4
>```

>Add a content video and an audio background music file with reduced volume:
>```bash
>ffmpeg -i content.mp4 -vcodec libvpx-vp9 -i 0001.webm -i background.wav -filter_complex "[2:a]volume=0.3[bg]; [1:a][bg]amix=inputs=2:duration=first[a]; overlay=(main_w-overlay_w)/2:(main_h-overlay_h)/2[v]" -map "[v]" -map "[a]" output.mp4
>```

>Crop avatar and move it to the right:
>```bash
>ffmpeg -i content.mp4 -vcodec libvpx-vp9 -i 0001.webm -filter_complex "[1:v]crop=440:1042:740:38[webm];[0:v][webm]overlay=W-w-160:38[outv]" -map "[outv]" -map 1:a output.mp4
>```
>Above command was adjusted to the default size of the Lisa avatar in "technical standing" style. The general pattern of croping and repositioning is as follows:
>
>```bash
>ffmpeg -i content.mp4 -vcodec libvpx-vp9 -i 0001.webm -filter_complex "[1:v]crop=w:h:x:y[webm];[0:v][webm]overlay=W-w-10:10[outv]" -map "[outv]" -map 1:a output.mp4
>```
>- __[1:v]crop=w:h:x:y[webm]__ is the cropping filter. Replace w, h, x, and y with the width, height, and the x, y coordinates of the top-left corner of the crop rectangle.
>- __[0:v][webm]overlay=W-w-10:10[outv]__ is the overlay filter. The overlay is positioned 10 pixels from the right edge and 10 pixels from the top.



### Option 2: Use a video editing tool

Using a video editing tool provides more intuitive and flexible options for generating your final output video. The following screenshot illustrates how to use the free edition of Microsoft Clipchamp.

<img src="./media/Clipchamp.png" alt="drawing" style="width:800px;"/>

You can multi-select the required files in Windows Explorer and select "Edit with Clipchamp" in the context menu. In our example, we have selected the avatar video `0001.webm`, the content video `content.mp4`, and an audio file for background music `background.wav`.

Ensure that you add the content as tracks in the video editor, positioning the avatar track with the transparent background on top. Then, align the timing of your tracks using the video cutting option. You can also use the tool to reposition your avatar, generate captions for subtitles, and add further content and special effects.
Lastly, use the export button to create your final video.

