Skip to content

Whisper, Stable Diffusion on U-Net, Chatgpt AI models, bundled in a Unity project. Locally run models powered by Onnxruntime. These transcribe podcasts' audio to text and generate contextual images tied to the transcribed text.

License

Notifications You must be signed in to change notification settings

sergiosolorzano/TalkomicApp-Unity

Repository files navigation

Project

Watching a sound wave or unrelated images for a podcast that is published in a platform that supports video and images can be dull.

We propose asking a trio of AI models, bundled in a Unity project, to transcribe the audio to text and generate contexual images closely tied to the transcribed text.

diagram-flow

 

We run two of the AI models locally, Whisper-Tiny and stable diffusion in U-Net architecture; we access a third, Chatgpt, remotely via an API.

diagram-flow

 

In a Unity scene we loop the AI Models over each podcast audio section to generate the contextual images.

Talkomic-tecshift-image-gen-speed.30fps.mp4

 

Watch The Trailer🎬

Talkomic_trailer.30fps.mp4

 

Project Motivation

I am new to AI, keen to tinker and learn !💥 The prototype is a good starting point as proof of concept to test the ability of AI models to help audio media augment its reach.

 

Proof of Concept Results: Talkomic App Prototype - "A chat into a Images"

Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on the podcast served as inspiration for this prototype.

I am thrilled and truly grateful to Maurizio Raffone at Tech Shift F9 Podcast for trusting me to run a proof of concept of the Talkomic app prototype with the audio file of a fantastic episode in this podcast.

 

Get Crisper Images with the ESRGAN AI Model

Finally, once the models have generated all images, we enhance these from 512×512 resolution to crisper 2048×2048 resolutions with the Real-ESRGAN AI Model. Suggested implementation steps in our blog.

512×512 512×512_image 2048x2048 2048×2048_image

 

Project's Blog Post

This is a prototype repo for proof of concept. Read the Talkomic app blog for the suggested steps to build the project in Unity:

 

Unity Project Features and Setup

  • This project has been updated to build for windows. See updates

  • The AI models in the Unity project of this repo are powered by Microsoft's cross-platform OnnxRuntime.

  • Native dlls (Onnxruntime, NAudio etc) required files: Project should include the following packages to Visual Studio (tested in VS2022 v.17.7.3) and dlls to Unity's Assets/Plugins directory.

native-dlls native-dlls_vs2022
  • Clone and save weights.pb weights file into Assets/StreamingAssets/Models/unet/ . Step also required for this repo's Release package (file too large). Fail model download availability, try here.

  • Podcast Audio Section List Required: Create in script TalkomicManager.cs at GenerateSummaryAndTimesAudioQueueAndDirectories() a list for each section in the podcast audio with the section_name and its start time in minutes:seconds.

    • Unity will generate an output directory for each section, save the transcribed text and chatgpt image description for each section, and the images generated.

      output-snapshot
  • Podcast Audio Chunks: The Whisper model is designed to work on audio samples of up to 30s in duration. Hence we chunk the podcast audio for each section in chunks of max 30 seconds but load these as a queue in Whisper-tiny for each podcast section.

    audio-chunks
  • AI Generated Images: Shown in the scene along with the transcribed text and chatgpt image description.

    N.B. - A black image is likely caused due to a not-safe filter triggered.

    scene-progress
  • Scene Control Input Variables:

    • Script: TalkomicManager.cs:

      • pathToAudioFile: full path to podcast audio file. Audio file is in sync with the list of section names and start times created in coroutine GenerateSummaryAndTimesAudioQueueAndDirectories() For example, in the case of the Tech Shift F9 E8 podcast, the sections were broken out by the host as shown below, along with the start times.

        image

        You'd need to create these sections for your podcast file. Unity will chunk the audio for each section in max 30 sec wav files. Hence each section, with as many 30 second chunk audio files as it is required, will be patched together and transcribed for each section. Each transcribed section is then sent to chatgpt to generate a description of an image for the transcribed text section.

        image
      • custom_chatgpt_pre_prompt: Text added at the front of the transcribed message sent to Chatgpt to guide chatgpt's response.

      • custom_diffuser_pre_prompt: Text added at the front of the stable diffusion prompt of the image to be created. It guides the stable diffusion result.

      • limitChatGPTResponseWordCount: Trims prompt to stable diffusion to 50 words to handle limit exception.

      • maxChatgptRequestedResponseWords: Max words requested for chatgpt to respond with.

      • numStableDiffusionImages: Number of images generated from a single chatgpt image description prompt.

      • steps: Number of stable diffusion denoising steps

      • ClassifierFreeGuidanceScaleValue: stable diffusion guidance scale

    • ChatGPT Scriptable Object API Credentials and Request Arguments Data: Script ChatgptCreds.cs

      • Create the object and add it as property to RunChatgpt.cs component in Hierarchy object "RunChatGPT"

        scriptable-object-snap
      • Enter credentials and request arguments

        scriptable-credentials-example

Update: Windows Build Functionality Complete

Key Additions:

  • Audio file is loaded from UI

  • Project is work in progress - audio sections still required to be manually entered before the build: For working demo, upload audio sample sampleaudio.wav which corresponds to audio section entered in TalkomicManager.cs:

    image
  • Chatgpt credentials entered in UI at runtime

  • Windows runtime video demo:

Win-Demo-Talkomic.mp4

Prototype Software

Unity version: Unity 2021.3.26f1.

This prototype has been tested on the Unity Editor and Windows 11 build.

Tested on Windows 11 system, 64GB RAM, GPU NVIDIA GeForce RTX 3090 Ti 24GB, CPU 12th Gen Intel i9-12900K, 3400Mhz, 16 cores.

 

License

This project is licensed under the MIT License. See LICENSE.txt for more information.

 

Thank you

Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on the podcast served as inspiration for this prototype.

We also thank @sd-akashic @Haoming02 @Microsoft for helping to better understand onnxruntime implementation in Unity,

and ai-forever for the git repo for Real-ESRGAN,

and yasirkula's Simple File Browser.

 

If you find this helpful you can buy me a coffee :)

 

Buy Me A Coffee

About

Whisper, Stable Diffusion on U-Net, Chatgpt AI models, bundled in a Unity project. Locally run models powered by Onnxruntime. These transcribe podcasts' audio to text and generate contextual images tied to the transcribed text.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published