Skip to content

Commit

Permalink
add speech conversational llm post
Browse files Browse the repository at this point in the history
  • Loading branch information
lepisma committed May 9, 2024
1 parent 121824a commit ecbb540
Show file tree
Hide file tree
Showing 4 changed files with 176 additions and 0 deletions.
67 changes: 67 additions & 0 deletions _posts/2024-05-09-speech-conversational-llms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
title: Speech LLMs for Conversations
date: 2024-05-09
tags: [llm, speech, conversations]
categories: [Machine Learning]
layout: post
authors: [Shangeth, lepisma]
latex: True
---

With LLMs making conversational systems has become easier. You no longer need to
focus on the low-level details of categorizing semantics and designing
responses. Instead, you can concentrate on controlling high-level behaviors via
an LLM. This is the trend that we see most of the world moving towards as
products are using vendor combinations of ASR, LLM, and TTS with some dialog
management stitched in between. While this is going to be the norm soon, we want
to keep exploring areas from where the next set of quality improvements will
come.

[Earlier](/speech-first-conversational-ai-revisited/) we discussed how spoken
conversations are richer than pure text and how the gap would be not bridged by
LLMs purely working on transcriptions. In one of our recent experiments we build
an efficient multi-modal LLM that takes speech directly to provide better
conversational experience. For production usage, the constraint here is that
this should happen without losing the flexibility that you get in a text-only
LLM around writing prompts, making changes, evaluating, and debugging.

Below is a conversation with our recent in-house Speech LLM based conversational
system. Notice that because of the extra information in speech some micro
personalizations can happen like usage of gendered pronouns[^1]. You also get
lower impact of transcription errors and in general better responses in
non-speech signals. With access to both speech and text domains, the model
allows for more fluent turn-taking, though not demonstrated in the current
conversation. In addition, our approach also reduces the combined model size
(<2B) for taking speech to response, leading to lower compute latency as
compared to larger systems.

<style>
.webvtt-player .media {
display: unset;
}

.webvtt-player .container {
width: unset;
}

.webvtt-player {
font-family: sans-serif;
font-size: 0.8em;
}
</style>

<div id="webvtt-player"
data-audio="../assets/audios/posts/speech-conversational-llms/audio.m4a"
data-transcript="../assets/audios/posts/speech-conversational-llms/transcript.vtt"
data-metadata="../assets/audios/posts/speech-conversational-llms/metadata.vtt" />

<script src="https://umd-mith.github.io/webvtt-player/webvtt-player.js"></script>

The model above doesn't yet control speech synthesis beyond the textual markers
it can generate, but that's something to be added soon (you might have noticed
erratic pitch shifts in the call above since TTS vendors don't contextualize
based on past conversations). Stay tuned for more details on how we take this
and similar research areas forward.

[^1]: Of course concerns around paralinguistic prediction accuracies are
extremely important to take something like this in production.
Binary file not shown.
34 changes: 34 additions & 0 deletions assets/audios/posts/speech-conversational-llms/metadata.vtt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
WEBVTT
00:00:04.940 --> 00:00:06.880
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:00:17.530 --> 00:00:20.840
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "female, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:00:29.080 --> 00:00:33.400
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:00:47.420 --> 00:00:52.020
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:01:07.190 --> 00:01:11.510
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, oceania", "title_alt": "", "subjects_alt": ""}

00:01:20.900 --> 00:01:22.530
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:01:32.440 --> 00:01:37.700
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:01:44.770 --> 00:01:48.200
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:01:52.680 --> 00:01:53.790
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:02:01.050 --> 00:02:09.710
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}

00:02:33.770 --> 00:02:36.590
{"keywords_alt": "", "gpspoints": {"gps_zoom": "", "gps_text_alt": "", "gps_text": "", "": ""}, "synopsis": "", "subjects": "", "hyperlinks": {"hyperlink_text_alt": "", "hyperlink_text": "", "hyperlink": ""}, "synopsis_alt": "", "title": "", "keywords": "male, middle-aged, neutral, america", "title_alt": "", "subjects_alt": ""}
75 changes: 75 additions & 0 deletions assets/audios/posts/speech-conversational-llms/transcript.vtt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:03.040
<Speaker 1>Hello, am I speaking with Tom Williams?

00:00:04.940 --> 00:00:06.880
<Speaker 2>May I ask who is calling?

00:00:07.740 --> 00:00:15.750
<Speaker 1>Hello Ma'am. This is the digital agent of FirstCar Finance on a recorded line. Could I please speak with Tom regarding an important financial matter?

00:00:17.530 --> 00:00:20.840
<Speaker 2>Um, hold on. I'll give him the phone.

00:00:21.760 --> 00:00:23.280
<Speaker 1>Sure Ma'am. No problem.

00:00:29.080 --> 00:00:33.400
<Speaker 3>Yeah yeah, uh, it's Tom. And um who is this?

00:00:35.240 --> 00:00:45.830
<Speaker 1>Hi, tom. this is the digital agent of FirstCar Finance on a recorded line. For safety, could you please confirm if 895 East Elm Street, Springfield, Illinois is your address?

00:00:47.420 --> 00:00:52.020
<Speaker 3>Uh uh yeah, yeah. That'll be my address.

00:00:53.420 --> 00:01:05.870
<Speaker 1>Thank you Tom, for confirming. I am giving you a call about the 2017 Honda Civic under your name. You have an overdue payment of $1345.75. Could you clear this today?

00:01:07.190 --> 00:01:11.510
<Speaker 3>Um, I was planning to pay this, but I can't pay that all at once.

00:01:12.750 --> 00:01:19.290
<Speaker 1>I understand, Tom. How does a partial payment of $1700 sound? You can pay the remaining amount later.

00:01:20.900 --> 00:01:22.530
<Speaker 3>Yeah, I can do that.

00:01:23.490 --> 00:01:31.060
<Speaker 1>Great. Tom, how would you like to do this? You can pay this amount via an SMS link or we can connect you to a human agent over the phone.

00:01:32.440 --> 00:01:37.700
<Speaker 3>Uh I guess a payment link would be yeah I guess easier.

00:01:38.830 --> 00:01:43.700
<Speaker 1>Alright. Can I send it to your mobile number ending in 7985?

00:01:44.770 --> 00:01:48.200
<Speaker 3>Yeah, that should um yeah, that should work just fine.

00:01:49.120 --> 00:01:51.980
<Speaker 1>Thank you Tom. I am sending the message right away.

00:01:52.680 --> 00:01:53.790
<Speaker 3>Yeah okay.

00:01:54.750 --> 00:02:00.130
<Speaker 1>Brilliant. I’ve sent the payment link to your phone. Could you please let me know once you've received it?

00:02:01.050 --> 00:02:09.710
<Speaker 3>Um, yeah, just got it. Please give me a minute to complete the payment.

00:02:10.950 --> 00:02:18.730
<Speaker 1>Of course, Tom. Take your time and hit the pound key once you've completed the payment. If you come across any issues, hit the star key.

00:02:25.660 --> 00:02:32.710
<Speaker 1>Thank you, Tom. We will verify your payment on our end. Please reach out if you need further assistance. Have a nice day.

00:02:33.770 --> 00:02:36.590
<Speaker 3>Okay, thanks. Goodbye.

00:02:37.530 --> 00:02:38.590
<Speaker 1>Goodbye Tom.

0 comments on commit ecbb540

Please sign in to comment.