-
Notifications
You must be signed in to change notification settings - Fork 5
/
paInterfaces.htm
1275 lines (1172 loc) · 45.7 KB
/
paInterfaces.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' encoding='UTF-8'?>
<html dir="ltr" about="" property="dcterms:language" content="en"
xmlns="http://www.w3.org/1999/xhtml"
prefix="bibo: http://purl.org/ontology/bibo/" typeof="bibo:Document">
<head>
<title>Intelligent Personal Assistant Interfaces</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<link href="../cg-draft.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div class="head">
<p>
<a href="http://www.w3.org/"> <img width="72"
height="48" src="http://www.w3.org/Icons/w3c_home"
alt="W3C" /></a>
</p>
<h1 property="dcterms:title" class="title" id="title">Intelligent
Personal Assistant Architecture</h1>
<h2 property="bibo:subtitle" id="subtitle">Intelligent
Personal Assistant Interfaces</h2>
<dl>
<dt>Latest version</dt>
<dd>
Last modified: April 03, 2024 <a
href="https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm">https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm</a>
(GitHub repository)<br /> <a
href="https://w3c.github.io/voiceinteraction/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm">HTML
rendered version</a>
</dd>
<dt>Editor</dt>
<dd>
Dirk Schnelle-Walka<br /> Deborah Dahl, Conversational
Technologies
</dd>
</dl>
<p class="copyright">
Copyright © 2022-2024 the Contributors to the Voice
Interaction Community Group, published by the <a
href="http://www.w3.org/community/voiceinteraction/">Voice
Interaction Community Group</a> under the <a
href="https://www.w3.org/community/about/agreements/cla/">W3C
Community Contributor License Agreement (CLA)</a>. A
human-readable <a
href="http://www.w3.org/community/about/agreements/cla-deed/">summary</a>
is available.
</p>
<hr />
</div>
<h2 id="abstract">Abstract</h2>
<p>
This document details the general architecture of Intelligent
Personal Assistants as described in <a
href="https://w3c.github.io/voiceinteraction/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm">Architecture
and Potential for Standardization Version 1.3</a> with regard to
interface definitions. The architectural descriptions focus on
intent-based voice-based personal assistants and chatbots.
Current LLM intent-less chatbots may have other interface needs.
</p>
<h2>Status of This Document</h2>
<p>
<em>This specification was published by the <a
href="http://www.w3.org/community/voiceinteraction/">Voice
Interaction Community Group</a>. It is not a W3C Standard
nor is it on the W3C Standards Track. Please note that under
the <a
href="http://www.w3.org/community/about/agreements/cla/">W3C
Community Contributor License Agreement (CLA)</a> there is a
limited opt-out and other conditions apply. Learn more about
<a href="http://www.w3.org/community/">W3C Community and
Business Groups</a>.
</em>
</p>
<h2 class="introductory">Table of Contents</h2>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#problemstatement">Problem Statement</a></li>
<li><a href="#architecture">Architecture</a></li>
<li><a href="#highlevelinterfaces">High Level
Interfaces</a></li>
<li><a href="#lowlevelinterfaces">Low Level Interfaces</a></li>
</ol>
<!-- OddPage -->
<h1 id="introduction">
<span class="secno">1. </span>Introduction
</h1>
<p>Intelligent Personal Assistants (IPAs) are now available in
our daily lives through our smart phones. Apple’s Siri, Google
Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more
are helping us with various tasks, like shopping, playing music,
setting a schedule, sending messages, and offering answers to
simple questions. Additionally, we equip our households with
smart speakers like Amazon’s Alexa or Google Home which are
available without the need to pick up explicit devices for these
sorts of tasks or even control household appliances in our
homes. As of today, there is no interoperability among the
available IPA providers. Especially for exchanging learned user
behaviors this is unlikely to happen at all.</p>
<p>Furthermore, in addition to these general-purpose assistants,
there are also specialized virtual assistants which are able to
provide their users with in-depth information which is specific
to an enterprise, government agency, school, or other
organization. They may also have the ability to perform
transactions on behalf of their users, such as purchasing items,
paying bills, or making reservations. Because of the breadth of
possibilities for these specialized assistants, it is imperative
that they be able to interoperate with the general-purpose
assistants. Without this kind of interoperability, enterprise
developers will need to re-implement their intelligent
assistants for each major generic platform.</p>
<p>
This document is the second step in our strategy for IPA
standardization. It is based on a general architecture of IPAs
described in <a
href="https://w3c.github.io/voiceinteraction/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm">Architecture
and Potential for Standardization Version 1.3</a> which aims at
exploring the potential areas for standardization. It focuses on
voice as the major input modality. We believe it will be of
value not only to developers, but to many of the constituencies
within the intelligent personal assistant ecosystem. Enterprise
decision-makers, strategists and consultants, and entrepreneurs
may study this work to learn of best practices and seek
adjacencies for creation or investment. The overall concept is
not restricted to voice but also covers purely text based
interactions with so-called chatbots as well as interaction
using multiple modalities. Conceptually, the authors also define
executing actions in the user's environment, like turning on the
light, as a modality. This means that components that deal with
speech recognition, natural language understanding or speech
synthesis will not necessarily be available in these
deployments. In case of chatbots, speech components will be
omitted. In case of multimodal interaction, interaction
modalities may be extended by components to recognize input from
the respective modality, transform it into something meaningful
and vice-versa to generate output in one or more modalities.
Some modalities may be used as output-only, like turning on the
light, while other modalities may be used as input-only, like
touch.
</p>
<p>
In this second step we describe the interfaces of the general
architecture of IPAs in <a
href="https://w3c.github.io/voiceinteraction/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm">Architecture
and Potential for Standardization Version 1.3</a>. We believe it
will be of value not only to developers, but to many of the
constituencies within the intelligent personal assistant
ecosystem. Enterprise decision-makers, strategists and
consultants, and entrepreneurs may study this work to learn of
best practices and seek adjacencies for creation or investment.
</p>
<p>
In order to cope with such <a href="#usecases">use cases</a> as
those described above an IPA follows the general design concepts
of a voice user interface, as can be seen in Figure 1.
</p>
<p>
Interfaces are described with the help of <a
href="https://www.omg.org/spec/UML/">UML diagrams</a>. We
expect the reader to be familiar with that notation, although
most concepts are easy to understand and do not require in-depth
knowledge. The main diagram types used in this document are <a
href="https://sparxsystems.com/resources/tutorials/uml2/component-diagram.html">component
diagrams</a> and <a
href="https://sparxsystems.com/resources/tutorials/uml2/sequence-diagram.html">sequence
diagrams</a>. The UML diagrams are provided as Enterprise
Architect Model <a href="pa-architecture.EAP">pa-architecture.EAP</a>.
They can be viewed with the free of charge tool <a
href="https://www.sparxsystems.eu/enterprise-architect/ea-lite-edition/">EA
Lite</a>
</p>
<h1 id="problem statement">
<span class="secno">2. </span>Problem Statement
</h1>
<h2 id="usecases">
<span class="secno">2.1 </span>Use Cases
</h2>
<p>This section describes potential usages of IPAs that will be
used later in the document to illustrate the usage of the
specified interfaces.</p>
<h3>
<span class="secno">2.1.1 </span>Weather Information
</h3>
<p>A user located in Berlin, Germany, is planning to visit her
friend a few kilometers away, the next day. As she considers
taking the bike, she asks the IPA for weather conditions.</p>
<h3>
<span class="secno">2.1.2 </span>Flight Reservation
</h3>
<p>A user located in Berlin, Germany, would like to plan a trio
to an international conference and she wants to book a flight to
the conference in San Francisco. Therefore, she approaches the
IPA to help her with booking the flight,</p>
<h1 id="architecture">
<span class="secno">3. </span>Architecture
</h1>
<h2 id="architectur-principle">
<span class="secno">3.1 </span><span><font
face="Segoe UI">Architectural Principle</font></span>
</h2>
<p>
The architecture described in this document follows the <a
href="https://web.archive.org/web/20150906155800/http:/www.objectmentor.com/resources/articles/Principles_and_Patterns.pdf">SOLID
principle</a> introduced by Robert C. Martin to arrive at a
scalable, understandable and reusable software solution.
</p>
<dl>
<dt>Single responsibility principle</dt>
<dd>The components should have only one clearly-defined
responsibility.</dd>
<dt>Open closed principle</dt>
<dd>Components should be open for extension, but closed for
modification.</dd>
<dt>Liskov substitution principle</dt>
<dd>Components may be replaced without impacts onto the
basic system behavior.</dd>
<dt>Interface segregation principle</dt>
<dd>Many specific interfaces are better than one
general-purpose interface.</dd>
<dt>Dependency inversion principle</dt>
<dd>High-level components should not depend on low-level
components. Both should depend on their interfaces.</dd>
</dl>
<p>This architecture aims at following both, a traditional
partitioning of conversational systems, with separate components
for speech recognition, natural language understanding, dialog
management, natural language generation, and audio output,
(audio files or text to speech) as well as newer LLM (Large
Language Model) based approaches. This architecture does not
rule out combining some of these components in specific systems.</p>
<h2 id="main-use-cases">
<span class="secno">3.2 </span>Main Use Cases
</h2>
<p>Among others, the following most popular high-level use cases
for IPAs are to be supported</p>
<ol>
<li>Question Answering or Information Retrieval</li>
<li>Executing local and/or remote services to accomplish
tasks</li>
</ol>
<p>This is supported by a flexible architecture that supports
dynamically adding local and remote services or knowledge
sources such as data providers. Moreover, it is possible to
include other IPAs, with the same architecture, and forward
requests to them, similar to the principle of a Russian doll
(omitting the Client Layer). All this describes the capabilities
of the IPA. These extensions may be selected from a standardized
marketplace. For the reminder of this document, we consider an
IPA that is extendible via such a marketplace.</p>
<p>The following table lists the IPA main use cases and related
examples that are used in this document</p>
<table>
<tr>
<th>Main Use Case</th>
<th>Example</th>
</tr>
<tr>
<td>Question Answering or Information Retrieval</td>
<td>Weather information</td>
</tr>
<tr>
<td>Executing local and/or remote services to
accomplish tasks</td>
<td>Flight reservation</td>
</tr>
</table>
<p>These main use cases are shown in the following figure</p>
<img src="Main-IPA-Use-Cases.svg" alt="Main IPA Use Cases"
style="width: 40%; height: auto;" />
<p>Not all components may be needed for actual implementations,
some may be omitted completely. Especially, LLM-based
architectures may combine the functionality of multiple
components into only one or few components. However, we note
them here to provide a more complete picture.</p>
<p>The architecture comprises three layers that are detailed in
the following sections</p>
<ol>
<li><a href="#clientlayer">Client Layer</a></li>
<li><a href="#dialoglayer">Dialog Layer</a></li>
<li><a href="#datalayer">External Data / Services / IPA
Providers</a></li>
</ol>
<p>Actual implementations may want to distinguish more or fewer
than these layers. The assignment to the layers is not
considered to be strict so that some of the components may be
shifted to other layers as needed. This view only reflects a
view that the Community Group regard as ideal and to show the
intended separation of concerns.</p>
<img src="IPA-Major-Components.svg" alt="IPA Major Components"
style="width: 50%; height: auto;" />
<p>According to these components they are assigned to the
packages shown below.</p>
<img src="IPA-Package-Hierarchy.svg" alt="IPA Package Hierarchy"
style="width: 50%; height: auto;" />
<h1 id="highlevelinterfaces">
<span class="secno">4. </span>High Level Interfaces
</h1>
<p>
This section details the interfaces from the figure shown in the
<a href="#architecture">architecture</a>. The interfaces are
described with the following attributes
</p>
<dl>
<dt>name</dt>
<dd>Name of the attribute</dd>
<dt>type</dt>
<dd>Hint if this attribute is a single data item or a
category. The exact data types of the attributes are left
open for now. A category may contain other categories or data
items.</dd>
<dt>description</dt>
<dd>A short description to illustrate the purpose of this
attribute.</dd>
<dt>required</dt>
<dd>Flag, if this attribute is required to be used in this
interface.</dd>
</dl>
<p>A typical flow for the high level interfaces is shown in the
following figure.</p>
<img src="Major-Components-Interaction.svg"
alt="IPA Major Components Interaction"
style="width: 100%; height: auto;" />
<p>This sequence supports the major use cases stated
<a href="#main-use-cases">above</a>.</p>
<h2 id="if-clientinput">
<span class="secno">4.1 </span>Interface Client Input
</h2>
<p>
This interface describes the data that is sent from the <a
href="#ipaclient">IPA Client</a> to the <a
href="#ipaservice">IPA Service</a>. The following table
details the data that should be considered for this interface in
the method <b>processInput</b>
</p>
<table>
<tr>
<th>name</th>
<th>type</th>
<th>description</th>
<th>required</th>
</tr>
<tr>
<td>session id</td>
<td>data item</td>
<td>unique identifier of the session</td>
<td>yes, if obtained</td>
</tr>
<tr>
<td>request id</td>
<td>data item</td>
<td>unique identifier of the request within a session</td>
<td>yes</td>
</tr>
<tr>
<td>audio data</td>
<td>data item</td>
<td>encoded or raw audio data</td>
<td>yes</td>
</tr>
<tr>
<td>multimodal input</td>
<td>category</td>
<td>input that has been received from modality
recognizers, e.g., text, gestures, pen input, ...</td>
<td>no</td>
</tr>
<tr>
<td>meta data</td>
<td>category</td>
<td>data augmenting the request, e.g., user
identification, timestamp, location, ...</td>
<td>no</td>
</tr>
</table>
<p>
The <b>session id</b> can be created by the <a
href="#ipaservice">IPA Service</a>. In case a session id is
provided, it must be used for subsequent calls.
</p>
<p>
The <a href="#ipaclient">IPA Client</a> maintains <b>request
id</b> for each request that is being sent via this interface.
These ids must be unique within a session.
</p>
<p>
<b>Audio data</b> can be delivered mainly in two ways
</p>
<ol>
<li>Endpointed audio data</li>
<li>Streamed audio data</li>
</ol>
<p>
For endpointed audio data the <a href="#ipaclient">IPA
Client</a> determines the end of speech, e.g., with the help of
voice activity detection. In this case only that portion of
audio is sent that contains the potential spoken user input.In
terms of user experience this means that processing of the user
input can only happen <em>after</em> the end of speech has
been detected.
</p>
<p>
For streamed audio data, the <a href="#ipaclient">IPA Client</a>
starts sending audio data as soon as it has been detected that
the user is speaking to the system with the help of the <a
href="#clientactivtionstrategy">Client Activation
Strategy</a>. In terms of user experience this means that
processing of the user input can happen <em>while</em> the user is
speaking.
</p>
<p>An audio codec may be used, e.g., to reduce the amount of
data to be transferred. The selection of the codec is not part
of this specification.</p>
<p>
Optionally, <b>multimodal input</b> can be transferred that has
been captured as input from a specific modality recognizer.
Modalities are all other modalities but audio, e.g., text for a
chat bot, or gestures.
</p>
<p>
Optionally, <b>meta data</b> may be transferred augmenting the
input. Examples of such data include user identification,
timestamp and location.
</p>
<p>
The <a href="#ipaservice">IPA Service</a> may maintain a <b>session
id</b>, e.g., to serve multiple clients and allow them to be
distinguished.
</p>
<p>
As a return value this interface describes the data that is sent
from the <a href="#ipaservice">IPA Service</a> to the <a
href="#ipaclient">IPA Client</a>. The following table
details the data that should be considered for this interface in
the <b>ClientResponse</b>.
</p>
<table>
<tr>
<th>name</th>
<th>type</th>
<th>description</th>
<th>required</th>
</tr>
<tr>
<td>session id</td>
<td>data item</td>
<td>unique identifier of the session</td>
<td>yes, if obtained</td>
</tr>
<tr>
<td>request id</td>
<td>data item</td>
<td>unique identifier of the request within a session</td>
<td>yes</td>
</tr>
<tr>
<td>audio data</td>
<td>data item</td>
<td>encoded or raw audio data</td>
<td>yes</td>
</tr>
<tr>
<td>multimodal output</td>
<td>category</td>
<td>output that has been received from modality
synthesizers, e.g., text, command to execute an
observable action, ...</td>
<td>no</td>
</tr>
</table>
<p>
In case the parameter <b>multimodal output</b> contains commands
to be executed, they are expected to follow the specification of
the <a href="#if-servicecall">Interface Service Call.</a>
</p>
<p>The following sections will provide examples using the JSON
format to illustrate the interfaces. JSON is only chosen as it
is easy to understand and read. This specification does not make
any assumptions about the underlying programming languages or
data format. They are just meant to be an illustration of how
responses may be generated with the provided data. It is not
required that implementations follow exactly the described
behavior.</p>
<h3 id="if-clientinput-weather-example">
<span class="secno">4.1.2 </span>Example Weather Information for
Interface Client Input
</h3>
<p>
The following request to <b>processInput</b> sends endpointed
audio data with the user's current location to query for
tomorrow's weather with the utterance <em>What will the
weather be like tomorrow"</em>.</p>
<pre>
{
"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
"requestId": "42",
"audio": {
"type": "Endpointed",
"data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3",
"encoding": "PCM-16BIT"
}
"multimodal": {
"location": {
"latitude": 52.51846213843821,
"longitude": 13.37872252544883338.897957
}
...
},
"meta": {
"timestamp": "2022-12-01T18:45:00.000Z"
...
}
}</pre>
<p>In this example endpointed audio data is transfered as a
value. There are other ways to send the audio data to the IPA,
e.g., as a reference. This way is chosen as it is easier to
illustrate the usage.</p>
<p>
In return the the IPA may send back the following response <em>Tomorrow
there will be snow showers in Berlin with temperatures
between 0 and -1 degrees</em> via <b>ClientResponse</b> to the
Client.</p>
<pre>
{
"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
"requestId": "42",
"audio": {
"type": "Endpointed",
"data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2",
"encoding": "PCM-16BIT"
}
"multimodal": {
"text": "Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees."
...
},
"meta": {
...
}
}</pre>
<h3 id="if-clientinput-flight-example">
<span class="secno">4.1.3 </span>Example Flight Reservation for
Interface Client Input
</h3>
<p>
The following request to <b>processInput</b> sends endpointed
audio data with the user's current location to book a flight
with the utterance <em>I want to fly to San Francisco</em>.</p>
<pre>
{
"sessionId": "0c27895c-644d-11ed-81ce-0242ac120002",
"requestId": "15",
"audio": {
"type": "Endpointed",
"data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3",
"encoding": "PCM-16BIT"
}
"multimodal": {
"location": {
"latitude": 52.51846213843821,
"longitude": 13.37872252544883338.897957
}
...
},
"meta": {
"timestamp": "2022-11-14T19:50:00.000Z"
...
}
}</pre>
<p>
In return the the IPA may send back the following response <em>When
do you want to fly from Berlin to San Francisco?</em> via <b>ClientResponse</b>
to the Client</p>
<pre>
{
"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
"requestId": "42",
"audio": {
"type": "Endpointed",
"data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2",
"encoding": "PCM-16BIT"
}
"multimodal": {
"text": "When do you want to fly from Berlin to San Francisco?"
...
},
"meta": {
...
}
}</pre>
<h2 id="if-externalclientinput">
<span class="secno">4.2 </span>External Client Input
</h2>
<p>
This interface describes the data that is sent from t the <a
href="#providerselectionservice">Provider Selection
Service</a>. The input is a copy of the data that is sent from
the <a href="#ipaclient">IPA Client</a> to the <a
href="#ipaservice">IPA Service</a> in <a
href="#if-clientinput">Interface Client Input</a>. This
interface mainly differs in the return value. The following
table details the data that should be considered for this
interface in the method <b>processInput.</b>
</p>
<p>
As a return value this interface describes the data that is sent
from the <a href="#providerselectionservice">Provider
Selection Service</a> and the <a href="#nlu">NLU</a> and <a
href="#dialogmanagement">Dialog Management</a>. The
following table details the data that should be considered for
this interface in the method <b>ExternalClientResponse.</b>
</p>
<table>
<tr>
<th>name</th>
<th>type</th>
<th>description</th>
<th>required</th>
</tr>
<tr>
<td>session id</td>
<td>data item</td>
<td>unique identifier of the session</td>
<td>yes, if the IPA requires the usage</td>
</tr>
<tr>
<td>request id</td>
<td>data item</td>
<td>unique identifier of the request within a session</td>
<td>yes</td>
</tr>
<tr>
<td>call result</td>
<td>data item</td>
<td>success or failure</td>
<td>yes</td>
</tr>
<tr>
<td>multimodal output</td>
<td>category</td>
<td>output that has been received from an external IPA</td>
<td>yes, if no interpretation is provided and no error
occurred</td>
</tr>
<tr>
<td>interpretation</td>
<td>category</td>
<td>meaning as intents and associated entities</td>
<td>yes, if no multimodal output is provided and no
error occurred</td>
</tr>
<tr>
<td>error</td>
<td>category</td>
<td>error as detailed in section <a
href="#errorhandling">Error Handling</a></td>
<td>yes, if an error during execution is observed</td>
</tr>
</table>
<p>
The parameters <b>name</b>, <b>session id</b> and <b>request
id</b> are copies of the data received from the <a
href="#if-clientinput">Interface Client Input</a>.
</p>
<p>This call is optional depending if external IPAs are used or
not.</p>
<p>Depending on the capabilities of the external IPA the return
value may be one of the following options</p>
<ul>
<li>multimodal output</li>
<li>interpretation</li>
</ul>
<p>
The category <b>interpretation</b> may be one of the following
options, depending on the capabilities of the external IPA
</p>
<ul>
<li>single-intent, i.e. provide multiple intents in a
single utterance</li>
<li>multi-intent, i.e. provide one intent in a single
utterance</li>
</ul>
<p>
With <b>single-intent</b> the user provides a single intent per
utterance. An example for single-intent is <em>"Book a
flight to San Francisco for tomorrow morning."</em> The single
intent is here book-flight. With <b>multi-intent</b> the user
provides multiple intents in a single utterance. An example for
multi-intent is <em>"How is the weather in San Francisco
and book a flight for tomorrow morning."</em> Provided intents
are check-weather and book-flight. In this case the IPA needs to
determine the order of intent execution based on the structure
of the utterance. If not to be done in parallel, the IPA will
trigger the next intent in the identified order.
</p>
<p>
As multi-intent is not very common in today's IPAs the focus for
now is on single-intent as detailed in the following table
</p>
<table style="width: 100%">
<tr>
<th colspan="3">name</th>
<th>data type</th>
<th>description</th>
<th>required</th>
</tr>
<tr>
<td colspan="3">interpretation</td>
<td>list</td>
<td>list of meaning as intents and associated entities</td>
<td>yes</td>
</tr>
<tr>
<td style="width: 20px;"></td>
<td colspan="2">intent</td>
<td>string</td>
<td>group of utterances with similar meaning</td>
<td>yes</td>
</tr>
<tr>
<td style="width: 20px;"></td>
<td colspan="2">intent confidence</td>
<td>float</td>
<td>confidence value for the intent in the range [0,1]</td>
<td>no</td>
</tr>
<tr>
<td style="width: 20px;"></td>
<td colspan="2">entities</td>
<td>list</td>
<td>list of entities associated to the intent</td>
<td>no</td>
</tr>
<tr>
<td style="width: 20px;"></td>
<td style="width: 20px;"></td>
<td>name of the entity</td>
<td>string</td>
<td>additional information to the intent</td>
<td>no</td>
</tr>
<tr>
<td style="width: 20px;"></td>
<td style="width: 20px;"></td>
<td>entity confidence</td>
<td>float</td>
<td>confidence value for the entity in the range [0,1]</td>
<td>no</td>
</tr>
</table>
<h3 id="if-externalclientinput-example-weather">
<span class="secno">4.2.1 </span>Example Weather Information for
Interface External Client Input
</h3>
<p>
The following request to <b>processInput</b> is a copy of <a
href="#if-clientinput-weather-example">Example Weather
Information for Interface Client Input</a>.
</p>
<p>
In return the the external IPA may send back the following
response via <b>ExternalClientResponse</b> to the Dialog.
</p>
<pre>
{
"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
"requestId": "42",
"callResult": "success",
"interpretation": [
{
"intent": "check-weather",
"intentConfidence": 0.9,
"entities": [
{
"location": "Berlin",
"entityConfidence": 1.0
},
{
"date": "2022-12-02",
"entityConfidence": 0.94
},
]
},
...
]
}</pre>
<p>
The external speech recognizer converts the obtained audio into
text like <em>How will be the weather tomorrow</em>. The NLU
then extracts the following from that decoded utterance, other
multimodal input and metadata.
</p>
<ul>
<li>intent: check-weather from, e.g., utterance part <em>How
will the weather…</em></li>
<li>entity: date from utterance part <em>…tomorrow…</em></li>
<li>entity: location, e.g., from the multimodal input of
location</li>
</ul>
<p>This is illustrated in the following figure.</p>
<img src="processInputWeather.svg"
alt="Processing Input of the check weather example"
style="width: 40%; height: auto;" />
<h3 id="if-externalclientinputexample-flight">
<span class="secno">4.2.2 </span> Example Flight Reservation for
Interface External Client Input
</h3>
<p>
The following request to <b>processInput</b> is a copy of <a
href="#if-clientinput-flight-example">Example Flight
Reservation for Interface Client Input</a>.
</p>
<p>
In return the the IPA may send back the following response <em>When
do you want to fly from Berlin to San Francisco?</em> via <b>ClientResponse</b>
to the Client. In this case, empty entities, like <em>date</em>
indicate that there are still slots to be filled and no service
call can be made right now.</p>
<pre>
{
"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
"requestId": "42",
"callResult": "success",
"interpretation": [
{
"intent": "book-flight",
"intentConfidence": 0.87,
"entities": [
{
"origin": "Berlin",
"entityConfidence": 1.0
},
{
"destination": "San Francisco",
"entityConfidence": 0.827
},
{
"date": "",
},
...
]
},
...
]
}</pre>
<p>
The external speech recognizer converts the obtained audio into
text like <em>I want to fly to San Francisco</em>. The NLU then
extracts the following from that decoded utterance, other
multimodal input and metadata.</p>
<ul>
<li>intent: book-fligh from, e.g., utterance part <em>I
want to fly…</em></li>
<li>entity: location from utterance part <em>…San
Francisco…</em></li>
<li>entity: location, e.g., from the multimodal input of
location</li>
</ul>
<p>
This is illustrated in the following figure. <img
src="processFlightReservation.svg"
alt="Processing Input of the flight reservation example"
style="width: 40%; height: auto;" />
</p>
<p>
Further steps will be needed to convert both location entities
to <em>origin</em> and <em>destination</em> in the actual reply.
This may be either done by the flight reservation IPA directly
or by calling external services beforehand to determine the
nearest airports from these locations.
</p>
<h2 id="if-servicecall">
<span class="secno">4.3 </span>External Service Call
</h2>
<p>
This interface describes the data that is sent from the <a
href="#dialog">Dialog</a> to the <a
href="#providerselectionservice">Provider Selection
Service</a>. The following table details the data that should be
considered for this interface in the method <b>callService</b>.
</p>
<table>
<tr>
<th>name</th>
<th>type</th>
<th>description</th>
<th>required</th>
</tr>
<tr>
<td>session id</td>
<td>data item</td>
<td>unique identifier of the session</td>
<td>yes, if the IPA requires the usage</td>
</tr>
<tr>
<td>request id</td>
<td>data item</td>
<td>unique identifier of the request within a session</td>
<td>yes</td>
</tr>
<tr>
<td>service id</td>
<td>data item</td>
<td>id of the service to be executed</td>
<td>yes</td>
</tr>
<tr>
<td>parameters</td>
<td>data item</td>
<td>Parameters to the service call</td>
<td>no</td>
</tr>
</table>
<p>
As a return value the result of this call is sent back in the
<b>ClientResponse</b>.
</p>
<table>
<tr>
<th>name</th>
<th>type</th>
<th>description</th>
<th>required</th>
</tr>
<tr>
<td>session id</td>
<td>data item</td>