## LMQL-based Constraint-Guided Decoding for Understanding Human Speech in Embodied Contexts

The core idea is to encode speech-act theoretic notions in constraint-guided decoding. The key challenges are:

1. Extracting components -- informational content (incl. what objects are they talking about), human intentions for what they want the agent to do with the information
2. The terms being extracted are groundable in the robotic action/perception repertoire (collectively, "robot capabilities") -- this is what it means for a robot to "understand".
3. Linking terms with variables -- that is in the command "pick up the blue ball", the pickup(), blue() and ball() and DEFINITE() are all refering to the same domain entity -- ball -- which may or may not exist in the real world, but needs a shared identifier for the robot to properly satisfy perceptual constraints

In [2]:
# Imports

import lmql
import asyncio

In [None]:
schema = ```
        referents: [
            {"text": "m3 screw", 
            "type": "physobj",
            "variable_name": "VAR0",
            "cognitive status": "ACTIVATED",
            "role": "central"},
            
            {"text": "evan", 
            "type": "agent",
            "variable_name": "VAR1",
            "cognitive status": "FAMILIAR",
            "role": "supplemental"}
            ]
            
        descriptors = [
            {"text": "m3 screw", 
            "name": "m3",
            "arguments": ["VAR0"] },
            
            {"text": "evan", 
            "name": "NONE",
            "arguments": [] }
            ]
        
        intention: {
            "speech_act": "wantBel",
            "proposition":
                {"text": belonging",
                "type": "concept",
                "arguments": ["VAR0", "VAR1"]}
            }
               
            
```

In [17]:

@lmql.query
async def get_intent(utterance):
    '''
    argmax 
        "Utterances have three intents: (1) want, (2) wantBel and (3) itk. A 'want' is an imperative statement or a request by the speaker to have the listener do an action or stop doing an action. An 'itk' is a 'wh' or 'yes/no' query (what, why, when, where, who) or request from a speaker for more information from the listener about the listeners knowledge, beliefs or perceptions. A 'wantBel' (note the uppercase B) is a statement of fact or opinion that the speaker conveys to a listener and  expects to listener to come to believe." 
        "Based on these definitions of intents, when a listener hears a speaker say '{utterance}', the listener comes to believe that the speaker's intent is [INTENT]."
    from
        "openai/text-davinci-003" 
    where
        INTENT in ["want", "wantBel", "itk"]
    '''
    
intent = (await get_intent("can you pick up the ball"))[0].variables['INTENT']

{'INTENT': 'want'}

In [30]:
@lmql.query
async def get_central_referent(utterance):
    '''
    argmax 
        "The central item or referent (which could be a single thing or a collection of things) that is being referred to in the utterance: {utterance} is [CENTRAL_REFERENT]. Remember, the central referent is a thing or object, not an action or descriptor.It is meant to capture the central real world item being referenced in the utterance." 
    from
        "openai/text-davinci-003" 
    '''
    
central_referent = (await get_central_referent("can you pick up the ball"))[0].variables['CENTRAL_REFERENT'].strip()
print("Central Referent: ", central_referent)

Central Referent:  the ball.


In [21]:


@lmql.query
async def parse(utterance):
    '''
    argmax 
        "Utterances have three intents: (1) want, (2) wantBel and (3) itk. A 'want' is an imperative statement or a request by the speaker to have the listener do an action or stop doing an action. An 'itk' is a 'wh' or 'yes/no' query (what, why, when, where, who) or request from a speaker for more information from the listener about the listeners knowledge, beliefs or perceptions. A 'wantBel' (note the uppercase B) is a statement of fact or opinion that the speaker conveys to a listener and  expects to listener to come to believe." 
        "Based on these definitions of intents, when a listener hears a speaker say '{utterance}', the listener comes to believe that the speaker's intent is [INTENT]."
        "The central item (which could be a single thing or a collection of things) that is a real world item and that is being referred to in the utterance is [CENTRAL_REFERENT]"
        "This [CENTRAL_REFERENT] is of a [CENTRAL_REFERENT_TYPE]. For example in the utterance 'pick up the lemon on the table', the central referent or item is the 'lemon' and it is of type 'physobj' because it is a physical object"
        "The supplemental referents or items or other items (which could be a single thing or a collection of things) not including the [CENTRAL_REFERENT] that is mentioned in the utterance include [SUPPLEMENTAL_REFERENTS]."
        "Each of these supplemental referents, as a [SUPPLEMENTAL_REFERENT_TYPES]." 
        "Based on the fact that the intent is [INTENT], if the intent is a 'want' then the [CPC_TYPE] is an 'action', otherwise it is a 'concept'. Now, from the utterance we know that the core propositional content is a [CPC]." 
        " If the type of cpc is an 'action', then the core propositional content (or cpc) is the action that is being performed on the central referent. If the type of cpc is a 'concept', then the core propositional content (or cpc) is a concept that is being associated with the central referent."
        
    from
        "openai/text-davinci-003" 
    where        "Parse the input utterance: {utterance}."
        INTENT in ["want", "wantBel", "itk"] and CENTRAL_REFERENT_TYPE in ["physobj", "location"] and SUPPLEMENTAL_REFERENT_TYPES in ["physobj", "location"] and CPC_TYPE in ["action", "concept"]
    '''
    
(await parse("can you pick up the ball"))[0].variables



{'INTENT': '',
 'CENTRAL_REFERENT': '',
 'CENTRAL_REFERENT_TYPE': '',
 'SUPPLEMENTAL_REFERENTS': '',
 'SUPPLEMENTAL_REFERENT_TYPES': '',
 'CPC_TYPE': '',
 'CPC': ''}

In [41]:
@lmql.query
async def get_parse_json(utterance):
    '''
    argmax 
    
        "Extract pragmatic and semantic parse of the utterance: {utterance}."
        "Intent: Utterances have three intents: (1) want, (2) wantBel and (3) itk. A 'want' is an imperative statement or a request by the speaker to have the listener do an action or stop doing an action. An 'itk' is a 'wh' or 'yes/no' query (what, why, when, where, who) or request from a speaker for more information from the listener about the listeners knowledge, beliefs or perceptions. A 'wantBel' (note the uppercase B) is a statement of fact or opinion that the speaker conveys to a listener and  expects to listener to come to believe." 
        "central referent: The central item (which could be a single thing or a collection of things) that is a real world item and that is being referred to in the utterance" 
        "supplemental referents: These are also items that are not the central referent that are being mentioned in the utterance."
        "proposition: The core propositional content (single term) that is either the central action being mentioned or the central concept being mentioned. Remember, if the intent is a 'want' then the propsition should be some action, otherwise it should be a concept."
        """
        {{
          "intent": [INTENT],
          "central_referent": [CENTRAL_REF],
          "supplemental_referent": [SUPPLEMENTAL_REFERENT],
          "proposition": [CPC]
        }}
        """
    from
        "openai/text-davinci-003" 
    where
        INTENT in ["want", "wantBel", "itk"] and STOPS_AT(SUPPLEMENTAL_REFERENT, "\n") and STOPS_AT(CENTRAL_REF, "\n")
    '''
    
(await get_parse_json("pick up the blue ball"))[0].variables

{'INTENT': 'wantBel',
 'CENTRAL_REF': ' "blue ball",\n',
 'SUPPLEMENTAL_REFERENT': ' None,\n',
 'CPC': ' "pick up"\n}'}