<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/cbds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/cbds.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

**Chatbots & Dialogue Systems**

- 📝 SALP chapter 15

## Introduction
- `Conversation` is a foundational aspect of language, learned early and used widely, driving the design of `interactive programs` such as 
  - `frame-based` dialogue systems for `structured tasks` 
  - `chatbots` for more flexible, `unstructured` conversations
  - many modern systems blend both, as seen in tools like Siri and ChatGPT.

- Early chatbot ELIZA, designed to simulate a therapist, used `simple pattern-matching and regular expressions`, creating responses that seemed personalized and led users to feel emotionally connected.
  - Its responses were crafted based on `specific keywords`, allowing the system to adapt replies to certain words for more engaging interactions, sometimes using `general responses` when no keywords were matched.
  - It demonstrated that users could become emotionally involved, often treating the system as a human, which led to behaviors typical of human interactions, like sharing personal issues.

- `Emotional attachment` to chatbots raises privacy concerns; 
  - users tend to disclose private information more freely, increasing risks, especially when the chatbot seems more human-like.
  - Privacy and emotional impact require careful consideration when designing and deploying chatbots,
    - as they may inadvertently influence users' emotional well-being and cognitive states.
- Chatbot usage, especially in `sensitive areas`, may need oversight, such as Institutional Review Board (IRB) approval, to ensure `ethical interaction` with human participants.

## Properties of Human Conversation
- 🍎 A short dialog between Alice and Mr. Rabbit 
  ```python
  Alice: "Mr. Rabbit, why are you always in such a rush? Do you ever just... stop to smell the roses?"
  Rabbit: "Smell the roses? I barely have time to smell the *carrots*! Honestly, I’m late for everything!"
  Alice: "Well, at least you’re never early enough to make a hare-brained decision!" 
  Rabbit: 🙂 "Very funny, Alice. But with these ears, I’ve heard them all!" 
  ```
- `Conversation Dynamics`
   - Human conversation is a complex, joint activity requiring mutual understanding.
   - Conversations are structured in `turns`, 
     - where each speaker contributes sequentially; 
     - `turn-taking` is essential for natural interaction, as speakers need to know when to start and stop talking.
   - Dialogue systems must detect when a user has finished speaking to respond accurately, 
     - a task known as `endpoint detection`.

- `Speech Acts`
   - Each utterance in a conversation serves as a specific `speech act`, 
     - such as answering, requesting, or acknowledging.
   - Speech acts are categorized into 
     - `Constatives` (statements), `Directives` (requests), `Commissives` (promises), and `Acknowledgments` (thanks), 
     - each reflecting the `speaker's intent`.
   - Recognizing speech acts helps dialogue systems interpret user intentions and respond appropriately, 
     - such as answering questions or confirming details.
`
- `Grounding`
   - *Grounding* is the process of confirming mutual understanding, 
     - often through `repeating or affirming` statements.
     - Examples include saying "OK" or repeating key details, 
       - to establish mutual understanding and build common ground.
   - Dialogue systems need grounding mechanisms to ensure they understand and maintain natural conversational flow.

- `Dialogue Structure and Subdialogues`
   - Conversations often follow structured patterns called `adjacency pairs`, 
     - like a `question and answer` or a `proposal and acceptance/rejection`.
   - `Subdialogues` or side sequences, such as clarifications, 
     - can temporarily shift focus and require the system to manage interruptions and resume the main conversation.
   - `Presequences`, or preliminary questions, set the stage for requests, 
     - like asking if the system can make reservations before making one.

- `Initiative`
   - *Initiative* in conversation can be held by one participant or shared; 
   - `mixed initiative` allows both parties to ask and answer questions, 
     - typical in human-human dialogue.
   - Dialogue systems with full mixed initiative are challenging to build; 
     - many rely on either `system-initiative` (system-led) or `user-initiative` (user-led) approaches.

- `Inference and Implicature`
   - *Inference* and *implicature* enable systems to `deduce unstated information` from context, 
     - such as deducing travel dates based on a meeting time.
   - Systems need to `interpret relevance and draw conclusions` beyond literal statements, 
     - a process crucial for `understanding implicit information` in human conversation.

## Frame-Based Dialogue Systems
- `Task-based dialogue` systems assist users with `specific tasks`, such as travel reservations, 
  - by using `frames—knowledge structures` with slots for capturing task details, forming a `domain ontology`.
- ![Architecture of a dialogue-state system for task-oriented dialogue](./images/chat/diag.png)
- The dialogue-state architecture, a common frame-based structure, includes `six components`, with four key components covered here and speech recognition/synthesis introduced later.

### Frames and Slot Filling
- Task-based dialogue systems use `frames with slots` to gather necessary details for tasks like booking a hotel or setting an alarm.
  - A frame-based system's goal is to `fill these slots` based on user input, 
    - using `questions` to clarify slot details.
    - Simple systems use `pre-written questions`, while advanced systems `generate questions` dynamically.
  - Slot fillers are constrained to specific semantic types, such as city or date.
  
  | **Slot**            | **Type** | **Example Question**                      |
  |---------------------|----------|--------------------------------------------|
  | ORIGIN CITY         | city     | "From what city are you leaving?"          |
  | DESTINATION CITY    | city     | "Where are you going?"                     |
  | DEPARTURE TIME      | time     | "When would you like to leave?"            |
  | DEPARTURE DATE      | date     | "What day would you like to leave?"        |
  | ARRIVAL TIME        | time     | "When do you want to arrive?"              |
  | ARRIVAL DATE        | date     | "What day would you like to arrive?"       |  

  - `Multiple frames` may be required for `different domains`, 
    - and the system must identify which frame and slot to use for each input.
- `Three key tasks in slot filling` are 
  - domain classification, 
  - intent determination, and 
  - filling slots based on user input.
  - 🍎 “Show me morning flights from Boston to San Francisco on Tuesday” 
    - fills slots for origin, destination, and time within the air-travel domain.
- `Handwritten rules or machine learning methods` are used for slot-filling, 
  - with `regular expressions` for simpler tasks like setting an alarm.
  - Most modern systems rely on supervised machine learning, 
    - using labeled examples for domain, intent, and slot-filling.
- `BIO tagging` is often employed, where each word is tagged as `beginning (B), inside (I), or outside (O)` a slot label.
  - Slot-filling architecture uses a language model encoder, feedforward layer, and softmax output to assign BIO tags.
  - ![Slot-filling architecture](./images/chat/fillslot.png)
- Synonyms or codes (e.g., “San Francisco” to “SFO”) are normalized using dictionaries, 
  - with a mix of rules and machine learning used in practical applications.

### Evaluating Task-Based Dialogue
- Task-based systems are evaluated based on 
  - `task success rate`: correctly completing tasks like booking flights,
  - `slot error rate`: percentage of slots correctly filled.

- 🍎 Given sentence `Make an appointment with Chris at 10:30 in Gates 104`, and extracted slot structure:

  | **Slot** | **Filler**    |
  |----------|---------------|
  | PERSON   | Chris         |
  | TIME     | 11:30 a.m.    |
  | ROOM     | Gates 104     |

  - has a slot error rate of 1/3, since the TIME is wrong.

- Additional metrics include 
  - precision, recall, F-score, 
  - efficiency costs, such as dialogue length in seconds or turns.

## Dialogue Acts and Dialogue State
- More complex task-based dialogue systems use `dialogue acts` and `dialogue states` 
  - to handle confirmations, clarifications, and nuanced interactions with users.

### Dialogue Acts
- Dialogue acts, which extend speech acts, are used to `structure interactions` 
  - by defining specific functions like `confirming, requesting, or providing information`, 
  - tailored for particular dialogue tasks.
  - 🍎 [Dialogue acts used by the HIS restaurant recommendation system](https://hal.science/hal-00598186/document)

    | **Tag**                  | **Sys** | **User** | **Description**                                               |
    |--------------------------|---------|----------|---------------------------------------------------------------|
    | HELLO(a = x, b = y, ...) | ✔       | ✔        | Open a dialogue and give info a = x, b = y, ...               |
    | INFORM(a = x, b = y, ...) | ✔       | ✔        | Give info a = x, b = y, ...                                   |
    | REQUEST(a, b = x, ...)   |  ✔      | ✔        | Request value for a given b = x, ...                          |
    | REQALTS(a = x, ...)      |  ❌      |  ✔        | Request alternative with a = x, ...                           |
    | CONFIRM(a = x, b = y, ...) | ✔       | ✔        | Explicitly confirm a = x, b = y, ...                          |
    | CONFREQ(a = x, ..., d)   | ✔       |   ❌       | Implicitly confirm a = x, ... and request value of d          |
    | SELECT(a = x, a = y)     | ✔       |  ❌        | Implicitly confirm a = x, ... and request value of d          |
    | AFFIRM(a = x, b = y, ...) | ✔       | ✔        | Affirm and give further info a = x, b = y, ...                |
    | NEGATE(a = x)            |  ❌       | ✔        | Negate and give corrected value a = x                         |
    | DENY(a = x)              | ❌       | ✔        | Deny that a = x                                               |
    | BYE()                    | ✔       | ✔        | Close a dialogue                                              |

- A sample tagset for [the HIS System](https://hal.science/hal-00598186/document) includes 
  - acts like INFORM, CONFIRM, and REQUEST to handle user needs,
  - as shown in a HIS system example where users confirm preferences, inquire about specifics, and close dialogues.

    | **Utterance**                                        | **Dialogue Act**                                       |
    |------------------------------------------------------|--------------------------------------------------------|
    | U: Hi, I am looking for somewhere to eat.            | hello(task = find, type = restaurant)                   |
    | S: You are looking for a restaurant. What type of food do you like? | confreq(type = restaurant, food)         |
    | U: I’d like an Italian near the museum.              | inform(food = Italian, near = museum)                   |
    | S: Roma is a nice Italian restaurant near the museum. | inform(name = "Roma", type = restaurant, food = Italian, near = museum) |
    | U: Is it reasonably priced?                          | confirm(pricerange = moderate)                          |
    | S: Yes, Roma is in the moderate price range.         | affirm(name = "Roma", pricerange = moderate)            |
    | U: What is the phone number?                         | request(phone)                                          |
    | S: The number of Roma is 385456.                     | inform(name = "Roma", phone = "385456")                 |
    | U: Ok, thank you goodbye.                            | bye()                                                   |

### Dialogue State Tracking
- The dialogue-state tracker determines the `current state of the frame` and `the most recent user dialogue act`, 
  - summarizing all user constraints.
- Dialogue act detection involves `classifying the user's input sentence` using an encoder and an act classifier, 
  - with prior dialogue acts improving classification.
- Dialogue-act detection and slot-filling tasks are often performed together, 
  - as dialogue acts constrain slot values.
- The state tracker uses slot-filling output or a classifying model to track changes in slot values after each sentence.
- Detecting correction acts is essential, as users may rephrase or correct utterances, 
  - which are harder to recognize due to speech adjustments like `hyperarticulation`.

### Dialogue Policy: Which act to generate
- Early frame-based systems followed a simple `dialogue policy`: 
  - ask questions until all slots are filled, 
  - then query the database and report back.
- A more advanced dialogue policy helps systems decide when to respond, 
  - ask for clarification, or take other actions, 
  - guiding the generation of dialogue acts.
- Systems often misrecognize words or meaning, 
  - so they use explicit or implicit confirmation acts to ensure shared understanding with the user.
  - Explicit confirmation acts allow users to easily correct misrecognitions, but they can be time-consuming and awkward,
    - whereas implicit confirmation is more efficient.
- Systems use `ASR (Automatic Speech Recognition) confidence levels` to decide 
  - when to confirm explicitly, implicitly, or reject based on transcription accuracy, 
  - with different thresholds for each action.

### Natural language generation: Sentence Realization
- **Sentence realization** is the process of `generating a user response` after a dialogue act and slots are chosen by the content planner.
- The system uses **delexicalization** to generalize training examples, 
  - replacing specific slot values with generic tokens for flexibility in generating sentences.
- 🍎 [A restaurant recommendation system](https://www.isca-archive.org/interspeech_2017/nayak17_interspeech.html): 
  - The content planner selects a dialogue act and attributes (e.g., restaurant name, neighborhood, cuisine), 
  - and the sentence realizer generates different possible sentences based on these inputs.

  | Delexicalized sentences |
  |---------|
  | recommend(restaurant name= Au Midi, neighborhood = midtown, cuisine = french) |
  | 1. restaurant name is in neighborhood and serves cuisine food. |
  | 2. There is a cuisine restaurant in neighborhood called restaurant name. |

- An **encoder-decoder model** is used to map frames (slots and fillers) to delexicalized sentences, 
  - which are later relexicalized with specific values.
  - ![An encoder decoder sentence realizer mapping slots/fillers to English](./images/chat/delex.png)
- The **encoder-decoder model** is trained on labeled dialogue corpora like MultiWOZ to improve sentence realization, 
  - enabling the system to generate varied responses.

## Chatbots
- Chatbots evolved from early systems like ELIZA to neural models like ChatGPT, integrating NLP tasks.
- Recent neural chatbots focus on functional applications like question answering and machine translation.

### Training chatbots
- Chatbots are trained on `large language model data`, 
  - including web sources like Common Crawl, Wikipedia, and books.
  - Additional `dialogue datasets`, like [Topical-Chat](https://github.com/alexa/Topical-Chat) and [EMPATHETIC DIALOGUES](https://paperswithcode.com/dataset/empatheticdialogues), 
    - are often used to train chatbots with real conversations.
  - `Social media` data from platforms like Twitter, Reddit, and Weibo is also used, 
    - with posts treated as conversation starters and comments as replies.
  - Datasets from the web are `filtered for toxicity` using toxicity classifiers before being used for training.
- Chatbot models are typically trained using a `causal language model architecture (decoder-only)`, 
  - predicting each word based on previous words in a conversation.
  - ![Training a causal (decoder-only) language model for a chatbot.](./images/chat/casual.png)
- An alternative approach is to use an `encoder-decoder architecture`, 
  - where the encoder processes the entire conversation and the decoder generates the next turn.
  - ![an encoder-decoder language model for a chatbot](./images/chat/ed.png)
- Despite pretraining on dialogue data, 
  - further f`ine-tuning` is often required to customize the chatbot for specific tasks.
  - Fine-tuning stages are essential for improving the chatbot's responses and aligning with desired conversational behaviors.

### Fine Tuning for Quality and Safety
- Dialogue systems are fine-tuned using `labeled data` to improve the quality and safety of responses, 
  - ensuring sensible and interesting dialogue while `avoiding harmful suggestions`.
- Fine-tuning involves training the system with high-quality, safe dialogues, 
  - often using a `multi-task learning approach` for tasks like answering questions and following instructions.
- Additional discriminative data is used to downweight low-quality or harmful responses, 
  - with `human-labeled ratings` for safety and quality assigned to each system turn.
- A language model can classify the quality and safety of responses by 
  - generating a label (e.g., `SENSIBLE, INTERESTING, UNSAFE`) 
  - in a two-phase process: `generative and discriminative`.
- At inference time, the system generates responses and assigns `safety/quality labels` to filter out unsafe options, 
  - returning the `highest-ranking safe response` to the user.

### Learning to perform retrieval as part of responding
- Modern chatbots, like Sparrow, integrate `retrieval-based` components 
  - where a fake dialogue participant (e.g., Search Query) is used to query search engines for information.
- Chatbot prompts can include special participants (`Search Query and Search Results`) 
  - to guide the system in generating search queries and handling fact-based questions.
- Systems can be fine-tuned to trigger search queries by using labeled data, 
  - where labelers perform fact checks and create appropriate search queries for incorrect responses.
  - The chatbot then uses search results as context to refine its responses, similar to retrieval-based question-answering methods.

### Evaluating Chatbots
- Chatbots are evaluated by 
  - `participants`: who chat with the bot
  - or `observers`: who read transcripts.
- Evaluations use [Likert scales](https://en.wikipedia.org/wiki/Likert_scale) to rate qualities like engagingness, fluency, and humanness.
  - Observer evaluations focus on turn coherence or overall conversation quality.
  - The acute-eval metric compares two systems on metrics like engagingness and knowledgability.

-🏃 Practice from HuggingFace NLP
  - [Summarization](https://huggingface.co/learn/nlp-course/en/chapter7/5?fw=pt)
  - [Training a causal language model from scratch](https://huggingface.co/learn/nlp-course/en/chapter7/5?fw=pt)