<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Creating Chatbots and recognizing entities</div>

In this exercice, we will apply the example of text classification studied previously to the creation of chatbots.

Building an agent capable of handling conversations is usually performed through the following steps:
1. We apply text classification techniques to the utterance typed by the user, in order to detect the **intent** of the user (e.g. "applying for a loan", "purchasing a product from a store", "asking about the weather", etc)
2. In addition to detecting intent, we extract **Named Entities** from the utterance (e.g. location names, currency amounts, etc...)
3. We use the extracted intents and entities to decide how the chatbot should reply

In this exercise, we will use the open source chatbot framework **Rasa** to build the chatbot. It includes the package *Rasa NLU* that performs NLU tasks such as entity extraction and intent classification, as well as *Rasa Core*, that manages the conversational aspects.

# 1. Using Rasa

Rasa runs code asynchronously. The following cell configures the notebook for Rasa use. See [the documentation of Rasa](https://rasa.com/docs/rasa/api/jupyter-notebooks/) for more information.


In [None]:
import nest_asyncio

nest_asyncio.apply()


# Rasa will throw plenty of warnings during execution. This line gets rid of them, for lisibility. 
# Just comment it to keep the warnings if desired
import warnings
warnings.simplefilter('ignore')

Rasa chatbot project folders follow a precise structure to function. Important files to consider are:
1. A **config.yml** file defining the language of the chatbot, the methods used for intent classification, etc...
2. A **domain.yml** file, defining the intents and entities covered by the chatbot, as well as the utterances and actions it will use. Generally, most of the engineering of chatbots goes towards this file.
3. A **nlu** file (Rasa uses json or md files), containing the training samples for intent classification and entity extraction.
4. A **stories** file, contaning the different scenariis of interaction between the user and the bot. This file defines how to associate bot actions and intents/entities.

When starting a new project, Rasa provides a useful function that prepares the folder structure:

In [None]:
from rasa.cli.scaffold import create_initial_project

project = "my_new_chatbot_project"
create_initial_project(project)

# 2. Building our first chatbot

In [None]:
#project location
project = "chatbot_projects/1_simple_clockbot/"

#path to config, domain, nlu and stories files
config = project+"config.yml"
domain = project+"domain.yml"
training_files = project+"data/"

#where to store the models
output = project+"models/"

Our first chatbot is meant to be a simple clockbot (a chatbot that gives the time). Let's go through the structure of the files to perform this task.

## 2.a. config file

```
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: fr
pipeline: supervised_embeddings
```

The chatbot is meant to be in french. The pipeline we use is the default Rasa one, **supervised_embeddings**.

In the text classification examples we saw in the previous exercise, we learnt individual vector representations of every word in our corpus, and used these as features for classification. We saw that it causes issues with out-of-vocabulary (OOV) words, and with model storage. The main modification to embeddings proposed by Rasa is to learn intent representations as ell as words'. This means that any input sequence can be judged in terms of similarity with an intent, making it easier and less costly to perform classification. You can read more about this [here](https://medium.com/rasa-blog/supervised-word-vectors-from-scratch-in-rasa-nlu-6daf794efcd8)

## 2.b. domain file

```
intents:
  - greet
  - ask_time

actions:
- utter_greet
- utter_time

templates:
  utter_greet:
  - text: "Bien le bonjour !"
  utter_time:
  - text: "Il est 10h"
```

As introduced earlier, this first chatbot will:
- cover 2 intents (greeting and asking what time it is)
- perform only two possible actions (greet the user back, or give the time)

Both actions performed by the chatbot in this example are single utterances, defined within the file itself. It would be possible to declare more complex actions, ran separately in an "action server" (see [the documentation](https://rasa.com/docs/rasa/core/actions/#actions)). As the objective of this exercise is not to define an action server, but rather to explore machine learning capabilities of chatbots, we propose to simplify the problem by creating a dumb clockbot stuck at 10 o'clock.

## 2.c. stories file

```
## greet path
* greet
  - utter_greet

## time_path_polite
* greet
  - utter_greet
* ask_time
  - utter_time
  - action_restart
```

We define only two main scenarii for interaction with this chatbot :
- the user greets the chatbot, but interacts no further
- the user greets the chatbot, then asks what time it is (the action_restart at the end of this path reinitializes the bot once it has given the time)

Note that we have not defined a scenario where the user would ask the time without greeting the robot first (we will cover this in the next exercise).

## 2.d. nlu file

```
## intent:greet
- salut
- hello
- yo
- slt
- bonjour
- bonsoir

## intent:ask_time
- il est quelle heure ?
- quelle heure est-il
- t'as l'heure stp ?
```

This file contains examples for all the intents that we have defined for the chatbot. Note that we included some examples containing abbreviations, mispellings, etc... Even if NLU models are designed to be robust to misspellings when trained on correctly spelled utterances, it is a good practice to include examples "from the real world" for each intent, and to include them as such (without correcting misspellings).

## 2.e. Training the chatbot


In [None]:
import rasa

#training the model and saving the path where the model is stored
model_path = rasa.train(domain, config, [training_files], output)


## 2.f. Using the chatbot



In [None]:
from rasa.jupyter import chat
chat(model_path)

## 2.g. Checking the behaviour of the bot classifier

In this section, we will cofirm the resilience of the bot to mistakes during typing.
We first load the model in to an agent.

In [None]:
from rasa.core.agent import Agent
agent=Agent.load(model_path)

Using the following code, we can then parse various messages and see how the nlu classifier ranks them.
We can check that :
- standard messages for the intents (e.g. "salut") get classified correctly, with very high confidence levels
- messages that vary quite extensively from the canonial question used during training can sill get classified correctly, but with lower confidence levels (e.g. using sms-style "kel h?")
- if the messages differ too much from the samples used during training, the confidence level will decrease dramatically. 

**WARNING** : This example can be biased, as we are using only two intents (classes) with very different syntax (greetings are mostly single words, while more complex sentences are used to ask for time) 

In [None]:
await agent.parse_message_using_nlu_interpreter("salut")

In [None]:
await agent.parse_message_using_nlu_interpreter("kel h")

In [None]:
await agent.parse_message_using_nlu_interpreter("compte")

# 3. Adding different paths

In the previous example, we did not consider the case when a user would ask what time it is without greeting the bot first. In this example, we propose to adapt the scenarii of user interaction to handle those cases.

The main modification is located in the **stories** file, where the following story is added :
```
## time_path_impolite
* ask_time
  - utter_impolite
```

Note that an action (utterance) has been added to handle the situation where the bot gets offended because the user did not greet it. This is also reflected in the **domain.yml** file.

In [None]:
#project location
project = "chatbot_projects/2_obnoxious_clockbot/"

#path to config, domain, nlu and stories files
config = project+"config.yml"
domain = project+"domain.yml"
training_files = project+"data/"

#where to store the models
output = project+"models/"

Let's train our bot and run it !

In [None]:
#training the model and saving the path where the model is stored
model_path = rasa.train(domain, config, [training_files], output)

The two cells below run the polite scenario (greeting + asking for time) and the impolite one (asking for time directly).
We confirm here that the bot behaves differently in those cases.

In [None]:
agent=Agent.load(model_path)
print(await agent.handle_text("slt"))
print(await agent.handle_text("il est kel h ?"))

In [None]:
agent=Agent.load(model_path)
print(await agent.handle_text("il est kel h ?"))

# 4. Adding entities

In this example, we will add **Named Entity extraction** to the bot. 

Named Entities can be viewed as instances of linguistic classes. For instance, there is an inifinite number of ways through which one might refer to a Location. Some examples could be "Paris", "my house", "Japan", "the restaurant I like on Main Street", etc... From a Named Entity Recognition (NER) perspective, all these examples would be specific **values** (or synonyms) refering to the **entity** "location".

## 4.a How are entities extracted by Rasa ?

### Symbolic approach

The simplest way of approaching Named Entity Extraction is to define all synonyms of an entity manually. In this case, the user would type all possible values of an entity manually. The extraction of these entities would then be performed by matching words in an utterance with the list of synonyms.

There are some cases where this mode of definition is appropriate. For instance, to define the entity "*means of payment*", there are only limited options (*cash*,*cheque*,*credit card*,*debit card*...). This approach, however, has two main issues :
- the maintenance of synonym lists can quickly become impossible (for instance when defining entities corresponding to person or loction names)
- this approach is not resilient to spelling variants (e.g. typing "crdt card" instead of "credit card"). 

The second issue can be mitigated by introducing "flexible" or "fuzzy" matching for entities (i.e. considering that a word that differs by only one character from one of the synonyms is still a match). However, in many cases, this is still imperfect.

### Statistical approach

Rasa relies on **extractors** for entities. For instance, the default CRFExtractor uses Conditional Random Fields fitted on word embeddings to predict whether a given word is an entity. In practice, that means that entities are detected using their representation, as well as the representations of the words in their surrounding.  

In our case, using this approach means that giving enough examples in the form "Quelle heure est-il à Paris", "quelle heure est-il à Londres", etc... will enable the extractor to understand that in the pattern "quelle heure est-il à ....", the word following "à" is a "location".

## 4.b. Defining location entities for our chatbot

We propose to add the capacity to ask what the time is at a given location to our chatbot.

In [None]:
#project location
project = "chatbot_projects/3_obnoxious_clockbot_with_entities/"

#path to config, domain, nlu and stories files
config = project+"config.yml"
domain = project+"domain.yml"
training_files = project+"data/"

#where to store the models
output = project+"models/"

The initial modification is located in the **domain** file, where the entity type "*location*" is defined. We also use the value of this entity in the "*utter_time*" utterance.
```
entities:
  - location

[...]

  utter_time:
  - text: "Il est 10h à {location}"
```

The training data in the **nlu** file is also modified to reflect this change :

```
## intent:ask_time
- il est quelle heure à [New York](location)
- quelle heure est-il à [Paris](location)
- quelle heure est-il à [Séoul](location)
- t'as l'heure de [Tokyo](location) stp ?
- heure de [Lyon](location)
```

In the above examples, the sentence ```Quelle heure est-il à [Lyon](location)``` should be read as "*Quelle heure est-il à Lyon*", but where Lyon is an instance of the "location" entity.
More information about the format of training data can be found in [the documentation](https://rasa.com/docs/rasa/nlu/training-data-format/).

## 4.c. Training and testing the chatbot

In [None]:
#training the model and saving the path where the model is stored
model_path = rasa.train(domain, config, [training_files], output)

We then test the chatbot using a city name that was not present in the training set.

In [None]:
agent=Agent.load(model_path)
print(await agent.handle_text("salut"))
print(await agent.handle_text("il est quelle heure à Toulouse"))

On an individual utterance, we can check that the model has learned to recognize entities based on the sentence structure.

In [None]:
await agent.parse_message_using_nlu_interpreter("kel h Toulouse")

However, there is one major limitation : the example above does not work anymore when the city name is not capitalized (toulouse instead of Toulouse). In fact, EntityExtractors traditionnally use word shapes (e.g. patterns of lowercase or uppercase letters) as predictors. In our training data, nearly all city names are capitalized. One way to correct this mistake would be to multiply examples with various capitalizations. 

In [None]:
await agent.parse_message_using_nlu_interpreter("kel h toulouse")