<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/6-sequence-to-sequence-models/building_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Building a chatbot

In this notebook, I’m going to go over another application of a Seq2Seq model—a chatbot,
which is an NLP application with which you can have a conversation. We are
going to build a very simple yet functional chatbot using a Seq2Seq model and discuss
techniques and challenges in building intelligent agents.

To recap, two main types of
dialogue systems exist: **task-oriented and chatbots**. 

Although task-oriented dialogue
systems are used to achieve some specific goals, such as making a reservation at a
restaurant and obtaining some information.

chatbots are used to have conversations
with humans. Conversational technologies are currently a hot topic among NLP practitioners,
due to the success and proliferation of commercial conversational AI systems
such as Amazon Alexa, Apple Siri, and Google Assistant.

If you think of a conversation as a set of “turns” where the response is generated by
pattern matching against the previous utterance, this starts to look a lot like a typical
NLP problem. 

In particular, if you regard dialogues as a problem where an NLP system
is simply converting your question to its response, this is exactly where we can
apply the Seq2Seq models we covered in this chapter so far. We can treat the previous
(human’s) utterance as a foreign sentence and have the chatbot “translate” it into
another language. 

Even though these two languages are both English in this case, it is
a common practice in NLP to treat the input and the output as two different languages
and apply a Seq2Seq model to them, including summarization (longer text to
a shorter one) and grammatical error correction (text with errors to one without).

##Setup

In [1]:
!pip -q install fairseq

[?25l[K     |▏                               | 10 kB 20.2 MB/s eta 0:00:01[K     |▍                               | 20 kB 25.1 MB/s eta 0:00:01[K     |▋                               | 30 kB 27.6 MB/s eta 0:00:01[K     |▊                               | 40 kB 21.1 MB/s eta 0:00:01[K     |█                               | 51 kB 16.5 MB/s eta 0:00:01[K     |█▏                              | 61 kB 12.8 MB/s eta 0:00:01[K     |█▍                              | 71 kB 13.0 MB/s eta 0:00:01[K     |█▌                              | 81 kB 13.7 MB/s eta 0:00:01[K     |█▊                              | 92 kB 14.5 MB/s eta 0:00:01[K     |██                              | 102 kB 13.9 MB/s eta 0:00:01[K     |██▏                             | 112 kB 13.9 MB/s eta 0:00:01[K     |██▎                             | 122 kB 13.9 MB/s eta 0:00:01[K     |██▌                             | 133 kB 13.9 MB/s eta 0:00:01[K     |██▊                             | 143 kB 13.9 MB/s eta 0:

Let's download and expand the dataset

In [2]:
%%shell

mkdir -p data/chatbot
wget https://realworldnlpbook.s3.amazonaws.com/data/chatbot/selfdialog.zip
unzip selfdialog.zip -d data/chatbot

--2021-12-16 10:00:35--  https://realworldnlpbook.s3.amazonaws.com/data/chatbot/selfdialog.zip
Resolving realworldnlpbook.s3.amazonaws.com (realworldnlpbook.s3.amazonaws.com)... 52.216.200.211
Connecting to realworldnlpbook.s3.amazonaws.com (realworldnlpbook.s3.amazonaws.com)|52.216.200.211|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13668021 (13M) [application/zip]
Saving to: ‘selfdialog.zip’


2021-12-16 10:00:36 (13.2 MB/s) - ‘selfdialog.zip’ saved [13668021/13668021]

Archive:  selfdialog.zip
  inflating: data/chatbot/selfdialog.valid.tok.en  
  inflating: data/chatbot/selfdialog.valid.tok.fr  
  inflating: data/chatbot/selfdialog.train.tok.en  
  inflating: data/chatbot/selfdialog.train.tok.fr  




##Preparing a dataset

In this case study, we are going to use [The Self-dialogue Corpus](https://github.com/jfainberg/self_dialogue_corpus), a collection of 24,165 conversations. What’s special
about this dataset is that these conversations are not actual ones between two people,
but fictitious ones written by one person who plays both sides.

By collecting made-up conversations instead, the Self-dialogue
Corpus improves the quality for half the original cost (because you need only one person
versus two people!).

You can use the following combination of the paste command (to stitch files horizontally)
and the head command to peek at the beginning of the training portion.

In [3]:
!paste data/chatbot/selfdialog.train.tok.fr data/chatbot/selfdialog.train.tok.en | head

I 'm playing basketball this weekend , do you want to come along ?	No thanks , I rented several movies I want to stay home and watch .
Have you played in a band ?	What type of band ?
What type of band ?	A rock and roll band .
A rock and roll band .	Sure , I played in one for years .
Sure , I played in one for years .	No kidding ?
No kidding ?	I played in rock love love .
I played in rock love love .	You played local ?
You played local ?	Yes
Yes	Would you play again ?
Would you play again ?	Why ?


As you can see, each line consists of an utterance (on the left) and a response to it (on the right).

Notice that this dataset has the same structure as the Spanish-English parallel
corpus. 

The next step is to run the fairseq-preprocess
command to convert it to a binary format as follows:

In [4]:
!fairseq-preprocess \
  --source-lang fr \
  --target-lang en \
  --trainpref data/chatbot/selfdialog.train.tok \
  --validpref data/chatbot/selfdialog.valid.tok \
  --destdir data/chatbot-bin \
  --thresholdsrc 3 \
  --thresholdtgt 3

2021-12-16 10:00:44 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data/chatbot-bin', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='fr', srcdict=None, target_lang='en', task='translation', tensorboard_logdir=None, testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=3, thresholdtgt=3, tokenizer=None, tpu=False, trainpref='dat

Again, this is similar to what we ran for the Spanish translator example. 

Just pay attention
to what you specify as the source language—we are using fr instead of es here.

##Training and running a chatbot

Now that the training data for the chatbot is ready, let’s train a Seq2Seq model from
this data. 

You can invoke the fairseq-train command with almost identical parameters
to the last time, as shown next:

In [None]:
!fairseq-train \
  data/chatbot-bin \
  --arch lstm \
  --share-decoder-input-output-embed \
  --optimizer adam \
  --lr 1.0e-3 \
  --max-tokens 4096 \
  --save-dir data/chatbot-ckpt

As previously, pay attention to how the validation loss changes every epoch. When I
tried this, the validation loss decreased for about five epochs but then started to slowly
creep back up. 

Feel free to stop the training command by pressing `Ctrl + C` after you observe the validation loss leveling out. Fairseq will automatically save the best model
(measured by the validation loss) to `checkpoint_best.pt`.

Finally, you can run the chatbot model by invoking the fairseq-interactive
command, as shown here:

In [None]:
!fairseq-interactive data/chatbot-bin \
  --path data/chatbot-ckpt/checkpoint_best.pt \
  --beam 5 \
  --source-lang fr \
  --target-lang en

2021-12-16 11:16:42 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='data/chatbot-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=F

You can type your source sentences and have a conversion
with your chatbot by having them “translate” to another language! Here’s part of
a conversation that I had with the model that I trained (I added boldface for clarity).

In this example, the conversation looks natural. Because the Self-dialogue Corpus is
built by restricting the set of possible conversation topics, the conversation is more
likely to go smoothly if you stay on such topics (movie, sports, music, and so on).

However, as soon as you start talking about unfamiliar topics, the chatbot loses its
confidence in its answers.

This is a well-known phenomenon—a simple Seq2Seq-based chatbot quickly regresses
to producing cookie-cutter answers such as “I don’t know” and “I’m not sure” whenever
asked about something it’s not familiar with. This has to do with the way we
trained this chatbot. Because we trained the model so that it minimizes the loss in the
training data, the best strategy it can take to reduce the loss is to produce something
applicable to as many input sentences as possible. Very generic phrases such as “I
don’t know” can be an answer for many questions, so it’s a great way to play it safe and
reduce the loss!