<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/6-sequence-to-sequence-models/building_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Building a chatbot

In this notebook, I’m going to go over another application of a Seq2Seq model—a chatbot,
which is an NLP application with which you can have a conversation. We are
going to build a very simple yet functional chatbot using a Seq2Seq model and discuss
techniques and challenges in building intelligent agents.

To recap, two main types of
dialogue systems exist: **task-oriented and chatbots**. 

Although task-oriented dialogue
systems are used to achieve some specific goals, such as making a reservation at a
restaurant and obtaining some information.

chatbots are used to have conversations
with humans. Conversational technologies are currently a hot topic among NLP practitioners,
due to the success and proliferation of commercial conversational AI systems
such as Amazon Alexa, Apple Siri, and Google Assistant.

If you think of a conversation as a set of “turns” where the response is generated by
pattern matching against the previous utterance, this starts to look a lot like a typical
NLP problem. 

In particular, if you regard dialogues as a problem where an NLP system
is simply converting your question to its response, this is exactly where we can
apply the Seq2Seq models we covered in this chapter so far. We can treat the previous
(human’s) utterance as a foreign sentence and have the chatbot “translate” it into
another language. 

Even though these two languages are both English in this case, it is
a common practice in NLP to treat the input and the output as two different languages
and apply a Seq2Seq model to them, including summarization (longer text to
a shorter one) and grammatical error correction (text with errors to one without).

##Setup

In [1]:
!pip -q install fairseq

[K     |████████████████████████████████| 1.7 MB 5.4 MB/s 
[K     |████████████████████████████████| 145 kB 39.3 MB/s 
[K     |████████████████████████████████| 90 kB 6.0 MB/s 
[K     |████████████████████████████████| 112 kB 35.6 MB/s 
[K     |████████████████████████████████| 74 kB 3.1 MB/s 
[K     |████████████████████████████████| 596 kB 48.0 MB/s 
[?25h  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone


Let's download and expand the dataset

In [2]:
%%shell

mkdir -p data/chatbot
wget https://realworldnlpbook.s3.amazonaws.com/data/chatbot/selfdialog.zip
unzip selfdialog.zip -d data/chatbot

--2021-12-14 11:03:28--  https://realworldnlpbook.s3.amazonaws.com/data/chatbot/selfdialog.zip
Resolving realworldnlpbook.s3.amazonaws.com (realworldnlpbook.s3.amazonaws.com)... 52.217.168.233
Connecting to realworldnlpbook.s3.amazonaws.com (realworldnlpbook.s3.amazonaws.com)|52.217.168.233|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13668021 (13M) [application/zip]
Saving to: ‘selfdialog.zip’


2021-12-14 11:03:29 (36.9 MB/s) - ‘selfdialog.zip’ saved [13668021/13668021]

Archive:  selfdialog.zip
  inflating: data/chatbot/selfdialog.valid.tok.en  
  inflating: data/chatbot/selfdialog.valid.tok.fr  
  inflating: data/chatbot/selfdialog.train.tok.en  
  inflating: data/chatbot/selfdialog.train.tok.fr  




##Preparing a dataset

In this case study, we are going to use [The Self-dialogue Corpus](https://github.com/jfainberg/self_dialogue_corpus), a collection of 24,165 conversations. What’s special
about this dataset is that these conversations are not actual ones between two people,
but fictitious ones written by one person who plays both sides.

By collecting made-up conversations instead, the Self-dialogue
Corpus improves the quality for half the original cost (because you need only one person
versus two people!).

You can use the following combination of the paste command (to stitch files horizontally)
and the head command to peek at the beginning of the training portion.

In [3]:
!paste data/chatbot/selfdialog.train.tok.fr data/chatbot/selfdialog.train.tok.en | head

I 'm playing basketball this weekend , do you want to come along ?	No thanks , I rented several movies I want to stay home and watch .
Have you played in a band ?	What type of band ?
What type of band ?	A rock and roll band .
A rock and roll band .	Sure , I played in one for years .
Sure , I played in one for years .	No kidding ?
No kidding ?	I played in rock love love .
I played in rock love love .	You played local ?
You played local ?	Yes
Yes	Would you play again ?
Would you play again ?	Why ?


As you can see, each line consists of an utterance (on the left) and a response to it (on the right).

Notice that this dataset has the same structure as the Spanish-English parallel
corpus. 

The next step is to run the fairseq-preprocess
command to convert it to a binary format as follows:

In [4]:
!fairseq-preprocess \
  --source-lang fr \
  --target-lang en \
  --trainpref data/chatbot/selfdialog.train.tok \
  --validpref data/chatbot/selfdialog.valid.tok \
  --destdir data/chatbot-bin \
  --thresholdsrc 3 \
  --thresholdtgt 3

2021-12-14 11:09:33 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data/chatbot-bin', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='fr', srcdict=None, target_lang='en', task='translation', tensorboard_logdir=None, testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=3, thresholdtgt=3, tokenizer=None, tpu=False, trainpref='dat

Again, this is similar to what we ran for the Spanish translator example. 

Just pay attention
to what you specify as the source language—we are using fr instead of es here.

##Training and running a chatbot