<a href="https://colab.research.google.com/github/sishef/nlpworkshop/blob/main/3_TrainingWithRealConversations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook 3: Training With Real Conversations

It's hopefully becoming clear that it'd be impractical to hard-code our chatbot to respond to any input which a user could throw at it.

Thankfully, there are other ways - in this session, we'll take a **corpus** (large structured body of text) of historical conversations, and use it to select responses to user input. This means our bot can learn from real conversations, rather than relying on every conversation path included in our code.

In [None]:
# This command will install chatterbot-corpus, a library which contains a corpus of conversations in YAML format
# You can view these raw files in the chatterbot-corpus GitHub repo: https://github.com/gunthercox/chatterbot-corpus/tree/master/chatterbot_corpus/data/english
!pip install chatterbot-corpus

In [None]:
import chatterbot_corpus
from yaml import load
import inspect
import os

In a fairly simple approach, we will take all of these historical conversations, and build a lookup table of known inputs and replies.

In [None]:
# Create a dict of msg->response from the files in the corpus
def load_conversations_from_corpus_simple():
  # 1) Get the location of the corpus YAML files installed with the chatterbot corpus package
  data_path = os.path.join(os.path.dirname(inspect.getfile(chatterbot_corpus)), 'data/english')

  # 2) Build a list of conversations (each file is a full conversation)
  conversations = []
  for file in os.listdir(data_path):
    convos = load(open(os.path.join(data_path, file), 'r'))
    conversations = conversations + convos['conversations']

  print('This is an example of the format of the conversations:\n')
  print(conversations[0])
  print('\nEach item is a list - the first element is a question/user input. All other elements are possible responses to this input.')

  # 3) Build a dictionary of all the msg->[response] pairs in every conversation
  lookup = {}
  for convo in conversations:
    lookup[convo.pop(0)] = convo
  return lookup

lookup = load_conversations_from_corpus_simple()

From this ```lookup``` dictionary we can look up an input message in the dictionary to find the corresponding responses, based on the message history loaded.

In [None]:
lookup['What language are you written in?']

In [None]:
lookup['What is a computer?']

Note that this will fail if we look up a message which isn't in the history.

In [None]:
lookup['How are you today?']

Like before, we have the issue that we're case-sensitive and punctuation sensitive.

In [None]:
lookup['what is a computer']

We can handle this by normalizing the questions in the conversation history as we load them, and normalizing the user's input before we look it up.

In [None]:
def normalize_text(msg):
  msg = msg.lower()
  symbols = ['?','-',',',':',';']
  for symbol in symbols:
    msg = msg.replace(symbol, '')
  return msg

In [None]:
# Create a dict of msg->response from the files in the corpus
def load_conversations_from_corpus():
  # 1) Get the location of the corpus YAML files installed with the chatterbot corpus package
  data_path = os.path.join(os.path.dirname(inspect.getfile(chatterbot_corpus)), 'data/english')

  # 2) Build a list of conversations (each file is a full conversation)
  conversations = []
  for file in os.listdir(data_path):
    convos = load(open(os.path.join(data_path, file), 'r'))
    conversations = conversations + convos['conversations']

  # 3) Build a dictionary of all the msg->[response] pairs in every conversation
  lookup = {}
  for convo in conversations:
    lookup[normalize_text(convo.pop(0))] = convo # Note we're now normalizing the dictionary key. We're keeping the responses in their original case, with punctuation.
  return lookup

lookup = load_conversations_from_corpus()

In [None]:
lookup['what is a computer']

Or, using normalized user input:

In [None]:
lookup[normalize_text(input(''))]

#### Random Replies

You'll notice that the dictionary contains lists of responses, rather than just one response per input. Obviously, returning this list to the user would look strange.

A naive solution would be to return the first item in the list:

In [None]:
lookup['what is a computer'][0]

This would work, but it means that many of our responses would never be seen. We can make the chatbot a little less predictable (and seem a little more alive) by randomly choosing one of the suitable responses:

In [None]:
import random

In [None]:
def choose_response(msg):
  try:

    # Fetch the list of possible responses
    options = lookup[normalize_text(msg)]
    # Return a randomly selected item from the list (using the Python random library)
    return random.choice(options)

  # Handle the case where the input isn't in the dictionary
  except KeyError:
    return 'No suitable answers found.'

In [None]:
choose_response('What is a computer?')

If you run the above cell multiple times, you'll get different response. 

In computer science terms, functions like this (where the output depends on a random element, giving one of many potential from an input) are described as **non-deterministic**.

**TASK:**
* Modify your chatbot to give a randomized reply from this training data. 
* If the user's input isn't in the corpus, the bot should reply using your existing logic. 
* Ensure that your chemical symbol question still works.