<a href="https://colab.research.google.com/github/sishef/nlpworkshop/blob/pilot-updates/3_TrainingWithRealConversations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook 3: Training With Real Conversations

It's hopefully becoming clear that it'd be impractical to hard-code our chatbot to respond to any input which a user could throw at it.

Thankfully, there are other ways - in this session, we'll take a **corpus** (large structured body of text) of historical conversations, and use it to select responses to user input. This means our bot can learn from real conversations, rather than relying on every conversation path included in our code.

In [None]:
# This workshop uses chatterbot-corpus, a library which contains a corpus of conversations in YAML format
# You can view these raw files in the chatterbot-corpus GitHub repo: https://github.com/gunthercox/chatterbot-corpus/tree/master/chatterbot_corpus/data/english

In [8]:
import requests
import yaml
import inspect
import os

In a fairly simple approach, we will take all of these historical conversations, and build a lookup table of known inputs and replies.

In [20]:
# Create a dict of msg->response from the files in the corpus
# Returns list of conversations
def load_chatterbot_conversations_simple():
  data_files = [
      'ai.yml',
      'botprofile.yml',
      'computers.yml',
      #'conversations.yml',
      'emotion.yml',
      'food.yml',
      'gossip.yml',
      'greetings.yml',
      'health.yml',
      'history.yml',
      'humor.yml',
      'literature.yml',
      'money.yml',
      'movies.yml',
      'politics.yml',
      'psychology.yml',
      'science.yml',
      'sports.yml',
      'trivia.yml'
  ]

  chatterbot_base_url = 'https://raw.githubusercontent.com/gunthercox/chatterbot-corpus/master/chatterbot_corpus/data/english/'
  conversations = []
  for data_file in data_files:
    resp = requests.get(chatterbot_base_url + data_file)
    if resp.status_code != 200:
      print('Issue fetching {} - skipping'.format(data_file))
    conversations = conversations + yaml.load(resp.content, Loader=yaml.FullLoader)['conversations']
  lookup = {}
  for convo in conversations:
    lookup[convo.pop(0)] = convo
  return lookup


lookup = load_chatterbot_conversations_simple()

From this ```lookup``` dictionary we can look up an input message in the dictionary to find the corresponding responses, based on the message history loaded.

In [17]:
lookup['What language are you written in?']

['I am written in Python.']

In [None]:
lookup['What is a computer?']

Note that this will fail if we look up a message which isn't in the history.

In [None]:
lookup['How are you today?']

Like before, we have the issue that we're case-sensitive and punctuation sensitive.

In [None]:
lookup['what is a computer']

We can handle this by normalizing the questions in the conversation history as we load them, and normalizing the user's input before we look it up.

In [21]:
def normalize_text(msg):
  msg = msg.lower()
  symbols = ['?','-',',',':',';']
  for symbol in symbols:
    msg = msg.replace(symbol, '')
  return msg

In [22]:
# Create a dict of msg->response from the files in the corpus
def load_chatterbot_conversations():
  data_files = [
      'ai.yml',
      'botprofile.yml',
      'computers.yml',
      #'conversations.yml',
      'emotion.yml',
      'food.yml',
      'gossip.yml',
      'greetings.yml',
      'health.yml',
      'history.yml',
      'humor.yml',
      'literature.yml',
      'money.yml',
      'movies.yml',
      'politics.yml',
      'psychology.yml',
      'science.yml',
      'sports.yml',
      'trivia.yml'
  ]

  chatterbot_base_url = 'https://raw.githubusercontent.com/gunthercox/chatterbot-corpus/master/chatterbot_corpus/data/english/'
  conversations = []
  for data_file in data_files:
    resp = requests.get(chatterbot_base_url + data_file)
    if resp.status_code != 200:
      print('Issue fetching {} - skipping'.format(data_file))
    conversations = conversations + yaml.load(resp.content, Loader=yaml.FullLoader)['conversations']
  lookup = {}
  for convo in conversations:
    lookup[normalize_text(convo.pop(0))] = convo
  return lookup


lookup = load_chatterbot_conversations()

In [23]:
lookup['what is a computer']

['A computer is an electronic device which takes information in digital form and performs a series of operations based on predetermined instructions to give some output.',
 "The thing you're using to talk to me is a computer.",
 'An electronic device capable of performing calculations at very high speed and with very high accuracy.',
 'A device which maps one set of numbers onto another set of numbers.']

Or, using normalized user input:

In [None]:
lookup[normalize_text(input(''))]

#### Random Replies

You'll notice that the dictionary contains lists of responses, rather than just one response per input. Obviously, returning this list to the user would look strange.

A naive solution would be to return the first item in the list:

In [24]:
lookup['what is a computer'][0]

'A computer is an electronic device which takes information in digital form and performs a series of operations based on predetermined instructions to give some output.'

This would work, but it means that many of our responses would never be seen. We can make the chatbot a little less predictable (and seem a little more alive) by randomly choosing one of the suitable responses:

In [None]:
import random

In [None]:
def choose_response(msg):
  try:

    # Fetch the list of possible responses
    options = lookup[normalize_text(msg)]
    # Return a randomly selected item from the list (using the Python random library)
    return random.choice(options)

  # Handle the case where the input isn't in the dictionary
  except KeyError:
    return 'No suitable answers found.'

In [None]:
choose_response('What is a computer?')

If you run the above cell multiple times, you'll get different response. 

In computer science terms, functions like this (where the output depends on a random element, giving one of many potential from an input) are described as **non-deterministic**.

**TASK:**
* Modify your chatbot to give a randomized reply from this training data. 
* If the user's input isn't in the corpus, the bot should reply using your existing logic. 
* Ensure that your chemical symbol question still works.