# Natural Language Understanding

Starts with some more useful regex patterns:

`r"\bme\b"` will match only the word "me"  
`[A-Z]{1}[a-z]*` will match any title case word.  

If you're going to use a pattern several times, then store it with `re.compile()`.

Use of pipe operators within a pattern to match several, also use of `pattern.findall()` for multiple matches within a sentence.

## Flexibly match intents

In [109]:
import re
import time

In [110]:
intent_dict = {
    'goodbye': ['see ya', 'bye'],
    'greet': [r'\bhi\b', 'hola', 'heya'],
    'thankyou': ['appreciate', 'thank', r'\bta\b']
 }

In [111]:
# compile a dict of regex patterns that can look for any of the above pattern matches
intent_patterns = {}
for key, values in intent_dict.items():
    multi_pat = "|".join(values)
    compiled_pat = re.compile(multi_pat)
    # label this flexible pattern with the intent key it came with
    intent_patterns[key] = compiled_pat
intent_patterns

{'goodbye': re.compile(r'see ya|bye', re.UNICODE),
 'greet': re.compile(r'\bhi\b|hola|heya', re.UNICODE),
 'thankyou': re.compile(r'appreciate|thank|\bta\b', re.UNICODE)}

In [112]:
# Define a function to find the intent of a message
def find_intent(pat_dict, some_input):
    matched = None
    for intent, patterns in pat_dict.items():
        # Check for a pattern match first
        if patterns.search(some_input):
            matched = intent
    return matched

In [113]:
print(find_intent(intent_patterns, "hola! Como estas"))
print(find_intent(intent_patterns, "thankee sai"))
print(find_intent(intent_patterns, "see ya bud!"))

greet
thankyou
goodbye


## Respond to the intent

In [114]:
response_dict = {
    'default': '...',
    'goodbye': 'Have a great day',
    'greet': 'Hi there',
    'thankyou': "no problem, that's my job"
 }

In [115]:
def answer(string_input, pat_dict, resps):
    # get the matched intent
    intent = find_intent(pat_dict, string_input.lower())
    # Use default as the fll back value
    key = "default"
    if intent in resps:
        key = intent
    return resps[key]

In [116]:
answer("See ya", intent_patterns, response_dict)

'Have a great day'

Update the wrapper function from module 1 that uses a nice display template.

In [117]:
# update params to include the lookup dicts as default
def user_speaks(user_input, pat_dict=intent_patterns, resps=response_dict, user_format="USER:", bot_format="BOT:"):
    """Passes the user's input to response handler."""
    time.sleep(0.6)
    print(f"{user_format} {user_input}")
    # update the line below to use the new flexible match functions
    resp = answer(user_input, pat_dict, resps)
    time.sleep(0.6)
    return f"{bot_format} {resp}"

In [118]:
print(user_speaks("Hi hi cherry pie!"))
print(user_speaks("Ta very much my lovely..."))
print(user_speaks("Gotta go. Bye bye hunny pie."))

USER: Hi hi cherry pie!
BOT: Hi there
USER: Ta very much my lovely...
BOT: no problem, that's my job
USER: Gotta go. Bye bye hunny pie.
BOT: Have a great day


## Basic NER

Named Entity Recognition. 

In [119]:
def get_names(string_input):
    """Searches a string for an indication that a name is being discussed, then search
    for a proper noun and return it if found."""
    # ensure None is returned if no match is found
    entity = None
    name_pat = re.compile("name|call")
    proper_noun_pat = re.compile("[A-Z]{1}[a-z]*")
    # look for a sentence about a named entity:
    if name_pat.search(string_input):
        entity = proper_noun_pat.findall(string_input)
        if len(entity) > 0:
            # several hits means we need to concatenate values
            entity = " ".join(entity)
    return entity

In [120]:
print(get_names("my name is Jimmy."))
print(get_names("My name is Jimmy."))
# you can see how this would be limited and won't work with lowering an input string

Jimmy
My Jimmy


In [121]:
# Define respond()
def answer_name(str_input):
    name = get_names(str_input)
    if name is None:
        return "You're mysterious, tell me your name."
    else:
        return f"Hello, {name}!"

In [122]:
# update params to include the lookup dicts as default
def user_speaks(user_input, pat_dict=intent_patterns, resps=response_dict, user_format="USER:", bot_format="BOT:"):
    """Passes the user's input to response handler."""
    time.sleep(0.6)
    print(f"{user_format} {user_input}")
    # update the line below to use the name retrieval funcs
    resp = answer_name(user_input)
    time.sleep(0.6)
    return f"{bot_format} {resp}"

In [123]:
print(user_speaks("i am called John Snow"))
print(user_speaks("my name is Spartacus"))
print(user_speaks("My name is Spartacus"))

USER: i am called John Snow
BOT: Hello, John Snow!
USER: my name is Spartacus
BOT: Hello, Spartacus!
USER: My name is Spartacus
BOT: Hello, My Spartacus!


## Wordvec with spaCy

Great little intro on word vectors, where tokens - floats - are assigned to words, word parts, letters or sentences. These can then be used within ML workflows. spaCy makes several wordvec models available. Here we are using `en_core_web_sm` which is trained upon a large corpus with the GloVe algorithm.

Tokens can be compared to others using their cosine similarity:

* Vector directions point in same direction = 1
* Vector directions are perpindicular = 0 
* Vector directions are opposite = -1

In [124]:
import spacy
nlp = spacy.load('en_core_web_md')

In [125]:
n_dim = nlp.vocab.vectors_length
n_dim

300

In [126]:
# Use the nlp model on a string to get tokens:
doc = nlp("Hey Ho, ah let's go!")
doc

Hey Ho, ah let's go!

In [127]:
for token in doc:
    print(f"{token}: {token.vector[:7]}")
# showing the first 7 word vectors for the sentence tokens

Hey: [ 2.9      0.48218 -2.2693   0.27522 -7.1124   1.2409  -0.43371]
Ho: [-1.9577  -3.629   -4.1803   0.75524  2.439    4.1769  -1.2797 ]
,: [-3.3899  -4.7034  -0.56101  1.2291   4.3298  -1.0775  -1.3006 ]
ah: [ 3.5059   2.9413  -0.30366 -0.53069 -3.0985   3.9806  -2.8103 ]
let: [ 8.0705   6.2403  -5.6268  -0.6813  -3.603    2.8543  -0.82774]
's: [ 3.3163   9.7209  -3.1254  -5.1013  12.248    0.74676 -2.2017 ]
go: [ 1.484   8.3944 -8.3806  3.2081 -4.2582  1.9773 -2.7806]
!: [ 5.0891  -3.3753  -4.2695  -4.8156   3.8904   6.2171   0.26271]


In [128]:
len(doc[0].vector) == n_dim
# you can see that each token has n_dim

True

In [129]:
# download ATIS dataset

import pandas as pd
import requests
import json
URL = 'https://raw.githubusercontent.com/jkkummerfeld/text2sql-data/master/data/atis.json'
data = json.loads(requests.get(URL).text)
# Flattening JSON data
ATIS = pd.json_normalize(data)
sentences = []
for l in ATIS.sentences:
    for d in l:
        sentences.append(d["text"])
n_sent = len(sentences)


In [130]:
# prepare a 2D array for storing the vectors
import numpy as np
vec_array = np.zeros((n_sent, n_dim))
print(np.shape(vec_array))
vec_array

(5280, 300)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [131]:
# pass all the sentences to spacy to calclate the word vectors, storing in our array
for row, sentence in enumerate(sentences):
    doc = nlp(sentence)
    vec_array[row, :] = doc.vector
vec_array


array([[-0.66322637, -1.28980911, -1.05443728, ..., -3.43461752,
        -0.70707184,  0.85744095],
       [-0.72871637, -0.17177999, -3.97666264, ..., -4.20017004,
         0.16950625,  0.9615075 ],
       [ 0.55754501,  0.75949955, -2.94400024, ..., -1.26750004,
        -4.14976645,  0.34788665],
       ...,
       [-0.66783941,  2.86714268, -2.67127156, ...,  0.06596395,
        -3.84909701,  2.84442091],
       [-2.73585796,  3.96805525, -3.66085553, ...,  1.75896692,
        -7.27086449,  2.62955999],
       [-1.04818916,  3.22615075, -3.07146311, ..., -1.51073062,
        -2.75123072,  1.16789699]])