# Natural Language Understanding

Starts with some more useful regex patterns:

`r"\bme\b"` will match only the word "me"  
`[A-Z]{1}[a-z]*` will match any title case word.  

If you're going to use a pattern several times, then store it with `re.compile()`.

Use of pipe operators within a pattern to match several, also use of `pattern.findall()` for multiple matches within a sentence.

## Flexibly match intents

In [1]:
import re
import time

In [2]:
intent_dict = {
    'goodbye': ['see ya', 'bye'],
    'greet': [r'\bhi\b', 'hola', 'heya'],
    'thankyou': ['appreciate', 'thank', r'\bta\b']
 }

In [3]:
# compile a dict of regex patterns that can look for any of the above pattern matches
intent_patterns = {}
for key, values in intent_dict.items():
    multi_pat = "|".join(values)
    compiled_pat = re.compile(multi_pat)
    # label this flexible pattern with the intent key it came with
    intent_patterns[key] = compiled_pat
intent_patterns

{'goodbye': re.compile(r'see ya|bye', re.UNICODE),
 'greet': re.compile(r'\bhi\b|hola|heya', re.UNICODE),
 'thankyou': re.compile(r'appreciate|thank|\bta\b', re.UNICODE)}

In [4]:
# Define a function to find the intent of a message
def find_intent(pat_dict, some_input):
    matched = None
    for intent, patterns in pat_dict.items():
        # Check for a pattern match first
        if patterns.search(some_input):
            matched = intent
    return matched

In [5]:
print(find_intent(intent_patterns, "hola! Como estas"))
print(find_intent(intent_patterns, "thankee sai"))
print(find_intent(intent_patterns, "see ya bud!"))

greet
thankyou
goodbye


## Respond to the intent

In [6]:
response_dict = {
    'default': '...',
    'goodbye': 'Have a great day',
    'greet': 'Hi there',
    'thankyou': "no problem, that's my job"
 }

In [7]:
def answer(string_input, pat_dict, resps):
    # get the matched intent
    intent = find_intent(pat_dict, string_input.lower())
    # Use default as the fll back value
    key = "default"
    if intent in resps:
        key = intent
    return resps[key]

In [8]:
answer("See ya", intent_patterns, response_dict)

'Have a great day'

Update the wrapper function from module 1 that uses a nice display template.

In [9]:
# update params to include the lookup dicts as default
def user_speaks(user_input, pat_dict=intent_patterns, resps=response_dict, user_format="USER:", bot_format="BOT:"):
    """Passes the user's input to response handler."""
    time.sleep(0.6)
    print(f"{user_format} {user_input}")
    # update the line below to use the new flexible match functions
    resp = answer(user_input, pat_dict, resps)
    time.sleep(0.6)
    return f"{bot_format} {resp}"

In [10]:
print(user_speaks("Hi hi cherry pie!"))
print(user_speaks("Ta very much my lovely..."))
print(user_speaks("Gotta go. Bye bye hunny pie."))

USER: Hi hi cherry pie!
BOT: Hi there
USER: Ta very much my lovely...
BOT: no problem, that's my job
USER: Gotta go. Bye bye hunny pie.
BOT: Have a great day


## Basic NER

Named Entity Recognition. 

In [11]:
def get_names(string_input):
    """Searches a string for an indication that a name is being discussed, then search
    for a proper noun and return it if found."""
    # ensure None is returned if no match is found
    entity = None
    name_pat = re.compile("name|call")
    proper_noun_pat = re.compile("[A-Z]{1}[a-z]*")
    # look for a sentence about a named entity:
    if name_pat.search(string_input):
        entity = proper_noun_pat.findall(string_input)
        if len(entity) > 0:
            # several hits means we need to concatenate values
            entity = " ".join(entity)
    return entity

In [12]:
print(get_names("my name is Jimmy."))
print(get_names("My name is Jimmy."))
# you can see how this would be limited and won't work with lowering an input string

Jimmy
My Jimmy


In [13]:
# Define respond()
def answer_name(str_input):
    name = get_names(str_input)
    if name is None:
        return "You're mysterious, tell me your name."
    else:
        return f"Hello, {name}!"

In [14]:
# update params to include the lookup dicts as default
def user_speaks(user_input, pat_dict=intent_patterns, resps=response_dict, user_format="USER:", bot_format="BOT:"):
    """Passes the user's input to response handler."""
    time.sleep(0.6)
    print(f"{user_format} {user_input}")
    # update the line below to use the name retrieval funcs
    resp = answer_name(user_input)
    time.sleep(0.6)
    return f"{bot_format} {resp}"

In [15]:
print(user_speaks("i am called John Snow"))
print(user_speaks("my name is Spartacus"))
print(user_speaks("My name is Spartacus"))

USER: i am called John Snow
BOT: Hello, John Snow!
USER: my name is Spartacus
BOT: Hello, Spartacus!
USER: My name is Spartacus
BOT: Hello, My Spartacus!


## Wordvec with spaCy

Great little intro on word vectors, where tokens - floats - are assigned to words, word parts, letters or sentences. These can then be used within ML workflows. spaCy makes several wordvec models available. Here we are using `en_core_web_sm` which is trained upon a large corpus with the GloVe algorithm.

Tokens can be compared to others using their cosine similarity:

* Vector directions point in same direction = 1
* Vector directions are perpindicular = 0 
* Vector directions are opposite = -1

In [16]:
import spacy
nlp = spacy.load('en_core_web_md')

  from .autonotebook import tqdm as notebook_tqdm


In [17]:
n_dim = nlp.vocab.vectors_length
n_dim

300

In [18]:
# Use the nlp model on a string to get tokens:
doc = nlp("Hey Ho, ah let's go!")
doc

Hey Ho, ah let's go!

In [19]:
for token in doc:
    print(f"{token}: {token.vector[:7]}")
# showing the first 7 word vectors for the sentence tokens

Hey: [ 2.9      0.48218 -2.2693   0.27522 -7.1124   1.2409  -0.43371]
Ho: [-1.9577  -3.629   -4.1803   0.75524  2.439    4.1769  -1.2797 ]
,: [-3.3899  -4.7034  -0.56101  1.2291   4.3298  -1.0775  -1.3006 ]
ah: [ 3.5059   2.9413  -0.30366 -0.53069 -3.0985   3.9806  -2.8103 ]
let: [ 8.0705   6.2403  -5.6268  -0.6813  -3.603    2.8543  -0.82774]
's: [ 3.3163   9.7209  -3.1254  -5.1013  12.248    0.74676 -2.2017 ]
go: [ 1.484   8.3944 -8.3806  3.2081 -4.2582  1.9773 -2.7806]
!: [ 5.0891  -3.3753  -4.2695  -4.8156   3.8904   6.2171   0.26271]


In [20]:
len(doc[0].vector) == n_dim
# you can see that each token has n_dim

True

In [25]:
# download ATIS dataset

import pandas as pd
import requests
import json
#URL = 'https://raw.githubusercontent.com/jkkummerfeld/text2sql-data/master/data/atis.json'
#data = json.loads(requests.get(URL).text)
## Flattening JSON data
#ATIS = pd.json_normalize(data)
#ATIS.head()
# This source did not have the labels from what I could see, so going with kaggle instead


Unnamed: 0,comments,old-name,query-split,sentences,sql,variables
0,[],,train,[{'text': 'list all the flights that arrive at...,[SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM A...,"[{'example': 'MKE', 'location': 'unk', 'name':..."
1,[],,train,[{'text': 'give me the flights leaving city_na...,[SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM A...,"[{'example': 'BOSTON', 'location': 'unk', 'nam..."
2,[],,train,[{'text': 'what is the most expensive one way ...,[SELECT DISTINCT FAREalias0.FARE_ID FROM AIRPO...,"[{'example': 'BOSTON', 'location': 'unk', 'nam..."
3,[],,train,[{'text': 'what flights return from city_name1...,[SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM A...,"[{'example': 'PHILADELPHIA', 'location': 'unk'..."
4,[],,train,[{'text': 'can you list all flights from city_...,[SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM A...,"[{'example': 'CHICAGO', 'location': 'unk', 'na..."


In [54]:
import os
from pyprojroot import here
# there is a kaggle api, but overkill for this project. Source is:
# https://www.kaggle.com/datasets/hassanamin/atis-airlinetravelinformationsystem?resource=download
colnames = ["label", "text"]
ATIS = pd.read_csv(os.path.join(here(), "data", "atis_intents.csv"), names = colnames)
sentences = ATIS.text
n_sent = len(sentences)
ATIS.head()

Unnamed: 0,label,text
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what is the arrival time in san francisco for...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...


In [52]:
# prepare a 2D array for storing the vectors
import numpy as np
vec_array = np.zeros((n_sent, n_dim))
print(np.shape(vec_array))
vec_array

(4978, 300)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [55]:
# pass all the sentences to spacy to calclate the word vectors, storing in our array
for row, sentence in enumerate(sentences):
    doc = nlp(sentence)
    vec_array[row, :] = doc.vector
vec_array


array([[-0.0826674 , -0.58716798, -2.1964817 , ..., -1.25990617,
        -2.6429491 ,  1.24336207],
       [-1.80790913,  1.00080669, -3.05921483, ..., -1.52719319,
        -2.36632752,  1.1796701 ],
       [-1.36993992,  0.49113077,  0.07254934, ..., -1.3198179 ,
        -2.0484488 ,  2.16480947],
       ...,
       [-1.82230973, -1.06588328, -2.15213466, ..., -2.93899107,
        -2.00138211, -1.13957   ],
       [-2.00526071,  2.67443204, -2.5772748 , ..., -0.08352997,
        -2.17978454,  0.47018856],
       [-1.33432257,  3.42474079, -2.36559343, ...,  0.34180367,
        -3.09680247,  2.61888719]])

## Use an SVM to recognise intents

Various approaches to cosine similarity are discussed, but the exercises are about fitting a support vector classifier to our data. Data is labelled and with a train test split as is usual, so prepare this now.