# Introduction to Rule-Based NLP
In this example, we look at basic natural language preprocessing with hand-written rules.

## Basic Preprocessing
First we will perform typical basic preprocessing on a sentence.

The idea is to make it easier to perform downstream processing steps vs. directly using the raw text. For example, we can reduce the number of edge cases in an input sentence (like upper case or special characters), and split it into individual words to it's easier to search.

In [1]:
import string

def preprocess_sentence(raw):
    """ Basic string preprocessing function """
    # Convert all to lower case
    processed = raw.lower()

    # Remove punctuation
    processed = processed.translate(
        str.maketrans("", "", string.punctuation))
    
    # Tokenize the sentence by spaces
    processed = processed.split(" ")

    return processed


input_sentence = "Go to the kitchen, and then get me an apple."
print(input_sentence)

processed_sentence = preprocess_sentence(input_sentence)
print(processed_sentence)

Go to the kitchen, and then get me an apple.
['go', 'to', 'the', 'kitchen', 'and', 'then', 'get', 'me', 'an', 'apple']


## Keyword Extraction
For simple use cases, we can often get away with directly searching for keywords in a sentence to extract meaning.

Suppose we expect the human to speak a sentence that may contain the name of a type of object and a type of location (or room), and we already know the list of objects and rooms beforehand because we have built up that knowledge base manually. This is common in rule-based NLP.

Since the processed sentence is already tokenized into a list of individual lowercase words, it is fairly straightforward to do some simple searching in Python to see if a sentence mentions any of the objects and/or locations in our knowledge base.

In [2]:
def extract_keywords(words, objects, locations):
    """ 
    Extracts target objects and locations from a list of words 
    given lists of possible objects and locations
    """

    target_object = None
    target_location = None

    # Extract object
    for obj in objects:
        if obj in words:
            target_object = obj
            break
    
    # Extract location
    for loc in locations:
        if loc in words:
            target_location = loc
            break

    return target_object, target_location


input_sentence = "Go to the kitchen, and then get me an apple."
objects = ["apple", "water", "snacks"]
locations = ["kitchen", "bedroom", "garage"]

words = preprocess_sentence(input_sentence)
(tgt_obj, tgt_loc) = extract_keywords(words, objects, locations)
print("Input sentence:  {}".format(input_sentence))
print("Target object:   {}".format(tgt_obj))
print("Target location: {}".format(tgt_loc))

Input sentence:  Go to the kitchen, and then get me an apple.
Target object:   apple
Target location: kitchen


## Test on Multiple Sentences
It is best practice to add some more test cases to our code to check whether it behaves as expected. 

Specifically, what if the sentence does not mention a valid object and/or location name? What if the human uses a synonymous word, or misspeaks/misspells a word? Would the rule-based system handle this robustly? No.

In [3]:
objects = ["apple", "water", "snacks"]
locations = ["kitchen", "bedroom", "garage"]
input_sentences = ["Go to the kitchen, and then get me an apple",
                   "Bring me a bottle of water", 
                   "Drive over to the garage",
                   "Find a snack in my bedroom"]

for sentence in input_sentences:
    words = preprocess_sentence(sentence)
    (tgt_obj, tgt_loc) = extract_keywords(words, objects, locations)
    print("Input sentence:  {}".format(sentence))
    print("Target object:   {}".format(tgt_obj))
    print("Target location: {}".format(tgt_loc))
    print("")

Input sentence:  Go to the kitchen, and then get me an apple
Target object:   apple
Target location: kitchen

Input sentence:  Bring me a bottle of water
Target object:   water
Target location: None

Input sentence:  Drive over to the garage
Target object:   None
Target location: garage

Input sentence:  Find a snack in my bedroom
Target object:   None
Target location: bedroom



## Text Processing with Regular Expressions
A more robust approach to extract patterns from text is to use regular expressions. Python has a `re` package for regular expressions, and you can refer to [the documentation](https://docs.python.org/3/library/re.html) for more details.

Regular expressions allows us to define complex patterns for search-and-replace operations. They have a significant learning curve, but they are undoubtedly useful and worth learning for several applications beyond NLP -- most notably for automating tedious manual tasks by scripting.

In the example below, we can use regular expressions to search for the same object and location names without having to preprocess the sentence first. That is, we don't need to split up individual words or remove special characters, which means the regular expression can handle a wider set of human input. We also add a pattern to the object checking logic such that you can enter a singular or plural form (assuming the plural is just the singular word with an `s` in the end, which is not always true).

In [4]:
import re

def extract_keywords_regexp(sentence, objects, locations):
    """ 
    Extracts target objects and locations from a list of words 
    given lists of possible objects and locations.
    This implementation uses regular expressions from the `re` module.
    """

    target_object = None
    target_location = None

    # Extract object
    for obj in objects:
        pattern = "(" + obj + ")s*"  # Includes plurals
        result = re.search(pattern, sentence, re.IGNORECASE)
        if result is not None:
            target_object = obj
    
    # Extract location
    for loc in locations:
        result = re.search(loc, sentence, re.IGNORECASE)
        if result is not None:
            target_location = loc

    return target_object, target_location


objects = ["apple", "water", "snack"]
locations = ["kitchen", "bedroom", "garage", "living room"]
input_sentences = ["Go to the kitchen, and then get me an apple",
                   "Bring me a BOTTLE OFWATER!!!", 
                   "Drive over to the garage",
                   "Find a snack in my bedroom",
                   "Look for some apples in the Kitchenette!", 
                   "Can you search for snacks in the living room?"]

for sentence in input_sentences:
    (tgt_obj, tgt_loc) = extract_keywords_regexp(sentence, objects, locations)
    print("Input sentence:  {}".format(sentence))
    print("Target object:   {}".format(tgt_obj))
    print("Target location: {}".format(tgt_loc))
    print("")

Input sentence:  Go to the kitchen, and then get me an apple
Target object:   apple
Target location: kitchen

Input sentence:  Bring me a BOTTLE OFWATER!!!
Target object:   water
Target location: None

Input sentence:  Drive over to the garage
Target object:   None
Target location: garage

Input sentence:  Find a snack in my bedroom
Target object:   snack
Target location: bedroom

Input sentence:  Look for some apples in the Kitchenette!
Target object:   apple
Target location: kitchen

Input sentence:  Can you search for snacks in the living room?
Target object:   snack
Target location: living room

