# 1. Load Data

In [15]:
import numpy as np
import pandas as pd

CODE_LOC = 'D:\\Git\\python-natural-language-processing\\nlp_in_short\\'   # !! Modify to path to "features.py" folder lcoation
DATA_LOC = 'D:\\Git\\python-natural-language-processing\\nlp_in_short\\sentences.csv'  # !! Modify this to the CSV data location

sentences = pd.read_csv(filepath_or_buffer = DATA_LOC)


In [13]:
sentences.head(10)

Unnamed: 0,SENTENCE,CLASS
0,"Sorry, I don't know about the weather.",S
1,That is a tricky question to answer.,C
2,What does OCM stand for,Q
3,MAX is a Mobile Application Accelerator,S
4,Can a dog see in colour?,Q
5,how are you,C
6,If you deploy a MySQL database in the Oracle c...,Q
7,who is dominic Fakename,Q
8,what's the weather like today?,C
9,Can the OCM host non Oracle software stacks?,Q


In [4]:
sentences.shape

(100, 2)

# 2. Feature Engineering - A Non-Standard, Bespoke Approch

Chapter 6 of the NLTK Book has a great deal of background and worked examples for classifying text using machine learning algorithms such as Naive Bayes Classifiers. A different bespoke approach involving home-grown feature engineering and a scikit-learn Random Forest model is outlined in this note.

The code snippet below is an example of taking a sentence and extracting sets of POS-tag Triples from it. We can use this approach for building up features from a sentence by counting occurances of triple-patterns (or other POS-tag patterns).

In [11]:
# Extract some patterns of PoS sequences 
import nltk
from nltk import word_tokenize


list_of_triple_strings = [] #triple sequence of PoS tags
sentence = "Can a dog see in Colour?"

sentence_parsed = word_tokenize(sentence)
#print(sentence_parsed)
pos_tags = nltk.pos_tag(sentence_parsed)
#print(pos_tags)
pos = [ i[1] for i in pos_tags ]
print("Words mapped to Part of Speech Tags:",pos_tags)
print("PoS Tags:", pos)

n = len(pos)
for i in range(0,n-3):
    t = "-".join(pos[i:i+3]) # pull out 3 list item from counter, convert to string
    list_of_triple_strings.append(t)
    
print("Sequence of triples:", list_of_triple_strings)

Words mapped to Part of Speech Tags: [('Can', 'MD'), ('a', 'DT'), ('dog', 'NN'), ('see', 'NN'), ('in', 'IN'), ('Colour', 'NNP'), ('?', '.')]
PoS Tags: ['MD', 'DT', 'NN', 'NN', 'IN', 'NNP', '.']
Sequence of triples: ['MD-DT-NN', 'DT-NN-NN', 'NN-NN-IN', 'NN-IN-NNP']


# Extracting Features

After pre-processing the sentences (using the approach above) we can get a set of triples for Questions, Chat, Statements. There will be a lot of intersection, but hopefully some clear patterns
## The features.py Features Generator
This is a custom Python module to extract features from a sentence, written for this ChatBot demo.

features.py is located here: https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/features.py

Just

`import features`

and call

`features = features_dict(id,sentence, c)`

to extract a dictionary of features for the given sentence.

* The "id" can be any arbirtary ID value - it just get s passed in and passout as an ID identifier in the resultant dictionary.
* The "c" value can also be any arbitrary value representing the Class label - the idea is to supply an appropriate label so that the dict that is passed back has all the necessary information in it.

The actual features that are generated and the logic behind how this is done is all hard-coded in features.py (it is not paramaterised - a potential enhancement that could be added)

#### features.py POS Triples Extract

The features.py module includes a function
`get_triples(pos)`
which returns a string of the form "POS-POS-POS" where "POS" is a Part-Of-Speech tag.
### Example

In [17]:
import sys
sys.path.append(CODE_LOC)
import features

sentence = "Can a dog see in colour?"

sentence = features.strip_sentence(sentence)
print(sentence)

pos = features.get_pos(sentence)
triples = features.get_triples(pos)

print(triples)


Can a dog see in colour
['MD-DT-NN', 'DT-NN-NN', 'NN-NN-IN', 'NN-IN-NN']
