# 1. Load Data

https://github.com/edbullen/nltk/blob/master/classification.ipynb

In [4]:
import numpy as np
import pandas as pd

CODE_LOC = 'D:\\Git\\python-natural-language-processing\\chatbot\\'   # !! Modify to path to "features.py" folder lcoation
DATA_LOC = 'D:\\Git\\python-natural-language-processing\\chatbot\\sentences.csv'  # !! Modify this to the CSV data location

sentences = pd.read_csv(filepath_or_buffer = DATA_LOC)


In [5]:
sentences.head(10)

Unnamed: 0,SENTENCE,CLASS
0,"Sorry, I don't know about the weather.",S
1,That is a tricky question to answer.,C
2,What does OCM stand for,Q
3,MAX is a Mobile Application Accelerator,S
4,Can a dog see in colour?,Q
5,how are you,C
6,If you deploy a MySQL database in the Oracle c...,Q
7,who is dominic Fakename,Q
8,what's the weather like today?,C
9,Can the OCM host non Oracle software stacks?,Q


In [6]:
sentences.shape

(100, 2)

# 2. Feature Engineering - A Non-Standard, Bespoke Approch

Chapter 6 of the NLTK Book has a great deal of background and worked examples for classifying text using machine learning algorithms such as Naive Bayes Classifiers. A different bespoke approach involving home-grown feature engineering and a scikit-learn Random Forest model is outlined in this note.

The code snippet below is an example of taking a sentence and extracting sets of POS-tag Triples from it. We can use this approach for building up features from a sentence by counting occurances of triple-patterns (or other POS-tag patterns).

In [7]:
# Extract some patterns of PoS sequences 
import nltk
from nltk import word_tokenize


list_of_triple_strings = [] #triple sequence of PoS tags
sentence = "Can a dog see in Colour?"

sentence_parsed = word_tokenize(sentence)
#print(sentence_parsed)
pos_tags = nltk.pos_tag(sentence_parsed)
#print(pos_tags)
pos = [ i[1] for i in pos_tags ]
print("Words mapped to Part of Speech Tags:",pos_tags)
print("PoS Tags:", pos)

n = len(pos)
for i in range(0,n-3):
    t = "-".join(pos[i:i+3]) # pull out 3 list item from counter, convert to string
    list_of_triple_strings.append(t)
    
print("Sequence of triples:", list_of_triple_strings)

Words mapped to Part of Speech Tags: [('Can', 'MD'), ('a', 'DT'), ('dog', 'NN'), ('see', 'NN'), ('in', 'IN'), ('Colour', 'NNP'), ('?', '.')]
PoS Tags: ['MD', 'DT', 'NN', 'NN', 'IN', 'NNP', '.']
Sequence of triples: ['MD-DT-NN', 'DT-NN-NN', 'NN-NN-IN', 'NN-IN-NNP']


# Extracting Features

After pre-processing the sentences (using the approach above) we can get a set of triples for Questions, Chat, Statements. There will be a lot of intersection, but hopefully some clear patterns
## The features.py Features Generator
This is a custom Python module to extract features from a sentence, written for this ChatBot demo.

features.py is located here: https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/features.py

Just

`import features`

and call

`features = features_dict(id,sentence, c)`

to extract a dictionary of features for the given sentence.

* The "id" can be any arbirtary ID value - it just get s passed in and passout as an ID identifier in the resultant dictionary.
* The "c" value can also be any arbitrary value representing the Class label - the idea is to supply an appropriate label so that the dict that is passed back has all the necessary information in it.

The actual features that are generated and the logic behind how this is done is all hard-coded in features.py (it is not paramaterised - a potential enhancement that could be added)

#### features.py POS Triples Extract

The features.py module includes a function

`get_triples(pos)`

which returns a string of the form "POS-POS-POS" where "POS" is a Part-Of-Speech tag.
### Example

In [9]:
import sys
sys.path.append(CODE_LOC)
import features

sentence = "Can a dog see in colour?"

sentence = features.strip_sentence(sentence)
print(sentence)

pos = features.get_pos(sentence)
triples = features.get_triples(pos)

print(triples)


Can a dog see in colour
['MD-DT-NN', 'DT-NN-NN', 'NN-NN-IN', 'NN-IN-NN']


In [14]:
#### Bespoke Features Generator Example - Get Python dictonary of feature ####

sentences = ["Can a dog see in colour?",
             "Hey, How's it going?",
             "Oracle 12.2 will be released for on-premises users on 15 March 2017",
             "When will Oracle 12 be released"]

id = 1
for s in sentences:
    features_dict = features.features_dict(str(id),s)
    features_string, header = features.get_string(str(id),s)
    print(features_dict)
    print(features_string, "\n")
    id += 1

{'verbBeforeNoun': 1, 'stemmedCount': 4, 'CD': 0, 'qVerbCombo': 1, 'VBG': 0, 'endTuple0': 1, 'NNP': 0, 'NNS': 0, 'id': '1', 'startTuple0': 0, 'qTripleScore': 0, 'NN': 3, 'NNPS': 0, 'qMark': 1, 'PRP': 0, 'VBZ': 0, 'stemmedEndNN': 0, 'class': 'X', 'wordCount': 6, 'sTripleScore': 0, 'endTuple1': 0, 'endTuple2': 0}
1,6,4,1,1,1,0,0,0,3,0,0,0,0,0,0,1,0,0,0,0,X 

{'verbBeforeNoun': 0, 'stemmedCount': 3, 'CD': 0, 'qVerbCombo': 1, 'VBG': 1, 'endTuple0': 0, 'NNP': 2, 'NNS': 0, 'id': '2', 'startTuple0': 0, 'qTripleScore': 0, 'NN': 0, 'NNPS': 0, 'qMark': 1, 'PRP': 1, 'VBZ': 0, 'stemmedEndNN': 0, 'class': 'X', 'wordCount': 4, 'sTripleScore': 0, 'endTuple1': 0, 'endTuple2': 0}
2,4,3,1,1,0,1,0,2,0,0,0,1,0,0,0,0,0,0,0,0,X 

{'verbBeforeNoun': 0, 'stemmedCount': 8, 'CD': 3, 'qVerbCombo': 1, 'VBG': 0, 'endTuple0': 0, 'NNP': 1, 'NNS': 1, 'id': '3', 'startTuple0': 0, 'qTripleScore': 0, 'NN': 1, 'NNPS': 0, 'qMark': 0, 'PRP': 0, 'VBZ': 0, 'stemmedEndNN': 0, 'class': 'X', 'wordCount': 12, 'sTripleScore': 2, 

With this approach we can bulk generate some numeric data-features generated from a CSV file of sentences. If each sentence has a unique ID and we have a classifier label (S/Q/C) for each row observation, we can now try to build a ML classification model and assess it's effectiveness.

# 3. Build a Machine Learning Model
In this section we load a features CSV file called featuresDump.csv into a Pandas data-frame. The data was generated with features.py reading in the sentences.csv file as described in the previous section. The featuresDump.csv data is then used to train a Random Forest model to predict whether a sentence is Chat, Statement or Question.

#### Load the Data

In [18]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

FNAME = 'featuresDump.csv'

df = pd.read_csv(filepath_or_buffer=FNAME,)
print(len(df),"rows loaded")

# strip any leading spaces from col names
df.columns = df.columns[:].str.strip()
df['class'] = df['class'].map(lambda x:x.strip())

width = df.shape[1]

100 rows loaded


#### Training and testing split

In [56]:
#split into test and training (is_train: True / False col)
np.random.seed(seed=1)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
print(str(len(train)), " rows split into training set,", str(len(test)), "split into test set.")

features = df.columns[1:width-1]  #remove the first ID col and last col=classifier
print("FEATURES = {}".format(features))

77  rows split into training set, 23 split into test set.
FEATURES = Index(['wordCount', 'stemmedCount', 'stemmedEndNN', 'CD', 'NN', 'NNP', 'NNPS',
       'NNS', 'PRP', 'VBG', 'VBZ', 'startTuple0', 'endTuple0', 'endTuple1',
       'endTuple2', 'verbBeforeNoun', 'qMark', 'qVerbCombo', 'qTripleScore',
       'sTripleScore'],
      dtype='object')


### Fit a Model with the Training Data-Set

In [57]:
# Fit an RF Model for "class" given features
clf = RandomForestClassifier(n_jobs=2, n_estimators = 100)
clf.fit(train[features], train['class'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=2, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

### Generate Predictions from the Test Data-Set

In [58]:
preds = clf.predict(test[features])
predout = pd.DataFrame({ 'id' : test['id'], 'predicted' : preds, 'actual' : test['class'] })

In [61]:
print(predout)

   actual                 id predicted
13      Q   31cedeb4e04fba02         Q
20      S   af7dd6b70d544b56         S
21      Q   584d5d4428d60a5f         S
24      Q   9140ee537fbe5390         Q
25      S   cabf9e317ba4a072         S
29      Q   3d25a26134f0e450         Q
32      S   280b0360e0d3ffc1         S
37      Q   0d4a13fc4cce6dab         Q
39      C   35179a54ea587953         C
40      C   8cdda20f1ae22213         C
43      Q   8798ff1fe7ac435d         Q
46      S   bc013bdd28614223         S
68      Q   7055c710336d670c         Q
70      Q   3b416352816dc854         Q
73      S   601fdf6ab85a9875         S
76      S   498b643ac17bcc7d         C
78      S   64e22039495c59bf         S
80      S   cc0c263a455bb702         S
82      C   8b1a9953c4611296         C
85      S   6b2d6039a794fb49         S
87      S   94590dd047fcbfce         S
91      Q   7a0fc645497df2c6         Q
96      S   ecef7fa7fcb25f20         S


In [62]:
## Cross-check accuracy ##
print(pd.crosstab(test['class'], preds, rownames=['actual'], colnames=['preds']))
print("\n",pd.crosstab(test['class'], preds, rownames=['actual']
                       , colnames=['preds']).apply(lambda r: round(r/r.sum()*100,2), axis=1) )

from sklearn.metrics import accuracy_score
print("\n\nAccuracy Score: ", round(accuracy_score(test['class'], preds),3) ) # https://en.wikipedia.org/wiki/Jaccard_index

preds   C  Q   S
actual          
C       3  0   0
Q       0  8   1
S       1  0  10

 preds        C      Q      S
actual                      
C       100.00   0.00   0.00
Q         0.00  88.89  11.11
S         9.09   0.00  90.91


Accuracy Score:  0.913


## Flaws in the Approach and Further Validation
The accuracy appears pretty good, but the approach taken probably means we have over-fitted the feature selection. In the next section we try out the model on a completely different data-set, taken from the Python FaQ at https://docs.python.org/3/faq/general.html


### Load Sentence Data and Generate Features

In [64]:
FNAME = 'pythonFAQ.csv'

import csv
import hashlib
import features

fin = open(FNAME,'rt')
reader = csv.reader(fin)

keys = ["id",
       "wordCount",
       "stemmedCount",
       "stemmedEndNN",
       "CD",
       "NN",
       "NNP",
       "NNPS",
       "NNS",
       "PRP",
       "VBG",
       "VBZ",
       "startTuple0",
       "endTuple0",
       "endTuple1",
       "endTuple2",
       "verbBeforeNoun",
       "qMark",
       "qVerbCombo",
       "qTripleScore",
       "sTripleScore",
       "class"]

rows = []
next(reader) #assume we have a header

for line in reader:
    sentence = line[0]
    c = line[1] #class label
    id = hashlib.md5(str(sentence).encode('utf-8')).hexdigest()[:16] #generate a unique id 
    f = features.features_dict(id,sentence,c)
    row = []
    
    for key in keys:
        value = f[key]
        row.append(value)
    rows.append(row)
    
faq = pd.DataFrame(rows,columns= keys)
fin.close()

### Predict the Class of Sentence with Previously Built Model

In [65]:
# predict against FAQ test set 
featureNames = faq.columns[1:width-1] #remove the first ID col and last col =classifier
feqPreds = clf.predict(faq[featureNames])

predout = pd.DataFrame({'id':faq['id'],'predicted':feqPreds,'actual':faq['class']})

#### Cross Check Accuracy 

In [69]:
print(pd.crosstab(faq['class'],feqPreds,rownames=['actual'],colnames=['preds']))
print("\n",pd.crosstab(faq['class'], feqPreds, rownames=['actual'],
                       colnames=['preds']).apply(lambda r: round(r/r.sum()*100,2), axis=1) )

preds    C   Q   S
actual            
C       12   5   2
Q        0  14   2
S        0   3  13

 preds       C      Q      S
actual                     
C       63.16  26.32  10.53
Q        0.00  87.50  12.50
S        0.00  18.75  81.25


In [72]:
print("Accuracy Score:", round(accuracy_score(faq['class'], feqPreds) ,3) )

Accuracy Score: 0.765


This could be summarised as "OK" but not great ...

The Question and Statement predictions are reported as greater than 80% accurate and the features extraction method could easily be expanded on and enhanced.

Also the training data-set is small.

### 5. Ad-hoc testing and experiments

In [74]:
textout = {'Q':"QUESTION",'C':"CHAT",'S':"STATEMENT"}

my_sentence = "Scikit learn is a popular Python library for machine learning."
#mySentence = "The cat is dead"
#mySentence = "Is the cat dead"

my_features = features.features_dict('1',my_sentence,'X')

values =[]
for key in keys:
    values.append(my_features[key])

s = pd.Series(values)
width = len(s)
my_features = s[1:width-1] # all but the last item (this is the class for supervised learning mode)
predict = clf.predict([my_features])

print("\n \n Prediction is:",textout[predict[0].strip()])


 
 Prediction is: STATEMENT
