## Objective : To create a classifier that can identify whether or not a provided series of words is a question

In [None]:
"""
Two approaches have been followed : 

1) The first approach is more naive. It just checks for certain words such as how , where, what , when etc. 
(will refer to these as q_words from here on) in the text provided. 
Normally, I would have only checked if any of the sentences provided begins with one of the q_words, but then,
I would miss out on sentences which might be a question where the q_words occur in the middle. 
Since my aim is to build an adequate Question identifier model , I would rather have a few False Positives than miss 
out on a Inquiry statement that might be relevant.
The analogy is similar to Spam Filtering. We would prefer having a Spam labelled as Non-Spam than have an important 
email labelled as a Spam and miss out on it. 

2) Using nps_chat corpus which comprises of sample chats and their corresponding dialogue act types. 
There are 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion" etc. The nps_chat corpus can be obtained 
from the nltk library.
Amongst the dialogue act types, the ones that are of type "ynQuestion" and "whQuestion",
are classified as 1 and the rest as 0. This would serve as a reference source (training data)
Using this training data , a Naive Bayes model is built which is used to test on the sample data provided. 

"""

In [38]:
import nltk

reference : https://www.nltk.org/book/ch06.html

In [39]:
nltk.download('nps_chat')

[nltk_data] Downloading package nps_chat to
[nltk_data]     C:\Users\suhit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\nps_chat.zip.


True

In [46]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suhit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

### Approach 1

In [None]:
outcome = []

In [31]:
qwords = ["who", "what", "when", "where", "why", "how", "can", "does", "do"]

In [32]:
f = open('test-inputs.txt', encoding='utf8')
lines = f.readlines()

In [33]:
for line in lines:
    for q in qwords:
        if q in line.lower():
            qexist = True
            break
        else:
            qexist = False
    if qexist:
        outcome.append(1)
    else:
        outcome.append(0)            

In [34]:
filename = 'output.txt'

In [35]:
with open(filename ,mode = 'w') as outfile:
    for o in outcome:
        outfile.write("%s\n" % o)
    

### Approach 2 

In [40]:
from nltk.corpus import nps_chat

In [43]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

In [44]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
         features['contains({})'.format(word.lower())] = True
    return features

In [None]:
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]

In [47]:
featurelist = []
classlist = []

In [57]:
for feature , dt in featuresets:
    featurelist.append(feature)
    classlist.append(dt)

In [58]:
# there are two types of Question : whQuestion and ynQuestion
classlist = [1 if 'Question' in classval  else 0 for classval in classlist]

In [66]:
train_set = list(zip(featurelist,classlist))

#### for the test set 

In [77]:
test_set = []

In [78]:

f = open('test-inputs.txt', encoding='utf8')
lines = f.readlines()
for line in lines:
    test_set.append(dialogue_act_features(line))
    

In [79]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [83]:
output = classifier.classify_many(test_set)

In [84]:
filename = 'output_nltk.txt'

In [85]:
with open(filename ,mode = 'w') as outfile:
    for o in output:
        outfile.write("%s\n" % o)
    